[torqueusers] reply code=15001...

Åke Sandgren ake.sandgren at hpc2n.umu.se
Wed Oct 11 00:41:20 MDT 2006


On Tue, 2006-10-10 at 11:58 -0600, Garrick Staples wrote:
> On Tue, Oct 10, 2006 at 01:33:32PM +0200, ?ke Sandgren alleged:
> > Hi!
> > 
> > I think this have been adressed before but i can't find any info.
> > 
> > We are getting loads of
> > pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id
> > REJHOST=i092.hpc2n.umu.se MSG=modify job failed, unknown job
> > 392438.ingrid-h.hpc2n.umu.se), aux=0, type=ModifyJob, from
> > PBS_Server at ingrid-i.hpc2n.umu.se
> > 
> > I think they are related to stage-in/out but exactly what should we be
> > looking for.
> > 
> > torque version ranging from 2.0.0p4 to 2.1.2.
> 
> This happens with every job, right?  And you are using maui/moab, right?
> 
> If so, that is maui/moab reseting the job's neednodes resource after
> starting the job.  This is a work-around for a mythical bug in job
> starts in OpenPBS that noone has ever been able to demonstrate to me.

It doesn't happen on every job, only those that do explicit stagein/out.
The attrlist is "resource" and this is what happens...

And yes this is with maui.
Jobs without the initial CopyFiles request never gets any Modify
rejects.

The "attrlist : resource" in the "Reject reply" output was added by
checking what attr it's really trying to modify when it can't find the
job.

===================
10/10/2006 20:24:17;0080;   pbs_mom;Svr;add_conn;added connection to fd
10 - num_connections=5
10/10/2006 20:24:17;0080;   pbs_mom;Req;dis_request_read;decoding
command CopyFiles from PBS_Server
10/10/2006 20:24:17;0100;   pbs_mom;Req;;Type CopyFiles request received
from PBS_Server at pbsserver.some.dom.ain, sock=10
10/10/2006 20:24:17;0008;   pbs_mom;Job;process_request;request type
CopyFiles from host pbsserver.some.dom.ain received
10/10/2006 20:24:17;0008;   pbs_mom;Job;process_request;request type
CopyFiles from host pbsserver.some.dom.ain allowed
10/10/2006 20:24:17;0008;   pbs_mom;Job;dispatch_request;dispatching
request CopyFiles on sd=10
10/10/2006 20:24:17;0008;
pbs_mom;Job;393264.pbsserver.some.dom.ain;dispatching request CopyFiles
on sd=10
10/10/2006 20:24:17;0004;   pbs_mom;Fil;N/A;forking to user, uid: xxxxx
gid: yyyy  homedir: '/some-home-dir'
10/10/2006 20:24:17;0002;   pbs_mom;n/a;mom_close_poll;entered
10/10/2006 20:24:17;0080;   pbs_mom;Svr;add_conn;added connection to fd
11 - num_connections=6
10/10/2006 20:24:17;0080;   pbs_mom;Req;dis_request_read;decoding
command ModifyJob from PBS_Server
10/10/2006 20:24:17;0100;   pbs_mom;Req;;Type ModifyJob request received
from PBS_Server at pbsserver.some.dom.ain, sock=11
10/10/2006 20:24:17;0008;   pbs_mom;Job;process_request;request type
ModifyJob from host pbsserver.some.dom.ain received
10/10/2006 20:24:17;0008;   pbs_mom;Job;process_request;request type
ModifyJob from host pbsserver.some.dom.ain allowed
10/10/2006 20:24:17;0008;   pbs_mom;Job;dispatch_request;dispatching
request ModifyJob on sd=11
10/10/2006 20:24:17;0080;   pbs_mom;Req;req_reject;Reject reply
code=15001(Unknown Job Id REJHOST=nodex.some.dom.ain MSG=modify job
failed, unknown job 393264.pbsserver.some.dom.ain attrlist : resource),
aux=0, type=ModifyJob, from PBS_Server at pbsserver.some.dom.ain
10/10/2006 20:24:17;0080;   pbs_mom;Req;dis_request_read;decoding
command Disconnect from PBS_Server
10/10/2006 20:24:17;0080;   pbs_mom;Svr;close_conn;closed connection to
fd 11 - num_connections=5
10/10/2006 20:24:17;0008;   pbs_mom;Job;scan_for_terminated;pid 24814
not tracked, exitcode=0
10/10/2006 20:24:17;0080;   pbs_mom;Req;dis_request_read;decoding
command Disconnect from PBS_Server
10/10/2006 20:24:17;0080;   pbs_mom;Svr;close_conn;closed connection to
fd 10 - num_connections=4
10/10/2006 20:24:17;0080;   pbs_mom;Svr;add_conn;added connection to fd
10 - num_connections=5
10/10/2006 20:24:17;0080;   pbs_mom;Req;dis_request_read;decoding
command QueueJob from PBS_Server
10/10/2006 20:24:17;0100;   pbs_mom;Req;;Type QueueJob request received
from PBS_Server at pbsserver.some.dom.ain, sock=10
10/10/2006 20:24:17;0008;   pbs_mom;Job;process_request;request type
QueueJob from host pbsserver.some.dom.ain received
10/10/2006 20:24:17;0008;   pbs_mom;Job;process_request;request type
QueueJob from host pbsserver.some.dom.ain allowed
10/10/2006 20:24:17;0008;   pbs_mom;Job;dispatch_request;dispatching
request QueueJob on sd=10
10/10/2006 20:24:17;0080;   pbs_mom;Req;dis_request_read;decoding
command JobScript from PBS_Server
10/10/2006 20:24:17;0100;   pbs_mom;Req;;Type JobScript request received
from PBS_Server at pbsserver.some.dom.ain, sock=10
10/10/2006 20:24:17;0008;   pbs_mom;Job;process_request;request type
JobScript from host pbsserver.some.dom.ain received
10/10/2006 20:24:17;0008;   pbs_mom;Job;process_request;request type
JobScript from host pbsserver.some.dom.ain allowed
10/10/2006 20:24:17;0008;   pbs_mom;Job;dispatch_request;dispatching
request JobScript on sd=10
10/10/2006 20:24:17;0080;   pbs_mom;Req;dis_request_read;decoding
command JobScript from PBS_Server
10/10/2006 20:24:17;0100;   pbs_mom;Req;;Type JobScript request received
from PBS_Server at pbsserver.some.dom.ain, sock=10
10/10/2006 20:24:17;0008;   pbs_mom;Job;process_request;request type
JobScript from host pbsserver.some.dom.ain received
10/10/2006 20:24:17;0008;   pbs_mom;Job;process_request;request type
JobScript from host pbsserver.some.dom.ain allowed
10/10/2006 20:24:17;0008;   pbs_mom;Job;dispatch_request;dispatching
request JobScript on sd=10
...
10/10/2006 20:24:17;0080;   pbs_mom;Req;dis_request_read;decoding
command ReadyToCommit from PBS_Server
10/10/2006 20:24:17;0100;   pbs_mom;Req;;Type ReadyToCommit request
received from PBS_Server at pbsserver.some.dom.ain, sock=10
10/10/2006 20:24:17;0008;   pbs_mom;Job;process_request;request type
ReadyToCommit from host pbsserver.some.dom.ain received
10/10/2006 20:24:17;0008;   pbs_mom;Job;process_request;request type
ReadyToCommit from host pbsserver.some.dom.ain allowed
10/10/2006 20:24:17;0008;   pbs_mom;Job;dispatch_request;dispatching
request ReadyToCommit on sd=10
10/10/2006 20:24:17;0008;
pbs_mom;Job;393264.pbsserver.some.dom.ain;ready to commit job
10/10/2006 20:24:17;0008;
pbs_mom;Job;393264.pbsserver.some.dom.ain;ready to commit job completed
10/10/2006 20:24:17;0080;   pbs_mom;Req;dis_request_read;decoding
command Commit from PBS_Server
===================



More information about the torqueusers mailing list