[torqueusers] reply code=15001...

Garrick Staples garrick at clusterresources.com
Wed Oct 11 10:55:17 MDT 2006


On Wed, Oct 11, 2006 at 08:41:20AM +0200, ?ke Sandgren alleged:
> On Tue, 2006-10-10 at 11:58 -0600, Garrick Staples wrote:
> > On Tue, Oct 10, 2006 at 01:33:32PM +0200, ?ke Sandgren alleged:
> > > Hi!
> > > 
> > > I think this have been adressed before but i can't find any info.
> > > 
> > > We are getting loads of
> > > pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id
> > > REJHOST=i092.hpc2n.umu.se MSG=modify job failed, unknown job
> > > 392438.ingrid-h.hpc2n.umu.se), aux=0, type=ModifyJob, from
> > > PBS_Server at ingrid-i.hpc2n.umu.se
> > > 
> > > I think they are related to stage-in/out but exactly what should we be
> > > looking for.
> > > 
> > > torque version ranging from 2.0.0p4 to 2.1.2.
> > 
> > This happens with every job, right?  And you are using maui/moab, right?
> > 
> > If so, that is maui/moab reseting the job's neednodes resource after
> > starting the job.  This is a work-around for a mythical bug in job
> > starts in OpenPBS that noone has ever been able to demonstrate to me.
> 
> It doesn't happen on every job, only those that do explicit stagein/out.
> The attrlist is "resource" and this is what happens...
> 
> And yes this is with maui.
> Jobs without the initial CopyFiles request never gets any Modify
> rejects.

IIRC, it is actually a race condition.  stagein and longer prologues
will cause the error message.  It is mostly harmless, but there are some
rare bad things.  I have a patch for maui if you want (moab has
tuneable, something like NOAUTONEEDNODE).

 
> The "attrlist : resource" in the "Reject reply" output was added by
> checking what attr it's really trying to modify when it can't find the
> job.
> 
> ===================
> 10/10/2006 20:24:17;0080;   pbs_mom;Svr;add_conn;added connection to fd
> 10 - num_connections=5
> 10/10/2006 20:24:17;0080;   pbs_mom;Req;dis_request_read;decoding
> command CopyFiles from PBS_Server
> 10/10/2006 20:24:17;0100;   pbs_mom;Req;;Type CopyFiles request received
> from PBS_Server at pbsserver.some.dom.ain, sock=10
> 10/10/2006 20:24:17;0008;   pbs_mom;Job;process_request;request type
> CopyFiles from host pbsserver.some.dom.ain received
> 10/10/2006 20:24:17;0008;   pbs_mom;Job;process_request;request type
> CopyFiles from host pbsserver.some.dom.ain allowed
> 10/10/2006 20:24:17;0008;   pbs_mom;Job;dispatch_request;dispatching
> request CopyFiles on sd=10
> 10/10/2006 20:24:17;0008;
> pbs_mom;Job;393264.pbsserver.some.dom.ain;dispatching request CopyFiles
> on sd=10
> 10/10/2006 20:24:17;0004;   pbs_mom;Fil;N/A;forking to user, uid: xxxxx
> gid: yyyy  homedir: '/some-home-dir'
> 10/10/2006 20:24:17;0002;   pbs_mom;n/a;mom_close_poll;entered
> 10/10/2006 20:24:17;0080;   pbs_mom;Svr;add_conn;added connection to fd
> 11 - num_connections=6
> 10/10/2006 20:24:17;0080;   pbs_mom;Req;dis_request_read;decoding
> command ModifyJob from PBS_Server
> 10/10/2006 20:24:17;0100;   pbs_mom;Req;;Type ModifyJob request received
> from PBS_Server at pbsserver.some.dom.ain, sock=11
> 10/10/2006 20:24:17;0008;   pbs_mom;Job;process_request;request type
> ModifyJob from host pbsserver.some.dom.ain received
> 10/10/2006 20:24:17;0008;   pbs_mom;Job;process_request;request type
> ModifyJob from host pbsserver.some.dom.ain allowed
> 10/10/2006 20:24:17;0008;   pbs_mom;Job;dispatch_request;dispatching
> request ModifyJob on sd=11
> 10/10/2006 20:24:17;0080;   pbs_mom;Req;req_reject;Reject reply
> code=15001(Unknown Job Id REJHOST=nodex.some.dom.ain MSG=modify job
> failed, unknown job 393264.pbsserver.some.dom.ain attrlist : resource),
> aux=0, type=ModifyJob, from PBS_Server at pbsserver.some.dom.ain
> 10/10/2006 20:24:17;0080;   pbs_mom;Req;dis_request_read;decoding
> command Disconnect from PBS_Server
> 10/10/2006 20:24:17;0080;   pbs_mom;Svr;close_conn;closed connection to
> fd 11 - num_connections=5
> 10/10/2006 20:24:17;0008;   pbs_mom;Job;scan_for_terminated;pid 24814
> not tracked, exitcode=0
> 10/10/2006 20:24:17;0080;   pbs_mom;Req;dis_request_read;decoding
> command Disconnect from PBS_Server
> 10/10/2006 20:24:17;0080;   pbs_mom;Svr;close_conn;closed connection to
> fd 10 - num_connections=4
> 10/10/2006 20:24:17;0080;   pbs_mom;Svr;add_conn;added connection to fd
> 10 - num_connections=5
> 10/10/2006 20:24:17;0080;   pbs_mom;Req;dis_request_read;decoding
> command QueueJob from PBS_Server
> 10/10/2006 20:24:17;0100;   pbs_mom;Req;;Type QueueJob request received
> from PBS_Server at pbsserver.some.dom.ain, sock=10
> 10/10/2006 20:24:17;0008;   pbs_mom;Job;process_request;request type
> QueueJob from host pbsserver.some.dom.ain received
> 10/10/2006 20:24:17;0008;   pbs_mom;Job;process_request;request type
> QueueJob from host pbsserver.some.dom.ain allowed
> 10/10/2006 20:24:17;0008;   pbs_mom;Job;dispatch_request;dispatching
> request QueueJob on sd=10
> 10/10/2006 20:24:17;0080;   pbs_mom;Req;dis_request_read;decoding
> command JobScript from PBS_Server
> 10/10/2006 20:24:17;0100;   pbs_mom;Req;;Type JobScript request received
> from PBS_Server at pbsserver.some.dom.ain, sock=10
> 10/10/2006 20:24:17;0008;   pbs_mom;Job;process_request;request type
> JobScript from host pbsserver.some.dom.ain received
> 10/10/2006 20:24:17;0008;   pbs_mom;Job;process_request;request type
> JobScript from host pbsserver.some.dom.ain allowed
> 10/10/2006 20:24:17;0008;   pbs_mom;Job;dispatch_request;dispatching
> request JobScript on sd=10
> 10/10/2006 20:24:17;0080;   pbs_mom;Req;dis_request_read;decoding
> command JobScript from PBS_Server
> 10/10/2006 20:24:17;0100;   pbs_mom;Req;;Type JobScript request received
> from PBS_Server at pbsserver.some.dom.ain, sock=10
> 10/10/2006 20:24:17;0008;   pbs_mom;Job;process_request;request type
> JobScript from host pbsserver.some.dom.ain received
> 10/10/2006 20:24:17;0008;   pbs_mom;Job;process_request;request type
> JobScript from host pbsserver.some.dom.ain allowed
> 10/10/2006 20:24:17;0008;   pbs_mom;Job;dispatch_request;dispatching
> request JobScript on sd=10
> ...
> 10/10/2006 20:24:17;0080;   pbs_mom;Req;dis_request_read;decoding
> command ReadyToCommit from PBS_Server
> 10/10/2006 20:24:17;0100;   pbs_mom;Req;;Type ReadyToCommit request
> received from PBS_Server at pbsserver.some.dom.ain, sock=10
> 10/10/2006 20:24:17;0008;   pbs_mom;Job;process_request;request type
> ReadyToCommit from host pbsserver.some.dom.ain received
> 10/10/2006 20:24:17;0008;   pbs_mom;Job;process_request;request type
> ReadyToCommit from host pbsserver.some.dom.ain allowed
> 10/10/2006 20:24:17;0008;   pbs_mom;Job;dispatch_request;dispatching
> request ReadyToCommit on sd=10
> 10/10/2006 20:24:17;0008;
> pbs_mom;Job;393264.pbsserver.some.dom.ain;ready to commit job
> 10/10/2006 20:24:17;0008;
> pbs_mom;Job;393264.pbsserver.some.dom.ain;ready to commit job completed
> 10/10/2006 20:24:17;0080;   pbs_mom;Req;dis_request_read;decoding
> command Commit from PBS_Server
> ===================
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list