[torquedev] Resource : neednodes, PBS_NODEFILE vanishes if stagein requirement is specified

rishi pathak mailmaverick666 at gmail.com
Wed Dec 5 23:14:45 MST 2007


Our configuration is as follows:
torque version: 2.1.6
Moab server version 5.1.0p4
The problem we are facing is that when a job specifies a stagein
requirement, PBS_NODEFILE(allocated nodes) environment variable is not
available to the job.Below is the moab log for the job:
12/06 11:45:51 WARNING:  cannot set job '7142.head.compute.in' attr
'Resource_List:neednodes' to '' (rc: 15001 'Unknown Job Id')
12/06 11:45:51 INFO:     job '7142' successfully started
12/06 11:45:51 INFO:     starting job '7142'
12/06 11:45:51 INFO:     1 jobs started on iteration 1

corresponding pbs_mom log is :
12/06/2007 11:38:54;0080;   pbs_mom;Req;req_reject;Reject reply
code=15001(Unknown Job Id REJHOST=amd16.compute.in MSG=modify job failed,
unknown job 7142.amd01.head.compute.in), aux=0, type=ModifyJob, from
PBS_Server at head.compute.in
12/06/2007 11:38:54;0100;   pbs_mom;Req;;Type QueueJob request received from
PBS_Server at head.compute.in, sock=11
12/06/2007 11:38:54;0100;   pbs_mom;Req;;Type JobScript request received
from PBS_Server at amd01.npsf.cdac.ernet.in, sock=11
12/06/2007 11:38:54;0100;   pbs_mom;Req;;Type ReadyToCommit request received
from PBS_Server at head.compute.in, sock=11
12/06/2007 11:38:54;0100;   pbs_mom;Req;;Type Commit request received from
PBS_Server at head.compute.in, sock=11
12/06/2007 11:38:54;0001;   pbs_mom;Job;TMomFinalizeJob3;job
7142.head.compurte.in started, pid = 2687
12/06/2007 11:38:54;0100;   pbs_mom;Req;;Type StatusJob request received
from PBS_Server at head.compute.in, sock=10
12/06/2007 11:38:54;0080;
pbs_mom;Job;7142.head.compute.in;scan_for_terminated: job
7142.head.compute.in task 1 terminated, sid 2687
12/06/2007 11:38:54;0008;   pbs_mom;Job;7142.head.compute.in;job was
terminated

I found some reference on this from torque mailing list, Below is the actual
mail content:
---------------------------------------BEGIN
MAIL--------------------------------------------------------------------
*Garrick Staples* garrick at clusterresources.com
<torqueusers%40supercluster.org?Subject=%5Btorqueusers%5D%20reply%20code%3D15001...&In-Reply-To=1160587021.6100.9.camel%40skutt.ydc.se>

On Wed, Oct 11, 2006 at 07:17:01PM +0200, ?ke Sandgren alleged:
>* On Wed, 2006-10-11 at 10:55 -0600, Garrick Staples wrote:
*>* > On Wed, Oct 11, 2006 at 08:41:20AM +0200, ?ke Sandgren alleged:
*>* > > On Tue, 2006-10-10 at 11:58 -0600, Garrick Staples wrote:
*>* > > > On Tue, Oct 10, 2006 at 01:33:32PM +0200, ?ke Sandgren alleged:
*>* > > > > Hi!
*>* > > > >
*>* > > > > I think this have been adressed before but i can't find any info.
*>* > > > >
*>* > > > > We are getting loads of
*>* > > > > pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id
*>* > > > > REJHOST=i092.hpc2n.umu.se MSG=modify job failed, unknown job
*>* > > > > 392438.ingrid-h.hpc2n.umu.se), aux=0, type=ModifyJob, from
*>* > > > > PBS_Server at ingrid-i.hpc2n.umu.se
<http://www.supercluster.org/mailman/listinfo/torqueusers>
*>* > > > >
*>* > > > > I think they are related to stage-in/out but exactly what
should we be
*>* > > > > looking for.
*>* > > > >
*>* > > > > torque version ranging from 2.0.0p4 to 2.1.2.
*>* > > >
*>* > > > This happens with every job, right?  And you are using
maui/moab, right?
*>* > > >
*>* > > > If so, that is maui/moab reseting the job's neednodes resource after
*>* > > > starting the job.  This is a work-around for a mythical bug in job
*>* > > > starts in OpenPBS that noone has ever been able to demonstrate to me.
*>* > >
*>* > > It doesn't happen on every job, only those that do explicit stagein/out.
*>* > > The attrlist is "resource" and this is what happens...
*>* > >
*>* > > And yes this is with maui.
*>* > > Jobs without the initial CopyFiles request never gets any Modify
*>* > > rejects.
*>* >
*>* > IIRC, it is actually a race condition.  stagein and longer prologues
*>* > will cause the error message.  It is mostly harmless, but there are some
*>* > rare bad things.  I have a patch for maui if you want (moab has
*>* > tuneable, something like NOAUTONEEDNODE).
*>*
*>* Yes definitely something i want.
*>*
*>* But isn't this something that should really be done in torque?
*>* Shouldn't it get a jobid to the mom before starting stagein?
*
You'd think so, but no.  stagein happens before the job is moved to the
node.  I think the idea is to allow for "pre-stagein".
---------------------END MAIL-------------------------------------------------

I just added 'NOAUTONEEDNODE' to moab.cfg and job starts but still
errors are same and PBS_NODEFILE env variable is still absent.


It seems like this is a known bug, but I was not able to find much
reference(and problem solution) on this.Also I couldnt find any reference in
moab documentation for 'NOAUTONEEDNODES' parameter specified by Garrick
Staples.

Is this bug fixed or is there any workaround for said problem.

-- 
Regards--
Rishi Pathak
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torquedev/attachments/20071206/f8542e4e/attachment.html


More information about the torquedev mailing list