[torqueusers] Re: [Moabusers] Resource : neednodes, PBS_NODEFILE vanishes if stagein requirement is specified

rishi pathak mailmaverick666 at gmail.com
Fri Dec 7 01:07:58 MST 2007


HI Brady,
                I tested with torque 2.2.1 .Still the node file does not get
created.

On 12/6/07, Brady Kimball <bkimball at clusterresources.com> wrote:
>
> Rishi,
>
> Try using the new configure option (as of TORQUE 2.2.1)
> "--enable-force-nodefile".  This should remove the check for neednodes
> when writing the node file.  Let me know if this doesn't work.
>
> rishi pathak wrote:
> > Our configuration is as follows:
> > torque version: 2.1.6
> > Moab server version 5.1.0p4
> > The problem we are facing is that when a job specifies a stagein
> > requirement, PBS_NODEFILE(allocated nodes) environment variable is not
> > available to the job.Below is the moab log for the job:
> > 12/06 11:45:51 WARNING:  cannot set job '7142.head.compute.in
> > <http://7142.head.compute.in>' attr 'Resource_List:neednodes' to ''
> > (rc: 15001 'Unknown Job Id')
> > 12/06 11:45:51 INFO:     job '7142' successfully started
> > 12/06 11:45:51 INFO:     starting job '7142'
> > 12/06 11:45:51 INFO:     1 jobs started on iteration 1
> >
> > corresponding pbs_mom log is :
> > 12/06/2007 11:38:54;0080;   pbs_mom;Req;req_reject;Reject reply
> > code=15001(Unknown Job Id REJHOST=amd16.compute.in
> > <http://amd16.compute.in> MSG=modify job failed, unknown job
> > 7142.amd01.head.compute.in <http://7142.amd01.head.compute.in>),
> > aux=0, type=ModifyJob, from PBS_Server at head.compute.in
> > <mailto:PBS_Server at head.compute.in>
> > 12/06/2007 11:38:54;0100;   pbs_mom;Req;;Type QueueJob request
> > received from PBS_Server at head.compute.in
> > <mailto:PBS_Server at head.compute.in>, sock=11
> > 12/06/2007 11:38:54;0100;   pbs_mom;Req;;Type JobScript request
> > received from PBS_Server at amd01.npsf.cdac.ernet.in
> > <mailto:PBS_Server at amd01.npsf.cdac.ernet.in>, sock=11
> > 12/06/2007 11:38:54;0100;   pbs_mom;Req;;Type ReadyToCommit request
> > received from PBS_Server at head.compute.in
> > <mailto:PBS_Server at head.compute.in>, sock=11
> > 12/06/2007 11:38:54;0100;   pbs_mom;Req;;Type Commit request received
> > from PBS_Server at head.compute.in <mailto:PBS_Server at head.compute.in>,
> > sock=11
> > 12/06/2007 11:38:54;0001;   pbs_mom;Job;TMomFinalizeJob3;job
> > 7142.head.compurte.in <http://7142.head.compurte.in> started, pid = 2687
> > 12/06/2007 11:38:54;0100;   pbs_mom;Req;;Type StatusJob request
> > received from PBS_Server at head.compute.in
> > <mailto:PBS_Server at head.compute.in>, sock=10
> > 12/06/2007 11:38:54;0080;
> > pbs_mom;Job;7142.head.compute.in;scan_for_terminated: job
> > 7142.head.compute.in <http://7142.head.compute.in> task 1 terminated,
> > sid 2687
> > 12/06/2007 11:38:54;0008;   pbs_mom;Job;7142.head.compute.in;job was
> > terminated
> >
> > I found some reference on this from torque mailing list, Below is the
> > actual mail content:
> > ---------------------------------------BEGIN
> > MAIL--------------------------------------------------------------------
> > *Garrick Staples* garrick at clusterresources.com
> > <mailto:
> torqueusers%40supercluster.org?Subject=%5Btorqueusers%5D%20reply%20code%3D15001...&In-Reply-To=1160587021.6100.9.camel%40skutt.ydc.se
> >
> > On Wed, Oct 11, 2006 at 07:17:01PM +0200, ?ke Sandgren alleged:
> > >/ On Wed, 2006-10-11 at 10:55 -0600, Garrick Staples wrote:
> > />/ > On Wed, Oct 11, 2006 at 08:41:20AM +0200, ?ke Sandgren alleged:
> >
> > />/ > > On Tue, 2006-10-10 at 11:58 -0600, Garrick Staples wrote:
> > />/ > > > On Tue, Oct 10, 2006 at 01:33:32PM +0200, ?ke Sandgren
> alleged:
> > />/ > > > > Hi!
> > /
> > >/ > > > >
> > />/ > > > > I think this have been adressed before but i can't find any
> info.
> > />/ > > > >
> > />/ > > > > We are getting loads of
> >
> > />/ > > > > pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job
> Id
> > />/ > > > > REJHOST=i092.hpc2n.umu.se <http://i092.hpc2n.umu.se>
> MSG=modify job failed, unknown job
> >
> > />/ > > > > 392438.ingrid-h.hpc2n.umu.se <
> http://392438.ingrid-h.hpc2n.umu.se>), aux=0, type=ModifyJob, from
> > />/ > > > >
> > PBS_Server at ingrid-i.hpc2n.umu.se <
> http://www.supercluster.org/mailman/listinfo/torqueusers>
> > />/ > > > >
> > />/ > > > > I think they are related to stage-in/out but exactly what
> should we be
> > />/ > > > > looking for.
> >
> > />/ > > > >
> > />/ > > > > torque version ranging from 2.0.0p4 to 2.1.2.
> > />/ > > >
> > />/ > > > This happens with every job, right?  And you are using
> maui/moab, right?
> >
> > />/ > > >
> > />/ > > > If so, that is maui/moab reseting the job's neednodes resource
> after
> > />/ > > > starting the job.  This is a work-around for a mythical bug in
> job
> >
> > />/ > > > starts in OpenPBS that noone has ever been able to demonstrate
> to me.
> > />/ > >
> > />/ > > It doesn't happen on every job, only those that do explicit
> stagein/out.
> >
> > />/ > > The attrlist is "resource" and this is what happens...
> > />/ > >
> > />/ > > And yes this is with maui.
> > />/ > > Jobs without the initial CopyFiles request never gets any Modify
> >
> > />/ > > rejects.
> > />/ >
> > />/ > IIRC, it is actually a race condition.  stagein and longer
> prologues
> > />/ > will cause the error message.  It is mostly harmless, but there
> are some
> >
> > />/ > rare bad things.  I have a patch for maui if you want (moab has
> > />/ > tuneable, something like NOAUTONEEDNODE).
> > />/
> > />/ Yes definitely something i want.
> > />
> > /
> > />/ But isn't this something that should really be done in torque?
> > />/ Shouldn't it get a jobid to the mom before starting stagein?
> > /
> > You'd think so, but no.  stagein happens before the job is moved to the
> >
> > node.  I think the idea is to allow for "pre-stagein".
> > ---------------------END
> MAIL-------------------------------------------------
> >
> > I just added 'NOAUTONEEDNODE' to moab.cfg and job starts but still
> errors are same and PBS_NODEFILE env variable is still absent.
> >
> >
> >
> > It seems like this is a known bug, but I was not able to find much
> > reference(and problem solution) on this.Also I couldnt find any
> > reference in moab documentation for 'NOAUTONEEDNODES' parameter
> > specified by Garrick Staples.
> >
> > Is this bug fixed or is there any workaround for said problem.
> >
> > --
> > Regards--
> > Rishi Pathak
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > moabusers mailing list
> > moabusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/moabusers
> >
>
>
>


-- 
Regards--
Rishi Pathak
National PARAM Supercomputing Facility
Center for Development of Advanced Computing(C-DAC)
Pune University Campus,Ganesh Khind Road
Pune-Maharastra
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20071207/17dadd2f/attachment.html


More information about the torqueusers mailing list