[torqueusers] reply code=15001...
Åke Sandgren
ake.sandgren at hpc2n.umu.se
Wed Oct 11 11:17:01 MDT 2006
On Wed, 2006-10-11 at 10:55 -0600, Garrick Staples wrote:
> On Wed, Oct 11, 2006 at 08:41:20AM +0200, ?ke Sandgren alleged:
> > On Tue, 2006-10-10 at 11:58 -0600, Garrick Staples wrote:
> > > On Tue, Oct 10, 2006 at 01:33:32PM +0200, ?ke Sandgren alleged:
> > > > Hi!
> > > >
> > > > I think this have been adressed before but i can't find any info.
> > > >
> > > > We are getting loads of
> > > > pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id
> > > > REJHOST=i092.hpc2n.umu.se MSG=modify job failed, unknown job
> > > > 392438.ingrid-h.hpc2n.umu.se), aux=0, type=ModifyJob, from
> > > > PBS_Server at ingrid-i.hpc2n.umu.se
> > > >
> > > > I think they are related to stage-in/out but exactly what should we be
> > > > looking for.
> > > >
> > > > torque version ranging from 2.0.0p4 to 2.1.2.
> > >
> > > This happens with every job, right? And you are using maui/moab, right?
> > >
> > > If so, that is maui/moab reseting the job's neednodes resource after
> > > starting the job. This is a work-around for a mythical bug in job
> > > starts in OpenPBS that noone has ever been able to demonstrate to me.
> >
> > It doesn't happen on every job, only those that do explicit stagein/out.
> > The attrlist is "resource" and this is what happens...
> >
> > And yes this is with maui.
> > Jobs without the initial CopyFiles request never gets any Modify
> > rejects.
>
> IIRC, it is actually a race condition. stagein and longer prologues
> will cause the error message. It is mostly harmless, but there are some
> rare bad things. I have a patch for maui if you want (moab has
> tuneable, something like NOAUTONEEDNODE).
Yes definitely something i want.
But isn't this something that should really be done in torque?
Shouldn't it get a jobid to the mom before starting stagein?
--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: ake at hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
More information about the torqueusers
mailing list