[torqueusers] reply code=15001...
Garrick Staples
garrick at clusterresources.com
Wed Oct 11 11:19:37 MDT 2006
On Wed, Oct 11, 2006 at 07:17:01PM +0200, ?ke Sandgren alleged:
> On Wed, 2006-10-11 at 10:55 -0600, Garrick Staples wrote:
> > On Wed, Oct 11, 2006 at 08:41:20AM +0200, ?ke Sandgren alleged:
> > > On Tue, 2006-10-10 at 11:58 -0600, Garrick Staples wrote:
> > > > On Tue, Oct 10, 2006 at 01:33:32PM +0200, ?ke Sandgren alleged:
> > > > > Hi!
> > > > >
> > > > > I think this have been adressed before but i can't find any info.
> > > > >
> > > > > We are getting loads of
> > > > > pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id
> > > > > REJHOST=i092.hpc2n.umu.se MSG=modify job failed, unknown job
> > > > > 392438.ingrid-h.hpc2n.umu.se), aux=0, type=ModifyJob, from
> > > > > PBS_Server at ingrid-i.hpc2n.umu.se
> > > > >
> > > > > I think they are related to stage-in/out but exactly what should we be
> > > > > looking for.
> > > > >
> > > > > torque version ranging from 2.0.0p4 to 2.1.2.
> > > >
> > > > This happens with every job, right? And you are using maui/moab, right?
> > > >
> > > > If so, that is maui/moab reseting the job's neednodes resource after
> > > > starting the job. This is a work-around for a mythical bug in job
> > > > starts in OpenPBS that noone has ever been able to demonstrate to me.
> > >
> > > It doesn't happen on every job, only those that do explicit stagein/out.
> > > The attrlist is "resource" and this is what happens...
> > >
> > > And yes this is with maui.
> > > Jobs without the initial CopyFiles request never gets any Modify
> > > rejects.
> >
> > IIRC, it is actually a race condition. stagein and longer prologues
> > will cause the error message. It is mostly harmless, but there are some
> > rare bad things. I have a patch for maui if you want (moab has
> > tuneable, something like NOAUTONEEDNODE).
>
> Yes definitely something i want.
>
> But isn't this something that should really be done in torque?
> Shouldn't it get a jobid to the mom before starting stagein?
You'd think so, but no. stagein happens before the job is moved to the
node. I think the idea is to allow for "pre-stagein".
-------------- next part --------------
Index: src/moab/MPBSI.c
===================================================================
RCS file: /usr/local/nfs/src/cvs_repository/maui/src/moab/MPBSI.c,v
retrieving revision 1.14
diff -u -r1.14 MPBSI.c
--- src/moab/MPBSI.c 5 Nov 2005 02:42:08 -0000 1.14
+++ src/moab/MPBSI.c 23 May 2006 01:50:11 -0000
@@ -1792,6 +1792,7 @@
return(FAILURE);
}
+/*
if (MPBSJobModify(
J,
R,
@@ -1826,6 +1827,7 @@
J->Name,
HostList);
}
+*/
}
else
{
@@ -1904,7 +1906,7 @@
MJobGetName(J,NULL,R,tmpJobName,sizeof(tmpJobName),mjnRMName);
- rc = pbs_runjob(R->U.PBS.ServerSD,tmpJobName,MasterHost,NULL);
+ rc = pbs_runjob(R->U.PBS.ServerSD,tmpJobName,HostList,NULL);
if (rc != 0)
{
@@ -1928,6 +1930,7 @@
JobStartFailed = TRUE;
}
+/*
if (J->NeedNodes != NULL)
{
if (MPBSJobModify(
@@ -1949,6 +1952,7 @@
J->NeedNodes);
}
}
+*/
if (JobStartFailed == TRUE)
{
More information about the torqueusers
mailing list