[torqueusers] reply code=15001...

Garrick Staples garrick at clusterresources.com
Wed Oct 11 11:19:37 MDT 2006


On Wed, Oct 11, 2006 at 07:17:01PM +0200, ?ke Sandgren alleged:
> On Wed, 2006-10-11 at 10:55 -0600, Garrick Staples wrote:
> > On Wed, Oct 11, 2006 at 08:41:20AM +0200, ?ke Sandgren alleged:
> > > On Tue, 2006-10-10 at 11:58 -0600, Garrick Staples wrote:
> > > > On Tue, Oct 10, 2006 at 01:33:32PM +0200, ?ke Sandgren alleged:
> > > > > Hi!
> > > > > 
> > > > > I think this have been adressed before but i can't find any info.
> > > > > 
> > > > > We are getting loads of
> > > > > pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id
> > > > > REJHOST=i092.hpc2n.umu.se MSG=modify job failed, unknown job
> > > > > 392438.ingrid-h.hpc2n.umu.se), aux=0, type=ModifyJob, from
> > > > > PBS_Server at ingrid-i.hpc2n.umu.se
> > > > > 
> > > > > I think they are related to stage-in/out but exactly what should we be
> > > > > looking for.
> > > > > 
> > > > > torque version ranging from 2.0.0p4 to 2.1.2.
> > > > 
> > > > This happens with every job, right?  And you are using maui/moab, right?
> > > > 
> > > > If so, that is maui/moab reseting the job's neednodes resource after
> > > > starting the job.  This is a work-around for a mythical bug in job
> > > > starts in OpenPBS that noone has ever been able to demonstrate to me.
> > > 
> > > It doesn't happen on every job, only those that do explicit stagein/out.
> > > The attrlist is "resource" and this is what happens...
> > > 
> > > And yes this is with maui.
> > > Jobs without the initial CopyFiles request never gets any Modify
> > > rejects.
> > 
> > IIRC, it is actually a race condition.  stagein and longer prologues
> > will cause the error message.  It is mostly harmless, but there are some
> > rare bad things.  I have a patch for maui if you want (moab has
> > tuneable, something like NOAUTONEEDNODE).
> 
> Yes definitely something i want.
> 
> But isn't this something that should really be done in torque?
> Shouldn't it get a jobid to the mom before starting stagein?

You'd think so, but no.  stagein happens before the job is moved to the
node.  I think the idea is to allow for "pre-stagein".

-------------- next part --------------
Index: src/moab/MPBSI.c
===================================================================
RCS file: /usr/local/nfs/src/cvs_repository/maui/src/moab/MPBSI.c,v
retrieving revision 1.14
diff -u -r1.14 MPBSI.c
--- src/moab/MPBSI.c	5 Nov 2005 02:42:08 -0000	1.14
+++ src/moab/MPBSI.c	23 May 2006 01:50:11 -0000
@@ -1792,6 +1792,7 @@
       return(FAILURE);
       }
 
+/*
     if (MPBSJobModify(
           J,
           R,
@@ -1826,6 +1827,7 @@
         J->Name,
         HostList);
       }
+*/
     }
   else
     {
@@ -1904,7 +1906,7 @@
 
   MJobGetName(J,NULL,R,tmpJobName,sizeof(tmpJobName),mjnRMName);       
 
-  rc = pbs_runjob(R->U.PBS.ServerSD,tmpJobName,MasterHost,NULL);
+  rc = pbs_runjob(R->U.PBS.ServerSD,tmpJobName,HostList,NULL);
 
   if (rc != 0)
     {
@@ -1928,6 +1930,7 @@
     JobStartFailed = TRUE;
     }
 
+/*
   if (J->NeedNodes != NULL)
     {
     if (MPBSJobModify(
@@ -1949,6 +1952,7 @@
         J->NeedNodes);
       }
     }
+*/
 
   if (JobStartFailed == TRUE)
     {


More information about the torqueusers mailing list