[torqueusers] reply code=15001...

nathaniel.x.woody at gsk.com nathaniel.x.woody at gsk.com
Thu Oct 25 11:12:32 MDT 2007


Huh, to follow up on this, what are the rare Bad Things that can happen 
here (I decided years ago to ignore the millions of these we get)? 

Best,
Nate





"Gonzalo Merino" <merino at pic.es> 
Sent by: torqueusers-bounces at supercluster.org
25-Oct-2007 13:05
 
To
torqueusers at supercluster.org
cc

Subject
Re: [torqueusers] reply code=15001...






Hello Garrick and others,

We are running this version of maui and torque:
 maui-3.2.6p19
 torque-2.1.8

And we see lots of these 15001 all the time. Sometimes the job starts 
immediately after the error appears in the pbs_mom log, but some other 
times the job never starts. It fails.

It definetly smells like some race condition as you mentioned. 
Do you know if the patch you sent one year ago is already included in some 
recent maui version?

thanks a lot,
Gonzalo

Garrick Staples escribió: 
On Wed, Oct 11, 2006 at 07:17:01PM +0200, ?ke Sandgren alleged:
 
On Wed, 2006-10-11 at 10:55 -0600, Garrick Staples wrote:
 
On Wed, Oct 11, 2006 at 08:41:20AM +0200, ?ke Sandgren alleged:
 
On Tue, 2006-10-10 at 11:58 -0600, Garrick Staples wrote:
 
On Tue, Oct 10, 2006 at 01:33:32PM +0200, ?ke Sandgren alleged:
 
Hi!

I think this have been adressed before but i can't find any info.

We are getting loads of
pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id
REJHOST=i092.hpc2n.umu.se MSG=modify job failed, unknown job
392438.ingrid-h.hpc2n.umu.se), aux=0, type=ModifyJob, from
PBS_Server at ingrid-i.hpc2n.umu.se

I think they are related to stage-in/out but exactly what should we be
looking for.

torque version ranging from 2.0.0p4 to 2.1.2.
 
This happens with every job, right?  And you are using maui/moab, right?

If so, that is maui/moab reseting the job's neednodes resource after
starting the job.  This is a work-around for a mythical bug in job
starts in OpenPBS that noone has ever been able to demonstrate to me.
 
It doesn't happen on every job, only those that do explicit stagein/out.
The attrlist is "resource" and this is what happens...

And yes this is with maui.
Jobs without the initial CopyFiles request never gets any Modify
rejects.
 
IIRC, it is actually a race condition.  stagein and longer prologues
will cause the error message.  It is mostly harmless, but there are some
rare bad things.  I have a patch for maui if you want (moab has
tuneable, something like NOAUTONEEDNODE).
 
Yes definitely something i want.

But isn't this something that should really be done in torque?
Shouldn't it get a jobid to the mom before starting stagein?
 

You'd think so, but no.  stagein happens before the job is moved to the
node.  I think the idea is to allow for "pre-stagein".

 


Index: src/moab/MPBSI.c
===================================================================
RCS file: /usr/local/nfs/src/cvs_repository/maui/src/moab/MPBSI.c,v
retrieving revision 1.14
diff -u -r1.14 MPBSI.c
--- src/moab/MPBSI.c             5 Nov 2005 02:42:08 -0000 1.14
+++ src/moab/MPBSI.c             23 May 2006 01:50:11 -0000
@@ -1792,6 +1792,7 @@
       return(FAILURE);
       }
 
+/*
     if (MPBSJobModify(
           J,
           R,
@@ -1826,6 +1827,7 @@
         J->Name,
         HostList);
       }
+*/
     }
   else
     {
@@ -1904,7 +1906,7 @@
 
   MJobGetName(J,NULL,R,tmpJobName,sizeof(tmpJobName),mjnRMName); 
 
-  rc = pbs_runjob(R->U.PBS.ServerSD,tmpJobName,MasterHost,NULL);
+  rc = pbs_runjob(R->U.PBS.ServerSD,tmpJobName,HostList,NULL);
 
   if (rc != 0)
     {
@@ -1928,6 +1930,7 @@
     JobStartFailed = TRUE;
     }
 
+/*
   if (J->NeedNodes != NULL)
     {
     if (MPBSJobModify(
@@ -1949,6 +1952,7 @@
         J->NeedNodes);
       }
     }
+*/
 
   if (JobStartFailed == TRUE)
     {
 


_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
  _______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20071025/31bfb5b3/attachment.html


More information about the torqueusers mailing list