[torqueusers] reply code=15001...

Gonzalo Merino merino at pic.es
Thu Oct 25 11:05:57 MDT 2007


Hello Garrick and others,

We are running this version of maui and torque:
 maui-3.2.6p19
 torque-2.1.8

And we see lots of these 15001 all the time. Sometimes the job starts 
immediately after the error appears in the pbs_mom log, but some other 
times the job never starts. It fails.

It definetly smells like some race condition as you mentioned.
Do you know if the patch you sent one year ago is already included in 
some recent maui version?

thanks a lot,
Gonzalo

Garrick Staples escribió:
> On Wed, Oct 11, 2006 at 07:17:01PM +0200, ?ke Sandgren alleged:
>   
>> On Wed, 2006-10-11 at 10:55 -0600, Garrick Staples wrote:
>>     
>>> On Wed, Oct 11, 2006 at 08:41:20AM +0200, ?ke Sandgren alleged:
>>>       
>>>> On Tue, 2006-10-10 at 11:58 -0600, Garrick Staples wrote:
>>>>         
>>>>> On Tue, Oct 10, 2006 at 01:33:32PM +0200, ?ke Sandgren alleged:
>>>>>           
>>>>>> Hi!
>>>>>>
>>>>>> I think this have been adressed before but i can't find any info.
>>>>>>
>>>>>> We are getting loads of
>>>>>> pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id
>>>>>> REJHOST=i092.hpc2n.umu.se MSG=modify job failed, unknown job
>>>>>> 392438.ingrid-h.hpc2n.umu.se), aux=0, type=ModifyJob, from
>>>>>> PBS_Server at ingrid-i.hpc2n.umu.se
>>>>>>
>>>>>> I think they are related to stage-in/out but exactly what should we be
>>>>>> looking for.
>>>>>>
>>>>>> torque version ranging from 2.0.0p4 to 2.1.2.
>>>>>>             
>>>>> This happens with every job, right?  And you are using maui/moab, right?
>>>>>
>>>>> If so, that is maui/moab reseting the job's neednodes resource after
>>>>> starting the job.  This is a work-around for a mythical bug in job
>>>>> starts in OpenPBS that noone has ever been able to demonstrate to me.
>>>>>           
>>>> It doesn't happen on every job, only those that do explicit stagein/out.
>>>> The attrlist is "resource" and this is what happens...
>>>>
>>>> And yes this is with maui.
>>>> Jobs without the initial CopyFiles request never gets any Modify
>>>> rejects.
>>>>         
>>> IIRC, it is actually a race condition.  stagein and longer prologues
>>> will cause the error message.  It is mostly harmless, but there are some
>>> rare bad things.  I have a patch for maui if you want (moab has
>>> tuneable, something like NOAUTONEEDNODE).
>>>       
>> Yes definitely something i want.
>>
>> But isn't this something that should really be done in torque?
>> Shouldn't it get a jobid to the mom before starting stagein?
>>     
>
> You'd think so, but no.  stagein happens before the job is moved to the
> node.  I think the idea is to allow for "pre-stagein".
>
>   
> ------------------------------------------------------------------------
>
> Index: src/moab/MPBSI.c
> ===================================================================
> RCS file: /usr/local/nfs/src/cvs_repository/maui/src/moab/MPBSI.c,v
> retrieving revision 1.14
> diff -u -r1.14 MPBSI.c
> --- src/moab/MPBSI.c	5 Nov 2005 02:42:08 -0000	1.14
> +++ src/moab/MPBSI.c	23 May 2006 01:50:11 -0000
> @@ -1792,6 +1792,7 @@
>        return(FAILURE);
>        }
>  
> +/*
>      if (MPBSJobModify(
>            J,
>            R,
> @@ -1826,6 +1827,7 @@
>          J->Name,
>          HostList);
>        }
> +*/
>      }
>    else
>      {
> @@ -1904,7 +1906,7 @@
>  
>    MJobGetName(J,NULL,R,tmpJobName,sizeof(tmpJobName),mjnRMName);       
>  
> -  rc = pbs_runjob(R->U.PBS.ServerSD,tmpJobName,MasterHost,NULL);
> +  rc = pbs_runjob(R->U.PBS.ServerSD,tmpJobName,HostList,NULL);
>  
>    if (rc != 0)
>      {
> @@ -1928,6 +1930,7 @@
>      JobStartFailed = TRUE;
>      }
>  
> +/*
>    if (J->NeedNodes != NULL)
>      {
>      if (MPBSJobModify(
> @@ -1949,6 +1952,7 @@
>          J->NeedNodes);
>        }
>      }
> +*/
>  
>    if (JobStartFailed == TRUE)
>      {
>   
> ------------------------------------------------------------------------
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20071025/352abb04/attachment.html


More information about the torqueusers mailing list