[torquedev] Rerunable jobs not restarting when nodes reboot

Glen Beane glen.beane at gmail.com
Fri May 22 21:58:08 MDT 2009


On Thu, May 7, 2009 at 11:27 AM, Victor Gregorio
<vgregorio at penguincomputing.com> wrote:
> On Wed, May 06, 2009 at 08:20:21AM -0400, Glen Beane wrote:
>> On Tue, May 5, 2009 at 2:40 PM, Victor Gregorio
>> <vgregorio at penguincomputing.com> wrote:
>> > Hey folks,
>> >
>> > I believe Josh Bernstein and I are close to a solution for
>> > 2.3-fixes.
>> >
>> > We believe that examine_all_running_jobs() should not assume the
>> > ti_exitstat is 0 when "no active process [is] found".  Replacing 0
>> > with
>> > JOB_EXEC_INITABT (-4 /* job aborted on MOM initialization */ )
>> > allows
>> > multi-node, rerunable jobs to properly restart when all execution
>> > nodes
>> > reboot.
>> >
>> > Otherwise, the 2.3-fixes tree will only properly rerun rerunable
>> > jobs if
>> > the job runs on a single execution node (no sisters).
>> >
>> > Index: src/resmom/mom_main.c
>> > ===================================================================
>> > --- src/resmom/mom_main.c   (revision 2909)
>> > +++ src/resmom/mom_main.c   (working copy)
>> > @@ -7830,7 +7830,7 @@
>> >               "no active process found");
>> >             }
>> >
>> > -          ptask->ti_qs.ti_exitstat = 0;
>> > +          ptask->ti_qs.ti_exitstat = JOB_EXEC_INITABT;
>> >
>> >           ptask->ti_qs.ti_status = TI_STATE_EXITED;
>> >           pjob->ji_qs.ji_un.ji_momt.ji_exitstat = 0;
>> >
>> > Note that also setting ji_exitstat to JOB_EXEC_INITABT made no
>> > difference as long as ti_exitstat was set to JOB_EXEC_INITABT.
>> >  Should
>> > both be set?
>> >
>> > Either way, is it sane to be setting a task's exit status using
>> > JOB_EXEC_INITABT?  We noticed that using -1 did not solve the
>> > problem.
>> >
>> > Finally, note that we tested using two patches.  The above patch to
>> > mom_main.c and below patch to requests.c.  All tests look good so
>> > far.
>>
>>
>>
>> Thanks for the patches, although I think this will need more
>> investigation before it should be considered the final solution to the
>> problem.  I think using JOB_EXEC_INITABT to set ti_exitstat in this
>
> No problem.  We absolutely agree that this is no final solution :)
>
>> case is abusing it, since that is not conveying the correct failure (I
>> don't think just assigning a random value to ti_exitstat is the right
>> thing to do).
>
> Well, we picked JOB_EXEC_INITABT (-4) purposely. Logs showed rerunable
> jobs without sisters restarting properly after pbs_server reported the
> jobs exiting with status -4.
>
> PBS_Server;Job;6.tesla;job exit status -4 handled
>
>> I would like to look into what affect ptask->ti_qs.ti_status =
>> TI_STATE_EXITED has.  TI_STATE_EXITED means that ti_exitstat is valid.
>>  I would guess (I don't know for sure yet) that if the status is
>> TI_STATE_EXITED then the ti_exitstat is eventually getting used for
>> the job's exit status, and a -1 ji_exitstat as you tried means "job
>
> I think that the task status eventually trickles down to the job
> status...
>
> src/resmom/catch_child.c:
>
>    546       if (ptask->ti_qs.ti_parenttask == TM_NULL_TASK)
>    547         {
>    548         /* master task is in state TI_STATE_EXITED */
>    549
>    550         pjob->ji_qs.ji_un.ji_momt.ji_exitstat = ptask->ti_qs.ti_exitstat;
>    551
>
>> exec failed, before files, no retry".   Perhaps the "no retry"
>> prevents it from being rerun.
>
> Honestly, setting the -1 value was just to check that the -4 value did
> not "work" just because it was a negative number.  I don't know the code
> well enough yet, so I was double checking my stabs in the dark.
>
>> Also,  is there any chance that in some cases setting ti_exitstat to
>> zero might be the correct thing to do?  We don't want to break
>> anything else here.
>
> Good point.
>
>> I would like to understand how all of this stuff works together rather
>> than just assigning a different value to ti_exitstat and saying "OK,
>> it works for my case" without knowing why it works or if it breaks
>> anything else. If you have done this, please let me know, so I don't
>> have to repeat your investigation.
>
> Agreed.  Things just felt on the right track and we needed your
> assistance.  What is the next step?  How can we help?


At this point you probably know as much about the code involved in
this as I do.

I guess we need to make sure there is no case where we want to set the
exit status to zero when no active processes are found.  (what if the
processes exited sucessfully while pbs_mom was shutdown?)  The other
thing to do is to find the right (or create a new) status to set
ptask->ti_qs.ti_exitstat to that more accurately describes the
situation than JOB_EXEC_INITABT.


More information about the torquedev mailing list