[torquedev] Rerunable jobs not restarting when nodes reboot

Victor Gregorio vgregorio at penguincomputing.com
Thu May 7 09:27:56 MDT 2009


On Wed, May 06, 2009 at 08:20:21AM -0400, Glen Beane wrote:
> On Tue, May 5, 2009 at 2:40 PM, Victor Gregorio
> <vgregorio at penguincomputing.com> wrote:
> > Hey folks,
> >
> > I believe Josh Bernstein and I are close to a solution for
> > 2.3-fixes.
> >
> > We believe that examine_all_running_jobs() should not assume the
> > ti_exitstat is 0 when "no active process [is] found".  Replacing 0
> > with
> > JOB_EXEC_INITABT (-4 /* job aborted on MOM initialization */ )
> > allows
> > multi-node, rerunable jobs to properly restart when all execution
> > nodes
> > reboot.
> >
> > Otherwise, the 2.3-fixes tree will only properly rerun rerunable
> > jobs if
> > the job runs on a single execution node (no sisters).
> >
> > Index: src/resmom/mom_main.c
> > ===================================================================
> > --- src/resmom/mom_main.c   (revision 2909)
> > +++ src/resmom/mom_main.c   (working copy)
> > @@ -7830,7 +7830,7 @@
> >               "no active process found");
> >             }
> >
> > -          ptask->ti_qs.ti_exitstat = 0;
> > +          ptask->ti_qs.ti_exitstat = JOB_EXEC_INITABT;
> >
> >           ptask->ti_qs.ti_status = TI_STATE_EXITED;
> >           pjob->ji_qs.ji_un.ji_momt.ji_exitstat = 0;
> >
> > Note that also setting ji_exitstat to JOB_EXEC_INITABT made no
> > difference as long as ti_exitstat was set to JOB_EXEC_INITABT.
> >  Should
> > both be set?
> >
> > Either way, is it sane to be setting a task's exit status using
> > JOB_EXEC_INITABT?  We noticed that using -1 did not solve the
> > problem.
> >
> > Finally, note that we tested using two patches.  The above patch to
> > mom_main.c and below patch to requests.c.  All tests look good so
> > far.
> 
> 
> 
> Thanks for the patches, although I think this will need more
> investigation before it should be considered the final solution to the
> problem.  I think using JOB_EXEC_INITABT to set ti_exitstat in this

No problem.  We absolutely agree that this is no final solution :)

> case is abusing it, since that is not conveying the correct failure (I
> don't think just assigning a random value to ti_exitstat is the right
> thing to do).

Well, we picked JOB_EXEC_INITABT (-4) purposely. Logs showed rerunable
jobs without sisters restarting properly after pbs_server reported the
jobs exiting with status -4.

PBS_Server;Job;6.tesla;job exit status -4 handled

> I would like to look into what affect ptask->ti_qs.ti_status =
> TI_STATE_EXITED has.  TI_STATE_EXITED means that ti_exitstat is valid.
>  I would guess (I don't know for sure yet) that if the status is
> TI_STATE_EXITED then the ti_exitstat is eventually getting used for
> the job's exit status, and a -1 ji_exitstat as you tried means "job

I think that the task status eventually trickles down to the job
status...

src/resmom/catch_child.c:

    546       if (ptask->ti_qs.ti_parenttask == TM_NULL_TASK)
    547         {
    548         /* master task is in state TI_STATE_EXITED */
    549 
    550         pjob->ji_qs.ji_un.ji_momt.ji_exitstat = ptask->ti_qs.ti_exitstat;
    551 

> exec failed, before files, no retry".   Perhaps the "no retry"
> prevents it from being rerun.

Honestly, setting the -1 value was just to check that the -4 value did
not "work" just because it was a negative number.  I don't know the code
well enough yet, so I was double checking my stabs in the dark.

> Also,  is there any chance that in some cases setting ti_exitstat to
> zero might be the correct thing to do?  We don't want to break
> anything else here.

Good point.

> I would like to understand how all of this stuff works together rather
> than just assigning a different value to ti_exitstat and saying "OK,
> it works for my case" without knowing why it works or if it breaks
> anything else. If you have done this, please let me know, so I don't
> have to repeat your investigation.

Agreed.  Things just felt on the right track and we needed your
assistance.  What is the next step?  How can we help?

-- 
Victor Gregorio
Penguin Computing



More information about the torquedev mailing list