[torquedev] Rerunable jobs not restarting when nodes reboot
vgregorio at penguincomputing.com
Thu May 7 09:27:56 MDT 2009
On Wed, May 06, 2009 at 08:20:21AM -0400, Glen Beane wrote:
> On Tue, May 5, 2009 at 2:40 PM, Victor Gregorio
> <vgregorio at penguincomputing.com> wrote:
> > Hey folks,
> > I believe Josh Bernstein and I are close to a solution for
> > 2.3-fixes.
> > We believe that examine_all_running_jobs() should not assume the
> > ti_exitstat is 0 when "no active process [is] found". Replacing 0
> > with
> > JOB_EXEC_INITABT (-4 /* job aborted on MOM initialization */ )
> > allows
> > multi-node, rerunable jobs to properly restart when all execution
> > nodes
> > reboot.
> > Otherwise, the 2.3-fixes tree will only properly rerun rerunable
> > jobs if
> > the job runs on a single execution node (no sisters).
> > Index: src/resmom/mom_main.c
> > ===================================================================
> > --- src/resmom/mom_main.c (revision 2909)
> > +++ src/resmom/mom_main.c (working copy)
> > @@ -7830,7 +7830,7 @@
> > "no active process found");
> > }
> > - ptask->ti_qs.ti_exitstat = 0;
> > + ptask->ti_qs.ti_exitstat = JOB_EXEC_INITABT;
> > ptask->ti_qs.ti_status = TI_STATE_EXITED;
> > pjob->ji_qs.ji_un.ji_momt.ji_exitstat = 0;
> > Note that also setting ji_exitstat to JOB_EXEC_INITABT made no
> > difference as long as ti_exitstat was set to JOB_EXEC_INITABT.
> > Should
> > both be set?
> > Either way, is it sane to be setting a task's exit status using
> > JOB_EXEC_INITABT? We noticed that using -1 did not solve the
> > problem.
> > Finally, note that we tested using two patches. The above patch to
> > mom_main.c and below patch to requests.c. All tests look good so
> > far.
> Thanks for the patches, although I think this will need more
> investigation before it should be considered the final solution to the
> problem. I think using JOB_EXEC_INITABT to set ti_exitstat in this
No problem. We absolutely agree that this is no final solution :)
> case is abusing it, since that is not conveying the correct failure (I
> don't think just assigning a random value to ti_exitstat is the right
> thing to do).
Well, we picked JOB_EXEC_INITABT (-4) purposely. Logs showed rerunable
jobs without sisters restarting properly after pbs_server reported the
jobs exiting with status -4.
PBS_Server;Job;6.tesla;job exit status -4 handled
> I would like to look into what affect ptask->ti_qs.ti_status =
> TI_STATE_EXITED has. TI_STATE_EXITED means that ti_exitstat is valid.
> I would guess (I don't know for sure yet) that if the status is
> TI_STATE_EXITED then the ti_exitstat is eventually getting used for
> the job's exit status, and a -1 ji_exitstat as you tried means "job
I think that the task status eventually trickles down to the job
546 if (ptask->ti_qs.ti_parenttask == TM_NULL_TASK)
548 /* master task is in state TI_STATE_EXITED */
550 pjob->ji_qs.ji_un.ji_momt.ji_exitstat = ptask->ti_qs.ti_exitstat;
> exec failed, before files, no retry". Perhaps the "no retry"
> prevents it from being rerun.
Honestly, setting the -1 value was just to check that the -4 value did
not "work" just because it was a negative number. I don't know the code
well enough yet, so I was double checking my stabs in the dark.
> Also, is there any chance that in some cases setting ti_exitstat to
> zero might be the correct thing to do? We don't want to break
> anything else here.
> I would like to understand how all of this stuff works together rather
> than just assigning a different value to ti_exitstat and saying "OK,
> it works for my case" without knowing why it works or if it breaks
> anything else. If you have done this, please let me know, so I don't
> have to repeat your investigation.
Agreed. Things just felt on the right track and we needed your
assistance. What is the next step? How can we help?
More information about the torquedev