[torquedev] Rerunable jobs not restarting when nodes reboot
Victor Gregorio
vgregorio at penguincomputing.com
Thu May 7 09:27:56 MDT 2009
On Wed, May 06, 2009 at 08:20:21AM -0400, Glen Beane wrote:
> On Tue, May 5, 2009 at 2:40 PM, Victor Gregorio
> <vgregorio at penguincomputing.com> wrote:
> > Hey folks,
> >
> > I believe Josh Bernstein and I are close to a solution for
> > 2.3-fixes.
> >
> > We believe that examine_all_running_jobs() should not assume the
> > ti_exitstat is 0 when "no active process [is] found". Replacing 0
> > with
> > JOB_EXEC_INITABT (-4 /* job aborted on MOM initialization */ )
> > allows
> > multi-node, rerunable jobs to properly restart when all execution
> > nodes
> > reboot.
> >
> > Otherwise, the 2.3-fixes tree will only properly rerun rerunable
> > jobs if
> > the job runs on a single execution node (no sisters).
> >
> > Index: src/resmom/mom_main.c
> > ===================================================================
> > --- src/resmom/mom_main.c (revision 2909)
> > +++ src/resmom/mom_main.c (working copy)
> > @@ -7830,7 +7830,7 @@
> > "no active process found");
> > }
> >
> > - ptask->ti_qs.ti_exitstat = 0;
> > + ptask->ti_qs.ti_exitstat = JOB_EXEC_INITABT;
> >
> > ptask->ti_qs.ti_status = TI_STATE_EXITED;
> > pjob->ji_qs.ji_un.ji_momt.ji_exitstat = 0;
> >
> > Note that also setting ji_exitstat to JOB_EXEC_INITABT made no
> > difference as long as ti_exitstat was set to JOB_EXEC_INITABT.
> > Should
> > both be set?
> >
> > Either way, is it sane to be setting a task's exit status using
> > JOB_EXEC_INITABT? We noticed that using -1 did not solve the
> > problem.
> >
> > Finally, note that we tested using two patches. The above patch to
> > mom_main.c and below patch to requests.c. All tests look good so
> > far.
>
>
>
> Thanks for the patches, although I think this will need more
> investigation before it should be considered the final solution to the
> problem. I think using JOB_EXEC_INITABT to set ti_exitstat in this
No problem. We absolutely agree that this is no final solution :)
> case is abusing it, since that is not conveying the correct failure (I
> don't think just assigning a random value to ti_exitstat is the right
> thing to do).
Well, we picked JOB_EXEC_INITABT (-4) purposely. Logs showed rerunable
jobs without sisters restarting properly after pbs_server reported the
jobs exiting with status -4.
PBS_Server;Job;6.tesla;job exit status -4 handled
> I would like to look into what affect ptask->ti_qs.ti_status =
> TI_STATE_EXITED has. TI_STATE_EXITED means that ti_exitstat is valid.
> I would guess (I don't know for sure yet) that if the status is
> TI_STATE_EXITED then the ti_exitstat is eventually getting used for
> the job's exit status, and a -1 ji_exitstat as you tried means "job
I think that the task status eventually trickles down to the job
status...
src/resmom/catch_child.c:
546 if (ptask->ti_qs.ti_parenttask == TM_NULL_TASK)
547 {
548 /* master task is in state TI_STATE_EXITED */
549
550 pjob->ji_qs.ji_un.ji_momt.ji_exitstat = ptask->ti_qs.ti_exitstat;
551
> exec failed, before files, no retry". Perhaps the "no retry"
> prevents it from being rerun.
Honestly, setting the -1 value was just to check that the -4 value did
not "work" just because it was a negative number. I don't know the code
well enough yet, so I was double checking my stabs in the dark.
> Also, is there any chance that in some cases setting ti_exitstat to
> zero might be the correct thing to do? We don't want to break
> anything else here.
Good point.
> I would like to understand how all of this stuff works together rather
> than just assigning a different value to ti_exitstat and saying "OK,
> it works for my case" without knowing why it works or if it breaks
> anything else. If you have done this, please let me know, so I don't
> have to repeat your investigation.
Agreed. Things just felt on the right track and we needed your
assistance. What is the next step? How can we help?
--
Victor Gregorio
Penguin Computing
More information about the torquedev
mailing list