[torquedev] Rerunable jobs not restarting when nodes reboot
glen.beane at gmail.com
Wed May 6 06:20:21 MDT 2009
On Tue, May 5, 2009 at 2:40 PM, Victor Gregorio
<vgregorio at penguincomputing.com> wrote:
> Hey folks,
> I believe Josh Bernstein and I are close to a solution for 2.3-fixes.
> We believe that examine_all_running_jobs() should not assume the
> ti_exitstat is 0 when "no active process [is] found". Replacing 0 with
> JOB_EXEC_INITABT (-4 /* job aborted on MOM initialization */ ) allows
> multi-node, rerunable jobs to properly restart when all execution nodes
> Otherwise, the 2.3-fixes tree will only properly rerun rerunable jobs if
> the job runs on a single execution node (no sisters).
> Index: src/resmom/mom_main.c
> --- src/resmom/mom_main.c (revision 2909)
> +++ src/resmom/mom_main.c (working copy)
> @@ -7830,7 +7830,7 @@
> "no active process found");
> - ptask->ti_qs.ti_exitstat = 0;
> + ptask->ti_qs.ti_exitstat = JOB_EXEC_INITABT;
> ptask->ti_qs.ti_status = TI_STATE_EXITED;
> pjob->ji_qs.ji_un.ji_momt.ji_exitstat = 0;
> Note that also setting ji_exitstat to JOB_EXEC_INITABT made no
> difference as long as ti_exitstat was set to JOB_EXEC_INITABT. Should
> both be set?
> Either way, is it sane to be setting a task's exit status using
> JOB_EXEC_INITABT? We noticed that using -1 did not solve the problem.
> Finally, note that we tested using two patches. The above patch to
> mom_main.c and below patch to requests.c. All tests look good so far.
Thanks for the patches, although I think this will need more
investigation before it should be considered the final solution to the
problem. I think using JOB_EXEC_INITABT to set ti_exitstat in this
case is abusing it, since that is not conveying the correct failure (I
don't think just assigning a random value to ti_exitstat is the right
thing to do).
I would like to look into what affect ptask->ti_qs.ti_status =
TI_STATE_EXITED has. TI_STATE_EXITED means that ti_exitstat is valid.
I would guess (I don't know for sure yet) that if the status is
TI_STATE_EXITED then the ti_exitstat is eventually getting used for
the job's exit status, and a -1 ji_exitstat as you tried means "job
exec failed, before files, no retry". Perhaps the "no retry"
prevents it from being rerun.
Also, is there any chance that in some cases setting ti_exitstat to
zero might be the correct thing to do? We don't want to break
anything else here.
I would like to understand how all of this stuff works together rather
than just assigning a different value to ti_exitstat and saying "OK,
it works for my case" without knowing why it works or if it breaks
anything else. If you have done this, please let me know, so I don't
have to repeat your investigation.
More information about the torquedev