[torquedev] Rerunable jobs not restarting when nodes reboot
vgregorio at penguincomputing.com
Tue May 5 12:40:07 MDT 2009
I believe Josh Bernstein and I are close to a solution for 2.3-fixes.
We believe that examine_all_running_jobs() should not assume the
ti_exitstat is 0 when "no active process [is] found". Replacing 0 with
JOB_EXEC_INITABT (-4 /* job aborted on MOM initialization */ ) allows
multi-node, rerunable jobs to properly restart when all execution nodes
Otherwise, the 2.3-fixes tree will only properly rerun rerunable jobs if
the job runs on a single execution node (no sisters).
--- src/resmom/mom_main.c (revision 2909)
+++ src/resmom/mom_main.c (working copy)
@@ -7830,7 +7830,7 @@
"no active process found");
- ptask->ti_qs.ti_exitstat = 0;
+ ptask->ti_qs.ti_exitstat = JOB_EXEC_INITABT;
ptask->ti_qs.ti_status = TI_STATE_EXITED;
pjob->ji_qs.ji_un.ji_momt.ji_exitstat = 0;
Note that also setting ji_exitstat to JOB_EXEC_INITABT made no
difference as long as ti_exitstat was set to JOB_EXEC_INITABT. Should
both be set?
Either way, is it sane to be setting a task's exit status using
JOB_EXEC_INITABT? We noticed that using -1 did not solve the problem.
Finally, note that we tested using two patches. The above patch to
mom_main.c and below patch to requests.c. All tests look good so far.
--- src/resmom/requests.c (revision 2909)
+++ src/resmom/requests.c (working copy)
@@ -2619,9 +2619,9 @@
- if (((rc = return_file(pjob, StdOut, sock)) != 0) ||
- ((rc = return_file(pjob, StdErr, sock)) != 0) ||
- ((rc = return_file(pjob, Chkpt, sock)) != 0))
+ if (((rc = return_file(pjob, StdOut, sock)) == 0) ||
+ ((rc = return_file(pjob, StdErr, sock)) == 0) ||
+ ((rc = return_file(pjob, Chkpt, sock)) == 0))
/* FAILURE - cannot report file to server */
On Mon, May 04, 2009 at 09:56:58PM +1000, Chris Samuel wrote:
> ----- "Victor Gregorio" <vgregorio at penguincomputing.com> wrote:
> > So, while 2.3-fixes is sometimes working, trunk does not. How can I
> > help you solve this problem? It is important to us that rerunable
> > jobs restart after execution nodes reboot.
> Personally I'd be happier running 2.3-fixes rather
> than trunk at present, src/resmom/linux/cpuset.c
> still has my patch committed by Glen in June 2008
> to avoid an infinite loop due to never completed
> code (commit 2173).
> Hence the comment:
> /* What was meant to be here ? - csamuel at vpac.org */
> Christopher Samuel - (03) 9925 4751 - Systems Manager
> The Victorian Partnership for Advanced Computing
> P.O. Box 201, Carlton South, VIC 3053, Australia
> VPAC is a not-for-profit Registered Research Agency
> torquedev mailing list
> torquedev at supercluster.org
More information about the torquedev