[torquedev] Rerunable jobs not restarting when nodes reboot

Victor Gregorio vgregorio at penguincomputing.com
Tue May 5 12:40:07 MDT 2009


Hey folks, 

I believe Josh Bernstein and I are close to a solution for 2.3-fixes.

We believe that examine_all_running_jobs() should not assume the
ti_exitstat is 0 when "no active process [is] found".  Replacing 0 with
JOB_EXEC_INITABT (-4 /* job aborted on MOM initialization */ ) allows
multi-node, rerunable jobs to properly restart when all execution nodes
reboot.

Otherwise, the 2.3-fixes tree will only properly rerun rerunable jobs if
the job runs on a single execution node (no sisters).

Index: src/resmom/mom_main.c
===================================================================
--- src/resmom/mom_main.c   (revision 2909)
+++ src/resmom/mom_main.c   (working copy)
@@ -7830,7 +7830,7 @@
               "no active process found");
             }
 
-          ptask->ti_qs.ti_exitstat = 0;
+          ptask->ti_qs.ti_exitstat = JOB_EXEC_INITABT;
 
           ptask->ti_qs.ti_status = TI_STATE_EXITED;
           pjob->ji_qs.ji_un.ji_momt.ji_exitstat = 0;

Note that also setting ji_exitstat to JOB_EXEC_INITABT made no
difference as long as ti_exitstat was set to JOB_EXEC_INITABT.  Should
both be set?  

Either way, is it sane to be setting a task's exit status using
JOB_EXEC_INITABT?  We noticed that using -1 did not solve the problem.  

Finally, note that we tested using two patches.  The above patch to
mom_main.c and below patch to requests.c.  All tests look good so far.

Index: src/resmom/requests.c
===================================================================
--- src/resmom/requests.c   (revision 2909)
+++ src/resmom/requests.c   (working copy)
@@ -2619,9 +2619,9 @@
     exit(0);
     }
 
-  if (((rc = return_file(pjob, StdOut, sock)) != 0) ||
-      ((rc = return_file(pjob, StdErr, sock)) != 0) ||
-      ((rc = return_file(pjob, Chkpt, sock)) != 0))
+  if (((rc = return_file(pjob, StdOut, sock)) == 0) ||
+      ((rc = return_file(pjob, StdErr, sock)) == 0) ||
+      ((rc = return_file(pjob, Chkpt, sock)) == 0))
     {
     /* FAILURE - cannot report file to server */

Any thoughts?

-- 
Victor Gregorio
Penguin Computing

On Mon, May 04, 2009 at 09:56:58PM +1000, Chris Samuel wrote:
> 
> ----- "Victor Gregorio" <vgregorio at penguincomputing.com> wrote:
> 
> > So, while 2.3-fixes is sometimes working, trunk does not.  How can I
> > help you solve this problem?  It is important to us that rerunable
> > jobs restart after execution nodes reboot.
> 
> Personally I'd be happier running 2.3-fixes rather
> than trunk at present, src/resmom/linux/cpuset.c
> still has my patch committed by Glen in June 2008
> to avoid an infinite loop due to never completed
> code (commit 2173).
> 
> Hence the comment:
> 
>    /* What was meant to be here ? - csamuel at vpac.org */
> 
> cheers,
> Chris
> -- 
> Christopher Samuel - (03) 9925 4751 - Systems Manager
>  The Victorian Partnership for Advanced Computing
>  P.O. Box 201, Carlton South, VIC 3053, Australia
> VPAC is a not-for-profit Registered Research Agency
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev


More information about the torquedev mailing list