[torquedev] Rerunable jobs not restarting when nodes reboot

Victor Gregorio vgregorio at penguincomputing.com
Wed Apr 22 11:53:42 MDT 2009


Interesting, Michael.  Thanks for the reply.  

I am not sure this is the right approach, but the following patch to the
2.1-fixes subversion branch (pbs_server --version 2.1.12)  allows
rerunable jobs to properly restart when the execution nodes reboot.

svn diff src/resmom/requests.c
Index: src/resmom/requests.c
===================================================================
--- src/resmom/requests.c   (revision 2898)
+++ src/resmom/requests.c   (working copy)
@@ -2344,9 +2344,9 @@
     exit(0);
     }
 
-  if (((rc = return_file(pjob,StdOut,sock)) != 0) ||
-      ((rc = return_file(pjob,StdErr,sock)) != 0) ||
-      ((rc = return_file(pjob,Chkpt,sock)) != 0)) 
+  if (((rc = return_file(pjob,StdOut,sock)) == 0) ||
+      ((rc = return_file(pjob,StdErr,sock)) == 0) ||
+      ((rc = return_file(pjob,Chkpt,sock)) == 0)) 
     {
     /* FAILURE - cannot report file to server */

I have been unsuccessful in my attempts to fix trunk or the 2.3-fixes
branch.  Any advice is appreciated.

Thanks,

-- 
Victor Gregorio
Penguin Computing

On Wed, Apr 22, 2009 at 09:58:37AM -0400, Michael Barnes wrote:
> On Mon, Apr 20, 2009 at 12:02:15PM -0700, Victor Gregorio wrote:
> > I have found that with Torque versions >= 2.1.10, rerunable jobs are not
> > restarting after the execution nodes reboot.  Torque version 2.1.9 works
> > as expected: rerunable jobs restart from the beginning after execution
> > nodes are rebooted.
> > 
> > Here is the [torqueusers] thread on the issue:
> > http://www.supercluster.org/pipermail/torqueusers/2009-April/008945.html
> > 
> > I am not certain why, but if I remove this patch (below) from 2.1.10,
> > rerunable jobs begin to restart properly after the execution nodes
> > reboot.  Please note that this patch was introduced in 2.1.10.
> 
> I've looked at torque-2.4.1b1-snap.200903101336 and torque-2.1.10, and
> these versions have logic inconsitancies within the function
> return_file() and when it is called.  For example:
> 
> The source says.
> 
> /* return 0 on failure */
> 
> static int return_file( ...
>     ...
>     return 0; // the /dev/null part and others
>     return PBS_SYSTEM; // PBS_SYSTEM failure/non-zero
>     return rc; // from other calls where 0 is NOT a failure
>       
> 
> 
> 
> and in req_rerunjob() there is:
> 
>   if (((rc = return_file(pjob, StdOut, sock, TRUE)) != 0) ||
>       ((rc = return_file(pjob, StdErr, sock, TRUE)) != 0) ||
>       ((rc = return_file(pjob, Checkpoint, sock, TRUE)) != 0))
>     {
>     /* FAILURE - cannot report file to server */
> 
> 
> The 2.4 version calls return_file() in more functions than the 2.1
> version, so this could explain the differences between them.
> 
> -mb
> 
> -- 
> +-----------------------------------------------
> | Michael Barnes
> |
> | Thomas Jefferson National Accelerator Facility
> | 12000 Jefferson Ave.
> | Newport News, VA 23606
> | (757) 269-7634
> +-----------------------------------------------


More information about the torquedev mailing list