[torquedev] Rerunable jobs not restarting when nodes reboot

Michael Barnes barnes at jlab.org
Wed Apr 22 07:58:37 MDT 2009


On Mon, Apr 20, 2009 at 12:02:15PM -0700, Victor Gregorio wrote:
> I have found that with Torque versions >= 2.1.10, rerunable jobs are not
> restarting after the execution nodes reboot.  Torque version 2.1.9 works
> as expected: rerunable jobs restart from the beginning after execution
> nodes are rebooted.
> 
> Here is the [torqueusers] thread on the issue:
> http://www.supercluster.org/pipermail/torqueusers/2009-April/008945.html
> 
> I am not certain why, but if I remove this patch (below) from 2.1.10,
> rerunable jobs begin to restart properly after the execution nodes
> reboot.  Please note that this patch was introduced in 2.1.10.

I've looked at torque-2.4.1b1-snap.200903101336 and torque-2.1.10, and
these versions have logic inconsitancies within the function
return_file() and when it is called.  For example:

The source says.

/* return 0 on failure */

static int return_file( ...
    ...
    return 0; // the /dev/null part and others
    return PBS_SYSTEM; // PBS_SYSTEM failure/non-zero
    return rc; // from other calls where 0 is NOT a failure
      



and in req_rerunjob() there is:

  if (((rc = return_file(pjob, StdOut, sock, TRUE)) != 0) ||
      ((rc = return_file(pjob, StdErr, sock, TRUE)) != 0) ||
      ((rc = return_file(pjob, Checkpoint, sock, TRUE)) != 0))
    {
    /* FAILURE - cannot report file to server */


The 2.4 version calls return_file() in more functions than the 2.1
version, so this could explain the differences between them.

-mb

-- 
+-----------------------------------------------
| Michael Barnes
|
| Thomas Jefferson National Accelerator Facility
| 12000 Jefferson Ave.
| Newport News, VA 23606
| (757) 269-7634
+-----------------------------------------------


More information about the torquedev mailing list