[torquedev] Rerunable jobs not restarting when nodes reboot
Michael Barnes
barnes at jlab.org
Wed Apr 22 07:58:37 MDT 2009
On Mon, Apr 20, 2009 at 12:02:15PM -0700, Victor Gregorio wrote:
> I have found that with Torque versions >= 2.1.10, rerunable jobs are not
> restarting after the execution nodes reboot. Torque version 2.1.9 works
> as expected: rerunable jobs restart from the beginning after execution
> nodes are rebooted.
>
> Here is the [torqueusers] thread on the issue:
> http://www.supercluster.org/pipermail/torqueusers/2009-April/008945.html
>
> I am not certain why, but if I remove this patch (below) from 2.1.10,
> rerunable jobs begin to restart properly after the execution nodes
> reboot. Please note that this patch was introduced in 2.1.10.
I've looked at torque-2.4.1b1-snap.200903101336 and torque-2.1.10, and
these versions have logic inconsitancies within the function
return_file() and when it is called. For example:
The source says.
/* return 0 on failure */
static int return_file( ...
...
return 0; // the /dev/null part and others
return PBS_SYSTEM; // PBS_SYSTEM failure/non-zero
return rc; // from other calls where 0 is NOT a failure
and in req_rerunjob() there is:
if (((rc = return_file(pjob, StdOut, sock, TRUE)) != 0) ||
((rc = return_file(pjob, StdErr, sock, TRUE)) != 0) ||
((rc = return_file(pjob, Checkpoint, sock, TRUE)) != 0))
{
/* FAILURE - cannot report file to server */
The 2.4 version calls return_file() in more functions than the 2.1
version, so this could explain the differences between them.
-mb
--
+-----------------------------------------------
| Michael Barnes
|
| Thomas Jefferson National Accelerator Facility
| 12000 Jefferson Ave.
| Newport News, VA 23606
| (757) 269-7634
+-----------------------------------------------
More information about the torquedev
mailing list