[torquedev] Rerunable jobs not restarting when nodes reboot
Victor Gregorio
vgregorio at penguincomputing.com
Wed Apr 22 11:53:42 MDT 2009
Interesting, Michael. Thanks for the reply.
I am not sure this is the right approach, but the following patch to the
2.1-fixes subversion branch (pbs_server --version 2.1.12) allows
rerunable jobs to properly restart when the execution nodes reboot.
svn diff src/resmom/requests.c
Index: src/resmom/requests.c
===================================================================
--- src/resmom/requests.c (revision 2898)
+++ src/resmom/requests.c (working copy)
@@ -2344,9 +2344,9 @@
exit(0);
}
- if (((rc = return_file(pjob,StdOut,sock)) != 0) ||
- ((rc = return_file(pjob,StdErr,sock)) != 0) ||
- ((rc = return_file(pjob,Chkpt,sock)) != 0))
+ if (((rc = return_file(pjob,StdOut,sock)) == 0) ||
+ ((rc = return_file(pjob,StdErr,sock)) == 0) ||
+ ((rc = return_file(pjob,Chkpt,sock)) == 0))
{
/* FAILURE - cannot report file to server */
I have been unsuccessful in my attempts to fix trunk or the 2.3-fixes
branch. Any advice is appreciated.
Thanks,
--
Victor Gregorio
Penguin Computing
On Wed, Apr 22, 2009 at 09:58:37AM -0400, Michael Barnes wrote:
> On Mon, Apr 20, 2009 at 12:02:15PM -0700, Victor Gregorio wrote:
> > I have found that with Torque versions >= 2.1.10, rerunable jobs are not
> > restarting after the execution nodes reboot. Torque version 2.1.9 works
> > as expected: rerunable jobs restart from the beginning after execution
> > nodes are rebooted.
> >
> > Here is the [torqueusers] thread on the issue:
> > http://www.supercluster.org/pipermail/torqueusers/2009-April/008945.html
> >
> > I am not certain why, but if I remove this patch (below) from 2.1.10,
> > rerunable jobs begin to restart properly after the execution nodes
> > reboot. Please note that this patch was introduced in 2.1.10.
>
> I've looked at torque-2.4.1b1-snap.200903101336 and torque-2.1.10, and
> these versions have logic inconsitancies within the function
> return_file() and when it is called. For example:
>
> The source says.
>
> /* return 0 on failure */
>
> static int return_file( ...
> ...
> return 0; // the /dev/null part and others
> return PBS_SYSTEM; // PBS_SYSTEM failure/non-zero
> return rc; // from other calls where 0 is NOT a failure
>
>
>
>
> and in req_rerunjob() there is:
>
> if (((rc = return_file(pjob, StdOut, sock, TRUE)) != 0) ||
> ((rc = return_file(pjob, StdErr, sock, TRUE)) != 0) ||
> ((rc = return_file(pjob, Checkpoint, sock, TRUE)) != 0))
> {
> /* FAILURE - cannot report file to server */
>
>
> The 2.4 version calls return_file() in more functions than the 2.1
> version, so this could explain the differences between them.
>
> -mb
>
> --
> +-----------------------------------------------
> | Michael Barnes
> |
> | Thomas Jefferson National Accelerator Facility
> | 12000 Jefferson Ave.
> | Newport News, VA 23606
> | (757) 269-7634
> +-----------------------------------------------
More information about the torquedev
mailing list