[torquedev] Rerunable jobs not restarting when nodes reboot
Victor Gregorio
vgregorio at penguincomputing.com
Tue May 5 12:39:26 MDT 2009
Michael,
I agree, the logic inside req_rerunjob() might be off.
I think that resmom/requests.c's req_rerunjob() needs the patch below to
properly test for a return_file() failure. Can someone confirm?
--
Victor Gregorio
Penguin Computing
On Wed, Apr 22, 2009 at 10:53:42AM -0700, Victor Gregorio wrote:
>
> svn diff src/resmom/requests.c
> Index: src/resmom/requests.c
> ===================================================================
> --- src/resmom/requests.c (revision 2898)
> +++ src/resmom/requests.c (working copy)
> @@ -2344,9 +2344,9 @@
> exit(0);
> }
>
> - if (((rc = return_file(pjob,StdOut,sock)) != 0) ||
> - ((rc = return_file(pjob,StdErr,sock)) != 0) ||
> - ((rc = return_file(pjob,Chkpt,sock)) != 0))
> + if (((rc = return_file(pjob,StdOut,sock)) == 0) ||
> + ((rc = return_file(pjob,StdErr,sock)) == 0) ||
> + ((rc = return_file(pjob,Chkpt,sock)) == 0))
> {
> /* FAILURE - cannot report file to server */
> On Wed, Apr 22, 2009 at 09:58:37AM -0400, Michael Barnes wrote:
> >
> > I've looked at torque-2.4.1b1-snap.200903101336 and torque-2.1.10, and
> > these versions have logic inconsitancies within the function
> > return_file() and when it is called. For example:
> >
> > The source says.
> >
> > /* return 0 on failure */
> >
> > static int return_file( ...
> > ...
> > return 0; // the /dev/null part and others
> > return PBS_SYSTEM; // PBS_SYSTEM failure/non-zero
> > return rc; // from other calls where 0 is NOT a failure
> >
> >
> >
> >
> > and in req_rerunjob() there is:
> >
> > if (((rc = return_file(pjob, StdOut, sock, TRUE)) != 0) ||
> > ((rc = return_file(pjob, StdErr, sock, TRUE)) != 0) ||
> > ((rc = return_file(pjob, Checkpoint, sock, TRUE)) != 0))
> > {
> > /* FAILURE - cannot report file to server */
> >
> >
> > The 2.4 version calls return_file() in more functions than the 2.1
> > version, so this could explain the differences between them.
> >
> > -mb
More information about the torquedev
mailing list