[torquedev] Rerunable jobs not restarting when nodes reboot

Victor Gregorio vgregorio at penguincomputing.com
Tue May 5 12:39:26 MDT 2009


Michael,

I agree, the logic inside req_rerunjob() might be off.  

I think that resmom/requests.c's req_rerunjob() needs the patch below to
properly test for a return_file() failure.  Can someone confirm?

-- 
Victor Gregorio
Penguin Computing

On Wed, Apr 22, 2009 at 10:53:42AM -0700, Victor Gregorio wrote:
>
> svn diff src/resmom/requests.c
> Index: src/resmom/requests.c
> ===================================================================
> --- src/resmom/requests.c   (revision 2898)
> +++ src/resmom/requests.c   (working copy)
> @@ -2344,9 +2344,9 @@
>      exit(0);
>      }
>  
> -  if (((rc = return_file(pjob,StdOut,sock)) != 0) ||
> -      ((rc = return_file(pjob,StdErr,sock)) != 0) ||
> -      ((rc = return_file(pjob,Chkpt,sock)) != 0)) 
> +  if (((rc = return_file(pjob,StdOut,sock)) == 0) ||
> +      ((rc = return_file(pjob,StdErr,sock)) == 0) ||
> +      ((rc = return_file(pjob,Chkpt,sock)) == 0)) 
>      {
>      /* FAILURE - cannot report file to server */
 
 
> On Wed, Apr 22, 2009 at 09:58:37AM -0400, Michael Barnes wrote:
> > 
> > I've looked at torque-2.4.1b1-snap.200903101336 and torque-2.1.10, and
> > these versions have logic inconsitancies within the function
> > return_file() and when it is called.  For example:
> > 
> > The source says.
> > 
> > /* return 0 on failure */
> > 
> > static int return_file( ...
> >     ...
> >     return 0; // the /dev/null part and others
> >     return PBS_SYSTEM; // PBS_SYSTEM failure/non-zero
> >     return rc; // from other calls where 0 is NOT a failure
> >       
> > 
> > 
> > 
> > and in req_rerunjob() there is:
> > 
> >   if (((rc = return_file(pjob, StdOut, sock, TRUE)) != 0) ||
> >       ((rc = return_file(pjob, StdErr, sock, TRUE)) != 0) ||
> >       ((rc = return_file(pjob, Checkpoint, sock, TRUE)) != 0))
> >     {
> >     /* FAILURE - cannot report file to server */
> > 
> > 
> > The 2.4 version calls return_file() in more functions than the 2.1
> > version, so this could explain the differences between them.
> > 
> > -mb


More information about the torquedev mailing list