[torquedev] Rerunable jobs not restarting when nodes reboot

Victor Gregorio vgregorio at penguincomputing.com
Thu Apr 23 11:02:42 MDT 2009


Hey folks,

I discovered something interesting: the 2.3-fixes branch (pbs_server
--version 2.3.7) properly reruns a rerunable job after execution nodes
reboot if a single job is running per node.  No patches were needed.

If more than one job is running per node or a job is using more than
one node, rerunable jobs do not rerun when the execution nodes reboot.
Instead, the jobs change from RUNNING to EXITING and then COMPLETE.

On the other hand, I cannot get the SVN trunk (pbs_server --version
2.4.1b1) to properly rerun a rerunable job when the execution nodes
reboot -- even with only one job running per node.

So, while 2.3-fixes is sometimes working, trunk does not.  How can I
help you solve this problem?  It is important to us that rerunable jobs
restart after execution nodes reboot.

Thank you,

-- 
Victor Gregorio
Penguin Computing

On Wed, Apr 22, 2009 at 10:53:42AM -0700, Victor Gregorio wrote:
> Interesting, Michael.  Thanks for the reply.  
> 
> I am not sure this is the right approach, but the following patch to the
> 2.1-fixes subversion branch (pbs_server --version 2.1.12)  allows
> rerunable jobs to properly restart when the execution nodes reboot.
> 
> svn diff src/resmom/requests.c
> Index: src/resmom/requests.c
> ===================================================================
> --- src/resmom/requests.c   (revision 2898)
> +++ src/resmom/requests.c   (working copy)
> @@ -2344,9 +2344,9 @@
>      exit(0);
>      }
>  
> -  if (((rc = return_file(pjob,StdOut,sock)) != 0) ||
> -      ((rc = return_file(pjob,StdErr,sock)) != 0) ||
> -      ((rc = return_file(pjob,Chkpt,sock)) != 0)) 
> +  if (((rc = return_file(pjob,StdOut,sock)) == 0) ||
> +      ((rc = return_file(pjob,StdErr,sock)) == 0) ||
> +      ((rc = return_file(pjob,Chkpt,sock)) == 0)) 
>      {
>      /* FAILURE - cannot report file to server */
> 
> I have been unsuccessful in my attempts to fix trunk or the 2.3-fixes
> branch.  Any advice is appreciated.
> 
> Thanks,
> 
> -- 
> Victor Gregorio
> Penguin Computing
> 
> On Wed, Apr 22, 2009 at 09:58:37AM -0400, Michael Barnes wrote:
> > On Mon, Apr 20, 2009 at 12:02:15PM -0700, Victor Gregorio wrote:
> > > I have found that with Torque versions >= 2.1.10, rerunable jobs are not
> > > restarting after the execution nodes reboot.  Torque version 2.1.9 works
> > > as expected: rerunable jobs restart from the beginning after execution
> > > nodes are rebooted.
> > > 
> > > Here is the [torqueusers] thread on the issue:
> > > http://www.supercluster.org/pipermail/torqueusers/2009-April/008945.html
> > > 
> > > I am not certain why, but if I remove this patch (below) from 2.1.10,
> > > rerunable jobs begin to restart properly after the execution nodes
> > > reboot.  Please note that this patch was introduced in 2.1.10.
> > 
> > I've looked at torque-2.4.1b1-snap.200903101336 and torque-2.1.10, and
> > these versions have logic inconsitancies within the function
> > return_file() and when it is called.  For example:
> > 
> > The source says.
> > 
> > /* return 0 on failure */
> > 
> > static int return_file( ...
> >     ...
> >     return 0; // the /dev/null part and others
> >     return PBS_SYSTEM; // PBS_SYSTEM failure/non-zero
> >     return rc; // from other calls where 0 is NOT a failure
> >       
> > 
> > 
> > 
> > and in req_rerunjob() there is:
> > 
> >   if (((rc = return_file(pjob, StdOut, sock, TRUE)) != 0) ||
> >       ((rc = return_file(pjob, StdErr, sock, TRUE)) != 0) ||
> >       ((rc = return_file(pjob, Checkpoint, sock, TRUE)) != 0))
> >     {
> >     /* FAILURE - cannot report file to server */
> > 
> > 
> > The 2.4 version calls return_file() in more functions than the 2.1
> > version, so this could explain the differences between them.
> > 
> > -mb
> > 
> > -- 
> > +-----------------------------------------------
> > | Michael Barnes
> > |
> > | Thomas Jefferson National Accelerator Facility
> > | 12000 Jefferson Ave.
> > | Newport News, VA 23606
> > | (757) 269-7634
> > +-----------------------------------------------
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev


More information about the torquedev mailing list