[torquedev] Rerunable jobs not restarting when nodes reboot

Victor Gregorio vgregorio at penguincomputing.com
Mon Apr 20 13:02:15 MDT 2009


Hello all, 

I have found that with Torque versions >= 2.1.10, rerunable jobs are not
restarting after the execution nodes reboot.  Torque version 2.1.9 works
as expected: rerunable jobs restart from the beginning after execution
nodes are rebooted.

Here is the [torqueusers] thread on the issue:
http://www.supercluster.org/pipermail/torqueusers/2009-April/008945.html

I am not certain why, but if I remove this patch (below) from 2.1.10,
rerunable jobs begin to restart properly after the execution nodes
reboot.  Please note that this patch was introduced in 2.1.10.

diff -u torque-2.1.9/src/resmom/requests.c torque-2.1.10/src/resmom/requests.c
--- torque-2.1.9/src/resmom/requests.c  2007-08-31 10:51:13.000000000 -0700
+++ torque-2.1.10/src/resmom/requests.c 2007-12-11 15:53:27.000000000 -0800
@@ -581,6 +581,11 @@
 
   filename = std_file_name(pjob,which,&amt); /* amt is place holder */
 
+  if (strcmp(filename,"/dev/null") == 0)
+    {
+    return(0);
+    }
+
   fds = open(filename,O_RDONLY,0);
 
   if (fds < 0)

This is the SVN log for the patch:
------------------------------------------------------------------------
r1662 | garrick | 2007-11-26 14:18:48 -0800 (Mon, 26 Nov 2007) | 1 line

b - handle /dev/null correctly when job rerun
------------------------------------------------------------------------

Unfortunately, removing this patch does not fix the problem in 2.3.6.
Thoughts?

-- 
Victor Gregorio
Penguin Computing



More information about the torquedev mailing list