[torquedev] Rerunable jobs not restarting when nodes reboot
Victor Gregorio
vgregorio at penguincomputing.com
Mon Apr 20 13:02:15 MDT 2009
Hello all,
I have found that with Torque versions >= 2.1.10, rerunable jobs are not
restarting after the execution nodes reboot. Torque version 2.1.9 works
as expected: rerunable jobs restart from the beginning after execution
nodes are rebooted.
Here is the [torqueusers] thread on the issue:
http://www.supercluster.org/pipermail/torqueusers/2009-April/008945.html
I am not certain why, but if I remove this patch (below) from 2.1.10,
rerunable jobs begin to restart properly after the execution nodes
reboot. Please note that this patch was introduced in 2.1.10.
diff -u torque-2.1.9/src/resmom/requests.c torque-2.1.10/src/resmom/requests.c
--- torque-2.1.9/src/resmom/requests.c 2007-08-31 10:51:13.000000000 -0700
+++ torque-2.1.10/src/resmom/requests.c 2007-12-11 15:53:27.000000000 -0800
@@ -581,6 +581,11 @@
filename = std_file_name(pjob,which,&amt); /* amt is place holder */
+ if (strcmp(filename,"/dev/null") == 0)
+ {
+ return(0);
+ }
+
fds = open(filename,O_RDONLY,0);
if (fds < 0)
This is the SVN log for the patch:
------------------------------------------------------------------------
r1662 | garrick | 2007-11-26 14:18:48 -0800 (Mon, 26 Nov 2007) | 1 line
b - handle /dev/null correctly when job rerun
------------------------------------------------------------------------
Unfortunately, removing this patch does not fix the problem in 2.3.6.
Thoughts?
--
Victor Gregorio
Penguin Computing
More information about the torquedev
mailing list