[torqueusers] What does "rerunable" really mean?

Rob torque at theknack.net
Tue Aug 15 10:22:29 MDT 2006


I am running some unreliable 3rd party simulation software on an 
unreliable cluster.  The simulation is nondeterministic and will have 
different results (including whether or not it halts within the allotted 
time) even if re-run with identical parameters.

I would like my batch jobs to be re-run when they fail for any of the 
following reasons:

1) the program deadlocks/infinite loops and exceeds its wallclock 
allocation and is killed by torque
2) the program crashes (or is aborted by the OS) and returns an error code
3) nfs fails at a node, in which case the node still responds to pings, 
but user processes are frozen and don't terminate
4) part or all of the cluster is rebooted causing jobs to be aborted.

 From what I have seen, "rerunable" (qsub -r y) only seems to apply to 
case 4).
Manually resubmitting a job is problematic because I have chains of 
mutli-job dependencies (-W afterok) so a failed job cascades to many 
other jobs.

Is there a good solution to this?

Thanks,
Rob.



More information about the torqueusers mailing list