[torqueusers] What does "rerunable" really mean?
torque at theknack.net
Tue Aug 15 10:22:29 MDT 2006
I am running some unreliable 3rd party simulation software on an
unreliable cluster. The simulation is nondeterministic and will have
different results (including whether or not it halts within the allotted
time) even if re-run with identical parameters.
I would like my batch jobs to be re-run when they fail for any of the
1) the program deadlocks/infinite loops and exceeds its wallclock
allocation and is killed by torque
2) the program crashes (or is aborted by the OS) and returns an error code
3) nfs fails at a node, in which case the node still responds to pings,
but user processes are frozen and don't terminate
4) part or all of the cluster is rebooted causing jobs to be aborted.
From what I have seen, "rerunable" (qsub -r y) only seems to apply to
Manually resubmitting a job is problematic because I have chains of
mutli-job dependencies (-W afterok) so a failed job cascades to many
Is there a good solution to this?
More information about the torqueusers