[torqueusers] Rescheduling jobs after a node failure
durket at highwire.stanford.edu
Wed Jun 11 09:38:28 MDT 2008
We've experienced this scenario:
1) User(s) submit jobs specifying nodes by their attributes
(features), not by
a specific host.
2) Based on specified features, one or more jobs are
scheduled by Maui
onto a specific node (but not all of them run at the same
time - some
stay in the queue).
3) Node goes down (and a 'pbsnodes -l' will list it as down).
4) All jobs still in the queue scheduled onto that node in
are stuck and won't run until the node comes back up
be hours or days if you need a part).
I'm not sure if this is a Torque or Maui question, but
couldn't one of those
rescan the queue looking for jobs scheduled on a "down" node that
the node by "feature" and resubmit those jobs for rescheduling onto
nodes with that same feature set? That is, redo the scheduling
assignment in (2)
above as if it never happened.
More information about the torqueusers