[torqueusers] Rescheduling jobs after a node failure
Michael Durket
durket at highwire.stanford.edu
Wed Jun 11 09:38:28 MDT 2008
A question:
We've experienced this scenario:
1) User(s) submit jobs specifying nodes by their attributes
(features), not by
a specific host.
2) Based on specified features, one or more jobs are
scheduled by Maui
onto a specific node (but not all of them run at the same
time - some
stay in the queue).
3) Node goes down (and a 'pbsnodes -l' will list it as down).
4) All jobs still in the queue scheduled onto that node in
(2) above,
are stuck and won't run until the node comes back up
(which could
be hours or days if you need a part).
I'm not sure if this is a Torque or Maui question, but
couldn't one of those
rescan the queue looking for jobs scheduled on a "down" node that
originally specified
the node by "feature" and resubmit those jobs for rescheduling onto
other (running)
nodes with that same feature set? That is, redo the scheduling
assignment in (2)
above as if it never happened.
Michael Durket
More information about the torqueusers
mailing list