[torqueusers] Rescheduling jobs after a node failure

Michael Durket durket at highwire.stanford.edu
Wed Jun 11 09:38:28 MDT 2008


A question:

      We've experienced this scenario:

         1) User(s) submit jobs specifying nodes by their attributes  
(features), not by
              a specific host.

         2) Based on specified features, one or more jobs are  
scheduled by Maui
             onto a specific node (but not all of them run at the same  
time - some
             stay in the queue).

          3) Node goes down (and a 'pbsnodes -l' will list it as down).

          4) All jobs still in the queue scheduled onto that node in  
(2) above,
              are stuck and won't run until the node comes back up  
(which could
              be hours or days if you need a part).

        I'm not sure if this is a Torque or Maui question, but  
couldn't one of those
rescan the queue looking for jobs scheduled on a "down" node that  
originally specified
the node by "feature" and resubmit those jobs for rescheduling onto  
other (running)
nodes with that same feature set? That is, redo the scheduling  
assignment in (2)
above as if it never happened.

      Michael Durket



More information about the torqueusers mailing list