[torqueusers] Rescheduling jobs after a node failure

Stewart.Samuels at sanofi-aventis.com Stewart.Samuels at sanofi-aventis.com
Wed Jun 11 10:15:30 MDT 2008

Fellow Torque Users,

I have noticed similar behaviour while testing the --HA features of
various snapshots and incantations of torque 2.3.0 and 2.4.0.  It seems
this can be similated with vmware sessions pretty easily in that when I
simply power off the primary master, jobs executing continue to execute
and complete when the secondary master comes online.  However, jobs that
were sitting in the queue stay there as "deferred" for very long periods
of time well beyond the completion of the executing jobs.

In at least the latest 2.4 snapshots, I have had at least one instance
that when waiting long enough, the queued jobs actually did go into
execution and completed.  But this took a very long time.

I continue to test this but I believe these experiences are related.


-----Original Message-----
From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of Michael
Sent: Wednesday, June 11, 2008 11:38 AM
To: torqueusers at supercluster.org
Subject: [torqueusers] Rescheduling jobs after a node failure

A question:

      We've experienced this scenario:

         1) User(s) submit jobs specifying nodes by their attributes
(features), not by
              a specific host.

         2) Based on specified features, one or more jobs are scheduled
by Maui
             onto a specific node (but not all of them run at the same
time - some
             stay in the queue).

          3) Node goes down (and a 'pbsnodes -l' will list it as down).

          4) All jobs still in the queue scheduled onto that node in
(2) above,
              are stuck and won't run until the node comes back up
(which could
              be hours or days if you need a part).

        I'm not sure if this is a Torque or Maui question, but couldn't
one of those rescan the queue looking for jobs scheduled on a "down"
node that originally specified the node by "feature" and resubmit those
jobs for rescheduling onto other (running) nodes with that same feature
set? That is, redo the scheduling assignment in (2) above as if it never

      Michael Durket

torqueusers mailing list
torqueusers at supercluster.org

More information about the torqueusers mailing list