[torqueusers] Moving jobs off dead nodes?

Troy Baer troy at osc.edu
Thu Mar 17 15:38:17 MST 2005

One of the problems we have occasionally with our heavily patched
in-house version of OpenPBS (most of which is in TORQUE as well IIRC) is
that it is some times difficult to persuade PBS to let you move a
running job off of a node that is dead or dying.  I don't mean
checkpoint/restart here (nice though it would be), but rather just being
able to stop a job and rerun it on a different set of nodes.  You can do
this easily enough if all the pbs_mom's are up -- you qhold the job,
stop all the moms, delete the job scripts on all the nodes allocated to
it, restart the moms, and then qrls the job.  However, if one of the
nodes is dead or not responding, this doesn't work.

The typical scenario goes something like this:  Job X gets allocated a
list of nodes that includes node foo and starts running.  In the middle
of the job, node foo's $MOVING_PART has a fatal problem that causes the
node to go down.  In OpenPBS, the job is now stuck; the pbs_server
thinks it is still there and running on all the nodes, but since node
foo's pbs_mom won't respond, you (the admin) can't qhold or qdel it. 
All you can do is stop the pbs_server process, delete the files
associated with the job, and restart pbs_server.

Does TORQUE deal with this sort of thing more gracefully than OpenPBS
does?  If so, how?  If not, is there interest in fixing it?

