[torqueusers] Jobs not being killed due to failed nodes - ie needing to do a qdel -p all the time

Peter Wyckoff wyckoff at yahoo-inc.com
Tue Jul 3 00:30:48 MDT 2007


I'm seeing that Torque doesn't seem to allow a job to be killed until all
the nodes it started on can confirm the job was killed. If some nodes fail
while killing the job or if we happened to modify torque to ignore failed
nodes, the system won't release those resources.

Is there a way to tell Torque to kill jobs on a kind of best effort basis -
i.e., all nodes it can talk to, but assume the best for nodes that are down.
As we can configure torque to start the pbs_mom's w/o restarting active
jobs.

We have a big installation and some long running jobs, so this is a real
problem for us - just about daily.

Thanks for any help here, pete



More information about the torqueusers mailing list