[torqueusers] Jobs not being killed due to failed nodes - ie needing to do a qdel -p all the time

Garrick Staples garrick at clusterresources.com
Tue Jul 10 14:22:47 MDT 2007


On Mon, Jul 02, 2007 at 11:30:48PM -0700, Peter Wyckoff alleged:
> 
> I'm seeing that Torque doesn't seem to allow a job to be killed until all
> the nodes it started on can confirm the job was killed. If some nodes fail
> while killing the job or if we happened to modify torque to ignore failed
> nodes, the system won't release those resources.
> 
> Is there a way to tell Torque to kill jobs on a kind of best effort basis -
> i.e., all nodes it can talk to, but assume the best for nodes that are down.
> As we can configure torque to start the pbs_mom's w/o restarting active
> jobs.
> 
> We have a big installation and some long running jobs, so this is a real
> problem for us - just about daily.

Torque definitely kills jobs when sister nodes are down.  It happens
over here all the time.



More information about the torqueusers mailing list