[torqueusers] Jobs not being killed due to failed nodes - ie
needing to do a qdel -p all the time
Garrick Staples
garrick at clusterresources.com
Tue Jul 10 14:22:47 MDT 2007
On Mon, Jul 02, 2007 at 11:30:48PM -0700, Peter Wyckoff alleged:
>
> I'm seeing that Torque doesn't seem to allow a job to be killed until all
> the nodes it started on can confirm the job was killed. If some nodes fail
> while killing the job or if we happened to modify torque to ignore failed
> nodes, the system won't release those resources.
>
> Is there a way to tell Torque to kill jobs on a kind of best effort basis -
> i.e., all nodes it can talk to, but assume the best for nodes that are down.
> As we can configure torque to start the pbs_mom's w/o restarting active
> jobs.
>
> We have a big installation and some long running jobs, so this is a real
> problem for us - just about daily.
Torque definitely kills jobs when sister nodes are down. It happens
over here all the time.
More information about the torqueusers
mailing list