[torqueusers] Jobs not being killed due to failed nodes - ie needing to do a qdel -p all the time

Walid walid.shaari at gmail.com
Sun Jul 15 09:32:01 MDT 2007


we are using 2.1.6, and 2.1.8 and we do see the problem Peter is seeing,
users with operators rights can not delete the job unless we do with qdel -p
and clean up after  the nodes.



On 7/10/07, Garrick Staples <garrick at clusterresources.com> wrote:
> On Mon, Jul 02, 2007 at 11:30:48PM -0700, Peter Wyckoff alleged:
> >
> > I'm seeing that Torque doesn't seem to allow a job to be killed until
> all
> > the nodes it started on can confirm the job was killed. If some nodes
> fail
> > while killing the job or if we happened to modify torque to ignore
> failed
> > nodes, the system won't release those resources.
> >
> > Is there a way to tell Torque to kill jobs on a kind of best effort
> basis -
> > i.e., all nodes it can talk to, but assume the best for nodes that are
> down.
> > As we can configure torque to start the pbs_mom's w/o restarting active
> > jobs.
> >
> > We have a big installation and some long running jobs, so this is a real
> > problem for us - just about daily.
> Torque definitely kills jobs when sister nodes are down.  It happens
> over here all the time.
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20070715/051474b1/attachment.html

More information about the torqueusers mailing list