[torqueusers] pbs_mom trying to kill job that does not exist

Walid walid.shaari at gmail.com
Tue May 12 07:22:32 MDT 2009


2009/5/11 Roger Moye <moye at rice.edu>

>
>
> I have also found that the problem can be reproduced if we do "qdel -p"
> to purge a job that has exceeded the walltime but is stuck in the queue
> unable to exit.  If the pbs_mom on the compute node is still running and
> is still trying to kill the stuck job, it will continue to do so even
> after a successful "qdel -p" and will eventually crash the torque
> server.  To avoid this I always restart pbs_mom on the compute node
> after running "qdel -p".  This resolves the problem.  So it would seem
> that there are circumstances where pbs_mom on the compute nodes is not
> aware that the job it is trying to kill is already gone and it keeps
> trying to kill it indefinitely.


Roger,

can't you use momctol -c job/all on the nodes that you want the job to be
killed on before doing qdel -p?

kind regards

Walid shaari
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090512/30e4aa86/attachment.html 


More information about the torqueusers mailing list