[torqueusers] pbs_mom trying to kill job that does not exist
walid.shaari at gmail.com
Tue May 12 07:22:32 MDT 2009
2009/5/11 Roger Moye <moye at rice.edu>
> I have also found that the problem can be reproduced if we do "qdel -p"
> to purge a job that has exceeded the walltime but is stuck in the queue
> unable to exit. If the pbs_mom on the compute node is still running and
> is still trying to kill the stuck job, it will continue to do so even
> after a successful "qdel -p" and will eventually crash the torque
> server. To avoid this I always restart pbs_mom on the compute node
> after running "qdel -p". This resolves the problem. So it would seem
> that there are circumstances where pbs_mom on the compute nodes is not
> aware that the job it is trying to kill is already gone and it keeps
> trying to kill it indefinitely.
can't you use momctol -c job/all on the nodes that you want the job to be
killed on before doing qdel -p?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers