[torqueusers] pbs_mom trying to kill job that does not exist

Roger Moye moye at rice.edu
Mon May 11 10:34:36 MDT 2009


We are using torque 2.3.0 and have run into a few situations where 
pbs_mom on one of our cluster nodes continues to try to kill a job that 
has already exited the system.  If this behavior goes undetected for a 
few days the torque server will eventually corrupt the jobs in the queue 
such that qstat returns "End of File."  I have found that this usually 
means that one or more files in /var/spool/torque/server_priv/jobs has 
been corrupted.   We have also seen this behavior on torque 2.1.9 on one 
of our older clusters.

I have also found that the problem can be reproduced if we do "qdel -p" 
to purge a job that has exceeded the walltime but is stuck in the queue 
unable to exit.  If the pbs_mom on the compute node is still running and 
is still trying to kill the stuck job, it will continue to do so even 
after a successful "qdel -p" and will eventually crash the torque 
server.  To avoid this I always restart pbs_mom on the compute node 
after running "qdel -p".  This resolves the problem.  So it would seem 
that there are circumstances where pbs_mom on the compute nodes is not 
aware that the job it is trying to kill is already gone and it keeps 
trying to kill it indefinitely.

Is there something we can do in our configuration to avoid this 
problem.  If not, perhaps a future release of Torque could detect this 
condition and mark the node offline rather than eventually crashing.

Thanks!
-Roger

-- 
=======================================
Roger Moye
Linux Cluster Administrator
TeraGrid Campus Champion
Rice University
Dept. of Academic and Research Computing
Research Computing Support Group
(713) 348-5756
moye at rice.edu



More information about the torqueusers mailing list