[torqueusers] pbs_mom trying to kill job that does not exist
moye at rice.edu
Mon May 11 10:34:36 MDT 2009
We are using torque 2.3.0 and have run into a few situations where
pbs_mom on one of our cluster nodes continues to try to kill a job that
has already exited the system. If this behavior goes undetected for a
few days the torque server will eventually corrupt the jobs in the queue
such that qstat returns "End of File." I have found that this usually
means that one or more files in /var/spool/torque/server_priv/jobs has
been corrupted. We have also seen this behavior on torque 2.1.9 on one
of our older clusters.
I have also found that the problem can be reproduced if we do "qdel -p"
to purge a job that has exceeded the walltime but is stuck in the queue
unable to exit. If the pbs_mom on the compute node is still running and
is still trying to kill the stuck job, it will continue to do so even
after a successful "qdel -p" and will eventually crash the torque
server. To avoid this I always restart pbs_mom on the compute node
after running "qdel -p". This resolves the problem. So it would seem
that there are circumstances where pbs_mom on the compute nodes is not
aware that the job it is trying to kill is already gone and it keeps
trying to kill it indefinitely.
Is there something we can do in our configuration to avoid this
problem. If not, perhaps a future release of Torque could detect this
condition and mark the node offline rather than eventually crashing.
Linux Cluster Administrator
TeraGrid Campus Champion
Dept. of Academic and Research Computing
Research Computing Support Group
moye at rice.edu
More information about the torqueusers