[torqueusers] pbs_mom endless kill loop
George Wm Turner
turnerg at indiana.edu
Tue Oct 21 12:34:34 MDT 2008
I'll add a "me too!"
I've seen it with versions up to torque 2.3.4; later version are
better; i.e. not as likely to tip over into this mode (2.3.3, 2.3.4)
2.3.2 was very bad about getting into this state.
I suspect with each iteration of the loop it opens another socket back
to the pbs_server; I quickly run out of privileged ports and then NFS
Kill off the mom, clean out the file owned by the dead job from .../
mom_priv/jobs, then restart the mom.
I've gotten into the habit of rebooting the node too because I've seen
it associated without of memory situations at times.
Usually the job will run OK after that (the clean up; the reboot is
It's sort of like playing whack-a-mole until you get all the job's
file clean out. The job will try to start and fail; Torque will then
move it to another node; problem occurs there, iterate until put a
hold on the job, offline the nodes, clean the job files, then
restart. When the mom is in this state, other jobs can get started on
the bad node; by then, the NFS home directory is offline and other
bad things start happening.
I've seen the behavior start when a user runs a node out of memory,
when we have network outages, or at times that I can't explain. I
wanted to document it better and actually discover/explain what's
happening better than this before presenting it but.... Well, yeh,
I've seen it too.
george wm turner
high performance systems
812 855 5156
On Oct 20, 2008, at 9:45 PM, Kevin Murphy wrote:
> Torque 2.3.0 on CentOS 5 (Rocks V).
> For reasons unknown, our moms occasionally explode in an orgy of
> logging, repeatedly writing messages like this:
> 10/20/2008 12:15:47;0080; pbs_mom;Svr;preobit_reply;top of
> 10/20/2008 12:15:47;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/
> decode_DIS_replySvr worked, top
> of while loop
> 10/20/2008 12:15:47;0080; pbs_mom;Svr;preobit_reply;in while loop,
> no error from job stat
> 10/20/2008 12:15:47;0001; pbs_mom;Job;
> 19139.variome.chop.edu;scan_for_exiting: sending signal 9,
> "KILL" to job 19140.variome.chop.edu, reason: local task termination
> These messages are endlessly repeated. Thoughts?
> torqueusers mailing list
> torqueusers at supercluster.org
More information about the torqueusers