[torqueusers] pbs_mom endless kill loop

George Wm Turner turnerg at indiana.edu
Tue Oct 21 12:34:34 MDT 2008


I'll add a "me too!"

I've seen it with versions up to torque 2.3.4;  later version are  
better; i.e. not as likely to tip over into this mode (2.3.3, 2.3.4)   
2.3.2 was very bad about getting into this state.

I suspect with each iteration of the loop it opens another socket back  
to the pbs_server; I quickly run out of privileged ports and then NFS  
goes offline.

Kill off the mom, clean out the file owned by the dead job from  .../ 
mom_priv/jobs, then restart the mom.
I've gotten into the habit of rebooting the node too because I've seen  
it associated without of memory situations at times.

Usually the job will run OK after that (the clean up; the reboot is  
optional).

It's sort of like playing whack-a-mole until you get all the job's  
file clean out.  The job will try to start and fail; Torque will then  
move it to another node; problem occurs there, iterate until put a  
hold on the job, offline the nodes, clean the job files, then  
restart.  When the mom is in this state, other jobs can get started on  
the bad node;  by then, the NFS home directory is offline and other  
bad things start happening.

I've seen the behavior start when a user runs a node out of memory,  
when we have network outages, or at times that I can't explain.  I  
wanted to document it better and actually discover/explain what's  
happening better than this before presenting it but....  Well, yeh,  
I've seen it too.


george wm turner
high performance systems
812 855 5156



On Oct 20, 2008, at 9:45 PM, Kevin Murphy wrote:

> Torque 2.3.0 on CentOS 5 (Rocks V).
>
> For reasons unknown, our moms occasionally explode in an orgy of  
> logging, repeatedly writing messages like this:
>
> 10/20/2008 12:15:47;0080; pbs_mom;Svr;preobit_reply;top of  
> preobit_reply
> 10/20/2008 12:15:47;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/ 
> decode_DIS_replySvr worked, top
> of while loop
> 10/20/2008 12:15:47;0080; pbs_mom;Svr;preobit_reply;in while loop,  
> no error from job stat
> 10/20/2008 12:15:47;0001; pbs_mom;Job; 
> 19139.variome.chop.edu;scan_for_exiting: sending signal 9,
> "KILL" to job 19140.variome.chop.edu, reason: local task termination
> detected
>
> These messages are endlessly repeated.  Thoughts?
>
> Thanks,
> Kevin
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list