[torqueusers] death of pbs_mom's

Brock Palen brockp at umich.edu
Wed Oct 11 09:45:21 MDT 2006


We have been using torque-2.1.2  for awhile on a amd64 rhel4 cluster.
In the last few days we started seeing mom's dying on the nodes.   
While they are not dying enmass,  there has been no noticeable  
coralation between the nodes that have had them die.

The logs say nothing.  This causes jobs to hand in the queue in the  
Running state.  And can only be removed by purging the job.   These  
hung jobs go on to cause Moab to get angry and not stop trying to  
remove the now very old job.  Thus does not sleep ever.

Has anyone else ever seen a problem when a mom dies?

mom_log

10/10/2006 01:58:40;0002;   pbs_mom;Svr;Log;Log opened
10/10/2006 01:58:40;0080;   pbs_mom;Job; 
37188.nyx.engin.umich.edu;scan_for_terminated: job  
37188.nyx.engin.umich.edu task 4 terminated, sid 25548
10/10/2006 22:36:47;0008;   pbs_mom;Job;37583.nyx.engin.umich.edu;Job  
Modified at request of PBS_Server at nyx.engin.umich.edu
10/10/2006 22:39:56;0002;   pbs_mom;Svr;im_eof;Premature end of  
message from addr 141.212.31.193:15003

Mom log after restart:

10/11/2006 11:45:41;0002;   pbs_mom;Svr;Log;Log opened
10/11/2006 11:45:41;0002;   pbs_mom;Svr;usecp;nyx.engin.umich.edu:/ 
home/ /home/
10/11/2006 11:45:41;0002;   pbs_mom;Svr;usecp;nyx- 
login.engin.umich.edu:/home/ /home/
10/11/2006 11:45:41;0002;   pbs_mom;Svr;node_check_script;/var/spool/ 
PBS/mom_priv/health_check.sh
10/11/2006 11:45:41;0080;   pbs_mom;n/a;add_static;config[0] add name  
node_check_script value /var/spool/PBS/mom_priv/health_check.sh
10/11/2006 11:45:41;0002;   pbs_mom;Svr;node_check_interval;15
10/11/2006 11:45:41;0080;   pbs_mom;n/a;add_static;config[0] add name  
node_check_interval value 15
10/11/2006 11:45:41;0002;   pbs_mom;n/a;initialize;independent
10/11/2006 11:45:41;0001;   pbs_mom;Svr;pbs_mom;Success (0) in  
recov_tmsock, read
10/11/2006 11:45:41;0001;   pbs_mom;Svr;pbs_mom;job_recov, warning:  
tmsockets not recovered from /var/spool/PBS/mom_priv/jobs/ 
37583.nyx.e.JB (written by an older pbs_mom?)
10/11/2006 11:45:41;0002;   pbs_mom;Svr;pbs_mom;Is up
10/11/2006 11:45:41;0002;   pbs_mom;Svr;mom_main;MOM executable path  
and mtime at launch: /home/software/rhel4/torque-2.1.2/sbin/pbs_mom  
1154658165
10/11/2006 11:45:41;0002;   pbs_mom;n/a;mom_main;hello sent to server  
nyx
10/11/2006 11:45:43;0002;   pbs_mom;Svr;im_eof;End of File from addr  
141.212.31.100:15001



Brock Palen
Center for Advanced Computing
brockp at umich.edu
(734)936-1985




More information about the torqueusers mailing list