[torqueusers] death of pbs_mom's
Brock Palen
brockp at umich.edu
Wed Oct 11 09:45:21 MDT 2006
We have been using torque-2.1.2 for awhile on a amd64 rhel4 cluster.
In the last few days we started seeing mom's dying on the nodes.
While they are not dying enmass, there has been no noticeable
coralation between the nodes that have had them die.
The logs say nothing. This causes jobs to hand in the queue in the
Running state. And can only be removed by purging the job. These
hung jobs go on to cause Moab to get angry and not stop trying to
remove the now very old job. Thus does not sleep ever.
Has anyone else ever seen a problem when a mom dies?
mom_log
10/10/2006 01:58:40;0002; pbs_mom;Svr;Log;Log opened
10/10/2006 01:58:40;0080; pbs_mom;Job;
37188.nyx.engin.umich.edu;scan_for_terminated: job
37188.nyx.engin.umich.edu task 4 terminated, sid 25548
10/10/2006 22:36:47;0008; pbs_mom;Job;37583.nyx.engin.umich.edu;Job
Modified at request of PBS_Server at nyx.engin.umich.edu
10/10/2006 22:39:56;0002; pbs_mom;Svr;im_eof;Premature end of
message from addr 141.212.31.193:15003
Mom log after restart:
10/11/2006 11:45:41;0002; pbs_mom;Svr;Log;Log opened
10/11/2006 11:45:41;0002; pbs_mom;Svr;usecp;nyx.engin.umich.edu:/
home/ /home/
10/11/2006 11:45:41;0002; pbs_mom;Svr;usecp;nyx-
login.engin.umich.edu:/home/ /home/
10/11/2006 11:45:41;0002; pbs_mom;Svr;node_check_script;/var/spool/
PBS/mom_priv/health_check.sh
10/11/2006 11:45:41;0080; pbs_mom;n/a;add_static;config[0] add name
node_check_script value /var/spool/PBS/mom_priv/health_check.sh
10/11/2006 11:45:41;0002; pbs_mom;Svr;node_check_interval;15
10/11/2006 11:45:41;0080; pbs_mom;n/a;add_static;config[0] add name
node_check_interval value 15
10/11/2006 11:45:41;0002; pbs_mom;n/a;initialize;independent
10/11/2006 11:45:41;0001; pbs_mom;Svr;pbs_mom;Success (0) in
recov_tmsock, read
10/11/2006 11:45:41;0001; pbs_mom;Svr;pbs_mom;job_recov, warning:
tmsockets not recovered from /var/spool/PBS/mom_priv/jobs/
37583.nyx.e.JB (written by an older pbs_mom?)
10/11/2006 11:45:41;0002; pbs_mom;Svr;pbs_mom;Is up
10/11/2006 11:45:41;0002; pbs_mom;Svr;mom_main;MOM executable path
and mtime at launch: /home/software/rhel4/torque-2.1.2/sbin/pbs_mom
1154658165
10/11/2006 11:45:41;0002; pbs_mom;n/a;mom_main;hello sent to server
nyx
10/11/2006 11:45:43;0002; pbs_mom;Svr;im_eof;End of File from addr
141.212.31.100:15001
Brock Palen
Center for Advanced Computing
brockp at umich.edu
(734)936-1985
More information about the torqueusers
mailing list