[torqueusers] pbs_mom failure

Stijn De Weirdt stijn.deweirdt at ugent.be
Tue Mar 17 08:46:35 MDT 2009


hi all,

i'm investigating a pbs_mom that stopped working on a node. in the logs
i can find 2 strange things in the logs:

a. the following message repeated 76891 times (yes, 76k times in approx
50 seconds)

03/17/2009 12:46:43;0080;   pbs_mom;Svr;preobit_reply;top of
preobit_reply
03/17/2009 12:46:43;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
of while loop
03/17/2009 12:46:43;0080;   pbs_mom;Svr;preobit_reply;in while loop, no
error from job stat
03/17/2009 12:46:43;0001;   pbs_mom;Job;198973.master;scan_for_exiting:
sending signal 9, "KILL" to job 198973.master, reason: local task
termination detected


b. after 76k messages: 


03/17/2009 12:46:43;0080;   pbs_mom;Svr;preobit_reply;top of
preobit_reply
03/17/2009 12:46:43;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
of while loop
03/17/2009 12:46:43;0080;   pbs_mom;Svr;preobit_reply;in while loop, no
error from job stat
03/17/2009 12:46:43;0001;   pbs_mom;Job;198973.master;first host DOES
NOT match me: node034/7+node034/6+node034/5+node034/4+node034/3
+node034/2+node034/1+node034/0 != node006


how did this node (node006) get a job from node034? (pbs_mom spool dirs
are on local disk).



many thanks,

stijn
-- 
The system will shutdown in 5 minutes.



More information about the torqueusers mailing list