[torquedev] mother superior <-> sister communication problem

Glen Beane glen.beane at gmail.com
Wed Jan 6 10:57:54 MST 2010

I have a cluster where about 4 nodes just stopped working for multi-node
jobs (they still work fine for single node jobs).  If one of these nodes is
used as a sister in a job it is unable to start and the job bounces between
R and Q states as Moab keeps trying to start it. The cluster was running
TORQUE 2.3.6 and was upgraded to 2.3.8 a couple days ago and the problem
persisted after the upgrade, and even a node reboot.

the mom_log on the mother superior looks like this:

01/06/2010 08:36:19;0008;   pbs_mom;Job;60812.HOST;Job Modified at request
of PBS_Server at HOST
01/06/2010 08:39:27;0002;   pbs_mom;Svr;im_eof;Premature end of message from
01/06/2010 08:39:27;0001;   pbs_mom;Svr;pbs_mom;node_bailout, 60812.HOST
join_job failed from node NODEXXX 1 - recovery attempted)
01/06/2010 08:39:27;0001;   pbs_mom;Svr;pbs_mom;sister could not communicate
(15059) in 60812.HOST, job_start_error from node HOSTXXX in job_start_error
01/06/2010 08:39:27;0008;   pbs_mom;Req;send_sisters;sending ABORT to
01/06/2010 08:39:27;0001;   pbs_mom;Job;60812.HOST;send_sisters:  sister #1
(NODEXXX) is not ok (1099)
01/06/2010 08:39:27;0001;   pbs_mom;Svr;pbs_mom;exec_bail, exec_bail: sent 0
ABORT requests, should be 1
01/06/2010 08:39:27;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
01/06/2010 08:39:27;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of
while loop
01/06/2010 08:39:27;0080;   pbs_mom;Svr;preobit_reply;in while loop, no
error from job stat
01/06/2010 08:39:27;0008;   pbs_mom;Job;60812.HOST;checking job
post-processing routine
01/06/2010 08:39:27;0080;   pbs_mom;Job;60812.HOST;obit sent to server

The sister node has no record of the job in its log. I don't think this is a
TORQUE bug or configuration problem, but it seems like there is some
communication problem between the mother superior and sister.  Where should
the sysadmin and networking guys start looking for problems?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torquedev/attachments/20100106/359a87f8/attachment.html 

More information about the torquedev mailing list