[torquedev] mother superior <-> sister communication problem
Joshua Bernstein
jbernstein at penguincomputing.com
Thu Jan 7 16:32:53 MST 2010
Glen Beane wrote:
> I have a cluster where about 4 nodes just stopped working for multi-node
> jobs (they still work fine for single node jobs). If one of these nodes
> is used as a sister in a job it is unable to start and the job bounces
> between R and Q states as Moab keeps trying to start it. The cluster was
> running TORQUE 2.3.6 and was upgraded to 2.3.8 a couple days ago and the
> problem persisted after the upgrade, and even a node reboot.
>
> the mom_log on the mother superior looks like this:
>
>
> 01/06/2010 08:36:19;0008; pbs_mom;Job;60812.HOST;Job Modified at
> request of PBS_Server at HOST
> 01/06/2010 08:39:27;0002; pbs_mom;Svr;im_eof;Premature end of message
> from addr 10.9.4.19:15003 <http://10.9.4.19:15003>
> 01/06/2010 08:39:27;0001; pbs_mom;Svr;pbs_mom;node_bailout, 60812.HOST
> join_job failed from node NODEXXX 1 - recovery attempted)
> 01/06/2010 08:39:27;0001; pbs_mom;Svr;pbs_mom;sister could not
> communicate (15059) in 60812.HOST, job_start_error from node HOSTXXX in
> job_start_error
> 01/06/2010 08:39:27;0008; pbs_mom;Req;send_sisters;sending ABORT to
> sisters
> 01/06/2010 08:39:27;0001; pbs_mom;Job;60812.HOST;send_sisters: sister
> #1 (NODEXXX) is not ok (1099)
> 01/06/2010 08:39:27;0001; pbs_mom;Svr;pbs_mom;exec_bail, exec_bail:
> sent 0 ABORT requests, should be 1
> 01/06/2010 08:39:27;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
> 01/06/2010 08:39:27;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
> of while loop
> 01/06/2010 08:39:27;0080; pbs_mom;Svr;preobit_reply;in while loop, no
> error from job stat
> 01/06/2010 08:39:27;0008; pbs_mom;Job;60812.HOST;checking job
> post-processing routine
> 01/06/2010 08:39:27;0080; pbs_mom;Job;60812.HOST;obit sent to server
>
>
> The sister node has no record of the job in its log. I don't think this
> is a TORQUE bug or configuration problem, but it seems like there is
> some communication problem between the mother superior and sister.
> Where should the sysadmin and networking guys start looking for problems?
Is it possible that for some reason those 4 nodes didn't get upgraded properly
and thus are running the previous version of TORQUE (2.3.6 in your case?)
-Joshua Bernstein
Senior Software Engineer
Penguin Computing
More information about the torquedev
mailing list