[torquedev] mother superior <-> sister communication problem

Joshua Bernstein jbernstein at penguincomputing.com
Thu Jan 7 16:32:53 MST 2010



Glen Beane wrote:
> I have a cluster where about 4 nodes just stopped working for multi-node 
> jobs (they still work fine for single node jobs).  If one of these nodes 
> is used as a sister in a job it is unable to start and the job bounces 
> between R and Q states as Moab keeps trying to start it. The cluster was 
> running TORQUE 2.3.6 and was upgraded to 2.3.8 a couple days ago and the 
> problem persisted after the upgrade, and even a node reboot.
> 
> the mom_log on the mother superior looks like this:
> 
> 
> 01/06/2010 08:36:19;0008;   pbs_mom;Job;60812.HOST;Job Modified at 
> request of PBS_Server at HOST
> 01/06/2010 08:39:27;0002;   pbs_mom;Svr;im_eof;Premature end of message 
> from addr 10.9.4.19:15003 <http://10.9.4.19:15003>
> 01/06/2010 08:39:27;0001;   pbs_mom;Svr;pbs_mom;node_bailout, 60812.HOST 
> join_job failed from node NODEXXX 1 - recovery attempted)
> 01/06/2010 08:39:27;0001;   pbs_mom;Svr;pbs_mom;sister could not 
> communicate (15059) in 60812.HOST, job_start_error from node HOSTXXX in 
> job_start_error
> 01/06/2010 08:39:27;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
> sisters
> 01/06/2010 08:39:27;0001;   pbs_mom;Job;60812.HOST;send_sisters:  sister 
> #1 (NODEXXX) is not ok (1099)
> 01/06/2010 08:39:27;0001;   pbs_mom;Svr;pbs_mom;exec_bail, exec_bail: 
> sent 0 ABORT requests, should be 1
> 01/06/2010 08:39:27;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
> 01/06/2010 08:39:27;0080;   
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top 
> of while loop
> 01/06/2010 08:39:27;0080;   pbs_mom;Svr;preobit_reply;in while loop, no 
> error from job stat
> 01/06/2010 08:39:27;0008;   pbs_mom;Job;60812.HOST;checking job 
> post-processing routine
> 01/06/2010 08:39:27;0080;   pbs_mom;Job;60812.HOST;obit sent to server
> 
> 
> The sister node has no record of the job in its log. I don't think this 
> is a TORQUE bug or configuration problem, but it seems like there is 
> some communication problem between the mother superior and sister.  
> Where should the sysadmin and networking guys start looking for problems?

Is it possible that for some reason those 4 nodes didn't get upgraded properly 
and thus are running the previous version of TORQUE (2.3.6 in your case?)

-Joshua Bernstein
Senior Software Engineer
Penguin Computing


More information about the torquedev mailing list