[torquedev] mother superior <-> sister communication problem

Glen Beane glen.beane at gmail.com
Thu Jan 7 16:45:13 MST 2010


On Thu, Jan 7, 2010 at 6:32 PM, Joshua Bernstein <
jbernstein at penguincomputing.com> wrote:

>
>
> Glen Beane wrote:
>
>> I have a cluster where about 4 nodes just stopped working for multi-node
>> jobs (they still work fine for single node jobs).  If one of these nodes is
>> used as a sister in a job it is unable to start and the job bounces between
>> R and Q states as Moab keeps trying to start it. The cluster was running
>> TORQUE 2.3.6 and was upgraded to 2.3.8 a couple days ago and the problem
>> persisted after the upgrade, and even a node reboot.
>>
>> the mom_log on the mother superior looks like this:
>>
>>
>> 01/06/2010 08:36:19;0008;   pbs_mom;Job;60812.HOST;Job Modified at request
>> of PBS_Server at HOST
>> 01/06/2010 08:39:27;0002;   pbs_mom;Svr;im_eof;Premature end of message
>> from addr 10.9.4.19:15003 <http://10.9.4.19:15003>
>>
>> 01/06/2010 08:39:27;0001;   pbs_mom;Svr;pbs_mom;node_bailout, 60812.HOST
>> join_job failed from node NODEXXX 1 - recovery attempted)
>> 01/06/2010 08:39:27;0001;   pbs_mom;Svr;pbs_mom;sister could not
>> communicate (15059) in 60812.HOST, job_start_error from node HOSTXXX in
>> job_start_error
>> 01/06/2010 08:39:27;0008;   pbs_mom;Req;send_sisters;sending ABORT to
>> sisters
>> 01/06/2010 08:39:27;0001;   pbs_mom;Job;60812.HOST;send_sisters:  sister
>> #1 (NODEXXX) is not ok (1099)
>> 01/06/2010 08:39:27;0001;   pbs_mom;Svr;pbs_mom;exec_bail, exec_bail: sent
>> 0 ABORT requests, should be 1
>> 01/06/2010 08:39:27;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
>> 01/06/2010 08:39:27;0080;
>> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of
>> while loop
>> 01/06/2010 08:39:27;0080;   pbs_mom;Svr;preobit_reply;in while loop, no
>> error from job stat
>> 01/06/2010 08:39:27;0008;   pbs_mom;Job;60812.HOST;checking job
>> post-processing routine
>> 01/06/2010 08:39:27;0080;   pbs_mom;Job;60812.HOST;obit sent to server
>>
>>
>> The sister node has no record of the job in its log. I don't think this is
>> a TORQUE bug or configuration problem, but it seems like there is some
>> communication problem between the mother superior and sister.  Where should
>> the sysadmin and networking guys start looking for problems?
>>
>
> Is it possible that for some reason those 4 nodes didn't get upgraded
> properly and thus are running the previous version of TORQUE (2.3.6 in your
> case?)


the problem started before the upgrade to 2.3.6 and persisted afterwards.  I
have confirmed all the moms are now running 2.3.8
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torquedev/attachments/20100107/8059ca20/attachment.html 


More information about the torquedev mailing list