[torqueusers] mom server communication failings.

Steve Traylen s.traylen at rl.ac.uk
Mon Aug 2 13:53:28 MDT 2004


 I'm using versions.


 After recently increasing the size of our maui/torque farm I've
 been having more or less constant problems.

 There appears to break down in the communication between pbs_server
 and the pbs_moms

 For instance 

 + In a maui log, all queued jobs get blocked because of 

   08/02 20:38:27 ERROR:    job '26673' cannot be started:
        (rc: 15057  errmsg: 'Cannot execute at specified host because 
        of checkpoint or stagein files'  hostlist: 'lcg0279.gridpp.rl.ac.uk')
   08/02 20:38:27 MPBSJobModify(26673,Resource_List,Resource,1)
   08/02 20:38:27 ERROR:    MBFFirstFit:  cannot start job 26673.0
   08/02 20:38:27 MRMJobStart(26711,Msg,SC)
   08/02 20:38:27 MPBSJobStart(26711,base,Msg,SC)
   08/02 20:38:27 MPBSJobModify(26711,Resource_List,Resource,
   08/02 20:38:30 ERROR:    job '26711' cannot be started: 
        (rc: 15070  errmsg: 'Server could not connect to MOM'
         hostlist: 'lcg0279.gridpp.rl.ac.uk')

 On the corresponding mom 

 pbs_mom;Svr;pbs_mom;im_eof, Premature end of message
        from addr

 The node is generally running fine.

 + Another common observation is the that two job nodes stay in 'busy'
   state when one of their jobs has finished despite this now creating
   a new slot.

 + The pbs_server regulary goes into a loop and fills the CPU.

 Any suggestions? 

Steve Traylen
s.traylen at rl.ac.uk

More information about the torqueusers mailing list