[torqueusers] mom server communication failings.
Steve Traylen
s.traylen at rl.ac.uk
Mon Aug 2 13:53:28 MDT 2004
Hi,
I'm using versions.
maui-3.2.6p6
torque-1.0.1p6
After recently increasing the size of our maui/torque farm I've
been having more or less constant problems.
There appears to break down in the communication between pbs_server
and the pbs_moms
For instance
+ In a maui log, all queued jobs get blocked because of
08/02 20:38:27 ERROR: job '26673' cannot be started:
(rc: 15057 errmsg: 'Cannot execute at specified host because
of checkpoint or stagein files' hostlist: 'lcg0279.gridpp.rl.ac.uk')
08/02 20:38:27 MPBSJobModify(26673,Resource_List,Resource,1)
08/02 20:38:27 ERROR: MBFFirstFit: cannot start job 26673.0
08/02 20:38:27 MRMJobStart(26711,Msg,SC)
08/02 20:38:27 MPBSJobStart(26711,base,Msg,SC)
08/02 20:38:27 MPBSJobModify(26711,Resource_List,Resource,
lcg0279.gridpp.rl.ac.uk)
08/02 20:38:30 ERROR: job '26711' cannot be started:
(rc: 15070 errmsg: 'Server could not connect to MOM'
hostlist: 'lcg0279.gridpp.rl.ac.uk')
On the corresponding mom
pbs_mom;Svr;pbs_mom;im_eof, Premature end of message
from addr 130.246.183.188:15001
The node is generally running fine.
+ Another common observation is the that two job nodes stay in 'busy'
state when one of their jobs has finished despite this now creating
a new slot.
+ The pbs_server regulary goes into a loop and fills the CPU.
Any suggestions?
--
Steve Traylen
s.traylen at rl.ac.uk
http://www.gridpp.ac.uk/
More information about the torqueusers
mailing list