[torqueusers] mom server communication failings.
s.traylen at rl.ac.uk
Tue Aug 3 09:30:07 MDT 2004
On Mon, Aug 02, 2004 at 08:53:28PM +0100 or thereabouts, Steve Traylen wrote:
> I'm using versions.
> After recently increasing the size of our maui/torque farm I've
> been having more or less constant problems.
Looking at this some more it looks like a job can get into a state that
it can never get out of.
The exec_host appears to have been set during the first execution attempt
when the pbs_mom rejected. So now the job is queued with an exec_host set
# qstat -f 27346 | grep exec_host
exec_host = lcg0453.gridpp.rl.ac.uk/1
# qstat -f | grep job_state
job_state = Q
The node in question though is now busy with other newer jobs.
# qmgr -c 'l n lcg0453.gridpp.rl.ac.uk'
state = job-exclusive,busy
np = 2
properties = lcgpro
ntype = cluster
jobs = 0/27079.lcgce02.gridpp.rl.ac.uk,
because the pbs_mom failure was just transient error.
If any one even has a way to unset the exec_host this would help, it all results
in a bad interaction with maui since now this stuck job is there soft limits are
imposed for user and not hard limits. All of the jobs from this user pile up behind this
> There appears to break down in the communication between pbs_server
> and the pbs_moms
> For instance
> + In a maui log, all queued jobs get blocked because of
> 08/02 20:38:27 ERROR: job '26673' cannot be started:
> (rc: 15057 errmsg: 'Cannot execute at specified host because
> of checkpoint or stagein files' hostlist: 'lcg0279.gridpp.rl.ac.uk')
> 08/02 20:38:27 MPBSJobModify(26673,Resource_List,Resource,1)
> 08/02 20:38:27 ERROR: MBFFirstFit: cannot start job 26673.0
> 08/02 20:38:27 MRMJobStart(26711,Msg,SC)
> 08/02 20:38:27 MPBSJobStart(26711,base,Msg,SC)
> 08/02 20:38:27 MPBSJobModify(26711,Resource_List,Resource,
> 08/02 20:38:30 ERROR: job '26711' cannot be started:
> (rc: 15070 errmsg: 'Server could not connect to MOM'
> hostlist: 'lcg0279.gridpp.rl.ac.uk')
> On the corresponding mom
> pbs_mom;Svr;pbs_mom;im_eof, Premature end of message
> from addr 126.96.36.199:15001
> The node is generally running fine.
> + Another common observation is the that two job nodes stay in 'busy'
> state when one of their jobs has finished despite this now creating
> a new slot.
> + The pbs_server regulary goes into a loop and fills the CPU.
> Any suggestions?
> Steve Traylen
> s.traylen at rl.ac.uk
> torqueusers mailing list
> torqueusers at supercluster.org
s.traylen at rl.ac.uk
More information about the torqueusers