[Mauiusers] maui stops scheduling when finds a non real busy node
Arnau Bria
arnaubria at pic.es
Wed Sep 14 08:32:34 MDT 2011
Hi all,
We have updated our torque version to 2.5.8 recently, but, as I see
this is a maui issue, I first ask here.
our combo is :
# rpm -qa|egrep 'maui-server|torque-server'
maui-server-3.3-1.x86_64
torque-server-2.5.8-1.cri.x86_64
Maui works fine, but in a schedule cycle, if it finds a node in busy
status, it does not schedule any other job in that cyle:
09/14 16:20:48 INFO: job '20420500' successfully started
09/14 16:20:48 MRMJobStart(20420265,Msg,SC)
09/14 16:20:48 MPBSJobStart(20420265,base,Msg,SC)
09/14 16:20:48 ERROR: job '20420265' cannot be started: (rc: 15046 errmsg: 'Resource temporarily unavailable REJHOST=td578.pic.es MSG=cannot allocate node 'td578.pic.es' to job - node not currently available (state: busy)' hostlist: 'td578.pic.es')
09/14 16:20:48 ERROR: cannot start job '20420265' in partition DEFAULT
09/14 16:20:48 MJobPReserve(20420265,DEFAULT,ResCount,ResCountRej)
09/14 16:20:48 INFO: no priority reservations created (bf/rsv policy)
09/14 16:20:48 MRMJobStart(20420306,Msg,SC)
09/14 16:20:48 MPBSJobStart(20420306,base,Msg,SC)
09/14 16:20:48 ERROR: job '20420306' cannot be started: (rc: 15046 errmsg: 'Resource temporarily unavailable REJHOST=td578.pic.es MSG=cannot allocate node 'td578.pic.es' to job - node not currently available (state: busy)' hostlist: 'td578.pic.es')
09/14 16:20:48 ERROR: cannot start job '20420306' in partition DEFAULT
09/14 16:20:48 MJobPReserve(20420306,DEFAULT,ResCount,ResCountRej)
09/14 16:20:48 INFO: no priority reservations created (bf/rsv policy)
09/14 16:20:48 MRMJobStart(20420268,Msg,SC)
torque says that the node is busy:
09/14/2011 03:00:13;0008;PBS_Server;Job;20401629.pbs03.pic.es;could not locate requested resources 'td578.pic.es' (node_spec failed) cannot allocate node 'td578.pic.es' to job - node not currently available (state: busy)
09/14/2011 03:00:13;0080;PBS_Server;Req;req_reject;Reject reply code=15046(Resource temporarily unavailable REJHOST=td578.pic.es MSG=cannot allocate node 'td578.pic.es' to job - node not currently available (state: busy)), aux=0, type=RunJob, from root at pbs03.pic.es
09/14/2011 03:00:13;0008;PBS_Server;Job;20401630.pbs03.pic.es;could not locate requested resources 'td578.pic.es' (node_spec failed) cannot allocate node 'td578.pic.es' to job - node not currently available (state: busy)
but that the node is not real "busy". It's only busy for
few seconds becasue, after I see the error (delay of 3-4 seconds), I do a
pbsnodes $nodename and I see it free.
On the next scheduling cycle, if it does not find any "busy" node, all jobs are scheduled.
I'm wondering if I could configure maui to bypass those failing nodes
and keep scheduling other jobs while I guess why torque mark those
nodes as busy if they are not.
TIA,
Arnau
More information about the mauiusers
mailing list