[Mauiusers] maui stops scheduling when finds a non real busy node

Arnau Bria arnaubria at pic.es
Wed Sep 14 08:32:34 MDT 2011


Hi all,

We have updated our torque version to 2.5.8 recently, but, as I see
this is a maui issue, I first ask here.

our combo is :
# rpm -qa|egrep 'maui-server|torque-server'
maui-server-3.3-1.x86_64
torque-server-2.5.8-1.cri.x86_64

Maui works fine, but in a schedule cycle, if it finds a node in busy
status, it does not schedule any other job in that cyle:

09/14 16:20:48 INFO:     job '20420500' successfully started
09/14 16:20:48 MRMJobStart(20420265,Msg,SC)
09/14 16:20:48 MPBSJobStart(20420265,base,Msg,SC)
09/14 16:20:48 ERROR:    job '20420265' cannot be started: (rc: 15046  errmsg: 'Resource temporarily unavailable REJHOST=td578.pic.es MSG=cannot allocate node 'td578.pic.es' to job - node not currently available (state: busy)'  hostlist: 'td578.pic.es')
09/14 16:20:48 ERROR:    cannot start job '20420265' in partition DEFAULT
09/14 16:20:48 MJobPReserve(20420265,DEFAULT,ResCount,ResCountRej)
09/14 16:20:48 INFO:     no priority reservations created (bf/rsv policy)
09/14 16:20:48 MRMJobStart(20420306,Msg,SC)
09/14 16:20:48 MPBSJobStart(20420306,base,Msg,SC)
09/14 16:20:48 ERROR:    job '20420306' cannot be started: (rc: 15046  errmsg: 'Resource temporarily unavailable REJHOST=td578.pic.es MSG=cannot allocate node 'td578.pic.es' to job - node not currently available (state: busy)'  hostlist: 'td578.pic.es')
09/14 16:20:48 ERROR:    cannot start job '20420306' in partition DEFAULT
09/14 16:20:48 MJobPReserve(20420306,DEFAULT,ResCount,ResCountRej)
09/14 16:20:48 INFO:     no priority reservations created (bf/rsv policy)
09/14 16:20:48 MRMJobStart(20420268,Msg,SC)



torque says that the node is busy:

09/14/2011 03:00:13;0008;PBS_Server;Job;20401629.pbs03.pic.es;could not locate requested resources 'td578.pic.es' (node_spec failed) cannot allocate node 'td578.pic.es' to job - node not currently available (state: busy)
09/14/2011 03:00:13;0080;PBS_Server;Req;req_reject;Reject reply code=15046(Resource temporarily unavailable REJHOST=td578.pic.es MSG=cannot allocate node 'td578.pic.es' to job - node not currently available (state: busy)), aux=0, type=RunJob, from root at pbs03.pic.es
09/14/2011 03:00:13;0008;PBS_Server;Job;20401630.pbs03.pic.es;could not locate requested resources 'td578.pic.es' (node_spec failed) cannot allocate node 'td578.pic.es' to job - node not currently available (state: busy)


but that the node is not real "busy". It's only busy for
few seconds becasue, after I see the error (delay of 3-4 seconds), I do a
pbsnodes $nodename and I see it free.
On the next scheduling cycle, if it does not find any "busy" node, all jobs are scheduled.


I'm wondering if I could configure maui to bypass those failing nodes
and keep scheduling other jobs while I guess why torque mark those
nodes as busy if they are not.


TIA,
Arnau


More information about the mauiusers mailing list