[Mauiusers] jobs being set to Deferred (NoResources)

Hein Zelle hein.zelle at bmtargoss.com
Wed Jan 27 03:48:16 MST 2010


Dear Maui users,

I'm struggling with a problem that has plagued our cluster since the
installation.  We are experiencing occasional cases where jobs will
suddenly be put on Deferred with a NoResources indication when I run
checkjob.  I can't seem to find out why this happens exactly.

Symptoms: when requesting jobs on one of our front end nodes with a
disk pack, we specify 1 exact node to run a job on.  Sometimes this
node is apparently unavailable, according to Maui.  We've seen it
marked as "down" in the pbs_nodes output, but this morning the node
reported itself as "free" with no apparent problems, and the problem
still occurred.  Any job requesting that specific node label is
immediately put on deferred.  It appears to happen most often with
jobs that request 1:frontend+32:blade, i.e. one CPU on the specific
front end and 32 CPU's on the computation nodes.  Most of the time
it's a task with resources like these that ends up Deferred.  The
problem typically occurs when there's a bit more queue activity with
several jobs running and/or waiting in the queue.

We thought we tracked it down to excessive load, where the machine
becomes temporarily unavailable.  This morning there seemed to be no
such problem.  Log files show no specific messages related to the
problem, as far as I've been able to find.

Does this problem sound familiar to anyone?  We'd really like to find
a solution or workaround, as it effectively makes the cluster unusable
for operational jobs that MUST run.  

If I can provide more specific information or log output that would
help in tracking the problem, please let me know.  If this question is
more appropriate for the Torque mailing list, also let me know.

System: scientific linux 5.4 (based on RHEL 5.4)
kernel: 2.6.18-164.9.1.el5
maui:   maui-3.2.6p21-42_cvos5.0.x86_64
torque: torque-2.3.7-87_cm5.0.x86_64

Thank you for your help!
Kind regards,

     Hein Zelle


Below is a piece from the maui log file which shows one of these jobs
(18451) which was set to deferred, this morning.  I've cut out some
lines marked with [...]



01/27 08:45:03 MPBSJobLoad(18451,18451.master.cm.cluster,J,TaskList,0)
01/27 08:45:03 MReqCreate(18451,SrcRQ,DstRQ,DoCreate)
01/27 08:45:03 MReqCreate(18451,SrcRQ,DstRQ,DoCreate)
01/27 08:45:03 INFO:     processing node request line '1:projects+32:blade'
01/27 08:45:03 MJobSetCreds(18451,wrf,agstaff,)
01/27 08:45:03 INFO:     default QOS for job 18451 set to DEFAULT(0) (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
01/27 08:45:03 INFO:     default QOS for job 18451 set to DEFAULT(0) (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
01/27 08:45:03 INFO:     default QOS for job 18451 set to DEFAULT(0) (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
01/27 08:45:03 INFO:     job '18451' loaded:   1      wrf  agstaff   7200       Idle   0 1264581902   [NONE] [NONE] [NONE] >=      0 >=      0 [projects] 1264581903

[...]
01/27 08:45:03 MQueueSelectAllJobs(Q,HARD,ALL,JIList,DP,Msg)
[...]
01/27 08:45:03 INFO:     job '18451' Priority:        1
  Targ:      0(00.0)  Res:      0(00.0)  Us:      0(00.0)
[...]

01/27 08:45:03 MStatClearUsage([NONE],Idle)
01/27 08:45:03 INFO:     total jobs selected (ALL): 1/8 [State: 7]
01/27 08:45:03 MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,FALSE)
01/27 08:45:03 INFO:     total jobs selected in partition ALL: 1/1 
01/27 08:45:03 MQueueScheduleRJobs(Q)
01/27 08:45:03 MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE)
01/27 08:45:03 INFO:     total jobs selected in partition ALL: 1/1 
01/27 08:45:03 MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE)
01/27 08:45:03 INFO:     total jobs selected in partition DEFAULT: 1/1 
01/27 08:45:03 MQueueScheduleIJobs(Q,DEFAULT)
01/27 08:45:03 INFO:     6 feasible tasks found for job 18451:0 in partition DEFAULT (1 Needed)
01/27 08:45:03 INFO:     64 feasible tasks found for job 18451:1 in partition DEFAULT (32 Needed)
01/27 08:45:03 INFO:     inadequate feasible tasks found for job 18451:1 (0 < 32)
01/27 08:45:03 MJobPReserve(18451,DEFAULT,ResCount,ResCountRej)
01/27 08:45:03 MJobReserve(18451,Priority)
01/27 08:45:03 INFO:     6 feasible tasks found for job 18451:0 in partition DEFAULT (1 Needed)
01/27 08:45:03 INFO:     64 feasible tasks found for job 18451:1 in partition DEFAULT (32 Needed)
01/27 08:45:03 INFO:     6 feasible tasks found for job 18451:0 in partition DEFAULT (1 Needed)
01/27 08:45:03 INFO:     64 feasible tasks found for job 18451:1 in partition DEFAULT (32 Needed)
01/27 08:45:03 INFO:     located resources for 1 tasks (6) in best partition DEFAULT for job 18451 at time 1:46:293 INFO:     6 feasible tasks found for job 18451:0 in partition DEFAULT (1 Needed)
01/27 08:45:03 ALERT:    inadequate tasks to allocate to job 18451:0 (0 < 1)
01/27 08:45:03 WARNING:  cannot allocate tasks for job 18451 at 1:46:29
01/27 08:45:03 INFO:     6 feasible tasks found for job 18451:0 in partition DEFAULT (1 Needed)
01/27 08:45:03 INFO:     64 feasible tasks found for job 18451:1 in partition DEFAULT (32 Needed)
01/27 08:45:03 INFO:     located resources for 1 tasks (12) in best partition DEFAULT for job 18451 at time 1:46:29 INFO:     64 feasible tasks found for job 18451:1 in partition DEFAULT (32 Needed)
01/27 08:45:03 ALERT:    inadequate tasks to allocate to job 18451:0 (0 < 1)
01/27 08:45:03 WARNING:  cannot allocate tasks for job 18451 at 1:58:38
01/27 08:45:03 ERROR:    cannot allocate tasks for job 18451 at any time
01/27 08:45:03 ALERT:    cannot create new reservation for job 18451 (shape[1] 1)
01/27 08:45:03 ALERT:    cannot create new reservation for job 18451
01/27 08:45:03 MJobSetHold(18451,16,00:05:00,NoResources,cannot create reservation for job '18451' (intital reservation attempt)
)
01/27 08:45:03 ALERT:    job '18451' cannot run (deferring job for 300 seconds)
01/27 08:45:03 WARNING:  cannot reserve priority job '18451'
Active Jobs------
------------------



-- 

Dr. Hein Zelle
Advisor Meteorology & Oceanography

Tel:    +31 (0)527-242299
Fax:    +31 (0)527-242016
Email:  hein.zelle at bmtargoss.com
Web:    www.bmtargoss.com

BMT ARGOSS
P.O. Box 61, 8325 ZH Vollenhove
Voorsterweg 28, 8316 PT Marknesse
The Netherlands

----Confidentiality Notice & Disclaimer---- 

The contents of this e-mail and any attachments are intended for the
use of the mail addressee(s) shown. If you are not that person, you
are not allowed to read it, to take any action based upon it or to
copy it, forward, distribute or disclose the contents of it and you
should please delete it from your system. BMT ARGOSS does not accept
liability for any errors or omissions in the context of this e-mail or
its attachments which arise as a result of internet transmission, nor
accept liability for statements which are those of the author and
clearly not made on behalf of BMT ARGOSS.


More information about the mauiusers mailing list