[Mauiusers] Diagnosing batchHold NoResources

Naveed Near-Ansari naveed at caltech.edu
Tue Aug 16 10:33:33 MDT 2011


Do you have any tricks on diagnosing jobs that are held on NoResources?

I have a job that keeps being put into this state, but I can't see what
resources are missing.  It is a 2001 core job but showq shows that there
are 3164 cores on the system.  I had the user drop memory requirements
to see if it would go through, and as far as I can see, nothing else was
requested.

checking job 141698

State: Idle  EState: Deferred
Creds:  user:user  group:group  class:default  qos:dedicated
WallTime: 00:00:00 of 6:00:00:00
SubmitTime: Mon Aug 15 16:00:56
  (Time Queued  Total: 17:31:43  Eligible: 00:22:39)

Total Tasks: 2001

Req[0]  TaskCount: 2001  Partition: ALL
Network: [NONE]  Memory >= 1024M  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [default]
Dedicated Resources Per Task: PROCS: 1  MEM: 1024M


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 0
PartitionMask: [ALL]
Flags:       RESTARTABLE PREEMPTEE DEDICATEDNODE
Attr:        PREEMPTEE

job is deferred.  Reason:  NoResources  (cannot create reservation for
job '141698' (intital reservation attempt)
)
Holds:    Batch  Defer  (hold reason:  NoResources)
PE:  2001.00  StartPriority:  22
cannot select job 141698 for partition DEFAULT (job hold active)



These are the logs after releasing the hold:


08/16 09:30:05 MQueueScheduleIJobs(Q,DEFAULT)
08/16 09:30:05 INFO:     2988 feasible tasks found for job 141698:0 in
partition DEFAULT (2001 Needed)
08/16 09:30:05 MJobPReserve(141698,DEFAULT,ResCount,ResCountRej)
08/16 09:30:05 INFO:     2988 feasible tasks found for job 141698:0 in
partition DEFAULT (2001 Needed)
08/16 09:30:05 ALERT:    job 141698 cannot run in any partition
08/16 09:30:05 ALERT:    cannot create new reservation for job 141698
(shape[1] 2001)
08/16 09:30:05 ALERT:    cannot create new reservation for job 141698
08/16 09:30:05 ALERT:    job '141698' cannot run (deferring job for 300
seconds)
08/16 09:30:05 INFO:     batch hold placed on job '141698', reason:
'NoResources'
08/16 09:30:05 MSysRegEvent(JOBHOLD:  batch hold placed on job
'141698'.  defercount: 33  reason: 'NoResources',0,0,1)
08/16 09:30:05 MSysLaunchAction(ASList,1)
08/16 09:30:05 WARNING:  cannot reserve priority job '141698'
Active Jobs------
------------------
08/16 09:30:05 INFO:     resources available after scheduling: N: 185 
P: 1908




This was submitted to the default queue which has qos of dedicated:

QOSCFG[dedicated]    QFLAGS=PREEMPTEE:DEDICATED
CLASSCFG[default]    QDEF=dedicated


create queue default
set queue default queue_type = Execution
set queue default resources_default.pmem = 1500mb
set queue default enabled = True
set queue default started = True




More information about the mauiusers mailing list