[Mauiusers] Diagnosing batchHold NoResources
Naveed Near-Ansari
naveed at caltech.edu
Tue Aug 16 10:33:33 MDT 2011
Do you have any tricks on diagnosing jobs that are held on NoResources?
I have a job that keeps being put into this state, but I can't see what
resources are missing. It is a 2001 core job but showq shows that there
are 3164 cores on the system. I had the user drop memory requirements
to see if it would go through, and as far as I can see, nothing else was
requested.
checking job 141698
State: Idle EState: Deferred
Creds: user:user group:group class:default qos:dedicated
WallTime: 00:00:00 of 6:00:00:00
SubmitTime: Mon Aug 15 16:00:56
(Time Queued Total: 17:31:43 Eligible: 00:22:39)
Total Tasks: 2001
Req[0] TaskCount: 2001 Partition: ALL
Network: [NONE] Memory >= 1024M Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [default]
Dedicated Resources Per Task: PROCS: 1 MEM: 1024M
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 0
PartitionMask: [ALL]
Flags: RESTARTABLE PREEMPTEE DEDICATEDNODE
Attr: PREEMPTEE
job is deferred. Reason: NoResources (cannot create reservation for
job '141698' (intital reservation attempt)
)
Holds: Batch Defer (hold reason: NoResources)
PE: 2001.00 StartPriority: 22
cannot select job 141698 for partition DEFAULT (job hold active)
These are the logs after releasing the hold:
08/16 09:30:05 MQueueScheduleIJobs(Q,DEFAULT)
08/16 09:30:05 INFO: 2988 feasible tasks found for job 141698:0 in
partition DEFAULT (2001 Needed)
08/16 09:30:05 MJobPReserve(141698,DEFAULT,ResCount,ResCountRej)
08/16 09:30:05 INFO: 2988 feasible tasks found for job 141698:0 in
partition DEFAULT (2001 Needed)
08/16 09:30:05 ALERT: job 141698 cannot run in any partition
08/16 09:30:05 ALERT: cannot create new reservation for job 141698
(shape[1] 2001)
08/16 09:30:05 ALERT: cannot create new reservation for job 141698
08/16 09:30:05 ALERT: job '141698' cannot run (deferring job for 300
seconds)
08/16 09:30:05 INFO: batch hold placed on job '141698', reason:
'NoResources'
08/16 09:30:05 MSysRegEvent(JOBHOLD: batch hold placed on job
'141698'. defercount: 33 reason: 'NoResources',0,0,1)
08/16 09:30:05 MSysLaunchAction(ASList,1)
08/16 09:30:05 WARNING: cannot reserve priority job '141698'
Active Jobs------
------------------
08/16 09:30:05 INFO: resources available after scheduling: N: 185
P: 1908
This was submitted to the default queue which has qos of dedicated:
QOSCFG[dedicated] QFLAGS=PREEMPTEE:DEDICATED
CLASSCFG[default] QDEF=dedicated
create queue default
set queue default queue_type = Execution
set queue default resources_default.pmem = 1500mb
set queue default enabled = True
set queue default started = True
More information about the mauiusers
mailing list