[Mauiusers] priority job failing to get reservation

Naveed Near-Ansari naveed at caltech.edu
Fri Apr 6 12:29:45 MDT 2012


Hi all,

I am having an issue with a priority job not getting a reservation. 
When I set resevation depth to 2, the second priority job does get a
reservation though.

The cluster has 3552 core available for the queue it is submitted to, at
the moment they are all in use.  Since the jobs has the highest
priority, it should start reserving nodes (and it does try.)  WHen i
change the RESERVATIONDEPTH to 2, the second highest priority job does
get a reservation, though this is a much smaller job.

We don't have a size limit on jobs and the cluster does have the
resources for this job.

Does anyone know what may be going on here?  We have this type of
workflow where some people send it very large jobs, and some small so I
would like to figure out what is happy.

Here is the checkjob output and as you can see, it isn't requesting any
resources other than cores.  I have no idead where it is getting the
idle procs from since none are actually idle:

checking job 213152

State: Idle
Creds:  user:user  group:group  class:default  qos:dedicated
WallTime: 00:00:00 of 1:12:00:00
SubmitTime: Fri Apr  6 03:35:23
  (Time Queued  Total: 7:45:59  Eligible: 1:30:06)

Total Tasks: 1501

Req[0]  TaskCount: 1501  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [default]


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 0
PartitionMask: [ALL]
Flags:       RESTARTABLE PREEMPTEE DEDICATEDNODE
Attr:        PREEMPTEE

PE:  1501.00  StartPriority:  144235
job cannot run in partition DEFAULT (insufficient idle procs available:
1056 < 1501)


Here are the relevant log entries:

04/06 03:35:24 MJobPReserve(213152,DEFAULT,ResCount,ResCountRej)
04/06 03:35:24 INFO:     3552 feasible tasks found for job 213152:0 in
partition DEFAULT (1501 Needed)
04/06 03:35:24 ALERT:    job 213152 cannot run in any partition
04/06 03:35:24 ALERT:    cannot create new reservation for job 213152
(shape[1] 1501)
04/06 03:35:24 ALERT:    cannot create new reservation for job 213152
04/06 03:35:24 ALERT:    job '213152' cannot run (deferring job for 3600
seconds)
04/06 03:35:24 WARNING:  cannot reserve priority job '213152'

-- 
Naveed Near-Ansari
E: naveed at caltech.edu
O: 626-395-2212
M: 626-394-3845


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4887 bytes
Desc: S/MIME Cryptographic Signature
Url : http://www.supercluster.org/pipermail/mauiusers/attachments/20120406/7c49f443/attachment.bin 


More information about the mauiusers mailing list