[torqueusers] priority job failing to get reservation

Naveed Near-Ansari naveed at caltech.edu
Fri Apr 20 15:37:43 MDT 2012

I know this isn't technically torque, but i haven't seen any activity on
the maui list and I thought there might be some overlap in users here.

Torque 2.5.9
Maui 3.3.1

I am having an issue with a priority job not getting a reservation. When
I set reservation depth to 2, the second priority job does get a
reservation though.

The cluster has 3552 core available for the queue it is submitted to, at
the moment they are all in use.  Since the jobs has the highest
priority, it should start reserving nodes (and it does try.)  When i
change the RESERVATIONDEPTH to 2, the second highest priority job does
get a reservation, though this is a much smaller job.  Perhaps I am
misunderstanding how these reservation work.  If there a timefram in
which it has to reserve nodes?

We don't have a size limit on jobs and the cluster does have the
resources for this job.

Does anyone know what may be going on here?  We have this type of
workflow where some people send it very large jobs, and some small so I
would like to figure out what is happening. Do you have any good
strategies to deal with the type of workflow?

Here is the checkjob output and as you can see, it isn't requesting any
resources other than cores.  I have no idea  where it is getting the
idle procs from since none are actually idle. perhaps it has do do with
reservable nodes?  The idle procs tends to fluctuate over time.

checking job 213152

State: Idle
Creds:  user:user  group:group  class:default  qos:dedicated
WallTime: 00:00:00 of 1:12:00:00
SubmitTime: Fri Apr  6 03:35:23
  (Time Queued  Total: 7:45:59  Eligible: 1:30:06)

Total Tasks: 1501

Req[0]  TaskCount: 1501  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [default]

IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 0
PartitionMask: [ALL]
Attr:        PREEMPTEE

PE:  1501.00  StartPriority:  144235
job cannot run in partition DEFAULT (insufficient idle procs available:
1056 < 1501)

Here are the relevant log entries:

04/06 03:35:24 MJobPReserve(213152,DEFAULT,ResCount,ResCountRej)
04/06 03:35:24 INFO:     3552 feasible tasks found for job 213152:0 in
partition DEFAULT (1501 Needed)
04/06 03:35:24 ALERT:    job 213152 cannot run in any partition
04/06 03:35:24 ALERT:    cannot create new reservation for job 213152
(shape[1] 1501)
04/06 03:35:24 ALERT:    cannot create new reservation for job 213152
04/06 03:35:24 ALERT:    job '213152' cannot run (deferring job for 3600
04/06 03:35:24 WARNING:  cannot reserve priority job '213152'

Naveed Near-Ansari

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4887 bytes
Desc: S/MIME Cryptographic Signature
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120420/e683b49d/attachment.bin 

More information about the torqueusers mailing list