[torqueusers] [Mauiusers] priority job failing to get reservation
Naveed Near-Ansari
naveed at caltech.edu
Fri Apr 20 21:16:30 MDT 2012
Thanks,
no recurring reservations at all. reservation policy is already set that way.
RESERVATIONPOLICY CURRENTHIGHEST
I have been having a dicksens of a time figuring out the best policy for our cluster. Lots of long jobs, some small some large.
Naveed Near-Ansari
On Apr 20, 2012, at 7:41 PM, Lyn Gerner wrote:
> Yes, the idle nodes do fluctuate. If you have any recurring
> reservations (say, for a weekly maintenance window), then it may not
> be able to find a big enough window to run a large, 4-day job, on
> dedicated nodes.
>
> You might also want to check to see if RESERVATIONPOLICY is set to
> HIGHEST, to make sure that the job keeps its priority reservation, if
> it ever gets to the top of the queue.
>
> Good luck,
> Lyn
>
> On 4/20/12, Naveed Near-Ansari <naveed at caltech.edu> wrote:
>> Thanks.
>>
>> The idle procs actually fluctuates:
>>
>> job cannot run in partition DEFAULT (insufficient idle procs available: 744
>> < 1501)
>>
>> I don't think it is mapping to procs since there are 628 procs on the system
>> (314 nodes * 2 procs)
>>
>> The QOS does request dedicated nodes. I have seen no issue with this on all
>> other jobs. When someone requests 12 tasks they get 1 12 core machine.
>>
>> I think i may be misunderstanding how priority reservations work. Does it
>> try to find available nodes to reserve within a timeframe and no procs will
>> be availble within that time frame, or is it supposed to look out forever
>> to find the procs available. We have a lot of long running processes, so if
>> it is looking within a time frame (say a month), it may not be able to find
>> the resources. If this is the case, is it possible to change how far ahead
>> it looks?
>>
>> I couldn't find anything in the documentation that describes specifically
>> how it finds resources for priority based reservations.
>>
>> Naveed Near-Ansari
>>
>>
>> On Apr 20, 2012, at 5:50 PM, Lyn Gerner wrote:
>>
>>> So does the checkjob for 220559 still show the "insufficient idle
>>> procs available: 1056 < 1501" msg?
>>>
>>> Seems like somehow the TASKS request is not mapping to cores (of which
>>> I surmise you have 3576) but rather procs (which in the above you have
>>> 1056).
>>>
>>> I am really grasping at straws on this: is the "ded" QOS requesting
>>> dedicated nodes, and you don't have enough?
>>>
>>> Not sure where else to tell you to look.
>>>
>>> Best of luck,
>>> Lyn
>>>
>>> On 4/20/12, Naveed Near-Ansari <naveed at caltech.edu> wrote:
>>>>
>>>>
>>>> On 04/20/2012 04:23 PM, Lyn Gerner wrote:
>>>>> Naveed,
>>>>>
>>>>> It looks like your setup is only showing 1056 procs, not 3552:
>>>>>> PE: 1501.00 StartPriority: 144235
>>>>>> job cannot run in partition DEFAULT (insufficient idle procs available:
>>>>>> 1056 < 1501)
>>>>> You might play w/diagnose -t (partition) and diagnose -j (job) to see
>>>>> what they tell you. Also, you could try to explicitly make a
>>>>> reservation for the job, and maybe then you could get info from
>>>>> diagnose -r (though attempting the setres may give enough error info).
>>>>>
>>>>> Good luck,
>>>>> Lyn
>>>>
>>>> Thanks for looking.
>>>>
>>>> I think it is configured for 3768 (i said 3552 because the queue it was
>>>> sent to has that many available to it). i didn't see anything clear in
>>>> either diagnose command. I attempted to create a reservation, but it
>>>> failed.
>>>>
>>>> # setres -u ortega -d 4:00:00:00 TASKS==1501
>>>> ERROR: 'setres' failed
>>>> ERROR: cannot select 1501 tasks for reservation for 3:13:33:56
>>>> ERROR: cannot select requested tasks for 'TASKS==1501'
>>>>
>>>>
>>>>
>>>> #diagnose -t
>>>> Displaying Partition Status
>>>>
>>>> System Partition Settings: PList: DEFAULT PDef: DEFAULT
>>>>
>>>> Name Procs
>>>>
>>>> DEFAULT 3768
>>>>
>>>> Partition Configured Up U/C Dedicated D/U
>>>> Active A/U
>>>>
>>>> NODE----------------------------------------------------------------------------
>>>> DEFAULT 314 313 99.68% 297 94.89%
>>>> 297 94.89%
>>>> PROC----------------------------------------------------------------------------
>>>> DEFAULT 3768 3756 99.68% 3564 94.89%
>>>> 3000 79.87%
>>>> MEM----------------------------------------------------------------------------
>>>> DEFAULT 15156264 15107978 99.68% 14335282 94.89%
>>>> 0 0.00%
>>>> SWAP----------------------------------------------------------------------------
>>>> DEFAULT 30227950 30131665 99.68% 28590985 94.89%
>>>> 1400704 4.65%
>>>> DISK----------------------------------------------------------------------------
>>>> DEFAULT 314 313 99.68% 297 94.89%
>>>> 0 0.00%
>>>>
>>>> Class/Queue State
>>>>
>>>> [<CLASS> <AVAIL>:<UP>]...
>>>>
>>>> DEFAULT [shared 3756:3756][debug 3756:3756][default 477:3756][gpu
>>>> 3756:3756]
>>>>
>>>>
>>>>
>>>> #diagnose -j 220559
>>>> Name State Par Proc QOS WCLimit R Min User
>>>> Group Account QueuedTime Network Opsys Arch Mem Disk
>>>> Procs Class Features
>>>>
>>>> 220559 Idle ALL 1501 ded 4:00:00:00 0 1501 ortega
>>>> simons - 1:23:34:41 [NONE] [NONE] [NONE] >=0 >=0 NC0
>>>> [default:1] [default]
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120420/f4fa1c22/attachment-0001.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 203 bytes
Desc: Message signed with OpenPGP using GPGMail
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120420/f4fa1c22/attachment-0001.bin
More information about the torqueusers
mailing list