[torqueusers] [Mauiusers] priority job failing to get reservation
Naveed Near-Ansari
naveed at caltech.edu
Mon Apr 23 12:49:39 MDT 2012
Thanks,
I have changed it to HIGHEST and will see what happens. Reading the
docs suggests to me that it probably won't do it since thise particular
job has been waiting so long it has by far the highest priority and
probably never dropped out of the reservation depth, but it is possible
it has and lost its reservation.
On 04/20/2012 08:29 PM, Lyn Gerner wrote:
> No, I'm suggesting you change the RESERVATIONPOLICY to HIGHEST.
> Completely different effect than CURRENTHIGHEST.
>
> On 4/20/12, Naveed Near-Ansari <naveed at caltech.edu> wrote:
>> Thanks,
>>
>> no recurring reservations at all. reservation policy is already set that
>> way.
>>
>> RESERVATIONPOLICY CURRENTHIGHEST
>>
>> I have been having a dicksens of a time figuring out the best policy for our
>> cluster. Lots of long jobs, some small some large.
>>
>> Naveed Near-Ansari
>>
>>
>>
>>
>> On Apr 20, 2012, at 7:41 PM, Lyn Gerner wrote:
>>
>>> Yes, the idle nodes do fluctuate. If you have any recurring
>>> reservations (say, for a weekly maintenance window), then it may not
>>> be able to find a big enough window to run a large, 4-day job, on
>>> dedicated nodes.
>>>
>>> You might also want to check to see if RESERVATIONPOLICY is set to
>>> HIGHEST, to make sure that the job keeps its priority reservation, if
>>> it ever gets to the top of the queue.
>>>
>>> Good luck,
>>> Lyn
>>>
>>> On 4/20/12, Naveed Near-Ansari <naveed at caltech.edu> wrote:
>>>> Thanks.
>>>>
>>>> The idle procs actually fluctuates:
>>>>
>>>> job cannot run in partition DEFAULT (insufficient idle procs available:
>>>> 744
>>>> < 1501)
>>>>
>>>> I don't think it is mapping to procs since there are 628 procs on the
>>>> system
>>>> (314 nodes * 2 procs)
>>>>
>>>> The QOS does request dedicated nodes. I have seen no issue with this on
>>>> all
>>>> other jobs. When someone requests 12 tasks they get 1 12 core machine.
>>>>
>>>> I think i may be misunderstanding how priority reservations work. Does
>>>> it
>>>> try to find available nodes to reserve within a timeframe and no procs
>>>> will
>>>> be availble within that time frame, or is it supposed to look out
>>>> forever
>>>> to find the procs available. We have a lot of long running processes, so
>>>> if
>>>> it is looking within a time frame (say a month), it may not be able to
>>>> find
>>>> the resources. If this is the case, is it possible to change how far
>>>> ahead
>>>> it looks?
>>>>
>>>> I couldn't find anything in the documentation that describes specifically
>>>> how it finds resources for priority based reservations.
>>>>
>>>> Naveed Near-Ansari
>>>>
>>>>
>>>> On Apr 20, 2012, at 5:50 PM, Lyn Gerner wrote:
>>>>
>>>>> So does the checkjob for 220559 still show the "insufficient idle
>>>>> procs available: 1056 < 1501" msg?
>>>>>
>>>>> Seems like somehow the TASKS request is not mapping to cores (of which
>>>>> I surmise you have 3576) but rather procs (which in the above you have
>>>>> 1056).
>>>>>
>>>>> I am really grasping at straws on this: is the "ded" QOS requesting
>>>>> dedicated nodes, and you don't have enough?
>>>>>
>>>>> Not sure where else to tell you to look.
>>>>>
>>>>> Best of luck,
>>>>> Lyn
>>>>>
>>>>> On 4/20/12, Naveed Near-Ansari <naveed at caltech.edu> wrote:
>>>>>>
>>>>>> On 04/20/2012 04:23 PM, Lyn Gerner wrote:
>>>>>>> Naveed,
>>>>>>>
>>>>>>> It looks like your setup is only showing 1056 procs, not 3552:
>>>>>>>> PE: 1501.00 StartPriority: 144235
>>>>>>>> job cannot run in partition DEFAULT (insufficient idle procs
>>>>>>>> available:
>>>>>>>> 1056 < 1501)
>>>>>>> You might play w/diagnose -t (partition) and diagnose -j (job) to see
>>>>>>> what they tell you. Also, you could try to explicitly make a
>>>>>>> reservation for the job, and maybe then you could get info from
>>>>>>> diagnose -r (though attempting the setres may give enough error info).
>>>>>>>
>>>>>>> Good luck,
>>>>>>> Lyn
>>>>>> Thanks for looking.
>>>>>>
>>>>>> I think it is configured for 3768 (i said 3552 because the queue it was
>>>>>> sent to has that many available to it). i didn't see anything clear in
>>>>>> either diagnose command. I attempted to create a reservation, but it
>>>>>> failed.
>>>>>>
>>>>>> # setres -u ortega -d 4:00:00:00 TASKS==1501
>>>>>> ERROR: 'setres' failed
>>>>>> ERROR: cannot select 1501 tasks for reservation for 3:13:33:56
>>>>>> ERROR: cannot select requested tasks for 'TASKS==1501'
>>>>>>
>>>>>>
>>>>>>
>>>>>> #diagnose -t
>>>>>> Displaying Partition Status
>>>>>>
>>>>>> System Partition Settings: PList: DEFAULT PDef: DEFAULT
>>>>>>
>>>>>> Name Procs
>>>>>>
>>>>>> DEFAULT 3768
>>>>>>
>>>>>> Partition Configured Up U/C Dedicated D/U
>>>>>> Active A/U
>>>>>>
>>>>>> NODE----------------------------------------------------------------------------
>>>>>> DEFAULT 314 313 99.68% 297 94.89%
>>>>>> 297 94.89%
>>>>>> PROC----------------------------------------------------------------------------
>>>>>> DEFAULT 3768 3756 99.68% 3564 94.89%
>>>>>> 3000 79.87%
>>>>>> MEM----------------------------------------------------------------------------
>>>>>> DEFAULT 15156264 15107978 99.68% 14335282 94.89%
>>>>>> 0 0.00%
>>>>>> SWAP----------------------------------------------------------------------------
>>>>>> DEFAULT 30227950 30131665 99.68% 28590985 94.89%
>>>>>> 1400704 4.65%
>>>>>> DISK----------------------------------------------------------------------------
>>>>>> DEFAULT 314 313 99.68% 297 94.89%
>>>>>> 0 0.00%
>>>>>>
>>>>>> Class/Queue State
>>>>>>
>>>>>> [<CLASS> <AVAIL>:<UP>]...
>>>>>>
>>>>>> DEFAULT [shared 3756:3756][debug 3756:3756][default 477:3756][gpu
>>>>>> 3756:3756]
>>>>>>
>>>>>>
>>>>>>
>>>>>> #diagnose -j 220559
>>>>>> Name State Par Proc QOS WCLimit R Min User
>>>>>> Group Account QueuedTime Network Opsys Arch Mem Disk
>>>>>> Procs Class Features
>>>>>>
>>>>>> 220559 Idle ALL 1501 ded 4:00:00:00 0 1501 ortega
>>>>>> simons - 1:23:34:41 [NONE] [NONE] [NONE] >=0 >=0 NC0
>>>>>> [default:1] [default]
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>
--
Naveed Near-Ansari
E: naveed at caltech.edu
O: 626-395-2212
M: 626-394-3845
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4887 bytes
Desc: S/MIME Cryptographic Signature
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120423/12621bf7/attachment.bin
More information about the torqueusers
mailing list