[torqueusers] [Mauiusers] priority job failing to get reservation

Naveed Near-Ansari naveed at caltech.edu
Mon Apr 23 12:49:39 MDT 2012


Thanks,

I have changed it to HIGHEST and will see what happens.  Reading the
docs suggests to me that it probably won't do it since thise particular
job has been waiting so long it has by far the highest priority and
probably never dropped out of the reservation depth, but it is possible
it has and lost its reservation.

On 04/20/2012 08:29 PM, Lyn Gerner wrote:
> No, I'm suggesting you change the RESERVATIONPOLICY to HIGHEST.
> Completely different effect than CURRENTHIGHEST.
>
> On 4/20/12, Naveed Near-Ansari <naveed at caltech.edu> wrote:
>> Thanks,
>>
>> no recurring reservations at all. reservation policy is already set that
>> way.
>>
>> RESERVATIONPOLICY     CURRENTHIGHEST
>>
>> I have been having a dicksens of a time figuring out the best policy for our
>> cluster. Lots of long jobs, some small some large.
>>
>> Naveed Near-Ansari
>>
>>
>>
>>
>> On Apr 20, 2012, at 7:41 PM, Lyn Gerner wrote:
>>
>>> Yes, the idle nodes do fluctuate.  If you have any recurring
>>> reservations (say, for a weekly maintenance window), then it may not
>>> be able to find a big enough window to run a large, 4-day job, on
>>> dedicated nodes.
>>>
>>> You might also want to check to see if RESERVATIONPOLICY is set to
>>> HIGHEST, to make sure that the job keeps its priority reservation, if
>>> it ever gets to the top of the queue.
>>>
>>> Good luck,
>>> Lyn
>>>
>>> On 4/20/12, Naveed Near-Ansari <naveed at caltech.edu> wrote:
>>>> Thanks.
>>>>
>>>> The idle procs actually fluctuates:
>>>>
>>>> job cannot run in partition DEFAULT (insufficient idle procs available:
>>>> 744
>>>> < 1501)
>>>>
>>>> I don't think it is mapping to procs since there are 628 procs on the
>>>> system
>>>> (314 nodes  * 2 procs)
>>>>
>>>> The QOS does request dedicated nodes. I have seen no issue with this on
>>>> all
>>>> other jobs. When someone requests 12 tasks they get 1 12 core machine.
>>>>
>>>> I think i may be misunderstanding how priority reservations work.  Does
>>>> it
>>>> try to find available nodes to reserve within a timeframe and no procs
>>>> will
>>>> be availble within that time frame, or is it supposed to  look out
>>>> forever
>>>> to find the procs available.  We have a lot of long running processes, so
>>>> if
>>>> it is looking within a time frame (say a month), it may not be able to
>>>> find
>>>> the resources.  If this is the case, is it possible to change how far
>>>> ahead
>>>> it looks?
>>>>
>>>> I couldn't find anything in the documentation that describes specifically
>>>> how it finds resources for priority based reservations.
>>>>
>>>> Naveed Near-Ansari
>>>>
>>>>
>>>> On Apr 20, 2012, at 5:50 PM, Lyn Gerner wrote:
>>>>
>>>>> So does the checkjob for 220559 still show the "insufficient idle
>>>>> procs available: 1056 < 1501" msg?
>>>>>
>>>>> Seems like somehow the TASKS request is not mapping to cores (of which
>>>>> I surmise you have 3576) but rather procs (which in the above you have
>>>>> 1056).
>>>>>
>>>>> I am really grasping at straws on this: is the "ded" QOS requesting
>>>>> dedicated nodes, and you don't have enough?
>>>>>
>>>>> Not sure where else to tell you to look.
>>>>>
>>>>> Best of luck,
>>>>> Lyn
>>>>>
>>>>> On 4/20/12, Naveed Near-Ansari <naveed at caltech.edu> wrote:
>>>>>>
>>>>>> On 04/20/2012 04:23 PM, Lyn Gerner wrote:
>>>>>>> Naveed,
>>>>>>>
>>>>>>> It looks like your setup is only showing 1056 procs, not 3552:
>>>>>>>> PE:  1501.00  StartPriority:  144235
>>>>>>>> job cannot run in partition DEFAULT (insufficient idle procs
>>>>>>>> available:
>>>>>>>> 1056 < 1501)
>>>>>>> You might play w/diagnose -t (partition) and diagnose -j (job) to see
>>>>>>> what they tell you.  Also, you could try to explicitly make a
>>>>>>> reservation for the job, and maybe then you could get info from
>>>>>>> diagnose -r (though attempting the setres may give enough error info).
>>>>>>>
>>>>>>> Good luck,
>>>>>>> Lyn
>>>>>> Thanks for looking.
>>>>>>
>>>>>> I think it is configured for 3768 (i said 3552 because the queue it was
>>>>>> sent to has that many available to it). i didn't see anything clear in
>>>>>> either diagnose command.  I attempted to create a reservation, but it
>>>>>> failed.
>>>>>>
>>>>>> # setres -u ortega -d 4:00:00:00 TASKS==1501
>>>>>> ERROR:    'setres' failed
>>>>>> ERROR:    cannot select 1501 tasks for reservation for 3:13:33:56
>>>>>> ERROR:    cannot select requested tasks for 'TASKS==1501'
>>>>>>
>>>>>>
>>>>>>
>>>>>> #diagnose -t
>>>>>> Displaying Partition Status
>>>>>>
>>>>>> System Partition Settings:  PList: DEFAULT PDef: DEFAULT
>>>>>>
>>>>>> Name                    Procs
>>>>>>
>>>>>> DEFAULT                  3768
>>>>>>
>>>>>> Partition    Configured         Up     U/C  Dedicated     D/U
>>>>>> Active     A/U
>>>>>>
>>>>>> NODE----------------------------------------------------------------------------
>>>>>> DEFAULT             314        313  99.68%        297  94.89%
>>>>>> 297  94.89%
>>>>>> PROC----------------------------------------------------------------------------
>>>>>> DEFAULT            3768       3756  99.68%       3564  94.89%
>>>>>> 3000  79.87%
>>>>>> MEM----------------------------------------------------------------------------
>>>>>> DEFAULT        15156264   15107978  99.68%   14335282  94.89%
>>>>>> 0   0.00%
>>>>>> SWAP----------------------------------------------------------------------------
>>>>>> DEFAULT        30227950   30131665  99.68%   28590985  94.89%
>>>>>> 1400704   4.65%
>>>>>> DISK----------------------------------------------------------------------------
>>>>>> DEFAULT             314        313  99.68%        297  94.89%
>>>>>> 0   0.00%
>>>>>>
>>>>>> Class/Queue State
>>>>>>
>>>>>>            [<CLASS> <AVAIL>:<UP>]...
>>>>>>
>>>>>>    DEFAULT [shared 3756:3756][debug 3756:3756][default 477:3756][gpu
>>>>>> 3756:3756]
>>>>>>
>>>>>>
>>>>>>
>>>>>> #diagnose -j 220559
>>>>>> Name                  State Par Proc QOS     WCLimit R  Min     User
>>>>>> Group  Account  QueuedTime  Network  Opsys   Arch    Mem   Disk
>>>>>> Procs       Class Features
>>>>>>
>>>>>> 220559                 Idle ALL 1501 ded  4:00:00:00 0 1501   ortega
>>>>>> simons        -  1:23:34:41   [NONE] [NONE] [NONE]    >=0    >=0    NC0
>>>>>> [default:1] [default]
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>

-- 
Naveed Near-Ansari
E: naveed at caltech.edu
O: 626-395-2212
M: 626-394-3845


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4887 bytes
Desc: S/MIME Cryptographic Signature
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120423/12621bf7/attachment.bin 


More information about the torqueusers mailing list