[torqueusers] [Mauiusers] priority job failing to get reservation

Naveed Near-Ansari naveed at caltech.edu
Fri Apr 20 21:16:30 MDT 2012


Thanks,

no recurring reservations at all. reservation policy is already set that way.

RESERVATIONPOLICY     CURRENTHIGHEST

I have been having a dicksens of a time figuring out the best policy for our cluster. Lots of long jobs, some small some large.

Naveed Near-Ansari




On Apr 20, 2012, at 7:41 PM, Lyn Gerner wrote:

> Yes, the idle nodes do fluctuate.  If you have any recurring
> reservations (say, for a weekly maintenance window), then it may not
> be able to find a big enough window to run a large, 4-day job, on
> dedicated nodes.
> 
> You might also want to check to see if RESERVATIONPOLICY is set to
> HIGHEST, to make sure that the job keeps its priority reservation, if
> it ever gets to the top of the queue.
> 
> Good luck,
> Lyn
> 
> On 4/20/12, Naveed Near-Ansari <naveed at caltech.edu> wrote:
>> Thanks.
>> 
>> The idle procs actually fluctuates:
>> 
>> job cannot run in partition DEFAULT (insufficient idle procs available: 744
>> < 1501)
>> 
>> I don't think it is mapping to procs since there are 628 procs on the system
>> (314 nodes  * 2 procs)
>> 
>> The QOS does request dedicated nodes. I have seen no issue with this on all
>> other jobs. When someone requests 12 tasks they get 1 12 core machine.
>> 
>> I think i may be misunderstanding how priority reservations work.  Does it
>> try to find available nodes to reserve within a timeframe and no procs will
>> be availble within that time frame, or is it supposed to  look out forever
>> to find the procs available.  We have a lot of long running processes, so if
>> it is looking within a time frame (say a month), it may not be able to find
>> the resources.  If this is the case, is it possible to change how far ahead
>> it looks?
>> 
>> I couldn't find anything in the documentation that describes specifically
>> how it finds resources for priority based reservations.
>> 
>> Naveed Near-Ansari
>> 
>> 
>> On Apr 20, 2012, at 5:50 PM, Lyn Gerner wrote:
>> 
>>> So does the checkjob for 220559 still show the "insufficient idle
>>> procs available: 1056 < 1501" msg?
>>> 
>>> Seems like somehow the TASKS request is not mapping to cores (of which
>>> I surmise you have 3576) but rather procs (which in the above you have
>>> 1056).
>>> 
>>> I am really grasping at straws on this: is the "ded" QOS requesting
>>> dedicated nodes, and you don't have enough?
>>> 
>>> Not sure where else to tell you to look.
>>> 
>>> Best of luck,
>>> Lyn
>>> 
>>> On 4/20/12, Naveed Near-Ansari <naveed at caltech.edu> wrote:
>>>> 
>>>> 
>>>> On 04/20/2012 04:23 PM, Lyn Gerner wrote:
>>>>> Naveed,
>>>>> 
>>>>> It looks like your setup is only showing 1056 procs, not 3552:
>>>>>> PE:  1501.00  StartPriority:  144235
>>>>>> job cannot run in partition DEFAULT (insufficient idle procs available:
>>>>>> 1056 < 1501)
>>>>> You might play w/diagnose -t (partition) and diagnose -j (job) to see
>>>>> what they tell you.  Also, you could try to explicitly make a
>>>>> reservation for the job, and maybe then you could get info from
>>>>> diagnose -r (though attempting the setres may give enough error info).
>>>>> 
>>>>> Good luck,
>>>>> Lyn
>>>> 
>>>> Thanks for looking.
>>>> 
>>>> I think it is configured for 3768 (i said 3552 because the queue it was
>>>> sent to has that many available to it). i didn't see anything clear in
>>>> either diagnose command.  I attempted to create a reservation, but it
>>>> failed.
>>>> 
>>>> # setres -u ortega -d 4:00:00:00 TASKS==1501
>>>> ERROR:    'setres' failed
>>>> ERROR:    cannot select 1501 tasks for reservation for 3:13:33:56
>>>> ERROR:    cannot select requested tasks for 'TASKS==1501'
>>>> 
>>>> 
>>>> 
>>>> #diagnose -t
>>>> Displaying Partition Status
>>>> 
>>>> System Partition Settings:  PList: DEFAULT PDef: DEFAULT
>>>> 
>>>> Name                    Procs
>>>> 
>>>> DEFAULT                  3768
>>>> 
>>>> Partition    Configured         Up     U/C  Dedicated     D/U
>>>> Active     A/U
>>>> 
>>>> NODE----------------------------------------------------------------------------
>>>> DEFAULT             314        313  99.68%        297  94.89%
>>>> 297  94.89%
>>>> PROC----------------------------------------------------------------------------
>>>> DEFAULT            3768       3756  99.68%       3564  94.89%
>>>> 3000  79.87%
>>>> MEM----------------------------------------------------------------------------
>>>> DEFAULT        15156264   15107978  99.68%   14335282  94.89%
>>>> 0   0.00%
>>>> SWAP----------------------------------------------------------------------------
>>>> DEFAULT        30227950   30131665  99.68%   28590985  94.89%
>>>> 1400704   4.65%
>>>> DISK----------------------------------------------------------------------------
>>>> DEFAULT             314        313  99.68%        297  94.89%
>>>> 0   0.00%
>>>> 
>>>> Class/Queue State
>>>> 
>>>>            [<CLASS> <AVAIL>:<UP>]...
>>>> 
>>>>    DEFAULT [shared 3756:3756][debug 3756:3756][default 477:3756][gpu
>>>> 3756:3756]
>>>> 
>>>> 
>>>> 
>>>> #diagnose -j 220559
>>>> Name                  State Par Proc QOS     WCLimit R  Min     User
>>>> Group  Account  QueuedTime  Network  Opsys   Arch    Mem   Disk
>>>> Procs       Class Features
>>>> 
>>>> 220559                 Idle ALL 1501 ded  4:00:00:00 0 1501   ortega
>>>> simons        -  1:23:34:41   [NONE] [NONE] [NONE]    >=0    >=0    NC0
>>>> [default:1] [default]
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120420/f4fa1c22/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 203 bytes
Desc: Message signed with OpenPGP using GPGMail
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120420/f4fa1c22/attachment-0001.bin 


More information about the torqueusers mailing list