[torqueusers] [Mauiusers] priority job failing to get reservation

Naveed Near-Ansari naveed at caltech.edu
Fri Apr 20 18:59:56 MDT 2012


Thanks.

The idle procs actually fluctuates:

job cannot run in partition DEFAULT (insufficient idle procs available: 744 < 1501)

I don't think it is mapping to procs since there are 628 procs on the system (314 nodes  * 2 procs)

The QOS does request dedicated nodes. I have seen no issue with this on all other jobs. When someone requests 12 tasks they get 1 12 core machine.

I think i may be misunderstanding how priority reservations work.  Does it try to find available nodes to reserve within a timeframe and no procs will be availble within that time frame, or is it supposed to  look out forever to find the procs available.  We have a lot of long running processes, so if it is looking within a time frame (say a month), it may not be able to find the resources.  If this is the case, is it possible to change how far ahead it looks?

I couldn't find anything in the documentation that describes specifically how it finds resources for priority based reservations.

Naveed Near-Ansari


On Apr 20, 2012, at 5:50 PM, Lyn Gerner wrote:

> So does the checkjob for 220559 still show the "insufficient idle
> procs available: 1056 < 1501" msg?
> 
> Seems like somehow the TASKS request is not mapping to cores (of which
> I surmise you have 3576) but rather procs (which in the above you have
> 1056).
> 
> I am really grasping at straws on this: is the "ded" QOS requesting
> dedicated nodes, and you don't have enough?
> 
> Not sure where else to tell you to look.
> 
> Best of luck,
> Lyn
> 
> On 4/20/12, Naveed Near-Ansari <naveed at caltech.edu> wrote:
>> 
>> 
>> On 04/20/2012 04:23 PM, Lyn Gerner wrote:
>>> Naveed,
>>> 
>>> It looks like your setup is only showing 1056 procs, not 3552:
>>>> PE:  1501.00  StartPriority:  144235
>>>> job cannot run in partition DEFAULT (insufficient idle procs available:
>>>> 1056 < 1501)
>>> You might play w/diagnose -t (partition) and diagnose -j (job) to see
>>> what they tell you.  Also, you could try to explicitly make a
>>> reservation for the job, and maybe then you could get info from
>>> diagnose -r (though attempting the setres may give enough error info).
>>> 
>>> Good luck,
>>> Lyn
>> 
>> Thanks for looking.
>> 
>> I think it is configured for 3768 (i said 3552 because the queue it was
>> sent to has that many available to it). i didn't see anything clear in
>> either diagnose command.  I attempted to create a reservation, but it
>> failed.
>> 
>> # setres -u ortega -d 4:00:00:00 TASKS==1501
>> ERROR:    'setres' failed
>> ERROR:    cannot select 1501 tasks for reservation for 3:13:33:56
>> ERROR:    cannot select requested tasks for 'TASKS==1501'
>> 
>> 
>> 
>> #diagnose -t
>> Displaying Partition Status
>> 
>> System Partition Settings:  PList: DEFAULT PDef: DEFAULT
>> 
>> Name                    Procs
>> 
>> DEFAULT                  3768
>> 
>> Partition    Configured         Up     U/C  Dedicated     D/U
>> Active     A/U
>> 
>> NODE----------------------------------------------------------------------------
>> DEFAULT             314        313  99.68%        297  94.89%
>> 297  94.89%
>> PROC----------------------------------------------------------------------------
>> DEFAULT            3768       3756  99.68%       3564  94.89%
>> 3000  79.87%
>> MEM----------------------------------------------------------------------------
>> DEFAULT        15156264   15107978  99.68%   14335282  94.89%
>> 0   0.00%
>> SWAP----------------------------------------------------------------------------
>> DEFAULT        30227950   30131665  99.68%   28590985  94.89%
>> 1400704   4.65%
>> DISK----------------------------------------------------------------------------
>> DEFAULT             314        313  99.68%        297  94.89%
>> 0   0.00%
>> 
>> Class/Queue State
>> 
>>             [<CLASS> <AVAIL>:<UP>]...
>> 
>>     DEFAULT [shared 3756:3756][debug 3756:3756][default 477:3756][gpu
>> 3756:3756]
>> 
>> 
>> 
>> #diagnose -j 220559
>> Name                  State Par Proc QOS     WCLimit R  Min     User
>> Group  Account  QueuedTime  Network  Opsys   Arch    Mem   Disk
>> Procs       Class Features
>> 
>> 220559                 Idle ALL 1501 ded  4:00:00:00 0 1501   ortega
>> simons        -  1:23:34:41   [NONE] [NONE] [NONE]    >=0    >=0    NC0
>> [default:1] [default]
>> 
>> 
>> 
>> 
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 203 bytes
Desc: Message signed with OpenPGP using GPGMail
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120420/432609ea/attachment.bin 


More information about the torqueusers mailing list