[torqueusers] [Mauiusers] priority job failing to get reservation
Naveed Near-Ansari
naveed at caltech.edu
Fri Apr 20 18:59:56 MDT 2012
Thanks.
The idle procs actually fluctuates:
job cannot run in partition DEFAULT (insufficient idle procs available: 744 < 1501)
I don't think it is mapping to procs since there are 628 procs on the system (314 nodes * 2 procs)
The QOS does request dedicated nodes. I have seen no issue with this on all other jobs. When someone requests 12 tasks they get 1 12 core machine.
I think i may be misunderstanding how priority reservations work. Does it try to find available nodes to reserve within a timeframe and no procs will be availble within that time frame, or is it supposed to look out forever to find the procs available. We have a lot of long running processes, so if it is looking within a time frame (say a month), it may not be able to find the resources. If this is the case, is it possible to change how far ahead it looks?
I couldn't find anything in the documentation that describes specifically how it finds resources for priority based reservations.
Naveed Near-Ansari
On Apr 20, 2012, at 5:50 PM, Lyn Gerner wrote:
> So does the checkjob for 220559 still show the "insufficient idle
> procs available: 1056 < 1501" msg?
>
> Seems like somehow the TASKS request is not mapping to cores (of which
> I surmise you have 3576) but rather procs (which in the above you have
> 1056).
>
> I am really grasping at straws on this: is the "ded" QOS requesting
> dedicated nodes, and you don't have enough?
>
> Not sure where else to tell you to look.
>
> Best of luck,
> Lyn
>
> On 4/20/12, Naveed Near-Ansari <naveed at caltech.edu> wrote:
>>
>>
>> On 04/20/2012 04:23 PM, Lyn Gerner wrote:
>>> Naveed,
>>>
>>> It looks like your setup is only showing 1056 procs, not 3552:
>>>> PE: 1501.00 StartPriority: 144235
>>>> job cannot run in partition DEFAULT (insufficient idle procs available:
>>>> 1056 < 1501)
>>> You might play w/diagnose -t (partition) and diagnose -j (job) to see
>>> what they tell you. Also, you could try to explicitly make a
>>> reservation for the job, and maybe then you could get info from
>>> diagnose -r (though attempting the setres may give enough error info).
>>>
>>> Good luck,
>>> Lyn
>>
>> Thanks for looking.
>>
>> I think it is configured for 3768 (i said 3552 because the queue it was
>> sent to has that many available to it). i didn't see anything clear in
>> either diagnose command. I attempted to create a reservation, but it
>> failed.
>>
>> # setres -u ortega -d 4:00:00:00 TASKS==1501
>> ERROR: 'setres' failed
>> ERROR: cannot select 1501 tasks for reservation for 3:13:33:56
>> ERROR: cannot select requested tasks for 'TASKS==1501'
>>
>>
>>
>> #diagnose -t
>> Displaying Partition Status
>>
>> System Partition Settings: PList: DEFAULT PDef: DEFAULT
>>
>> Name Procs
>>
>> DEFAULT 3768
>>
>> Partition Configured Up U/C Dedicated D/U
>> Active A/U
>>
>> NODE----------------------------------------------------------------------------
>> DEFAULT 314 313 99.68% 297 94.89%
>> 297 94.89%
>> PROC----------------------------------------------------------------------------
>> DEFAULT 3768 3756 99.68% 3564 94.89%
>> 3000 79.87%
>> MEM----------------------------------------------------------------------------
>> DEFAULT 15156264 15107978 99.68% 14335282 94.89%
>> 0 0.00%
>> SWAP----------------------------------------------------------------------------
>> DEFAULT 30227950 30131665 99.68% 28590985 94.89%
>> 1400704 4.65%
>> DISK----------------------------------------------------------------------------
>> DEFAULT 314 313 99.68% 297 94.89%
>> 0 0.00%
>>
>> Class/Queue State
>>
>> [<CLASS> <AVAIL>:<UP>]...
>>
>> DEFAULT [shared 3756:3756][debug 3756:3756][default 477:3756][gpu
>> 3756:3756]
>>
>>
>>
>> #diagnose -j 220559
>> Name State Par Proc QOS WCLimit R Min User
>> Group Account QueuedTime Network Opsys Arch Mem Disk
>> Procs Class Features
>>
>> 220559 Idle ALL 1501 ded 4:00:00:00 0 1501 ortega
>> simons - 1:23:34:41 [NONE] [NONE] [NONE] >=0 >=0 NC0
>> [default:1] [default]
>>
>>
>>
>>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 203 bytes
Desc: Message signed with OpenPGP using GPGMail
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120420/432609ea/attachment.bin
More information about the torqueusers
mailing list