[torqueusers] queue routing based on mem resource not working properly...

Lech Nieroda lnieroda at gmail.com
Tue Aug 3 06:11:41 MDT 2010


On Wed, Jul 28, 2010 at 7:48 PM, Garrick Staples <garrick at usc.edu> wrote:
> On Wed, Jul 28, 2010 at 02:17:29PM +0200, Lech Nieroda alleged:
>> Dear list,
>>
>> we have a cluster with 3 groups of machines - some have 24GB, some
>> have 48GB, another group has 96GB, our maui version is 3.2.6p21.
>> The general idea is to keep the larger nodes free for jobs that
>> actually need that much RAM and thus route jobs with >48GB
>> automatically to the 96GB nodes, >24GB to the 48GB nodes and the rest
>> to the 24GB nodes.
>>
>> How this was implemented:
>> - each node has been given a "property" according to its available
>> memory, i.e. ram96gb, ram48gb, ram24gb
>> - there is a queue for each memory size, with appropriate "neednodes"
>> and "resources_min.mem" statements, i.e.
>>    set queue qram48g resources_default.neednodes = ram48gb
>>    set queue qram48g resources_min.mem = 24gb
>> - finally, there is a routing queue, which routes the jobs, i.e.
>>     set queue default queue_type = Route
>>     set queue default route_destinations = qram96gb
>>     set queue default route_destinations += qram48gb
>>     set queue default route_destinations += qram24gb
>>
>> However, this isn't working properly - since the jobs are routed
>> according to their total mem requirement and not the per node value,
>> for example a job with "-l nodes=2:ppn=4,mem=50gb" would require 25gb
>> per node but it is routed to qram48gb since 50gb>48gb. Supplying more
>> resource limits in the queue setup, like pmem and pvmem doesn't change
>> this behavior - the jobs are still routed to the larger nodes even
>> though smaller ones would suffice.
>>
>> Any ideas, experiences with such routing?
>
> I think you want to order your queues the other way around using resources_max.mem.

Since the mem Parameter isn't divided by the number nodes when the
routing check occurs and we have many more 24gb nodes than 48gb or
96gb, this probably wouldn't work - i.e. a job for 200 nodes with 20GB
would require a mem=4000gb. This cannot be limited effectively with a
max.mem limit.

> Are you using Maui? You could just use the MINRESOURCE node allocation policy.
> Or just order the nodes in your server_priv/nodes file the way you want them
> allocated.

Yes, we are using Maui (3.2.6p21), and have tried the MINRESOURCE
policy. But there seems to be a problem: I've noticed that even though
the "NODEALLOCATIONPOLICY MINRESOURCE" is set, jobs are sometimes
assigned to the larger nodes even though their resources could be
satisfied by a smaller node. Strange.

> I use the LASTAVAILABLE policy and order my largest nodes at the top of the
> list.
>
> And you get rid of the queues and let the scheduler find the nodes based on the
> mem resource.

That's an alternative, which we'll probably end up using. However,
it'd be optimal for our users if the larger nodes were used only by
jobs which actually require the larger RAM and were not swamped with
smaller jobs if the smaller nodes run full. Basically, one could just
use the queues with "neednodes" restrictions and inform the users to
use them accordingly, but an automatic routing solution would be
preferable...
Any ideas?


Regards,
Lech


More information about the torqueusers mailing list