[torqueusers] cgroup memory allocation problem

Brock Palen brockp at umich.edu
Mon Aug 13 12:15:12 MDT 2012


On Aug 12, 2012, at 9:55 PM, <Gareth.Williams at csiro.au> <Gareth.Williams at csiro.au> wrote:

>> -----Original Message-----
>> From: Brock Palen [mailto:brockp at umich.edu]
>> Sent: Friday, 10 August 2012 8:18 AM
>> To: Torque Users Mailing List
>> Subject: [torqueusers] cgroup memory allocation problem
>> 
>> I filed this with adaptive but others should be aware of a major
>> problem for high memory use jobs on pbs_moms using cgroups:
>> 
>> cgroups in torque4 are assigning memory banks in numa systems based on
>> core layout only.
>> 
>> Example:
>> 
>> 8 core 48GB memroy two socket machine valid cpus 0-7  valid mems 0-1
>> 
>> If a job is only on the first socket is is assigned to mems 0   if it
>> is on the second, mems 1,  if a job is assigned cores on both it is
>> assigned both.
>> 
>> The above is fine,
>> 
>> Now if I request 1 core and more memory, node has two 24GB memory banks
>> qsub procs=1,mem=47gb
>> 
>> the mems is set to 0 and cpus 0 when my job hits 24 gb (the size of
>> mems 0)  I start to swap rather than giving me all the assigned memory.
>> 
>> A similar case:
>> procs=1,mem=20gb
>> procs=1,mem=20gb
>> procs=1,mem=20gb
>> 
>> On am empty node if they are all on the same one, they get assigned
>> cpu 0, 1, and 2 but all get mems 0  and jobs swap.
>> 
>> Is there away to just assign all numa nodes in jobs?  and just use CPU
>> binding?  Currently we are most interested in cpu binding.
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> CAEN Advanced Computing
>> brockp at umich.edu
>> (734)936-1985
> 
> Hi Brock,
> 
> For reference, we've noticed something related on our UV system. To work around this we set the numa virtual node configuration to define each virtual node to correspond to a socket and are asking/forcing users to request whole nodes except for low processor count low memory jobs.  The machine topology would be better reflected if we defined virtual nodes to correspond to socket pairs.
> 
>> Is there away to just assign all numa nodes in jobs?  and just use CPU
>> binding?  Currently we are most interested in cpu binding.
> 
> You could use a submit filter to round requests up to full nodes or reject jobs...  Also you could use the prologue to alter the existing cpuset to include more mems.
> 
> Note we are running a torque 3 version with cpusets rather than cgroups per-se if that matters.

Gareth,  

Looking in the mom source in cpuset.c  it looks like all torque does is find a match for all memory domains that overlap with the assigned cpus.  So there is no book keeping at all to report to the scheduler about the memory layout what has been assigned etc.

So we did exactly what you said and created a prologe that copies the memory nodes of the entire system into the job when it starts.  Very simple works well.  Appears to have solved our problem for now.


> 
> Gareth
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list