[torqueusers] cgroup memory allocation problem
Brock Palen
brockp at umich.edu
Thu Aug 9 16:18:12 MDT 2012
I filed this with adaptive but others should be aware of a major problem for high memory use jobs on pbs_moms using cgroups:
cgroups in torque4 are assigning memory banks in numa systems based on core layout only.
Example:
8 core 48GB memroy two socket machine valid cpus 0-7 valid mems 0-1
If a job is only on the first socket is is assigned to mems 0 if it is on the second, mems 1, if a job is assigned cores on both it is assigned both.
The above is fine,
Now if I request 1 core and more memory, node has two 24GB memory banks
qsub procs=1,mem=47gb
the mems is set to 0 and cpus 0 when my job hits 24 gb (the size of mems 0) I start to swap rather than giving me all the assigned memory.
A similar case:
procs=1,mem=20gb
procs=1,mem=20gb
procs=1,mem=20gb
On am empty node if they are all on the same one, they get assigned cpu 0, 1, and 2 but all get mems 0 and jobs swap.
Is there away to just assign all numa nodes in jobs? and just use CPU binding? Currently we are most interested in cpu binding.
Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
brockp at umich.edu
(734)936-1985
More information about the torqueusers
mailing list