[torqueusers] numa problems

Wannes Van Causbroeck wannes.van.causbroeck at imdc.be
Thu Sep 29 06:39:11 MDT 2011


Hello everyone!
I sent this message before, but i don't know if it arrived correctly, so i'll try again. (sorry if this is a dupe)


we're just starting out with torque, but we've run into a problem. We
have a 48-core AMD system (4 sockets with 12 cores each). The linux
system sees this as 8 nodes with 6 cores each.
I've tried compiling torque 3.02 with --enable-cpuset and
--enable-numa-support. (i also tried without cpuset, but the result was
the same, i even got an error telling me i had to mount /dev/cpuset,
even without this switch???).
Anyway, our mom.layout looks like this:

cpus=0,4,8,12,16,20    mem=0
cpus=24,28,32,36,40,44    mem=1
cpus=1,5,9,13,17,21    mem=2
cpus=25,29,33,37,31,45    mem=3
cpus=2,6,10,14,18,22    mem=4
cpus=26,30,34,38,42,46    mem=5
cpus=3,7,11,15,19,23    mem=6
cpus=27,31,35,39,43,47    mem=7

it's a bit strange, but this is how it's reported by linux.
When i start a job with these parameters:

#PBS -N JobMPI
#PBS -l nodes=1:ppn=4
#PBS -m abe

It starts 4 processes in a really weird way. Sometimes he uses core
0,1,2,3, sometimes 2 processes get run on one core, then it jumps to
core 24, etc.
the system takes a big performance hit when the processes aren't run on
the cores sharing the same memory, so we want to lock the tasks on the
same node.

What am i doing wrong?


Greetings,
Wannes


More information about the torqueusers mailing list