[torqueusers] numa problems

Wannes Van Causbroeck wannes.van.causbroeck at imdc.be
Mon Oct 3 00:29:47 MDT 2011


Thanks for the input guys!
Running lstopo i get the following output:

Machine (126GB)
  Socket L#0 (32GB)
    NUMANode L#0 (P#0 16GB) + L3 L#0 (5118KB)
      L2 L#0 (512KB) + L1 L#0 (64KB) + Core L#0 + PU L#0 (P#0)
      L2 L#1 (512KB) + L1 L#1 (64KB) + Core L#1 + PU L#1 (P#4)
      L2 L#2 (512KB) + L1 L#2 (64KB) + Core L#2 + PU L#2 (P#8)
      L2 L#3 (512KB) + L1 L#3 (64KB) + Core L#3 + PU L#3 (P#12)
      L2 L#4 (512KB) + L1 L#4 (64KB) + Core L#4 + PU L#4 (P#16)
      L2 L#5 (512KB) + L1 L#5 (64KB) + Core L#5 + PU L#5 (P#20)
    NUMANode L#1 (P#1 16GB) + L3 L#1 (5118KB)
      L2 L#6 (512KB) + L1 L#6 (64KB) + Core L#6 + PU L#6 (P#24)
      L2 L#7 (512KB) + L1 L#7 (64KB) + Core L#7 + PU L#7 (P#28)
      L2 L#8 (512KB) + L1 L#8 (64KB) + Core L#8 + PU L#8 (P#32)
      L2 L#9 (512KB) + L1 L#9 (64KB) + Core L#9 + PU L#9 (P#36)
      L2 L#10 (512KB) + L1 L#10 (64KB) + Core L#10 + PU L#10 (P#40)
      L2 L#11 (512KB) + L1 L#11 (64KB) + Core L#11 + PU L#11 (P#44)
  Socket L#1 (32GB)
    NUMANode L#2 (P#2 16GB) + L3 L#2 (5118KB)
      L2 L#12 (512KB) + L1 L#12 (64KB) + Core L#12 + PU L#12 (P#1)
      L2 L#13 (512KB) + L1 L#13 (64KB) + Core L#13 + PU L#13 (P#5)
      L2 L#14 (512KB) + L1 L#14 (64KB) + Core L#14 + PU L#14 (P#9)
      L2 L#15 (512KB) + L1 L#15 (64KB) + Core L#15 + PU L#15 (P#13)
      L2 L#16 (512KB) + L1 L#16 (64KB) + Core L#16 + PU L#16 (P#17)
      L2 L#17 (512KB) + L1 L#17 (64KB) + Core L#17 + PU L#17 (P#21)
    NUMANode L#3 (P#3 16GB) + L3 L#3 (5118KB)
      L2 L#18 (512KB) + L1 L#18 (64KB) + Core L#18 + PU L#18 (P#25)
      L2 L#19 (512KB) + L1 L#19 (64KB) + Core L#19 + PU L#19 (P#29)
      L2 L#20 (512KB) + L1 L#20 (64KB) + Core L#20 + PU L#20 (P#33)
      L2 L#21 (512KB) + L1 L#21 (64KB) + Core L#21 + PU L#21 (P#37)
      L2 L#22 (512KB) + L1 L#22 (64KB) + Core L#22 + PU L#22 (P#41)
      L2 L#23 (512KB) + L1 L#23 (64KB) + Core L#23 + PU L#23 (P#45)
  Socket L#2 (32GB)
    NUMANode L#4 (P#4 16GB) + L3 L#4 (5118KB)
      L2 L#24 (512KB) + L1 L#24 (64KB) + Core L#24 + PU L#24 (P#2)
      L2 L#25 (512KB) + L1 L#25 (64KB) + Core L#25 + PU L#25 (P#6)
      L2 L#26 (512KB) + L1 L#26 (64KB) + Core L#26 + PU L#26 (P#10)
      L2 L#27 (512KB) + L1 L#27 (64KB) + Core L#27 + PU L#27 (P#14)
      L2 L#28 (512KB) + L1 L#28 (64KB) + Core L#28 + PU L#28 (P#18)
      L2 L#29 (512KB) + L1 L#29 (64KB) + Core L#29 + PU L#29 (P#22)
    NUMANode L#5 (P#5 16GB) + L3 L#5 (5118KB)
      L2 L#30 (512KB) + L1 L#30 (64KB) + Core L#30 + PU L#30 (P#26)
      L2 L#31 (512KB) + L1 L#31 (64KB) + Core L#31 + PU L#31 (P#30)
      L2 L#32 (512KB) + L1 L#32 (64KB) + Core L#32 + PU L#32 (P#34)
      L2 L#33 (512KB) + L1 L#33 (64KB) + Core L#33 + PU L#33 (P#38)
      L2 L#34 (512KB) + L1 L#34 (64KB) + Core L#34 + PU L#34 (P#42)
      L2 L#35 (512KB) + L1 L#35 (64KB) + Core L#35 + PU L#35 (P#46)
  Socket L#3 (32GB)
    NUMANode L#6 (P#6 16GB) + L3 L#6 (5118KB)
      L2 L#36 (512KB) + L1 L#36 (64KB) + Core L#36 + PU L#36 (P#3)
      L2 L#37 (512KB) + L1 L#37 (64KB) + Core L#37 + PU L#37 (P#7)
      L2 L#38 (512KB) + L1 L#38 (64KB) + Core L#38 + PU L#38 (P#11)
      L2 L#39 (512KB) + L1 L#39 (64KB) + Core L#39 + PU L#39 (P#15)
      L2 L#40 (512KB) + L1 L#40 (64KB) + Core L#40 + PU L#40 (P#19)
      L2 L#41 (512KB) + L1 L#41 (64KB) + Core L#41 + PU L#41 (P#23)
    NUMANode L#7 (P#7 16GB) + L3 L#7 (5118KB)
      L2 L#42 (512KB) + L1 L#42 (64KB) + Core L#42 + PU L#42 (P#27)
      L2 L#43 (512KB) + L1 L#43 (64KB) + Core L#43 + PU L#43 (P#31)
      L2 L#44 (512KB) + L1 L#44 (64KB) + Core L#44 + PU L#44 (P#35)
      L2 L#45 (512KB) + L1 L#45 (64KB) + Core L#45 + PU L#45 (P#39)
      L2 L#46 (512KB) + L1 L#46 (64KB) + Core L#46 + PU L#46 (P#43)
      L2 L#47 (512KB) + L1 L#47 (64KB) + Core L#47 + PU L#47 (P#47)

I guess the non-sequential core numbering is correct?


-----Original Message-----
From: torqueusers-bounces at supercluster.org on behalf of David Beer
Sent: Fri 9/30/2011 17:14
To: Torque Users Mailing List
Subject: Re: [torqueusers] numa problems
 


----- Original Message -----
> Hello everyone!
> I sent this message before, but i don't know if it arrived correctly,
> so i'll try again. (sorry if this is a dupe)
> 
> 
> we're just starting out with torque, but we've run into a problem. We
> have a 48-core AMD system (4 sockets with 12 cores each). The linux
> system sees this as 8 nodes with 6 cores each.
> I've tried compiling torque 3.02 with --enable-cpuset and
> --enable-numa-support. (i also tried without cpuset, but the result
> was
> the same, i even got an error telling me i had to mount /dev/cpuset,
> even without this switch???).

Numa support uses cpusets for its implementation, so yes, you'll get the same result whether or not you use the --enable-cpuset switch. You will definitely need to mount cpusets in order to get things working.

> Anyway, our mom.layout looks like this:
> 
> cpus=0,4,8,12,16,20    mem=0
> cpus=24,28,32,36,40,44    mem=1
> cpus=1,5,9,13,17,21    mem=2
> cpus=25,29,33,37,31,45    mem=3
> cpus=2,6,10,14,18,22    mem=4
> cpus=26,30,34,38,42,46    mem=5
> cpus=3,7,11,15,19,23    mem=6
> cpus=27,31,35,39,43,47    mem=7
> 
> it's a bit strange, but this is how it's reported by linux.
> When i start a job with these parameters:
> 
> #PBS -N JobMPI
> #PBS -l nodes=1:ppn=4
> #PBS -m abe
> 
> It starts 4 processes in a really weird way. Sometimes he uses core
> 0,1,2,3, sometimes 2 processes get run on one core, then it jumps to
> core 24, etc.
> the system takes a big performance hit when the processes aren't run
> on
> the cores sharing the same memory, so we want to lock the tasks on
> the
> same node.
> 
> What am i doing wrong?

I second Chris's suggestion - please send in the output of lstopo and we'll see what to do from there. I do wonder about your ordering - I'm not sure that TORQUE 3.0.* is well-equipped to handle a system with that kind of layout, but send in your lstopo output and we'll help you as much as we can. 

-- 
David Beer 
Direct Line: 801-717-3386 | Fax: 801-717-3738
     Adaptive Computing
     1656 S. East Bay Blvd. Suite #300
     Provo, UT 84606

_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/ms-tnef
Size: 5104 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20111003/603a271e/attachment-0001.bin 


More information about the torqueusers mailing list