[torquedev] Re: [torqueusers] Torque notes from SC'08

Michel Béland michel.beland at rqchp.qc.ca
Tue Dec 2 15:37:30 MST 2008


Chris Samuel wrote:

> a) CPUSET NUMA Support
> 
> First of all virtually everyone seems to want better NUMA support,
> i.e. adding in memory locality to the current cpusets support.
> 
> To illustrate how complicated these systems are internally,
> and why NUMA locality is important, I've attached a PDF file
> courtesy of Dave Singleton at the NCI in Canberra illustrating
> the architecture of a small and large Altix system.

That is funny how the topology is different from one Altix to the
next... In the figure you attached, you can see that one level-0 router
(directly connected to the nodes) is also connected to three other
level-0 routers. On our 768-processor Altix 4700, it is connected to two
level-0 router (with a double link to one of them). On a 512-processor
Altix 3000, it is only directly connected to one of them (again with a
double link) and has to go to level-1 routers to reach the other level-0
routers.

> Dave has his own PBS fork (ANUPBS) which includes Altix
> NUMA support, and he's been kindly talking through issues
> with me via private email whilst I was at SC.
> 
> Currently it'd be easy for the pbs_mom to add certain mems to
> a job, rather than all of them as it does now, if the scheduler
> tells it what to do, but for that to happen there are a few
> issues that need to be dealt with!
> 
> 
> 1) Determining the layout of the system and reporting it
> to the pbs_server.
> 
> Investigations of various systems show that the
> /sys/devices/system/node directory will (if present) include
> information on what NUMA nodes there are, which CPUs are on
> them and information about how much RAM is present and how
> much is used.

Here is what I get on our Altix 4700:

$cd /sys/devices/system/node/node0
$cat meminfo

Node 0 MemTotal:      8064400 kB
Node 0 MemFree:       6727968 kB
Node 0 MemUsed:       1336432 kB
Node 0 Active:         235824 kB
Node 0 Inactive:       194544 kB
Node 0 HighTotal:           0 kB
Node 0 HighFree:            0 kB
Node 0 LowTotal:      8064400 kB
Node 0 LowFree:       6727968 kB
Node 0 Dirty:               0 kB
Node 0 Writeback:    224795008 kB
Node 0 Mapped:          73024 kB
Node 0 AnonPages:     1653088 kB
Node 0 Slab:         11484688 kB
Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
$

> Newer kernels also have a distance file which contains the
> memory latency from that node to other NUMA nodes normalised
> against local memory (valued at 10).  So, for instance, on
> Dave's Altix in Canberra there's a large (2.5x) penalty for
> going off your NUMA node (from 10 to 25) but then to reach
> the furthest node is only a 3.7x penalty (from 10 to 37). 
> 
> Whilst this naively makes an NxN matrix (i.e. 1024 numbers
> for a small 64P Altix) we can simplify this as many numbers
> are repeated - here's a real life example of the info for
> NUMA node 0 on one:
> 
> 10 25 25 25 29 29 29 29 29 29 29 29 29 29 29 29 33 33 33 33 37 37 37 37 37 37 37
>  37 37 37 37 37

Again, here is what I get on our Altix 4700 (one 256-processor
partition, in fact) for node 0:

$cat distance
10 22 22 22 26 26 26 26 26 26 26 26 30 30 30 30 30 30 30 30 34 34 34 34
34 34 34 34 38 38 38 38 30 30 30 30 34 34 34 34 34 34 34 34 38 38 38 38
30 30 30 30 34 34 34 34 34 34 34 34 38 38 38 38
$

What is peculiar is that it does not always increase, due to the funny
topology of the level-0 routers, that I described above.

> So perhaps using a simple run length encoding scheme we can
> collapse that down to:
> 
> 10 25*3 29*12 33*4 37*12

Considering the way it behaves on my system, I think listing these
latencies out is better. But anyway, why send this to the scheduler? It
is much easier to just let pbs_mom pick the nodes to be used by the
cpuset, according to the memory and cpu requirements of the job. PBS Pro
does use the scheduler for that, but it does it by virtualizing the
Altix nodes so that they are treated like machines by themselves. I am
not sure that Torque is ready for that yet.

> Much more friendly to pbs_server, though not as artistic as
> Garrick's idea of drawing the layout of the system in ASCII
> art and letting the scheduler folks work it out from that. :-)

Here are the concatenated distance files for a 16-processor Altix 350,
which does not have any router:

10 21 21 27 27 33 33 39
21 10 51 21 45 27 39 33
21 51 10 45 21 39 27 33
27 21 45 10 39 21 33 27
27 45 21 39 10 33 21 27
33 27 39 21 33 10 27 21
33 39 27 33 21 27 10 21

You can see that the first line is not enough for this machine... I
wonder if something is broken in there, though. Here is the topology of
the machine:

4 - 6 - 7
|       |
2       5
|       |
0 - 1 - 3

You can see that the latency from node 0 to 1 or 2 is 21, but it is 51
between nodes 1 and 2, as if it went through 6 connections instead of 2
(going almost all the way around the circle, or should I say square).

> The issue I have is that I don't have any examples from larger
> NUMA systems, so if people can send me an example I'd much
> appreciate it!  I just need to know the number of cores and
> sockets in the node and the contents of:
> 
> /sys/devices/system/node/node0/distance

Shown above for our Altix 4700. Each node has two sockets and 4 cores
(we have dual-core Montecito processors).

> If the system isn't running a kernel new enough to have that
> information we might just have to assume that all sockets are
> equi-distant.

I am not sure that all this information is really needed, at least in a
first step. This might be nice to know the best nodes to pick for a job.
But the really import thing to me is that jobs should be given in their
cpuset whole nodes for their cpu and memory requirements. The scheduler
needs then to know about this so that it does not try to give some part
of these nodes to other jobs. The original memory and cpu requirements
should always be kept in case the job needs to be restarted.

> 3) The scheduler needs to be able to tell Torque to allocate NUMA
> nodes as well as cores when it runs a job.
> 
> Dave Singleton has a modified exec_host string that he uses to
> relay that information to his pbs_mom's.  He tells me that the
> their format is:
> 
> #    "host/cpus=<id_range_spec>/mems=<id_range_spec>+...."
> #
> # We use the convention that the form "host/cpuid" is shorthand
> # for "host/cpus=cpuid/mems=allmems" so we haven't actually broken
> # the original format - just extended it.
> 
> To illustrate that here's an example from a running job there,
> as shown by qstat -f $JOBID :
> 
>     exec_host = ac10/cpus=8-23,40-55/mems=4-11,
>         20-27+ac11/cpus=40-63/mems=20-31+ac12/cpus=8-23/mems=4-11+ac13/cpus=8-
>         23,56-63/mems=4-11,28-31+ac14/cpus=8-39,56-63/mems=4-19,
>         28-31+ac15/cpus=24-39,56-63/mems=12-19,28-31
> 
> Given that this is proven to work perhaps Torque should adopt
> it too ?
> 
> In any case this means that *everything* that processes exec_host
> for information will need to be updated to handle this!

Altair did something like this with PBS Pro in some previous version,
but they reverted back to the original exec_host syntax. They added
instead a new string called exec_vnode that looked like this:

    exec_vnode =
(udem-sgi02[39]:mem=8077296kb:ncpus=2+udem-sgi02[47]:mem=7897104kb:ncpus=2)

Note that udem-sgi02[39] and udem-sgi02[47] are virtual nodes of the
machine udem-sgi02, here.


-- 
Michel Béland, analyste en calcul scientifique
michel.beland at rqchp.qc.ca
bureau S-250, pavillon Roger-Gaudry (principal), Université de Montréal
téléphone   : 514 343-6111 poste 3892     télécopieur : 514 343-2155
RQCHP (Réseau québécois de calcul de haute performance)  www.rqchp.qc.ca


More information about the torquedev mailing list