[torquedev] 3.0-alpha branch added to TORQUE subversion tree
David.Singleton at anu.edu.au
Mon Apr 26 16:51:34 MDT 2010
On 04/27/2010 03:28 AM, Ken Nielson wrote:
> The last part of my last response was not as clear as I wanted.
> We definitely want to get user response about the Multi-MOM, but what I
> really would like to get input for is how people are using their NUMA
> systems. How do they lock down nodes and memory etc.
> Ken Nielson
Isn't a MOM per node-board or any other subset of an SMP a restriction
on shared memory job sizes? Why does it help in the NUMA case to have
a MOM per node board? Do these MOM's segregate NUMA node memory as
Just to give an example of an alternative approach, we developed a very
NUMA-aware PBS and MPI system for an Altix cluster that we ran for about
4 years. Some of that functionality is still being used on our current
Nehalem cluster, some we need to recover, some is irrelevant.
- we ran/run a very broad range of apps (including:
- 48cpu OpenMP jobs
- many 100 cpu MPI jobs
- some hydrid jobs
- single cpu 36GB jobs
- lots more)
- the Altix cluster had 30 64-way nodes but that was an arbitrary
partitioning and changed during the lifetime of the cluster
- we ran large MPI jobs within and across partitions as needed.
- the scheduler has a concept of arbitrary sized "virtual nodes"
/"vnodes" which are a subset of a real node/host, i.e. a host
contains one or more vnodes. The expectation is that vnodes
have some NUMA basis.
- MOMs do not know about virtual nodes although they do report the
NUMA structure of their host to the scheduler
- the scheduler is aware of what resources are local to a vnode and
which are shared between vnodes and are allocated correctly.
- larger jobs are allocated multiple whole vnodes.
- on the Altix BX3700, 8P "C-bricks" were a natural topological unit -
they were 4 NUMA nodes sharing a NUMALink router (cpus/mems outside
a C-brick were an extra hop away). These were used for vnodes.
- The actual vnode size is at the admins discretion and is set based
on host NUMA structure and site job requirements. Typically they
would be 4, 6, 8, 12 or even 16 way units.
- with our 64-way nodes, a 64-cpu MPI job was allocated 8 vnodes which
may have been on 1, 2 or more 64-way nodes (partitions)
- users *must* request memory
- exechosts include both a cpus list and a mems (NUMA node) list
- all jobs are placed in cpusets derived from these lists, one cpuset
- if the users memory request respects the NUMA node size, the job is
is allocated only mems local to its cpus. If the job is a larger
memory job, it's mems will extend "wider" than it's cpus as needed.
- the mems of these "non-NUMA (large mem) jobs" are constrained within
one or more vnodes so as not to interfere with large parallel jobs, eg.
the mems of running jobs only overlap if the jobs are using less than
a vnode worth of cpus.
- over and above the use of vnodes, the scheduler is "NUMA aware" in
that it aligns jobs with NUMA chunks of the system, eg with our
current dual quad core Nehalem nodes, if cpu 0 is running a singe cpu
job, an incoming 4 cpu job will be allocated cpus 4-7 (the other
NUMA node). (A vnode = node/host = 2 NUMA nodes in this cluster.)
[- Yes, we do limit odd layouts that users might think they want to
request. Users are always free to request sufficient cpus/vnodes and
then use mpirun arguments to use a subset of these cpus to satisfy
whatever layout they think they want. They are charged for the whole
job ;-) The whole point of our scheme is to respect NUMA and isolate
jobs from one another in the NUMA sense (as much as possible). That
automatically limits layouts. ]
- a large (in cpu count) shared memory job is treated the same as a
large MPI job except that it's allocated vnodes must come from one
- we wrote our own mpirun for SGI MPT so that each MPI task could
optionally be placed in a subcpuset of the job cpuset. In simple
cases, these would be single cpu/single mems cpusets but for hybrid
jobs, they could be larger. Note that these subcpusets are *not*
created by the MOM - the MOM cannot know what is required.
More information about the torquedev