[torquedev] 3.0-alpha branch added to TORQUE subversion tree

David Singleton David.Singleton at anu.edu.au
Mon Apr 26 16:51:34 MDT 2010

On 04/27/2010 03:28 AM, Ken Nielson wrote:
> The last part of my last response was not as clear as I wanted.
> We definitely want to get user response about the Multi-MOM, but what I
> really would like to get input for is how people are using their NUMA
> systems. How do they lock down nodes and memory etc.
> Ken Nielson

Isn't a MOM per node-board or any other subset of an SMP a restriction
on shared memory job sizes?  Why does it help in the NUMA case to have
a MOM per node board?   Do these MOM's segregate NUMA node memory as

Just to give an example of an alternative approach, we developed a very
NUMA-aware PBS and MPI system for an Altix cluster that we ran for about
4 years.  Some of that functionality is still being used on our current
Nehalem cluster, some we need to recover, some is irrelevant.

  - we ran/run a very broad range of apps (including:
          - 48cpu OpenMP jobs
          - many 100 cpu MPI jobs
          - some hydrid jobs
          - single cpu 36GB jobs
          - lots more)

  - the Altix cluster had 30 64-way nodes but that was an arbitrary
    partitioning and changed during the lifetime of the cluster

  - we ran large MPI jobs within and across partitions as needed.

  - the scheduler has a concept of arbitrary sized "virtual nodes"
    /"vnodes" which are a subset of a real node/host, i.e. a host
    contains one or more vnodes. The expectation is that vnodes
    have some NUMA basis.

  - MOMs do not know about virtual nodes although they do report the
    NUMA structure of their host to the scheduler

  - the scheduler is aware of what resources are local to a vnode and
    which are shared between vnodes and are allocated correctly.

  - larger jobs are allocated multiple whole vnodes.

  - on the Altix BX3700, 8P "C-bricks" were a natural topological unit -
    they were 4 NUMA nodes sharing a NUMALink router (cpus/mems outside
    a C-brick were an extra hop away). These were used for vnodes.

  - The actual vnode size is at the admins discretion and is set based
    on host NUMA structure and site job requirements.  Typically they
    would be 4, 6, 8, 12 or even 16 way units.

  - with our 64-way nodes, a 64-cpu MPI job was allocated 8 vnodes which
    may have been on 1, 2 or more 64-way nodes (partitions)

  - users *must* request memory

  - exechosts include both a cpus list and a mems (NUMA node) list

  - all jobs are placed in cpusets derived from these lists, one cpuset
    per host

  - if the users memory request respects the NUMA node size, the job is
    is allocated only mems local to its cpus.  If the job is a larger
    memory job, it's mems will extend "wider" than it's cpus as needed.

  - the mems of these "non-NUMA (large mem) jobs" are constrained within
    one or more vnodes so as not to interfere with large parallel jobs, eg.
    the mems of running jobs only overlap if the jobs are using less than
    a vnode worth of cpus.

  - over and above the use of vnodes, the scheduler is "NUMA aware" in
    that it aligns jobs with NUMA chunks of the system, eg with our
    current dual quad core Nehalem nodes, if cpu 0 is running a singe cpu
    job, an incoming 4 cpu job will be allocated cpus 4-7 (the other
    NUMA node).  (A vnode = node/host = 2 NUMA nodes in this cluster.)

[- Yes, we do limit odd layouts that users might think they want to
    request. Users are always free to request sufficient cpus/vnodes and
    then use mpirun arguments to use a subset of these cpus to satisfy
    whatever layout they think they want.  They are charged for the whole
    job ;-)  The whole point of our scheme is to respect NUMA and isolate
    jobs from one another in the NUMA sense (as much as possible). That
    automatically limits layouts. ]

  - a large (in cpu count) shared memory job is treated the same as a
    large MPI job except that it's allocated vnodes must come from one

  - we wrote our own mpirun for SGI MPT so that each MPI task could
    optionally be placed in a subcpuset of the job cpuset. In simple
    cases, these would be single cpu/single mems cpusets but for hybrid
    jobs, they could be larger.  Note that these subcpusets are *not*
    created by the MOM - the MOM cannot know what is required.


More information about the torquedev mailing list