[torquedev] 3.0-alpha branch added to TORQUE subversion tree

David Beer dbeer at adaptivecomputing.com
Mon Apr 26 17:07:55 MDT 2010

----- Original Message -----
> On 04/27/2010 03:28 AM, Ken Nielson wrote:
> > The last part of my last response was not as clear as I wanted.
> >
> > We definitely want to get user response about the Multi-MOM, but
> > what I
> > really would like to get input for is how people are using their
> > NUMA systems. How do they lock down nodes and memory etc.
> >
> > Ken Nielson
> >
> Isn't a MOM per node-board or any other subset of an SMP a restriction
> on shared memory job sizes? Why does it help in the NUMA case to have
> a MOM per node board? Do these MOM's segregate NUMA node memory as
> well.

No, our implementation does not restrict the job size of a shared-memory job. This is only restricted by the amount of memory in the system.

Having a mom per node board just seemed like an easy way to do it. We're still deciding what the optimal configuration will be, and its possible that the final version checked into TORQUE will leave the ratio of moms to node boards configurable for each site. 

At least initially, having a one-to-one mapping of moms to nodeboards made the implementation very easy. Its conceivable that this could change, as we find out more about the NUMA systems that different sites are running. The more we know about how people are using them, the better product we can produce.

> Just to give an example of an alternative approach, we developed a
> very NUMA-aware PBS and MPI system for an Altix cluster that we ran
> for about
> 4 years. Some of that functionality is still being used on our current
> Nehalem cluster, some we need to recover, some is irrelevant.
> - we ran/run a very broad range of apps (including:
> - 48cpu OpenMP jobs
> - many 100 cpu MPI jobs
> - some hydrid jobs
> - single cpu 36GB jobs
> - lots more)
> - the Altix cluster had 30 64-way nodes but that was an arbitrary
> partitioning and changed during the lifetime of the cluster
> - we ran large MPI jobs within and across partitions as needed.
> - the scheduler has a concept of arbitrary sized "virtual nodes"
> /"vnodes" which are a subset of a real node/host, i.e. a host
> contains one or more vnodes. The expectation is that vnodes
> have some NUMA basis.
> - MOMs do not know about virtual nodes although they do report the
> NUMA structure of their host to the scheduler
> - the scheduler is aware of what resources are local to a vnode and
> which are shared between vnodes and are allocated correctly.
> - larger jobs are allocated multiple whole vnodes.
> - on the Altix BX3700, 8P "C-bricks" were a natural topological unit -
> they were 4 NUMA nodes sharing a NUMALink router (cpus/mems outside
> a C-brick were an extra hop away). These were used for vnodes.
> - The actual vnode size is at the admins discretion and is set based
> on host NUMA structure and site job requirements. Typically they
> would be 4, 6, 8, 12 or even 16 way units.
> - with our 64-way nodes, a 64-cpu MPI job was allocated 8 vnodes which
> may have been on 1, 2 or more 64-way nodes (partitions)
> - users *must* request memory
> - exechosts include both a cpus list and a mems (NUMA node) list
> - all jobs are placed in cpusets derived from these lists, one cpuset
> per host
> - if the users memory request respects the NUMA node size, the job is
> is allocated only mems local to its cpus. If the job is a larger
> memory job, it's mems will extend "wider" than it's cpus as needed.
> - the mems of these "non-NUMA (large mem) jobs" are constrained within
> one or more vnodes so as not to interfere with large parallel jobs,
> eg. the mems of running jobs only overlap if the jobs are using less
> than a vnode worth of cpus.
> - over and above the use of vnodes, the scheduler is "NUMA aware" in
> that it aligns jobs with NUMA chunks of the system, eg with our
> current dual quad core Nehalem nodes, if cpu 0 is running a singe cpu
> job, an incoming 4 cpu job will be allocated cpus 4-7 (the other
> NUMA node). (A vnode = node/host = 2 NUMA nodes in this cluster.)
> [- Yes, we do limit odd layouts that users might think they want to
> request. Users are always free to request sufficient cpus/vnodes and
> then use mpirun arguments to use a subset of these cpus to satisfy
> whatever layout they think they want. They are charged for the whole
> job ;-) The whole point of our scheme is to respect NUMA and isolate
> jobs from one another in the NUMA sense (as much as possible). That
> automatically limits layouts. ]
> - a large (in cpu count) shared memory job is treated the same as a
> large MPI job except that it's allocated vnodes must come from one
> host.
> - we wrote our own mpirun for SGI MPT so that each MPI task could
> optionally be placed in a subcpuset of the job cpuset. In simple
> cases, these would be single cpu/single mems cpusets but for hybrid
> jobs, they could be larger. Note that these subcpusets are *not*
> created by the MOM - the MOM cannot know what is required.
> Cheers,
> David

Thanks for so much detail about how you did things. This will help us in creating as flexible of a solution as possible. Our goal is to create a solution to support NUMA machines in general. From your description, it seems that we offer a lot of the same functionality even if we implement it differently.

I do have a question about your MPI jobs - are you using MPI to control a job running across nodeboards, or just across different NUMA machines? It seems that this is what you mean, I just want to make sure that that's what you're saying.

David Beer | Senior Software Engineer
Adaptive Computing

More information about the torquedev mailing list