[torquedev] 3.0-alpha branch added to TORQUE subversion tree

David Singleton David.Singleton at anu.edu.au
Mon Apr 26 17:43:30 MDT 2010


On 04/27/2010 09:07 AM, David Beer wrote:
> I do have a question about your MPI jobs - are you using MPI to control a job running across nodeboards, or just across different NUMA machines? It seems that this is what you mean, I just want to make sure that that's what you're saying.
>

I'm not sure what your question is.  IIRC, a nodeboard = 1 NUMA node in a
Altix 4700 (4 cpus).  What do you mean by "just across different NUMA
machines"?

We used mpirun to place individual MPI tasks in the smallest NUMA unit
they could operate.  Usually this meant allocating subcpusets of 1 cpu
and 1 mem per task.  The totality of all MPI tasks in a job would be
spread across multiple NUMA nodes, vnodes and usually hosts.

For example, a 64 cpu job may have (NUMA nodes have 2 cpus each):

Job exechost:  ac2/cpus=8-23/mems=4-11+ac3/cpus=16-63/mems=8-31

ac2:  MOM-created job cpuset cpus=8-23/mems=4-11

       mpirun-created task subcpusets:  rank 0: cpus=8/mems=4
                                        rank 1: cpus=9/mems=4
                                        rank 2: cpus=10/mems=5
                                        rank 3: cpus=11/mems=5
                                        ....

       vnodes : 0    cpus=0-7/mems=0-3

ac3:  MOM-created job cpuset cpus=16-63/mems=8-31

       mpirun-created task subcpusets:  rank 16: cpus=16/mems=8
                                        rank 17: cpus=17/mems=8
                                        rank 18: cpus=18/mems=9
                                        rank 19: cpus=19/mems=9
                                        ....

For both hosts, vnodes are
            vnode 0    cpus=0-7/mems=0-3
            vnode 1    cpus=8-15/mems=4-7
            vnode 2    cpus=15-23/mems=8-11
            ....
but the MOM is totally unaware of this scheduling division.


Cheers,
David


More information about the torquedev mailing list