[torquedev] Torque CPU topology requests
Craig West
cwest at astro.umass.edu
Fri Dec 4 12:40:59 MST 2009
I spoke about this to a few people at SC09, I'll detail my thoughts here
again.
Firstly, what I am talking about is CPU topology. It looks like the
start of work towards this has already begun. In the 2.4 tree there is a
"bitmap" option. I don't know if or how this is dealt with in MOAB, and
I doubt we have support in MAUI. What I mean by CPU topology is how the
actual processors are physically organised in a system. For consistency,
I will call a physical CPU a socket. A core, or course is a core (1 to
many per socket). And a node may have 1 to many sockets. AMD and (now)
Intel have memory directly attached to CPUs (other processor makers too)
which is something else that should be considered part of the CPU
topology. Currently the memory attaches directly to a socket. As I
understand it Linux (not sure about other OSes) allocates RAM on the
socket to which the memory request came from. It allocates it else where
if needed. If a running process gets moved to a different socket, there
is an increased latency in access to the RAM.
From some quick examinations of the "bitmap" feature, it appears that
this allows you to define a layout of exactly which processors you would
like to use. However, this immediately breaks down if the systems are
inhomogeneous. Of course this also means that the users of these systems
need to understand the topology of the computers they are using to run
their jobs if they want to specify the correct bitmap.
For those interested, bitmaps are enabled at compile time of Torque (2.4
tree) using the geometry options. There is already documentation about
this on the Torque pages.
http://www.clusterresources.com/torquedocs21/3.6schedulingcores.shtml
One important note is that it appears only a single job can run on any
given node when using bitmaps, and that the node must be completely free
to allocate the bitmap to start with. I think a second job (or more)
could run on a bitmapped node if they didn't request further bitmaps.
Not sure how well this would/could work and I've not tested any of this.
What I would like to propose is a more general method of allowing users
to select the processors they are running on. Only some users will want
this, but as the number of cores in systems (nodes) increase I think it
will become more and more useful. This requires modifications to the
Torque system.
Basically, there are least two parts to this in Torque, and one part in
the Scheduler.
* The system side is to get the information about the processor
topology into the resources list for a computer so that the scheduler
can allocate the resources correctly.
* The user side is to allow a simple method for the users to request
the cpu topology. The users should be able to request the topology some
what generically. Of course they still need to understand the system -
but not in as much detail, and if the information is viewable in the
node resources, they can see the options they have.
The items I see being requested would be number of cores per socket,
number of sockets per node, number of cores per node (the last one is
already available "ppn="). There are likely to be other options either
now, or in the future that could be given.
* The last part of this is that scheduler needs to be able to allocate
the requested topology (avoiding conflicts) and keep it locked - which
requires cpusets/cgroups (in the kernel).
The OpenMPI group had a project called PLPA, which is being replaced
with HW-LOC. http://www.open-mpi.org/projects/hwloc/
This system is capable of providing all the information about the
layouts of the system with all the information that would be required.
It just needs to be interfaced to torque.
I have seen performance improvements in some of our codes by locking
cpus to the cores they start on (more than 40% in some cases). I also
see the possibility for certain codes to benefit from choosing with
cores they are locked to. For a user that is requesting a whole system,
there are existing options for allocating the cores to use (OpenMPI can
use an appfile and numactl to control the given cores for each process
in the mpi stack, and recently OpenMPI also added locking to cores or
sockets - version 1.3.4).
Intel still appears to have their hyper threading in the latest CPUs,
and allocating those as cpus in a cluster can cause problems (or at
least it did for me in the past). Being able to identify and allocate
those processors correctly (some codes do benefit from that extra
'thread') would be nice to have in the queue/scheduling systems.
Craig.
More information about the torquedev
mailing list