[torquedev] Torque CPU topology requests

Craig West cwest at astro.umass.edu
Fri Dec 4 12:40:59 MST 2009

I spoke about this to a few people at SC09, I'll detail my thoughts here 

Firstly, what I am talking about is CPU topology. It looks like the 
start of work towards this has already begun. In the 2.4 tree there is a 
"bitmap" option. I don't know if or how this is dealt with in MOAB, and 
I doubt we have support in MAUI. What I mean by CPU topology is how the 
actual processors are physically organised in a system. For consistency, 
I will call a physical CPU a socket. A core, or course is a core (1 to 
many per socket). And a node may have 1 to many sockets. AMD and (now) 
Intel have memory directly attached to CPUs (other processor makers too) 
which is something else that should be considered part of the CPU 
topology. Currently the memory attaches directly to a socket. As I 
understand it Linux (not sure about other OSes) allocates RAM on the 
socket to which the memory request came from. It allocates it else where 
if needed. If a running process gets moved to a different socket, there 
is an increased latency in access to the RAM.

 From some quick examinations of the "bitmap" feature, it appears that 
this allows you to define a layout of exactly which processors you would 
like to use. However, this immediately breaks down if the systems are 
inhomogeneous. Of course this also means that the users of these systems 
need to understand the topology of the computers they are using to run 
their jobs if they want to specify the correct bitmap.
For those interested, bitmaps are enabled at compile time of Torque (2.4 
tree) using the geometry options. There is already documentation about 
this on the Torque pages. 
One important note is that it appears only a single job can run on any 
given node when using bitmaps, and that the node must be completely free 
to allocate the bitmap to start with. I think a second job (or more) 
could run on a bitmapped node if they didn't request further bitmaps. 
Not sure how well this would/could work and I've not tested any of this.

What I would like to propose is a more general method of allowing users 
to select the processors they are running on. Only some users will want 
this, but as the number of cores in systems (nodes) increase I think it 
will become more and more useful. This requires modifications to the 
Torque system.
Basically, there are least two parts to this in Torque, and one part in 
the Scheduler.

 * The system side is to get the information about the processor 
topology into the resources list for a computer so that the scheduler 
can allocate the resources correctly.

 * The user side is to allow a simple method for the users to request 
the cpu topology. The users should be able to request the topology some 
what generically. Of course they still need to understand the system - 
but not in as much detail, and if the information is viewable in the 
node resources, they can see the options they have.
The items I see being requested would be number of cores per socket, 
number of sockets per node, number of cores per node (the last one is 
already available "ppn="). There are likely to be other options either 
now, or in the future that could be given.

 * The last part of this is that scheduler needs to be able to allocate 
the requested topology (avoiding conflicts) and keep it locked - which 
requires cpusets/cgroups (in the kernel).

The OpenMPI group had a project called PLPA, which is being replaced 
with HW-LOC. http://www.open-mpi.org/projects/hwloc/
This system is capable of providing all the information about the 
layouts of the system with all the information that would be required. 
It just needs to be interfaced to torque.

I have seen performance improvements in some of our codes by locking 
cpus to the cores they start on (more than 40% in some cases). I also 
see the possibility for certain codes to benefit from choosing with 
cores they are locked to. For a user that is requesting a whole system, 
there are existing options for allocating the cores to use (OpenMPI can 
use an appfile and numactl to control the given cores for each process 
in the mpi stack, and recently OpenMPI also added locking to cores or 
sockets - version 1.3.4).

Intel still appears to have their hyper threading in the latest CPUs, 
and allocating those as cpus in a cluster can cause problems (or at 
least it did for me in the past). Being able to identify and allocate 
those processors correctly (some codes do benefit from that extra 
'thread') would be nice to have in the queue/scheduling systems.


More information about the torquedev mailing list