[torqueusers] Torque notes from SC'08

Chris Samuel csamuel at vpac.org
Mon Dec 1 16:28:26 MST 2008

OK, here's my memories of the issues and comments
that were made to me at SC'08 about Torque. Apologies
if I've missed things out, I was ill on the Wednesday
and it took me a few days to recover fully.

Can I suggest that talk about implementing any of these
happen on the torquedev list please ?


a) CPUSET NUMA Support

First of all virtually everyone seems to want better NUMA support,
i.e. adding in memory locality to the current cpusets support.

To illustrate how complicated these systems are internally,
and why NUMA locality is important, I've attached a PDF file
courtesy of Dave Singleton at the NCI in Canberra illustrating
the architecture of a small and large Altix system.

Dave has his own PBS fork (ANUPBS) which includes Altix
NUMA support, and he's been kindly talking through issues
with me via private email whilst I was at SC.

Currently it'd be easy for the pbs_mom to add certain mems to
a job, rather than all of them as it does now, if the scheduler
tells it what to do, but for that to happen there are a few
issues that need to be dealt with!

1) Determining the layout of the system and reporting it
to the pbs_server.

Investigations of various systems show that the
/sys/devices/system/node directory will (if present) include
information on what NUMA nodes there are, which CPUs are on
them and information about how much RAM is present and how
much is used.

Newer kernels also have a distance file which contains the
memory latency from that node to other NUMA nodes normalised
against local memory (valued at 10).  So, for instance, on
Dave's Altix in Canberra there's a large (2.5x) penalty for
going off your NUMA node (from 10 to 25) but then to reach
the furthest node is only a 3.7x penalty (from 10 to 37). 

Whilst this naively makes an NxN matrix (i.e. 1024 numbers
for a small 64P Altix) we can simplify this as many numbers
are repeated - here's a real life example of the info for
NUMA node 0 on one:

10 25 25 25 29 29 29 29 29 29 29 29 29 29 29 29 33 33 33 33 37 37 37 37 37 37 37
 37 37 37 37 37

So perhaps using a simple run length encoding scheme we can
collapse that down to:

10 25*3 29*12 33*4 37*12

Much more friendly to pbs_server, though not as artistic as
Garrick's idea of drawing the layout of the system in ASCII
art and letting the scheduler folks work it out from that. :-)

The issue I have is that I don't have any examples from larger
NUMA systems, so if people can send me an example I'd much
appreciate it!  I just need to know the number of cores and
sockets in the node and the contents of:


If the system isn't running a kernel new enough to have that
information we might just have to assume that all sockets are

2) The schedulers will need to know how to process this information.

I don't know if this is something within the remit of the simple
FIFO scheduler, it's probably more something for Maui and Moab instead.

3) The scheduler needs to be able to tell Torque to allocate NUMA
nodes as well as cores when it runs a job.

Dave Singleton has a modified exec_host string that he uses to
relay that information to his pbs_mom's.  He tells me that the
their format is:

#    "host/cpus=<id_range_spec>/mems=<id_range_spec>+...."
# We use the convention that the form "host/cpuid" is shorthand
# for "host/cpus=cpuid/mems=allmems" so we haven't actually broken
# the original format - just extended it.

To illustrate that here's an example from a running job there,
as shown by qstat -f $JOBID :

    exec_host = ac10/cpus=8-23,40-55/mems=4-11,

Given that this is proven to work perhaps Torque should adopt
it too ?

In any case this means that *everything* that processes exec_host
for information will need to be updated to handle this!

The rest follow in no particular order:

B) Better memory resource limitations

This is mostly just down to the fact that currently the Linux
pbs_mom only sets RLIMIT_DATA which will only restrict memory
allocations of less than 128KB as malloc() in glibc uses sbrk()
for those sizes. For allocations above that it uses mmap()
which is limited by RLIMIT_AS instead.

The fix should just be to change RLIMIT_DATA to RLIMIT_AS.

I'll test this on our small test cluster and if it works
I'll submit a patch.

C) Job arrays

Glen Beane is after help with job array code, please contact him
via the torquedev list to assist.

D) Better (less resource intensive) tracking of processes under Linux

On a large SMP system trawling through /proc can be an fairly heavy
work, which is what pbs_mom currently does to track processes.

The suggestion has been made that on a system with cpusets enabled
it would be far easier to just traverse the cpusets to monitor jobs.

E) Unambiguous way to request N cpus

This has been a longstanding bugbear for some of us with Torque,
by default there's no simple way to request > N/c CPUs on a system
with c cores per node without the nodect hack.

According to Scott Jackson who talked on CR's work on Torque at
SC'08 this will be in 2.4, but I forgot to note what it is!

F) Better handling of job cleanup

This was raised with me by Ole from the Technical
University of Denmark, and (from memory) was mostly
around processes from non-TM aware MPI job launchers.

Hopefully this is something that pbs_track can help

Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency
-------------- next part --------------
A non-text attachment was scrubbed...
Name: altix3700_topology.pdf
Type: application/pdf
Size: 58234 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20081202/c7653f0e/altix3700_topology-0001.pdf

More information about the torqueusers mailing list