[torquedev] Unused vnode cupsets - should they stay or go ?

Chris Samuel csamuel at vpac.org
Fri Nov 20 22:17:35 MST 2009

Hi folks,

At SC'09 Michel Béland (I hope I got that right, sorry
if I didn't!) who has an Altix running Torque with
cpuset support raised a very interesting problem with
the current implementation. 

The current design has a job cpuset for each node
containing the cores allocated for the job. We then
create sub-cpusets for each vnode of the job with
its assigned core.

In the initial design we put each TM request to start
a process into its per-vnode cpuset, but abandoned
that as soon as we realised that Open-MPI et. al were
launching a single daemon per node which then launched
the appropriate MPI ranks - and of course these were
all getting locked onto a single core! :-(

So now we put any TM request into the job cpuset *but*
we still create those per-vnode cpusets, even though
they are never used.

This isn't an issue on small systems but Michele has 
a large Altix system with hundreds of cores per node
and when a large job using hundreds of cores starts up
he sees this creation process take a long time (and I
think he said it triggers some sort of timeout).

My guess is that this will also cause issues when
the job is being torn down as well as pbs_mom
iterates through the directory recursively deleting
any sub-cpusets found.

So - can anyone see any reason why this code just
shouldn't get ifdef'd/commented out (and removed
later) ?

In 2.4 the code has been tidied up a lot, and so we
just need to not call create_vnodesets(), but the 2.3
code does all the work inline and so needs more care.
Trunk is currently the same as 2.4.

I don't think that we should touch the cpuset deletion
code, if nothing has changed then it will be a quick
process but should the job have created cpusets itself
it's going to be necessary to clean them up or we risk
not being able to delete the parent cpuset if there are
active sub-cpusets.

A final performance note - I've been working with the
Open-MPI folks to introduce cpuset-aware processor
affinity settings into the next release (1.3.4) so
that rather than naively assuming you can bind from
core 0 onwards it will look at the cores it is
permitted to access (portably) instead.  There is
real benefit from this, one speaker at the HPC
Advisory Council meeting reported that CPU affinity
produced up to a 10% performance boost in their
testing of LS-Dyna (a crash code).

Thoughts ?

Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency

More information about the torquedev mailing list