[torquedev] cpuset support

Chris Samuel csamuel at vpac.org
Fri Nov 16 00:22:12 MST 2007


On Tue, 13 Nov 2007, Garrick Staples wrote:

> Here's what we came up with...

Just to capture some of the other aspects of this that myself, Garrick 
and Craig have been discussing about at SC'07.  None of this is set 
in stone and discussion is encouraged!



0)  If we create the cpusets for a job prior to the prologue running 
then cluster admins are able to use this to modify the cpus 
allocated.   This is _probably_ a bad thing as it seems that the 
right place to change this decision is probably in the scheduler.

However, we should probably work out what to do about this - do we 
check the cpusets after the prologue to record any changes and report 
those back so that the scheduler can update its view of the world or 
do we instead make the cpusets *after* the job has started to prevent 
this happening in the first place ?



1)  If we are using cpusets then the concept of load average to work 
out how busy a node is goes out the window.

This is because if a user submits a 1 cpu job which then fires off 20 
processes by accident all those processes will be confined to 1 core, 
so the scheduler could use the other cores in the knowledge that they 
are unaffected by the rogue CPU usage.

This could make life easier for the scheduler if it knows that cpusets 
are enabled on this node.



2)  The pbs_mom can easily tell at startup whether cpusets are enabled 
by checking for the presence of the cpuset pseudo-filesystem 
in /proc/filesystems - we won't try and second guess the admin and 
load the kernel module if it's missing either.. :-)

This fact could be exported dynamically as a node property.



3) We need some way for the pbs_mom to advertise the organisation of 
which core is in which socket, and possibly higher levels of NUMA 
organisational awareness for systems such as the Altix, so that the 
scheduler can make decisions based upon this.



4) To do this we need to be able to work out from what is in /proc how 
the system is arranged (and be able to handle the various layouts on 
different architectures and kernel versions).

On recent kernels this info is fairly easy to get (at least for a 
standard Intel or AMD system) but you *cannot* assume that all 
sockets (aka "physical id") and cores (i.e. processor numbers) are 
sequential!

We will need to collect various /proc/cpuinfo outputs along with 
details of the system and the output of the 'arch' and 'uname -a' 
commands to help us with this.

I'll be posting some in a bit from the systems I have access to..

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20071116/31716aa9/attachment.bin


More information about the torquedev mailing list