[torquedev] [Bug 93] Resource management semantics of Torque need to be well defined

Michel Béland michel.beland at rqchp.qc.ca
Mon Dec 6 16:38:38 MST 2010


>> The fact of the matter is that ppn hasn't been clearly defined over time, and
>> what it has become in practice is probably best described as processes per
>> node.
> 
> Describing it as "processes per node" is very misleading and completely
> inaccurate.  Take for example a multi-threaded program.  I routinely run
> multi-threaded code on our cluster.  We have 32 cores per node, and if I run a
> _single process_ that uses 32 threads, I request ppn=32.  If that meant
> _processes_ I would request ppn=1 because, after all, my mult-threaded program
> is still a single process. It is, however, using multiple-cores.
> 
> virtual processor per node is the correct definition of ppn - the number of
> virtual processors will typically be set to the total number of cores on a
> node. redefining it as processes per node will lead to problems.

The fact that Torque generates a $PBS_NODEFILE containing one line per 
virtual processor seems to support the conclusion that ppn means virtual 
processors. But I have always considered this behaviour broken since it 
does not work well with programs that use MPI *and* OpenMP. Back in the 
days when I used PBS Pro 5.3, I liked what they implemented to submit a 
job like this. You could ask for example -lnodes=10:ppn=2:cpp=4 to get 
10 nodes with 2 processes per node and 4 CPUs per process. Then PBS Pro 
would generate a $PBS_NODEFILE containing all the nodes repeated twice 
(because of ppn=2) and also set the environment variables NCPUS and 
OMP_NUM_THREADS to 4. With this, it would really allocate 8 virtual 
processors per node (ppn*cpp). With Torque, you have to fiddle with 
$PBS_NODEFILE to make it work with hybrid parallel programs.

True enough, PBS Pro did not preclude running more processes than 
requested, but ppn meant processes per node (at least in the restricted 
MPI sense) and that is what the documentation said.

Later, they introduced -lselect and deprecated -lnodes altogether. Now 
one can ask for -lselect=10:ncpus=8:mpiprocs=2:ompthread=4 to get the 
same result, if I remember correctly, but I think that I liked ppn and 
cpp better...

-- 
Michel Béland, analyste en calcul scientifique
michel.beland at rqchp.qc.ca
bureau S-250, pavillon Roger-Gaudry (principal), Université de Montréal
téléphone : 514 343-6111 poste 3892     télécopieur : 514 343-2155
RQCHP (Réseau québécois de calcul de haute performance)  www.rqchp.ca
Calcul Canada (computecanada.org)


More information about the torquedev mailing list