[torqueusers] NUMA general use

David Beer dbeer at adaptivecomputing.com
Mon Apr 26 13:43:17 MDT 2010


Michel, 

Just a few questions below:

----- Original Message -----

> We use Torque with Maui on an Altix cluster which contains a 768-core
> Altix 4700 split in three 256-core partitions, a 32-processor Altix
> 3700 and a 16-processor Altix 350. The Altix 4700 has four cores per
> node and
> 2 GB per core, the Altix 3700 has two cores per node and 4 GB per core
> and the Altix 350 has two cores per node and 3 GB per core. As Maui is
> not aware of cpusets, this is what we did :
> 
> - we have seperate queues for each machine,
> - we wrote a qsub script that calls the real qsub after some
> modification to the request,
> - the qsub script looks at the resources requested by the job and
> computes how many nodes are needed to satisfy the request in cpus and
> memory, in the requested queue, and calls qalter accordingly,
> - the way to look at the number of processors requested is to look for
> both -lncpus and -lnode, and take the maximum,
> - we make sure that all the jobs have -lncpus *and* -lnodes because
> with Torque 2.3.5 -lnodes is necessary for the correct creation of
> cpusets and -lncpus is necessary to have qstat -a show the number of
> processors, - every job gets complete memory nodes, even for
> sequential jobs,
> - the request also gets a node property to make sur it runs on the
> right node according to the queue requested,
> - the job is moved to a routing queue that selects an appropriate
> queue according to how much walltime and how many cores were
> requested, - we also modified pbs_mom to get rid of per-cpu cpusets,
> which take
> forever to remove on a big system, even when they are not used,
> - we also modified pbs_mom so that only the memory of the nodes
> obtained by the job is in the cpuset,
> - so there is one cpuset per job (jobs are restricted to one node),
> which can be used for OpenMP or MPI jobs,

Just to clarify, do your users run OpenMP and MPI jobs on these systems? If they are, do they run them when a job is contained only on a single Altix machine?

> - as we have a four-cpu boot cpuset on the three partitions of the
> Altix 4700 that Torque would not recognize, we launch a dummy job
> using the four cpus and the memory of one node that end up above the
> boot cpuset,
> - if Torque would honour preexisting cpusets (like the boot cpuset),
> this could create problems as sometimes cpusets are not erased
> correctly and new jobs could fail (now they just sit above the faulty
> cpuset and
> the jobs run).

As a note, our current branch of NUMA is set up to respect the boot cpuset.

> 
> We would have liked to have sequential jobs take only one cpus, but if
> we were to have one sequential job run on cpu 0 only, a 16-cpu job
> might run on cpus 1-16, for example, as pbs_mom gives processors to
> jobs one
> after the other. Then another job could run on processors 17-32.
> Sharing nodes with multi-nodes jobs is very bad if one is using too
> much memory
> in the master, for example. It can make the first parallel job swap
> even if it is perfectly well behaved.
> 

This is an area that, to me, is best managed by the scheduler as the scenarios are complicated and people would probably want highly customizable policies.

> That is pretty much it for our site, unless I forgot something.
> 
> -- Michel Béland, analyste en calcul scientifique
> michel.beland at rqchp.qc.ca bureau S-250, pavillon Roger-Gaudry
> (principal), Université de Montréal
> téléphone : 514 343-6111 poste 3892 télécopieur : 514 343-2155
> RQCHP (Réseau québécois de calcul de haute performance)
> www.rqchp.qc.ca

Thanks for your quick reply. The more information we have, the better off we are for making these decisions.

-- 
David Beer | Senior Software Engineer
Adaptive Computing



More information about the torqueusers mailing list