[torqueusers] NCPUS environment variable?

Robin Humble rjh at cita.utoronto.ca
Tue Jul 12 15:34:05 MDT 2005


On Tue, Jul 12, 2005 at 11:06:54AM -0700, Martin Siegert wrote:
>On Mon, Jul 11, 2005 at 09:24:37PM -0400, Robin Humble wrote:
>> and LAM queries PBS/torque for the node list via the tm interface and
>Unfortunately that is not an option: we support a bunch of different
>architectures, most systems come with the vendor's MPI distribution
>(e.g., the p595 IBM systems), other's offer more than one MPI implementation

using the tm interface is probably still the best long term solution.
maybe hassle your vendors?

>> you can do nodes=8:ppn=1, and in maui.cfg set 
>>   NODEACCESSPOLICY        SHARED
>> and make sure JOBNODEMATCHPOLICY is unset, and that'll work.
>>   http://www.clusterresources.com/products/maui/docs/a.fparameters.shtml
>
>Thanks! I try that. But in principle torque/maui (or torque/moab in our case)
>should support 3 different cases:
>
>1) nodes=4:ppn=2
>2) nodes=8:ppn=1
>   (e.g., for MPI jobs that require the full network bandwidth between
>   processes, but can tolerate a serial job on the other processor in the
>   nodes)
>3) ncpus=8
>   (the don't care choice)
>
>As far as I can tell your solution supports 1 and 3, but not 2, correct?

yes. with the SHARED maui option, case 2 becomes the same as 3 - the 8 cpus
are scheduled anywhere.

however with both SHARED and 
  JOBNODEMATCHPOLICY      EXACTNODE
(which is what we use), then your cases 1,2 are handled correctly.

I have no idea about case 3 - maybe try it and see! :)
it would be great if it worked...


your case 2 is a bit scary though - you really want to make sure that
the 2nd job that PBS puts onto the shared nodes has small network
requirements.
perhaps there's a way to request a fraction of network bandwidth per
node, like there is with memory per node. if there is, then that would
likely be the best solution to make case 2 work effectively.

we just use nodes=8:ppn=2 for jobs which fall into case 2, and then run
with (LAM syntax again)
  mpirun N
instead of the usual mpirun C. The 'N' says fire up one process per
node instead of one per cpu. That way they reserve the whole node worth
of network (and memory), and nothing conflicting can be scheduled onto it.
low-tech, but it works.
mpirun N is also our solution for OpenMP + MPI jobs that want to fire up
multiple threads per node.

cheers,
robin
--
    Robin Humble       http://www.cita.utoronto.ca/~rjh/


More information about the torqueusers mailing list