[torqueusers] mpiexec not running on requested # of processors

Gus Correa gus at ldeo.columbia.edu
Thu Oct 9 09:16:33 MDT 2008


Hi Mary Ellen and list

There seems to be a misunderstanding of what are:

A) the number of CPUs/cores on a node requested to PBS (4),
B) the number of nodes requested to PBS (6), and
C) the number of processes that will run your executable.
The latter are launched by mpiexec, and controlled by the parameter -np 
or -n,
which should be 24 (=6*4),
if you are not oversubscribing or undersubscribing the CPUs/cores you 
requested to PBS.

I think your mpiexec command should use 24 processes rather than 6, i.e.:

mpiexec -np 24 /fs/userB1/mfitzpat/mpi_test

If you don't want to hardwire the "24",
you could also use your $NP variable
(a count of lines on $PBS_NODEFILE, which is just nodes*ppn = 6*4 = 24): 

mpiexec -np $NP /fs/userB1/mfitzpat/mpi_test

I hope this helps,
Gus Correa

-- 
---------------------------------------------------------------------
Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
Lamont-Doherty Earth Observatory - Columbia University
P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


Mary Ellen Fitzpatrick wrote:

> Hi,
> I am having trouble getting mpich2 to use all of the processors on the 
> number of nodes I specify.  I am running torque-2.3.2 and mpich2-1.0.7 
> on dual-dual core nodes.  My nodes files is defined as node1001 np=4, 
> node1002 np=4, etc.  I have started mpd on all of the nodes from the 
> head node.
>
> In my pbs script, I want my code (simple pi sciprt) to run on 6 nodes 
> and use all 4 processors (dual-dual core CPUs).
> snippet of my pbs script:
> #PBS -l nodes=6:ppn=4
> # How many procs do I have?
> NP=$(wc -l $PBS_NODEFILE | awk '{print $1}')
> echo Number of processors is $NP
> #Run on nodes
> mpiexec -np 6 /fs/userB1/mfitzpat/mpi_test
>
> output:
> Begin PBS Prologue Wed Oct  8 15:35:40 EDT 2008 1223494540
> Job ID:         90.nona-man
> Username:       mfitzpat
> Group:          umass
> Nodes:          node1043 node1044 node1045 node1046 node1047 node1048
>
> Number of processors is 24
> Process 0 on node1048
> Process 1 on node1047
> Process 2 on node1001
> Process 3 on node1009
> Process 4 on node1026
> Process 5 on node1029
> pi is approximately 3.1416009869231249, Error is 0.0000083333333318
> wall clock time = 0.004853
>
> It says from above the nodes used were node1043-node1048, but it 
> appear to have run on nodes 1001,1009, 1026, 1029, 1047 and 1048.
> Looks like it only ran 6 processes instead of 24. 
> If I specify 24 instead of 6 in my command: mpiexec -np 6 
> /fs/userB1/mfitzpat/mpi_test
> Then the job hangs.
>
> any ideas where I am making the mistake?
>



More information about the torqueusers mailing list