[torqueusers] Job can not be allocated correctly

Coyle, James J [ITACD] jjc at iastate.edu
Tue Mar 2 14:06:33 MST 2010


Chen Weiguang,

  In your mpd command I see npcus=8.
Doesn't that mean 8 cpus per node?
Perhaps mpiexec starts all 8 on one node first
before starting any on other nodes.

   I suggest that you try the script again with nodes=2:ppn=8
And use mpiexec -n 16

(or if you cannot use ppn=8. Just use mpiexec -n 16 ... )


-          Jim Coyle

From: Weiguang Chen [mailto:chenweiguang82 at gmail.com]
Sent: Tuesday, March 02, 2010 1:56 PM
To: Coyle, James J [ITACD]; torqueusers maillist
Subject: Re: [torqueusers] Job can not be allocated correctly

James,

Thanks for you reply.
In our cluster, the mpd has been started as a dameon when the cluster setup, as follow:
00:00:01 python2.4 /home/software/mpich2-1.1.1p1-intel/bin/mpd --daemon --listenport=33013 --ncpus=8
likewise, in the compute nodes, following command is executed:
/home/software/mpich2-1.1.1p1-intel/bin/mpd --daemon --host=node1 --port=33013 --ncpus=8
and i executed the command:mpdtrace -l, which shows the communicati
on is well during these nodes.

After above setting, whether it is not necessary to use mpdboot and mpdexit.
I also try to add the above command to my job script as the url in your email, but problems still happen.
2010/3/2 Coyle, James J [ITACD] <jjc at iastate.edu<mailto:jjc at iastate.edu>>
ChenWeiguang,

Mpich is not aware of the nodes that the scheduler assigned.

For MPICH-2 you need to use mpdbooot at the beginning of your Torque script and mpdexit
at the end.  You can look at :

http://beige.ucs.indiana.edu/I590/node58.html

for an example of how these can be used in a PBS or Torque job.

  If Torque was built with the tm interface, then if you install and use OpenMPI
you won't need these, as the TM interface is used by OpenMPI to know which
nodes are assigned.  I changed from MPICH to OpenMPI when going from
MPI-1 to MPI-2 due to this issue.

 James Coyle, PhD
 High Performance Computing Group
 115 Durham Center
 Iowa State Univ.           phone: (515)-294-2099
 Ames, Iowa 50011           web: http://www.public.iastate.edu/~jjc<http://www.public.iastate.edu/%7Ejjc>

From: torqueusers-bounces at supercluster.org<mailto:torqueusers-bounces at supercluster.org> [mailto:torqueusers-bounces at supercluster.org<mailto:torqueusers-bounces at supercluster.org>] On Behalf Of Weiguang Chen
Sent: Tuesday, March 02, 2010 3:26 AM
To: torqueusers maillist
Subject: [torqueusers] Job can not be allocated correctly

Hi, all

In fact, I can not sure whether Torque or mpich cause this problem. I just express my problem as follow a exapmle script:
#!/bin/bash
### Job name
#PBS -N name
#PBS -q batch
### number of numbers and process per node
#PBS -l nodes=2:ppn=4
### Job's error output
#PBS -e error
### Job's general output
#PBS -o stdout

cd $PBS_O_WORKDIR
echo "Job begin at "`date`
# program examples
mpiexec -n 8 $PBS_O_WORKDIR/cpi
echo "Job stop at "`date`

exit 0

cpi is a example progrm in mpich package. Our cluster profile is two processors with every 4 cores, i.e. 8 cores per one node. But the message from the above job is as follow when i submit this job:
Process 0 on node5
Process 1 on node5
Process 2 on node5
Process 3 on node5
Process 5 on node5
Process 6 on node5
Process 4 on node5
Process 7 on node5

All processes are ran on one node, but i allocated 2 nodes. I don't know what cause it happen, and how to solve it.
Thanks

PS: Torque version:2.4.6, mpich:2-1.1.1p1, mpiexec:0.83
--
Best Wishes
ChenWeiguang

************************************************
#               Chen, Weiguang
#
#    Postgraduate,  Ph. D
#  75 University Road, Physics Buliding  #  218
#  School of Physics & Engineering
#  Zhengzhou University
#  Zhengzhou, Henan 450052  CHINA
#
#  Tel: 86-13203730117;
#  E-mail:chenweiguang82 at gmail.com<mailto:E-mail%3Achenweiguang82 at gmail.com>;
#            chenweiguang82 at qq.com<mailto:chenweiguang82 at qq.com>
#
**********************************************



--
Best Wishes
ChenWeiguang

************************************************
#               Chen, Weiguang
#
#    Postgraduate,  Ph. D
#  75 University Road, Physics Buliding  #  218
#  School of Physics & Engineering
#  Zhengzhou University
#  Zhengzhou, Henan 450052  CHINA
#
#  Tel: 86-13203730117;
#  E-mail:chenweiguang82 at gmail.com<mailto:E-mail%3Achenweiguang82 at gmail.com>;
#            chenweiguang82 at qq.com<mailto:chenweiguang82 at qq.com>
#
**********************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100302/286cd047/attachment.html 


More information about the torqueusers mailing list