[torqueusers] Job can not be allocated correctly

Weiguang Chen chenweiguang82 at gmail.com
Tue Mar 2 12:56:27 MST 2010


James,

Thanks for you reply.
In our cluster, the mpd has been started as a dameon when the cluster setup,
as follow:
00:00:01 python2.4 /home/software/mpich2-1.1.1p1-intel/bin/mpd --daemon
--listenport=33013 --ncpus=8
likewise, in the compute nodes, following command is executed:
/home/software/mpich2-1.1.1p1-intel/bin/mpd --daemon --host=node1
--port=33013 --ncpus=8
and i executed the command:mpdtrace -l, which shows the communicati
on is well during these nodes.

After above setting, whether it is not necessary to use mpdboot and mpdexit.
I also try to add the above command to my job script as the url in your
email, but problems still happen.

2010/3/2 Coyle, James J [ITACD] <jjc at iastate.edu>

>  ChenWeiguang,
>
>
>
> Mpich is not aware of the nodes that the scheduler assigned.
>
>
>
> For MPICH-2 you need to use mpdbooot at the beginning of your Torque script
> and mpdexit
>
> at the end.  You can look at :
>
>
>
> http://beige.ucs.indiana.edu/I590/node58.html
>
>
>
> for an example of how these can be used in a PBS or Torque job.
>
>
>
>   If Torque was built with the tm interface, then if you install and use
> OpenMPI
>
> you won’t need these, as the TM interface is used by OpenMPI to know which
>
> nodes are assigned.  I changed from MPICH to OpenMPI when going from
>
> MPI-1 to MPI-2 due to this issue.
>
>
>
>  James Coyle, PhD
>
>  High Performance Computing Group
>
>  115 Durham Center
>
>  Iowa State Univ.           phone: (515)-294-2099
>
>  Ames, Iowa 50011           web: http://www.public.iastate.edu/~jjc<http://www.public.iastate.edu/%7Ejjc>
>
>
>
> *From:* torqueusers-bounces at supercluster.org [mailto:
> torqueusers-bounces at supercluster.org] *On Behalf Of *Weiguang Chen
> *Sent:* Tuesday, March 02, 2010 3:26 AM
> *To:* torqueusers maillist
> *Subject:* [torqueusers] Job can not be allocated correctly
>
>
>
> Hi, all
>
> In fact, I can not sure whether Torque or mpich cause this problem. I just
> express my problem as follow a exapmle script:
> #!/bin/bash
> ### Job name
> #PBS -N name
> #PBS -q batch
> ### number of numbers and process per node
> #PBS -l nodes=2:ppn=4
> ### Job's error output
> #PBS -e error
> ### Job's general output
> #PBS -o stdout
>
> cd $PBS_O_WORKDIR
> echo "Job begin at "`date`
> # program examples
> mpiexec -n 8 $PBS_O_WORKDIR/cpi
> echo "Job stop at "`date`
>
> exit 0
>
> cpi is a example progrm in mpich package. Our cluster profile is two
> processors with every 4 cores, i.e. 8 cores per one node. But the message
> from the above job is as follow when i submit this job:
> Process 0 on node5
> Process 1 on node5
> Process 2 on node5
> Process 3 on node5
> Process 5 on node5
> Process 6 on node5
> Process 4 on node5
> Process 7 on node5
>
> All processes are ran on one node, but i allocated 2 nodes. I don't know
> what cause it happen, and how to solve it.
> Thanks
>
> PS: Torque version:2.4.6, mpich:2-1.1.1p1, mpiexec:0.83
> --
> Best Wishes
> ChenWeiguang
>
> ************************************************
> #               Chen, Weiguang
> #
> #    Postgraduate,  Ph. D
> #  75 University Road, Physics Buliding  #  218
> #  School of Physics & Engineering
> #  Zhengzhou University
> #  Zhengzhou, Henan 450052  CHINA
> #
> #  Tel: 86-13203730117;
> #  E-mail:chenweiguang82 at gmail.com <E-mail%3Achenweiguang82 at gmail.com>;
> #            chenweiguang82 at qq.com
> #
> **********************************************
>



-- 
Best Wishes
ChenWeiguang

************************************************
#               Chen, Weiguang
#
#    Postgraduate,  Ph. D
#  75 University Road, Physics Buliding  #  218
#  School of Physics & Engineering
#  Zhengzhou University
#  Zhengzhou, Henan 450052  CHINA
#
#  Tel: 86-13203730117;
#  E-mail:chenweiguang82 at gmail.com <E-mail%3Achenweiguang82 at gmail.com>;
#            chenweiguang82 at qq.com
#
**********************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100303/602eea75/attachment-0001.html 


More information about the torqueusers mailing list