[torqueusers] Job can not be allocated correctly

Weiguang Chen chenweiguang82 at gmail.com
Wed Mar 3 06:53:32 MST 2010


Hi,

The problem has been resolved by using the mpiexec of OSC(
http://www.osc.edu/~pw/mpiexec/). So i recommend this program to run the
parallel job.

Thank you all friend who attention this problem.

Weiguang Chen
2010/3/3 Weiguang Chen <chenweiguang82 at gmail.com>

> Jim Coyle,
>
> What you thought is right. There are 8 cpus(cores) per node on our cluster.
> And the problem just is that the job starts all 8 cpus on one node first
> even i allocate 4 cpus on one node and 4 cpus on the other node.
>
> If i set ppn=8, the problem won't happen, but that is not the allocation i
> want
> .
> 2010/3/3 Coyle, James J [ITACD] <jjc at iastate.edu>
>
>   Chen Weiguang,
>>
>>
>>
>>   In your mpd command I see npcus=8.
>>
>> Doesn¡¯t that mean 8 cpus per node?
>>
>> Perhaps mpiexec starts all 8 on one node first
>>
>> before starting any on other nodes.
>>
>>
>>
>>    I suggest that you try the script again with nodes=2:ppn=8
>>
>> And use mpiexec ¨Cn 16
>>
>>
>>
>> (or if you cannot use ppn=8. Just use mpiexec ¨Cn 16 ¡­ )
>>
>>
>>
>> -          Jim Coyle
>>
>>
>>
>> *From:* Weiguang Chen [mailto:chenweiguang82 at gmail.com]
>> *Sent:* Tuesday, March 02, 2010 1:56 PM
>> *To:* Coyle, James J [ITACD]; torqueusers maillist
>> *Subject:* Re: [torqueusers] Job can not be allocated correctly
>>
>>
>>
>> James,
>>
>> Thanks for you reply.
>> In our cluster, the mpd has been started as a dameon when the cluster
>> setup, as follow:
>> 00:00:01 python2.4 /home/software/mpich2-1.1.1p1-intel/bin/mpd --daemon
>> --listenport=33013 --ncpus=8
>> likewise, in the compute nodes, following command is executed:
>> /home/software/mpich2-1.1.1p1-intel/bin/mpd --daemon --host=node1
>> --port=33013 --ncpus=8
>> and i executed the command:mpdtrace -l, which shows the communicati
>> on is well during these nodes.
>>
>> After above setting, whether it is not necessary to use mpdboot and
>> mpdexit.
>> I also try to add the above command to my job script as the url in your
>> email, but problems still happen.
>>
>> 2010/3/2 Coyle, James J [ITACD] <jjc at iastate.edu>
>>
>> ChenWeiguang,
>>
>>
>>
>> Mpich is not aware of the nodes that the scheduler assigned.
>>
>>
>>
>> For MPICH-2 you need to use mpdbooot at the beginning of your Torque
>> script and mpdexit
>>
>> at the end.  You can look at :
>>
>>
>>
>> http://beige.ucs.indiana.edu/I590/node58.html
>>
>>
>>
>> for an example of how these can be used in a PBS or Torque job.
>>
>>
>>
>>   If Torque was built with the tm interface, then if you install and use
>> OpenMPI
>>
>> you won¡¯t need these, as the TM interface is used by OpenMPI to know which
>>
>> nodes are assigned.  I changed from MPICH to OpenMPI when going from
>>
>> MPI-1 to MPI-2 due to this issue.
>>
>>
>>
>>  James Coyle, PhD
>>
>>  High Performance Computing Group
>>
>>  115 Durham Center
>>
>>  Iowa State Univ.           phone: (515)-294-2099
>>
>>  Ames, Iowa 50011           web: http://www.public.iastate.edu/~jjc<http://www.public.iastate.edu/%7Ejjc>
>>
>>
>>
>> *From:* torqueusers-bounces at supercluster.org [mailto:
>> torqueusers-bounces at supercluster.org] *On Behalf Of *Weiguang Chen
>> *Sent:* Tuesday, March 02, 2010 3:26 AM
>> *To:* torqueusers maillist
>> *Subject:* [torqueusers] Job can not be allocated correctly
>>
>>
>>
>> Hi, all
>>
>> In fact, I can not sure whether Torque or mpich cause this problem. I just
>> express my problem as follow a exapmle script:
>> #!/bin/bash
>> ### Job name
>> #PBS -N name
>> #PBS -q batch
>> ### number of numbers and process per node
>> #PBS -l nodes=2:ppn=4
>> ### Job's error output
>> #PBS -e error
>> ### Job's general output
>> #PBS -o stdout
>>
>> cd $PBS_O_WORKDIR
>> echo "Job begin at "`date`
>> # program examples
>> mpiexec -n 8 $PBS_O_WORKDIR/cpi
>> echo "Job stop at "`date`
>>
>> exit 0
>>
>> cpi is a example progrm in mpich package. Our cluster profile is two
>> processors with every 4 cores, i.e. 8 cores per one node. But the message
>> from the above job is as follow when i submit this job:
>> Process 0 on node5
>> Process 1 on node5
>> Process 2 on node5
>> Process 3 on node5
>> Process 5 on node5
>> Process 6 on node5
>> Process 4 on node5
>> Process 7 on node5
>>
>> All processes are ran on one node, but i allocated 2 nodes. I don't know
>> what cause it happen, and how to solve it.
>> Thanks
>>
>> PS: Torque version:2.4.6, mpich:2-1.1.1p1, mpiexec:0.83
>> --
>> Best Wishes
>> ChenWeiguang
>>
>> ************************************************
>> #               Chen, Weiguang
>> #
>> #    Postgraduate,  Ph. D
>> #  75 University Road, Physics Buliding  #  218
>> #  School of Physics & Engineering
>> #  Zhengzhou University
>> #  Zhengzhou, Henan 450052  CHINA
>> #
>> #  Tel: 86-13203730117;
>> #  E-mail:chenweiguang82 at gmail.com <E-mail%3Achenweiguang82 at gmail.com>;
>> #            chenweiguang82 at qq.com
>> #
>> **********************************************
>>
>>
>>
>>
>> --
>> Best Wishes
>> ChenWeiguang
>>
>> ************************************************
>> #               Chen, Weiguang
>> #
>> #    Postgraduate,  Ph. D
>> #  75 University Road, Physics Buliding  #  218
>> #  School of Physics & Engineering
>> #  Zhengzhou University
>> #  Zhengzhou, Henan 450052  CHINA
>> #
>> #  Tel: 86-13203730117;
>> #  E-mail:chenweiguang82 at gmail.com <E-mail%3Achenweiguang82 at gmail.com>;
>> #            chenweiguang82 at qq.com
>> #
>> **********************************************
>>
>
>
>
> --
> Best Wishes
> ChenWeiguang
>
> ************************************************
> #               Chen, Weiguang
> #
> #    Postgraduate,  Ph. D
> #  75 University Road, Physics Buliding  #  218
> #  School of Physics & Engineering
> #  Zhengzhou University
> #  Zhengzhou, Henan 450052  CHINA
> #
> #  Tel: 86-13203730117;
> #  E-mail:chenweiguang82 at gmail.com <E-mail%3Achenweiguang82 at gmail.com>;
> #            chenweiguang82 at qq.com
> #
> **********************************************
>
>


-- 
*****************************************************************************
*        Chen, Weiguang   (PhD Student)
*   Laboratory of Condensed Matter Theory and Computatational Materials &
*   School of Physics and Engineering
*   75 North University Road, Physics Building  Rm#202
*   Zhengzhou University, Zhengzhou, 450052 Henan, China
*
*   Tel: 86-13203730117£¬ 86-13783677861; Fax: 86-371-67767758;
*   Email: chenweiguang82 at gmail.com
****************************************************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100303/2a13e844/attachment-0001.html 


More information about the torqueusers mailing list