[torqueusers] MPICH2 support in TORQUE: do I need to run mpdboot myself?

Martin MOKREJŠ mmokrejs at ribosome.natur.cuni.cz
Wed Jul 2 11:21:03 MDT 2008



Martin MOKREJŠ wrote:
> Hi Frank and James,
> 
> Frank Mietke wrote:
>> Hi Martin,
>>
>>> the docs at http://www.clusterresources.com/torquedocs21/7.1mpi.shtml
>>> mention something about MPICH2 but it is just not enough.
>>>
>>>  I have found some scripts used by people, notably:
>>> http://www.cct.lsu.edu/~hsunda3/doc/#id241705
>>>
>>>  mpdboot --totalnum=`cat $PBS_NODEFILE | uniq | wc -l` -f $PBS_NODEFILE
>>>  mpiexec -n `cat $PBS_NODEFILE | wc -l` a.out
>>>  mpdallexit

Finally, some details from MPICH2 docs:
http://www.mcs.anl.gov/research/projects/mpich2/documentation/files/mpich2-doc-user.pdf

<quote>
5.7 Using MPICH2 with SLURM and PBS

MPICH2 can be used in both SLURM and PBS environments. If configured
with SLURM, use the srun job launching utility provided by SLURM. For
PBS, MPICH2 jobs can be launched in two ways: (i) using MPD or (ii)
using the OSC mpiexec.

5.7.1 MPD in the PBS environment

PBS specifies the machines allocated to a particular job in the file $PBS NODEFILE.
But the format used by PBS is different from that of MPD. Specifically, PBS
lists each node on a single line; if a node (n0) has two processors, it is listed
twice. MPD on the other hand uses an identifier (ncpus) to describe how
many processors a node has. So, if n0 has two processors, it is listed as n0:2.
One way to convert the node file to the MPD format is as follows:

sort $PBS NODEFILE | uniq -C | awk ’{ printf(”%s:%s”, $2, $1); }’ > mpd.nodes
Once the PBS node file is converted, MPD can be normally started within the PBS
job script using mpdboot and torn down using mpdallexit.

mpdboot -f mpd.hosts -n [NUM NODES REQUESTED]
mpiexec -n [NUM PROCESSES] ./my test program
mpdallexit

5.7.2 OSC mpiexec

Pete Wyckoff from the Ohio Supercomputer Center provides a alternate utility
called OSC mpiexec to launch MPICH2 jobs on PBS systems without using
MPD. More information about this can be found here: http://www.osc.edu/~pw/mpiexec
</quote>


So it looks the approach in section 5.7.1 of the PDF document is different
because it creates a file with different syntax.


>>>> From this Open MPI FAQ (http://www.open-mpi.org/faq/?category=tm) 
>>>> which Garrick
>>> pointed out in the mailing list archives it is still unclear how it 
>>> works
>>
>> OpenMPI has to be configured and built in such a way that it could find
>> the torque libraries for activating the appropriate support. Then the 
>> job could be
>> started as simply as typing "mpirun <executable>".
> 
> How about MPICH2 which I do use? In case of mpirun I could forget about the
> environment variable tricks and manual startup of mpdboot? Currently, I get
> with MPICH2 the following:
> 
> 
> $ mpirun -np $NUM_NODES -machinefile $PBS_NODEFILE mb -i a.nex
> mpiexec_node004: cannot connect to local mpd 
> (/tmp/mpd2.console_mmokrejs); possible causes:
>  1. no mpd is running on this host
>  2. an mpd is running but was started without a "console" (-n option)
> In case 1, you can start an mpd on this host with:
>    mpd &
> and you will be able to run jobs just on this host.
> For more details on starting mpds on a set of hosts, see
> the MPICH2 Installation Guide.
> $
> 
> So clearly I have to start mpdboot myself and should not forget about
> mpdallexit after the job is over. :( No, I do not want to place
> mpdallexit into my .logout. Maybe time to learn about those job 
> post-execution
> scripts to be executed? Do I have to re-invent the wheel?

Seems I will have to since the MPICH2 docs do not give a hint how to do
this automatically on behalf of the user. Or maybe I manage to get compiled
and configured the OSC mpiexec.

Martin



More information about the torqueusers mailing list