[torqueusers] MPICH2 support in TORQUE: do I need to run mpdboot myself?

Martin MOKREJŠ mmokrejs at ribosome.natur.cuni.cz
Wed Jul 2 08:53:50 MDT 2008


Hi Frank and James,

Frank Mietke wrote:
> Hi Martin,
> 
>> the docs at http://www.clusterresources.com/torquedocs21/7.1mpi.shtml
>> mention something about MPICH2 but it is just not enough.
>>
>>  I have found some scripts used by people, notably:
>> http://www.cct.lsu.edu/~hsunda3/doc/#id241705
>>
>>  mpdboot --totalnum=`cat $PBS_NODEFILE | uniq | wc -l` -f $PBS_NODEFILE
>>  mpiexec -n `cat $PBS_NODEFILE | wc -l` a.out
>>  mpdallexit
> 
> this should work after doing "qsub -I ..." or place it in a job file for "qsub
> <job.script>". Instead of "--totalnum=.." you could use "-n .." as well, should
> be the same.
>> but that gives me:
>>
>> mpdroot: cannot connect to local mpd at: /tmp/mpd2.console_root
>>    probable cause:  no mpd daemon on this machine
>>    possible cause:  unix socket /tmp/mpd2.console_root has been removed
>> mpiexec_node004 (__init__ 1190): forked process failed; status=255
> 
> Try the commands above in an interactive environment if possible to see if it
> works correctly. The "mpdboot" program has also some switches to turn on
> verbose or debugging mode to see what's going on.

So, I had somehow set in ~/.mpd.conf with a variable like USE_MPD_ROOT or something
similar to 1. I removed that and now mpdboot creates /tmp/mpd2.console_mmokrejs
socket instead.

Since then, the

mpdboot --totalnum=`cat $PBS_NODEFILE | uniq | wc -l` -f $PBS_NODEFILE
mpiexec -n `cat $PBS_NODEFILE | wc -l` a.out
mpdallexit

approach works fine for me (mpich2-1.0.6).


>> https://www.liniac.upenn.edu/wiki/tiki-index.php?page=LAM+with+Torque
>> http://wwwas.oat.ts.astro.it/planck/index.php?option=com_content&task=view&id=30&Itemid=46
>>
>>> From this Open MPI FAQ (http://www.open-mpi.org/faq/?category=tm) which Garrick
>> pointed out in the mailing list archives it is still unclear how it works
> 
> OpenMPI has to be configured and built in such a way that it could find
> the torque libraries for activating the appropriate support. Then the job could be
> started as simply as typing "mpirun <executable>".

How about MPICH2 which I do use? In case of mpirun I could forget about the
environment variable tricks and manual startup of mpdboot? Currently, I get
with MPICH2 the following:


$ mpirun -np $NUM_NODES -machinefile $PBS_NODEFILE mb -i a.nex
mpiexec_node004: cannot connect to local mpd (/tmp/mpd2.console_mmokrejs); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
In case 1, you can start an mpd on this host with:
    mpd &
and you will be able to run jobs just on this host.
For more details on starting mpds on a set of hosts, see
the MPICH2 Installation Guide.
$

So clearly I have to start mpdboot myself and should not forget about
mpdallexit after the job is over. :( No, I do not want to place
mpdallexit into my .logout. Maybe time to learn about those job post-execution
scripts to be executed? Do I have to re-invent the wheel?


>> (how to configure the systems with Mpich2+Torque to get this behaviour). And,
>> nowhere was explained how to start parallel computation using the mpiexec bundled
>> in mpich2 and how does it differ to using the mpiched from
>> http://www.osc.edu/~pw/mpiexec
> 
> This is the replacement for mpdboot/mpirun(mpiexec)/mpdallexit cycle when using
> torque. In this case you simply start your job with the new "mpiexec" and need
> no mpdboot/mpdallexit. Everything is done over the pbs_moms.

So why isn't this clearly explained in one place altogether? ;-)

> 
>>
>> Would be nice if somebody could clarify how to configure and use what.
> 
> Hope this helps.

Sure, thanks!
Martin


More information about the torqueusers mailing list