[torqueusers] MPICH2 support in TORQUE: do I need to run mpdboot
mmokrejs at ribosome.natur.cuni.cz
Wed Jul 2 08:53:50 MDT 2008
Hi Frank and James,
Frank Mietke wrote:
> Hi Martin,
>> the docs at http://www.clusterresources.com/torquedocs21/7.1mpi.shtml
>> mention something about MPICH2 but it is just not enough.
>> I have found some scripts used by people, notably:
>> mpdboot --totalnum=`cat $PBS_NODEFILE | uniq | wc -l` -f $PBS_NODEFILE
>> mpiexec -n `cat $PBS_NODEFILE | wc -l` a.out
> this should work after doing "qsub -I ..." or place it in a job file for "qsub
> <job.script>". Instead of "--totalnum=.." you could use "-n .." as well, should
> be the same.
>> but that gives me:
>> mpdroot: cannot connect to local mpd at: /tmp/mpd2.console_root
>> probable cause: no mpd daemon on this machine
>> possible cause: unix socket /tmp/mpd2.console_root has been removed
>> mpiexec_node004 (__init__ 1190): forked process failed; status=255
> Try the commands above in an interactive environment if possible to see if it
> works correctly. The "mpdboot" program has also some switches to turn on
> verbose or debugging mode to see what's going on.
So, I had somehow set in ~/.mpd.conf with a variable like USE_MPD_ROOT or something
similar to 1. I removed that and now mpdboot creates /tmp/mpd2.console_mmokrejs
Since then, the
mpdboot --totalnum=`cat $PBS_NODEFILE | uniq | wc -l` -f $PBS_NODEFILE
mpiexec -n `cat $PBS_NODEFILE | wc -l` a.out
approach works fine for me (mpich2-1.0.6).
>>> From this Open MPI FAQ (http://www.open-mpi.org/faq/?category=tm) which Garrick
>> pointed out in the mailing list archives it is still unclear how it works
> OpenMPI has to be configured and built in such a way that it could find
> the torque libraries for activating the appropriate support. Then the job could be
> started as simply as typing "mpirun <executable>".
How about MPICH2 which I do use? In case of mpirun I could forget about the
environment variable tricks and manual startup of mpdboot? Currently, I get
with MPICH2 the following:
$ mpirun -np $NUM_NODES -machinefile $PBS_NODEFILE mb -i a.nex
mpiexec_node004: cannot connect to local mpd (/tmp/mpd2.console_mmokrejs); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
In case 1, you can start an mpd on this host with:
and you will be able to run jobs just on this host.
For more details on starting mpds on a set of hosts, see
the MPICH2 Installation Guide.
So clearly I have to start mpdboot myself and should not forget about
mpdallexit after the job is over. :( No, I do not want to place
mpdallexit into my .logout. Maybe time to learn about those job post-execution
scripts to be executed? Do I have to re-invent the wheel?
>> (how to configure the systems with Mpich2+Torque to get this behaviour). And,
>> nowhere was explained how to start parallel computation using the mpiexec bundled
>> in mpich2 and how does it differ to using the mpiched from
> This is the replacement for mpdboot/mpirun(mpiexec)/mpdallexit cycle when using
> torque. In this case you simply start your job with the new "mpiexec" and need
> no mpdboot/mpdallexit. Everything is done over the pbs_moms.
So why isn't this clearly explained in one place altogether? ;-)
>> Would be nice if somebody could clarify how to configure and use what.
> Hope this helps.
More information about the torqueusers