[torqueusers] MPICH2 support in TORQUE: do I need to run mpdboot myself?

Frank Mietke frank.mietke at informatik.tu-chemnitz.de
Wed Jul 2 07:35:48 MDT 2008


Hi Martin,

> the docs at http://www.clusterresources.com/torquedocs21/7.1mpi.shtml
> mention something about MPICH2 it is just not enough.
>
>  I have found some scripts used by people, notably:
> http://www.cct.lsu.edu/~hsunda3/doc/#id241705
>
>  mpdboot --totalnum=`cat $PBS_NODEFILE | uniq | wc -l` -f $PBS_NODEFILE
>  mpiexec -n `cat $PBS_NODEFILE | wc -l` a.out
>  mpdallexit

this should work after doing "qsub -I ..." or place it in a job file for "qsub
<job.script>". Instead of "--totalnum=.." you could use "-n .." as well, should
be the same.
>
> but that gives me:
>
> mpdroot: cannot connect to local mpd at: /tmp/mpd2.console_root
>    probable cause:  no mpd daemon on this machine
>    possible cause:  unix socket /tmp/mpd2.console_root has been removed
> mpiexec_node004 (__init__ 1190): forked process failed; status=255

Try the commands above in an interactive environment if possible to see if it
works correctly. The "mpdboot" program has also some switches to turn on
verbose or debugging mode to see what's going on.

> I also found:
> http://www.cita.utoronto.ca/mediawiki/index.php/Sunnyvale#Submitting_Jobs
>
> #!/bin/csh #PBS -l nodes=8:ppn=8
> #PBS -q workq #PBS -r n
> #PBS -l walltime=00:35:00
> set NNODES=8
> set NCPUS=64
> cd $PBS_O_WORKDIR
> cat $PBS_NODEFILE > nodes
> foreach NN (`cat $PBS_NODEFILE | uniq`)
> echo `echo $NN | cut -f1 -d.` >> machinefile.$PBS_JOBID
> end
> mpdboot -n $NNODES -f $PBS_O_WORKDIR/machinefile.$PBS_JOBID -v
> mpiexec -n $NCPUS $PBS_O_WORKDIR/a.out mpdallexit
> unset NNODES
> unset NCPUS

This is such a job script as mentioned above.


>
>
> https://www.liniac.upenn.edu/wiki/tiki-index.php?page=LAM+with+Torque
> http://wwwas.oat.ts.astro.it/planck/index.php?option=com_content&task=view&id=30&Itemid=46
>
>> From this Open MPI FAQ (http://www.open-mpi.org/faq/?category=tm) which Garrick
> pointed out in the mailing list archives it is still unclear how it works

OpenMPI has to be configured and built in such a way that it could find
the torque libraries for activating the appropriate support. Then the job could be
started as simply as typing "mpirun <executable>".


> (how to configure the systems with Mpich2+Torque to get this behaviour). And,
> nowhere was explained how to start paralle computation using the mpiexec bundled
> in mpich2 and how does it differ to using the mpiched from
> http://www.osc.edu/~pw/mpiexec

This is the replacement for mpdboot/mpirun(mpiexec)/mpdallexit cycle when using
torque. In this case you simply start your job with the new "mpiexec" and need
no mpdboot/mpdallexit. Everything is done over the pbs_moms.

>
>
> Would be nice if somebody could clarify how to configure and use what.

Hope this helps.

Best Regards,
Frank


> Thanks,
> Martin
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>

-- 
Dipl.-Inf. Frank Mietke     |     Fakultätsrechen- und Informationszentrum
Tel.: 0371 - 531 - 35538    |     Fak. für Informatik
Fax:  0371 - 531 8 35538    |     TU-Chemnitz
Key-ID: 60F59599            |     frank.mietke at informatik.tu-chemnitz.de


More information about the torqueusers mailing list