[torqueusers] start intel mpi in pbs

Alan_E_Solomon at whirlpool.com Alan_E_Solomon at whirlpool.com
Thu Aug 16 08:15:48 MDT 2007


Chaucer Cao wrote:

Hi all,

In the pbs script file I can¡¯t start the mpd (intel mpi ) useing the
following command

****************************************************************************


mpdboot  --rsh=ssh -v -n `cat mpd.hosts|wc -l`  -f mpd.hosts

****************************************************************************


It gives:

----------------------------------------------------------------------------

----------------------

totalnum=4  numhosts=3

there are not enough hosts on which to start all processes

----------------------------------------------------------------------------

----------------------

But I can manually start mpd using the same command.

My posting:
I have encountered what seems to be the same problem and think I may have
solved it.  We have a cluster of 20 4-way Xeon nodes and use the Intel MPI
for LS-Dyna.  About 1 in every 10 jobs fails to start and we see "not
enough hosts on which to start all processes" just after we run "mpdboot".

Logging into a compute node to try to debug this, I set my path & library
path by hand and created an "mpd.hosts" file by hand, then tried running
mpdboot.   I get the same error every time I do "mpdboot" when the local
node hostname is NOT the first one listed in  mpd.hosts!  Every time I ran
"mpdboot" with the local node listed first it ran OK and started the
expected number of mpd daemons.   I have added code to my job script that
makes sure that the node the job is launched on (the Mother Superior node)
is first in the list.  Since this change we have so far not seen this error
again.

Here's a snip of code that does this in my job script.  I actually name
this host list file "machinefile", not mpd.hosts, but the name is
arbitrary:

# Create Hosts List File with Mother Superior listed first (accomodate
Intel MPI bug)
if [ -f $STAGE_PATH/machinefile ]; then rm -f $STAGE_PATH/machinefile ; fi
MOTHER=`hostname`
echo $MOTHER > $STAGE_PATH/machinefile
for NODE in `cat $PBS_NODEFILE | sort -u |  grep -v $MOTHER`; do
    echo "$NODE" >> $STAGE_PATH/machinefile
done

Regards,

Al Solomon


More information about the torqueusers mailing list