[torquedev] MPI TM problem and fix

A.Th.C. Hulst hulst at argoss.nl
Tue Sep 2 15:37:10 MDT 2008


Hello,

I'm a new torque user and setting up a test cluster. We use several MPI 
applications so I was trying to use the MPI TM interface (lam-7.1.3). I've 
been using torque-2.3.2 so far.

SETUP:

   LAN
      |
      | public network
      |
HEADNODE
      |
      | private network
      |
 CLUSTER

The headnode has both an public and private address and the hostname is set to 
the public name (requirement).

The public name is neptune, the private name is agcnms2.

On the headnode a pbs_mom is running as well.

PROBLEM:
I need to start an MPI application with the first node (n0) on the headnode as 
that is the one with the diskpack and some applications collect all data on 
the first node

When I start a job with the script:

<file>
#PBS -N mpi_hello
#PBS -l nodes=agcnms2:ppn=4+3:ppn=4
#
cd $PBS_O_WORKDIR
lamboot -d -ssi boot tm
mpiexec ./hello_mpi
lamhalt
</file>

Then booting lam fails. Reason is that the private name agcnms2 is translated 
to the public name neptune. A boot is attempted over two network which 
obviously fails.

Digging through the code and stracing showed that when Torque was collecting 
it hosts for the LAM environment, it contacted all relevant MOM's, probably 
to update the resource availability (I don't know exactly). Part of the 
returned info by the MOM is uname. However, that contains the hostname and 
not necessarily the name of the MOM. Usually these are the same, but not in 
my case.

In $torquesrc/src/resmom/mom_main.c:785 and further I've changed:

-----------------
sprintf(ret_string,"%s %s %s %s %s",
      n.sysname,
      n.hostname,
      n.release,
      n.version,
      n.machine);
----- TO -----
sprintf(ret_string,"%s %s %s %s %s",
      n.sysname,
      mom_short_name,
      n.release,
      n.version,
      n.machine);
----------------
Now it listens to the -H flag of pbs_mom and MPI TM boots properly.

As far as I can see this hack does not influence any other behavior, but I 
don't really know that as I'm just starting.

Best regards,
Sander

-- 
ARGOSS: Atmospheric, marine & coastal information, systems and consultancy.
P.O. Box 61
8325 ZH Vollenhove
The Netherlands
Tel: +31 (0)527-242299
Fax: +31 (0)527-242016
E-mail: hulst at argoss.nl
Web: www.argoss.nl

---Confidentiality Notice & Disclaimer---
The contents of this e-mail and any attachments are intended only for the
use of the e-mail addressee(s) shown. If you are not that person, or one of
those persons, then you are not allowed copy, forward, distribute or disclose 
the contents of the mail or base any actions upon it.

ARGOSS Holding BV and its subsidiaries do not accept any liability for any
errors or omissions in the context of this e-mail or its attachments which
arise as a result of Internet transmission, nor accept liability for
statements which are those of the author and not clearly made on behalf of
ARGOSS.



More information about the torquedev mailing list