[torqueusers] LAM-MPI won't boot with torque-1.2.0p6

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Thu Sep 15 03:27:18 MDT 2005


When we upgraded our test cluster from torque-1.2.0p4 to
torque-1.2.0p6 (final version), parallel jobs using LAM-MPI
(latest beta version lam-7.1.2b26) would no longer boot the
LAM daemons.  I've rebuilt the latest LAM-MPI after
torque-1.2.0p6 was installed, but that didn't help.
It would appear that something has changed within
Torque that impacts a LAM-MPI installation working nicely
with torque-1.2.0p4.

I've built torque-1.2.0p6 RPMs for Redhat RHEL 4.0 based upon
the torque.spec files provided by Garrick Staples' source RPMs at
    http://mirrors.usc.edu/usc/usclinux/3AS/source/common/
The Torque configure flags used are as follows:
./configure  --prefix=/usr --mandir=/usr/share/man --enable-docs \
    --enable-server --enable-mom --enable-clients --with-scp \
    --enable-syslog --set-server-home=/var/spool/torque \
    --set-default-server=localhost --libdir=/usr/lib/torque \
    --disable-gui --without-tclx --without-tcl --disable-filesync \
    --disable-rpp

The problem with Torque is specific to LAM-MPI (serial jobs run
perfectly well).  When LAM-MPI selects a boot schema, it defaults
to the Torque/OpenPBS "tm" schema.  Unfortunately, this tm schema
is unable to boot correctly (see output below).  If I force LAM-MPI
to use the "rsh" boot schema (export LAM_MPI_SSI_boot_tm_priority=1),
everything with LAM-MPI works just fine !  It is of course possible
that LAM-MPI used to default to the "rsh" boot schema with
torque-1.2.0p4, but we can't verify that any more.

Question:  Is Torque's LAM-MPI "tm" boot schema supposed to be
working correctly with torque-1.2.0p6 ?  I'd love to get it to
work because of the performance improvements promised in the
LAM-MPI documentation.

Thanks,
Ole

Output from LAM-MPI booting:
----------------------------

$ recon -d
n-1<15810> ssi:boot:open: opening
n-1<15810> ssi:boot:open: opening boot module globus
n-1<15810> ssi:boot:open: opened boot module globus
n-1<15810> ssi:boot:open: opening boot module rsh
n-1<15810> ssi:boot:open: opened boot module rsh
n-1<15810> ssi:boot:open: opening boot module slurm
n-1<15810> ssi:boot:open: opened boot module slurm
n-1<15810> ssi:boot:open: opening boot module tm
n-1<15810> ssi:boot:open: opened boot module tm
n-1<15810> ssi:boot:select: initializing boot module tm
n-1<15810> ssi:boot:tm: module initializing
n-1<15810> ssi:boot:tm:verbose: 1000
n-1<15810> ssi:boot:tm:priority: 50
n-1<15810> ssi:boot:select: boot module available: tm, priority: 50
n-1<15810> ssi:boot:select: initializing boot module slurm
n-1<15810> ssi:boot:slurm: not running under SLURM
n-1<15810> ssi:boot:select: boot module not available: slurm
n-1<15810> ssi:boot:select: initializing boot module rsh
n-1<15810> ssi:boot:rsh: module initializing
n-1<15810> ssi:boot:rsh:agent: ssh -x
n-1<15810> ssi:boot:rsh:username: <same>
n-1<15810> ssi:boot:rsh:verbose: 1000
n-1<15810> ssi:boot:rsh:algorithm: linear
n-1<15810> ssi:boot:rsh:no_n: 0
n-1<15810> ssi:boot:rsh:no_profile: 0
n-1<15810> ssi:boot:rsh:fast: 0
n-1<15810> ssi:boot:rsh:ignore_stderr: 0
n-1<15810> ssi:boot:rsh:priority: 10
n-1<15810> ssi:boot:select: boot module available: rsh, priority: 10
n-1<15810> ssi:boot:select: initializing boot module globus
n-1<15810> ssi:boot:globus: globus-job-run not found, globus boot will 
not run
n-1<15810> ssi:boot:select: boot module not available: globus
n-1<15810> ssi:boot:select: finalizing boot module slurm
n-1<15810> ssi:boot:slurm: finalizing
n-1<15810> ssi:boot:select: closing boot module slurm
n-1<15810> ssi:boot:select: finalizing boot module rsh
n-1<15810> ssi:boot:rsh: finalizing
n-1<15810> ssi:boot:select: closing boot module rsh
n-1<15810> ssi:boot:select: finalizing boot module globus
n-1<15810> ssi:boot:globus: finalizing
n-1<15810> ssi:boot:select: closing boot module globus
n-1<15810> ssi:boot:select: selected boot module tm
n-1<15810> ssi:boot:tm: found the following 3 hosts:
n-1<15810> ssi:boot:tm:   n0 n469.dcsc.fysik.dtu.dk (cpu=1)
n-1<15810> ssi:boot:tm:   n1 n478.dcsc.fysik.dtu.dk (cpu=1)
n-1<15810> ssi:boot:tm:   n2 n477.dcsc.fysik.dtu.dk (cpu=1)
n-1<15810> ssi:boot:tm: starting RTE procs
n-1<15810> ssi:boot:base:linear_windowed: starting
n-1<15810> ssi:boot:base:linear_windowed: no startup protocol
n-1<15810> ssi:boot:base:linear_windowed: invoking linear
n-1<15810> ssi:boot:base:linear: starting
n-1<15810> ssi:boot:base:linear: booting n0 (n469.dcsc.fysik.dtu.dk)
n-1<15810> ssi:boot:tm: starting recon on (n469.dcsc.fysik.dtu.dk)
n-1<15810> ssi:boot:tm: starting on n0 (n469.dcsc.fysik.dtu.dk): 
/usr/local/lam-7.1.2-pgi/bin/tkill -N
n-1<15810> ssi:boot:tm: successfully launched on n0 (n469.dcsc.fysik.dtu.dk)
n-1<15810> ssi:boot:tm: waiting for completion on n0 
(n469.dcsc.fysik.dtu.dk)
n-1<15810> ssi:boot:base:linear: Failed to boot n0 (n469.dcsc.fysik.dtu.dk)
n-1<15810> ssi:boot:base:linear: aborted!
-----------------------------------------------------------------------------
recon was not able to complete successfully.  There can be any number
of problems that did not allow recon to work properly.  You should use
the "-d" option to recon to get more information about each step that
recon attempts.

Any error message above may present a more detailed description of the
actual problem.

Here is general a list of prerequisites that *must* be fulfilled
before recon can work:

         - Each machine in the hostfile must be reachable and operational.
         - You must have an account on each machine.
         - You must be able to rsh(1) to the machine (permissions
           are typically set in the user's $HOME/.rhosts file).

         *** Sidenote: If you compiled LAM to use a remote shell program
             other than rsh (with the --with-rsh option to ./configure;
             e.g., ssh), or if you set the LAMRSH environment variable
             to an alternate remote shell program, you need to ensure
             that you can execute programs on remote nodes with no
             password.  For example:

         unix% ssh -x pinky uptime
         3:09am up 211 day(s), 23:49, 2 users, load average: 0.01, 0.08, 
0.10

         - The LAM executables must be locatable on each machine, using
           the shell's search path and possibly the LAMHOME environment
           variable.
         - The shell's start-up script must not print anything on standard
           error.  You can take advantage of the fact that rsh(1) will
           start the shell non-interactively.  The start-up script (such
           as .profile or .cshrc) can exit early in this case, before
           executing many commands relevant only to interactive sessions
           and likely to generate output.
-----------------------------------------------------------------------------
n-1<15810> ssi:boot:tm: finalizing
n-1<15810> ssi:boot: Closing



-- 
Ole Holm Nielsen
Department of Physics, Technical University of Denmark


More information about the torqueusers mailing list