[torquedev] torque-4.0.x and openmpi

Martin Siegert siegert at sfu.ca
Mon May 28 18:45:16 MDT 2012


Hi,

I am wondering whether there is a way of running an MPI program
compiled with openmpi (configured with --with-tm=...) and torque-2.5.x
using the TM interface under torque-4.0.x?

The dependence on torque enters openmpi only through the
mca_plm_tm.so module which links with libtorque.so.2:

# ldd /usr/local/openmpi-1.4.3/lib64/openmpi/mca_plm_tm.so
        linux-vdso.so.1 =>  (0x00007fff63ffd000)
        libtorque.so.2 => /usr/local/torque-2.5.8/lib/libtorque.so.2 (0x00002b0286c14000)
        libnsl.so.1 => /lib64/libnsl.so.1 (0x00002b0286f2d000)
        libutil.so.1 => /lib64/libutil.so.1 (0x00002b0287145000)
        libm.so.6 => /lib64/libm.so.6 (0x00002b0287348000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b02875cc000)
        libc.so.6 => /lib64/libc.so.6 (0x00002b02877e7000)
        /lib64/ld-linux-x86-64.so.2 (0x0000003a3e200000)

The program runs fine when I run it from the command line, i.e.,

mpiexec -n 20 -hostfile mfile ./myprog

and it also runs fine when torque 2.5.11 is running.
However, with torque-4.0.2 and using a submission script

#!/bin/bash
#PBS -l walltime=1:30:00
#PBS -l procs=20
cd $PBS_O_WORKDIR
mpiexec ./myprog

the job fails to run (as long as the number of requested processors so large
that more than one node is involved in the computation).
This is the error message from mpiexec:
        b413 - daemon did not report back when launched

I would have expected that since both torque-2.5.x and torque-4.0.2
come with libtorque.so.2 (i.e., same soname) that the library is
"backward compatible".

The solution to this problem appears to be to just recompile the
mca_plm_tm.so module and replace just that file. This appears to
be working although I find this somewhat hair-raising. Has somebody
more experience with this?

Otherwise upgrading to torque-4.0.x would be almost impossible:
we would have to recompile all MPI programs on the system including
the users' programs.
Even when just replacing mca_plm_tm.so We still need to drain all
jobs, upgrade torque and replace mca_plm_tm.so since I cannot imagine
that a rolling upgrade can work: do moms from torque-2.5.11 talk to a
torque-4.0.2 server?.

Cheers,
Martin

-- 
Martin Siegert
Simon Fraser University
Burnaby, British Columbia


More information about the torquedev mailing list