[torquedev] torque-4.0.x and openmpi

Ken Nielson knielson at adaptivecomputing.com
Tue May 29 09:44:36 MDT 2012


On Mon, May 28, 2012 at 6:45 PM, Martin Siegert <siegert at sfu.ca> wrote:

> Hi,
>
> I am wondering whether there is a way of running an MPI program
> compiled with openmpi (configured with --with-tm=...) and torque-2.5.x
> using the TM interface under torque-4.0.x?
>
> The dependence on torque enters openmpi only through the
> mca_plm_tm.so module which links with libtorque.so.2:
>
> # ldd /usr/local/openmpi-1.4.3/lib64/openmpi/mca_plm_tm.so
>        linux-vdso.so.1 =>  (0x00007fff63ffd000)
>        libtorque.so.2 => /usr/local/torque-2.5.8/lib/libtorque.so.2
> (0x00002b0286c14000)
>        libnsl.so.1 => /lib64/libnsl.so.1 (0x00002b0286f2d000)
>        libutil.so.1 => /lib64/libutil.so.1 (0x00002b0287145000)
>        libm.so.6 => /lib64/libm.so.6 (0x00002b0287348000)
>        libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b02875cc000)
>        libc.so.6 => /lib64/libc.so.6 (0x00002b02877e7000)
>        /lib64/ld-linux-x86-64.so.2 (0x0000003a3e200000)
>
> The program runs fine when I run it from the command line, i.e.,
>
> mpiexec -n 20 -hostfile mfile ./myprog
>
> and it also runs fine when torque 2.5.11 is running.
> However, with torque-4.0.2 and using a submission script
>
> #!/bin/bash
> #PBS -l walltime=1:30:00
> #PBS -l procs=20
> cd $PBS_O_WORKDIR
> mpiexec ./myprog
>
> the job fails to run (as long as the number of requested processors so
> large
> that more than one node is involved in the computation).
> This is the error message from mpiexec:
>        b413 - daemon did not report back when launched
>
> I would have expected that since both torque-2.5.x and torque-4.0.2
> come with libtorque.so.2 (i.e., same soname) that the library is
> "backward compatible".
>
> The solution to this problem appears to be to just recompile the
> mca_plm_tm.so module and replace just that file. This appears to
> be working although I find this somewhat hair-raising. Has somebody
> more experience with this?
>
> Otherwise upgrading to torque-4.0.x would be almost impossible:
> we would have to recompile all MPI programs on the system including
> the users' programs.
> Even when just replacing mca_plm_tm.so We still need to drain all
> jobs, upgrade torque and replace mca_plm_tm.so since I cannot imagine
> that a rolling upgrade can work: do moms from torque-2.5.11 talk to a
> torque-4.0.2 server?.
>
> Cheers,
> Martin
>
> --
> Martin Siegert
> Simon Fraser University
> Burnaby, British Columbia
>

Martin,

Thanks for the report. Hopefully this is something we can fix. I can't
think of anything we did that would require a recompile for the apps that
use libtorque.so but that doesn't mean we didn't.

Ken
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torquedev/attachments/20120529/dde0e0c8/attachment.html 


More information about the torquedev mailing list