[torquedev] torque-4.0.x and openmpi
knielson at adaptivecomputing.com
Tue May 29 09:44:36 MDT 2012
On Mon, May 28, 2012 at 6:45 PM, Martin Siegert <siegert at sfu.ca> wrote:
> I am wondering whether there is a way of running an MPI program
> compiled with openmpi (configured with --with-tm=...) and torque-2.5.x
> using the TM interface under torque-4.0.x?
> The dependence on torque enters openmpi only through the
> mca_plm_tm.so module which links with libtorque.so.2:
> # ldd /usr/local/openmpi-1.4.3/lib64/openmpi/mca_plm_tm.so
> linux-vdso.so.1 => (0x00007fff63ffd000)
> libtorque.so.2 => /usr/local/torque-2.5.8/lib/libtorque.so.2
> libnsl.so.1 => /lib64/libnsl.so.1 (0x00002b0286f2d000)
> libutil.so.1 => /lib64/libutil.so.1 (0x00002b0287145000)
> libm.so.6 => /lib64/libm.so.6 (0x00002b0287348000)
> libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b02875cc000)
> libc.so.6 => /lib64/libc.so.6 (0x00002b02877e7000)
> /lib64/ld-linux-x86-64.so.2 (0x0000003a3e200000)
> The program runs fine when I run it from the command line, i.e.,
> mpiexec -n 20 -hostfile mfile ./myprog
> and it also runs fine when torque 2.5.11 is running.
> However, with torque-4.0.2 and using a submission script
> #PBS -l walltime=1:30:00
> #PBS -l procs=20
> cd $PBS_O_WORKDIR
> mpiexec ./myprog
> the job fails to run (as long as the number of requested processors so
> that more than one node is involved in the computation).
> This is the error message from mpiexec:
> b413 - daemon did not report back when launched
> I would have expected that since both torque-2.5.x and torque-4.0.2
> come with libtorque.so.2 (i.e., same soname) that the library is
> "backward compatible".
> The solution to this problem appears to be to just recompile the
> mca_plm_tm.so module and replace just that file. This appears to
> be working although I find this somewhat hair-raising. Has somebody
> more experience with this?
> Otherwise upgrading to torque-4.0.x would be almost impossible:
> we would have to recompile all MPI programs on the system including
> the users' programs.
> Even when just replacing mca_plm_tm.so We still need to drain all
> jobs, upgrade torque and replace mca_plm_tm.so since I cannot imagine
> that a rolling upgrade can work: do moms from torque-2.5.11 talk to a
> torque-4.0.2 server?.
> Martin Siegert
> Simon Fraser University
> Burnaby, British Columbia
Thanks for the report. Hopefully this is something we can fix. I can't
think of anything we did that would require a recompile for the apps that
use libtorque.so but that doesn't mean we didn't.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torquedev