[torqueusers] OpenMPI and version changed to Torque

Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu
Fri Jun 29 10:54:27 MDT 2012


On Fri, Jun 29, 2012 at 12:12 PM, David Beer
<dbeer at adaptivecomputing.com> wrote:
> Peter,
>
> I am under the impression that the different sites running 4.x (either on
> test or production systems) haven't had to recompile their version of MPI.
> It'd be nice to hear input from different admins on this subject, but my
> impression is that this isn't necessary, and I know that we didn't change
> the tm interface. I will respond to some of your other questions below.

you did change the soname of libtorque.so, right?

this is likely what keeps OpenMP from failing,
since the corresponding plugin won't load anymore.
here is the ldd output from an OpenMPI 1.4.x installation
on a Torque 2.5.5 machine:

[akohlmey at login2 openmpi]$ ldd mca_ras_tm.so
	linux-vdso.so.1 =>  (0x00002aaaaaacb000)
	libtorque.so.2 => /opt/torque-2.5.5/lib/libtorque.so.2 (0x00002aaaaaed0000)
	libnsl.so.1 => /lib64/libnsl.so.1 (0x00002aaaab1f1000)
	libutil.so.1 => /lib64/libutil.so.1 (0x00002aaaab40a000)
	libm.so.6 => /lib64/libm.so.6 (0x00002aaaab60d000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00002aaaab892000)
	libc.so.6 => /lib64/libc.so.6 (0x00002aaaabaae000)
	/lib64/ld-linux-x86-64.so.2 (0x00002aaaaaaab000)

if there are no changes in the ABI (of what OpenMPI uses),
the workaround for keeping OpenMPI happy and working
may be as simple as doing a symlink from libtorque.so.4
to libtorque.so.2.

alternatively, i would try to recompile/relink only the one plugin
and (mca_ras_tm.so) and replace it in the OpenMPI installation.

OpenMPI has a very modular structure and none of the application
binaries reference the plugin dependencies unless OpenMPI was
compile for static linkage only.

HTH,
    axel.

>
> On Fri, Jun 29, 2012 at 9:09 AM, Peter A Ruprecht
> <peter.ruprecht at colorado.edu> wrote:
>>
>> Hi everyone,
>>
>> Currently we're using torque 2.5.11 and would like to migrate to 4.x
>> pretty soon.  However, some testing with 4.0.2 has shown that programs
>> linked against a version of OpenMPI (1.4.x) that was compiled with torque
>> 2.5 won't run across more than one node.  My guess is that the task
>> manager API has changed between 2.5 and 4.0.
>>
>> Certainly, best practices would suggest recompiling all libraries that
>> depend on torque when the torque version changes.  However, a significant
>> number of our users would be very unhappy having to re-test and possibly
>> recompile their codes with a recompiled OpenMPI.  I think that in some
>> cases they are even required to use identical libraries across a whole
>> suite of runs to guarantee consistency.  This makes it a little tough to
>> ever change the resource manager.
>>
>> So, getting around to my questions, is it likely that I am understanding
>> the dependency between torque, the task manager, and OpenMPI correctly?
>
>
> My two cents: it seems extremely unlikely that if you recompile your MPI
> version it would change the results of the job, especially if you recompile
> the same version of MPI. In the event that you have to recompile, it seems
> like overkill to make everyone re-test their applications. However, I'm by
> no means an expert in being an admin for HPC systems (I am a TORQUE
> developer) so hopefully some more in the community can weigh in.
>
>>
>> And if so, is it really going to be necessary to recompile OpenMPI?  What
>> do you all do in this situation?  Is it a bad idea to run torque (on a big
>> cluster, ~1400 nodes and >10000 jobs/day) without using the task manager?
>>
>
> There are a lot of sites that use (at least occasionally) versions of MPI
> that don't interface with TORQUE, or haven't been built to interface with
> TORQUE. The most common complaint I've heard from this is that sometimes
> they have stray processes left from jobs that don't get cleaned up up by the
> mom because the mom isn't told when they are launched. Others may have more
> input here.
>
>>
>> Any commentary or pointers to relevant documentation appreciated!
>>
>> Pete Ruprecht
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
> --
> David Beer | Software Engineer
> Adaptive Computing
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
Dr. Axel Kohlmeyer    akohlmey at gmail.com
http://sites.google.com/site/akohlmey/

Institute for Computational Molecular Science
Temple University, Philadelphia PA, USA.


More information about the torqueusers mailing list