[torqueusers] OpenMPI and version changed to Torque

Martin Siegert siegert at sfu.ca
Fri Jun 29 11:18:40 MDT 2012


Hi David, Peter,

I can confirm Peter's observation - see my email to torquedev from
May 28. It is necessary to recompile OpenMPI's mca_plm_tm.so
(which does make upgrading difficult).

Cheers,
Martin

-- 
Martin Siegert
Simon Fraser University
Burnaby, British Columbia
Canada

On Fri, Jun 29, 2012 at 10:12:33AM -0600, David Beer wrote:
> 
>    Peter,
> 
>    I am under the impression that the different sites running 4.x (either
>    on test or production systems) haven't had to recompile their version
>    of MPI. It'd be nice to hear input from different admins on this
>    subject, but my impression is that this isn't necessary, and I know
>    that we didn't change the tm interface. I will respond to some of your
>    other questions below.
>    On Fri, Jun 29, 2012 at 9:09 AM, Peter A Ruprecht
>    <[1]peter.ruprecht at colorado.edu> wrote:
> 
>      Hi everyone,
>      Currently we're using torque 2.5.11 and would like to migrate to 4.x
>      pretty soon.  However, some testing with 4.0.2 has shown that
>      programs
>      linked against a version of OpenMPI (1.4.x) that was compiled with
>      torque
>      2.5 won't run across more than one node.  My guess is that the task
>      manager API has changed between 2.5 and 4.0.
>      Certainly, best practices would suggest recompiling all libraries
>      that
>      depend on torque when the torque version changes.  However, a
>      significant
>      number of our users would be very unhappy having to re-test and
>      possibly
>      recompile their codes with a recompiled OpenMPI.  I think that in
>      some
>      cases they are even required to use identical libraries across a
>      whole
>      suite of runs to guarantee consistency.  This makes it a little
>      tough to
>      ever change the resource manager.
>      So, getting around to my questions, is it likely that I am
>      understanding
>      the dependency between torque, the task manager, and OpenMPI
>      correctly?
> 
>    My two cents: it seems extremely unlikely that if you recompile your
>    MPI version it would change the results of the job, especially if you
>    recompile the same version of MPI. In the event that you have to
>    recompile, it seems like overkill to make everyone re-test their
>    applications. However, I'm by no means an expert in being an admin for
>    HPC systems (I am a TORQUE developer) so hopefully some more in the
>    community can weigh in.
> 
>      And if so, is it really going to be necessary to recompile OpenMPI?
>       What
>      do you all do in this situation?  Is it a bad idea to run torque (on
>      a big
>      cluster, ~1400 nodes and >10000 jobs/day) without using the task
>      manager?
> 
>    There are a lot of sites that use (at least occasionally) versions of
>    MPI that don't interface with TORQUE, or haven't been built to
>    interface with TORQUE. The most common complaint I've heard from this
>    is that sometimes they have stray processes left from jobs that don't
>    get cleaned up up by the mom because the mom isn't told when they are
>    launched. Others may have more input here.
> 
>      Any commentary or pointers to relevant documentation appreciated!
>      Pete Ruprecht
>      _______________________________________________
>      torqueusers mailing list
>      [2]torqueusers at supercluster.org
>      [3]http://www.supercluster.org/mailman/listinfo/torqueusers
> 
>    --
>    David Beer | Software Engineer
>    Adaptive Computing
> 
> References
> 
>    1. mailto:peter.ruprecht at colorado.edu
>    2. mailto:torqueusers at supercluster.org
>    3. http://www.supercluster.org/mailman/listinfo/torqueusers

> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list