[torqueusers] OpenMPI and version changed to Torque
Martin Siegert
siegert at sfu.ca
Fri Jun 29 11:18:40 MDT 2012
Hi David, Peter,
I can confirm Peter's observation - see my email to torquedev from
May 28. It is necessary to recompile OpenMPI's mca_plm_tm.so
(which does make upgrading difficult).
Cheers,
Martin
--
Martin Siegert
Simon Fraser University
Burnaby, British Columbia
Canada
On Fri, Jun 29, 2012 at 10:12:33AM -0600, David Beer wrote:
>
> Peter,
>
> I am under the impression that the different sites running 4.x (either
> on test or production systems) haven't had to recompile their version
> of MPI. It'd be nice to hear input from different admins on this
> subject, but my impression is that this isn't necessary, and I know
> that we didn't change the tm interface. I will respond to some of your
> other questions below.
> On Fri, Jun 29, 2012 at 9:09 AM, Peter A Ruprecht
> <[1]peter.ruprecht at colorado.edu> wrote:
>
> Hi everyone,
> Currently we're using torque 2.5.11 and would like to migrate to 4.x
> pretty soon. However, some testing with 4.0.2 has shown that
> programs
> linked against a version of OpenMPI (1.4.x) that was compiled with
> torque
> 2.5 won't run across more than one node. My guess is that the task
> manager API has changed between 2.5 and 4.0.
> Certainly, best practices would suggest recompiling all libraries
> that
> depend on torque when the torque version changes. However, a
> significant
> number of our users would be very unhappy having to re-test and
> possibly
> recompile their codes with a recompiled OpenMPI. I think that in
> some
> cases they are even required to use identical libraries across a
> whole
> suite of runs to guarantee consistency. This makes it a little
> tough to
> ever change the resource manager.
> So, getting around to my questions, is it likely that I am
> understanding
> the dependency between torque, the task manager, and OpenMPI
> correctly?
>
> My two cents: it seems extremely unlikely that if you recompile your
> MPI version it would change the results of the job, especially if you
> recompile the same version of MPI. In the event that you have to
> recompile, it seems like overkill to make everyone re-test their
> applications. However, I'm by no means an expert in being an admin for
> HPC systems (I am a TORQUE developer) so hopefully some more in the
> community can weigh in.
>
> And if so, is it really going to be necessary to recompile OpenMPI?
> What
> do you all do in this situation? Is it a bad idea to run torque (on
> a big
> cluster, ~1400 nodes and >10000 jobs/day) without using the task
> manager?
>
> There are a lot of sites that use (at least occasionally) versions of
> MPI that don't interface with TORQUE, or haven't been built to
> interface with TORQUE. The most common complaint I've heard from this
> is that sometimes they have stray processes left from jobs that don't
> get cleaned up up by the mom because the mom isn't told when they are
> launched. Others may have more input here.
>
> Any commentary or pointers to relevant documentation appreciated!
> Pete Ruprecht
> _______________________________________________
> torqueusers mailing list
> [2]torqueusers at supercluster.org
> [3]http://www.supercluster.org/mailman/listinfo/torqueusers
>
> --
> David Beer | Software Engineer
> Adaptive Computing
>
> References
>
> 1. mailto:peter.ruprecht at colorado.edu
> 2. mailto:torqueusers at supercluster.org
> 3. http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list