[torqueusers] OpenMPI and version changed to Torque
dbeer at adaptivecomputing.com
Fri Jun 29 10:12:33 MDT 2012
I am under the impression that the different sites running 4.x (either on
test or production systems) haven't had to recompile their version of MPI.
It'd be nice to hear input from different admins on this subject, but my
impression is that this isn't necessary, and I know that we didn't change
the tm interface. I will respond to some of your other questions below.
On Fri, Jun 29, 2012 at 9:09 AM, Peter A Ruprecht <
peter.ruprecht at colorado.edu> wrote:
> Hi everyone,
> Currently we're using torque 2.5.11 and would like to migrate to 4.x
> pretty soon. However, some testing with 4.0.2 has shown that programs
> linked against a version of OpenMPI (1.4.x) that was compiled with torque
> 2.5 won't run across more than one node. My guess is that the task
> manager API has changed between 2.5 and 4.0.
> Certainly, best practices would suggest recompiling all libraries that
> depend on torque when the torque version changes. However, a significant
> number of our users would be very unhappy having to re-test and possibly
> recompile their codes with a recompiled OpenMPI. I think that in some
> cases they are even required to use identical libraries across a whole
> suite of runs to guarantee consistency. This makes it a little tough to
> ever change the resource manager.
> So, getting around to my questions, is it likely that I am understanding
> the dependency between torque, the task manager, and OpenMPI correctly?
My two cents: it seems extremely unlikely that if you recompile your MPI
version it would change the results of the job, especially if you recompile
the same version of MPI. In the event that you have to recompile, it seems
like overkill to make everyone re-test their applications. However, I'm by
no means an expert in being an admin for HPC systems (I am a TORQUE
developer) so hopefully some more in the community can weigh in.
> And if so, is it really going to be necessary to recompile OpenMPI? What
> do you all do in this situation? Is it a bad idea to run torque (on a big
> cluster, ~1400 nodes and >10000 jobs/day) without using the task manager?
There are a lot of sites that use (at least occasionally) versions of MPI
that don't interface with TORQUE, or haven't been built to interface with
TORQUE. The most common complaint I've heard from this is that sometimes
they have stray processes left from jobs that don't get cleaned up up by
the mom because the mom isn't told when they are launched. Others may have
more input here.
> Any commentary or pointers to relevant documentation appreciated!
> Pete Ruprecht
> torqueusers mailing list
> torqueusers at supercluster.org
David Beer | Software Engineer
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers