[torquedev] MPI question

Ken Nielson knielson at adaptivecomputing.com
Fri Jul 23 06:26:33 MDT 2010


I have a customer who is having some problems installing OpenMPI with TORQUE. Does anyone know what might be going wrong given the information below?

Thanks


When OpenMPI is correctly built against Torque, it appears to use the
'tm' interface such that pbs_mom() becomes the parent process of all the
MPI processes (via orted)
We have found that the stock build of OpenMPI although it claims to have
tm support,
(vis
$ ompi_info |grep "tm "
                 MCA ras: tm (MCA v2.0, API v2.0, Component v1.0.2)
                 MCA plm: tm (MCA v2.0, API v2.0, Component v1.0.2)
)
does not appear to work  - the parent of orted is just init - presumably
from an ssh.

When running OpenMPI with proper torque integration, user hit all sorts
of problems with missing libraries (orted is built against the Intel
Compilers)
and users have to 'unset PBS_JOBID to fool mpirun to ignore the Torque
bindings

If we rebuild OpenMPI on the cluster iteself (and so against the
installed libtorque-devel headers all works as hoped.

We suspect that it may be due to differences in the tm interface between
Torque and the flavour of pbs used on the machine where OpenMPI was built.
On a different system I have access to,  I see this in the mom_log:
07/22/2010 17:09:06;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad file
descriptor (9) in tm_request, bad protocol version 2

Ken Nielson
Adaptive Computing


More information about the torquedev mailing list