[torquedev] MPI question
knielson at adaptivecomputing.com
Fri Jul 23 06:26:33 MDT 2010
I have a customer who is having some problems installing OpenMPI with TORQUE. Does anyone know what might be going wrong given the information below?
When OpenMPI is correctly built against Torque, it appears to use the
'tm' interface such that pbs_mom() becomes the parent process of all the
MPI processes (via orted)
We have found that the stock build of OpenMPI although it claims to have
$ ompi_info |grep "tm "
MCA ras: tm (MCA v2.0, API v2.0, Component v1.0.2)
MCA plm: tm (MCA v2.0, API v2.0, Component v1.0.2)
does not appear to work - the parent of orted is just init - presumably
from an ssh.
When running OpenMPI with proper torque integration, user hit all sorts
of problems with missing libraries (orted is built against the Intel
and users have to 'unset PBS_JOBID to fool mpirun to ignore the Torque
If we rebuild OpenMPI on the cluster iteself (and so against the
installed libtorque-devel headers all works as hoped.
We suspect that it may be due to differences in the tm interface between
Torque and the flavour of pbs used on the machine where OpenMPI was built.
On a different system I have access to, I see this in the mom_log:
07/22/2010 17:09:06;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad file
descriptor (9) in tm_request, bad protocol version 2
More information about the torquedev