[torquedev] MPI question
garrick at usc.edu
Fri Jul 23 09:47:19 MDT 2010
On Jul 23, 2010, at 5:26 AM, Ken Nielson wrote:
> I have a customer who is having some problems installing OpenMPI with TORQUE. Does anyone know what might be going wrong given the information below?
> When OpenMPI is correctly built against Torque, it appears to use the
> 'tm' interface such that pbs_mom() becomes the parent process of all the
> MPI processes (via orted)
> We have found that the stock build of OpenMPI although it claims to have
> tm support,
> $ ompi_info |grep "tm "
> MCA ras: tm (MCA v2.0, API v2.0, Component v1.0.2)
> MCA plm: tm (MCA v2.0, API v2.0, Component v1.0.2)
> does not appear to work - the parent of orted is just init - presumably
> from an ssh.
> When running OpenMPI with proper torque integration, user hit all sorts
> of problems with missing libraries (orted is built against the Intel
> and users have to 'unset PBS_JOBID to fool mpirun to ignore the Torque
> If we rebuild OpenMPI on the cluster iteself (and so against the
> installed libtorque-devel headers all works as hoped.
> We suspect that it may be due to differences in the tm interface between
> Torque and the flavour of pbs used on the machine where OpenMPI was built.
> On a different system I have access to, I see this in the mom_log:
> 07/22/2010 17:09:06;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad file
> descriptor (9) in tm_request, bad protocol version 2
Sounds like a reasonable assessment. What is this "stock" openmpi build? Where did the customer get it?
More information about the torquedev