[torquedev] MPI question

Garrick Staples garrick at usc.edu
Fri Jul 23 09:47:19 MDT 2010


On Jul 23, 2010, at 5:26 AM, Ken Nielson wrote:

> I have a customer who is having some problems installing OpenMPI with TORQUE. Does anyone know what might be going wrong given the information below?
> 
> Thanks
> 
> 
> When OpenMPI is correctly built against Torque, it appears to use the
> 'tm' interface such that pbs_mom() becomes the parent process of all the
> MPI processes (via orted)
> We have found that the stock build of OpenMPI although it claims to have
> tm support,
> (vis
> $ ompi_info |grep "tm "
>                 MCA ras: tm (MCA v2.0, API v2.0, Component v1.0.2)
>                 MCA plm: tm (MCA v2.0, API v2.0, Component v1.0.2)
> )
> does not appear to work  - the parent of orted is just init - presumably
> from an ssh.
> 
> When running OpenMPI with proper torque integration, user hit all sorts
> of problems with missing libraries (orted is built against the Intel
> Compilers)
> and users have to 'unset PBS_JOBID to fool mpirun to ignore the Torque
> bindings
> 
> If we rebuild OpenMPI on the cluster iteself (and so against the
> installed libtorque-devel headers all works as hoped.
> 
> We suspect that it may be due to differences in the tm interface between
> Torque and the flavour of pbs used on the machine where OpenMPI was built.
> On a different system I have access to,  I see this in the mom_log:
> 07/22/2010 17:09:06;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad file
> descriptor (9) in tm_request, bad protocol version 2

Sounds like a reasonable assessment. What is this "stock" openmpi build? Where did the customer get it?


More information about the torquedev mailing list