[torquedev] MPI question

Michael Jennings mej at lbl.gov
Fri Jul 23 11:29:24 MDT 2010


On Friday, 23 July 2010, at 06:26:33 (-0600),
Ken Nielson wrote:

> I have a customer who is having some problems installing OpenMPI
> with TORQUE. Does anyone know what might be going wrong given the
> information below?
> 
> We suspect that it may be due to differences in the tm interface between
> Torque and the flavour of pbs used on the machine where OpenMPI was built.
> On a different system I have access to,  I see this in the mom_log:
> 07/22/2010 17:09:06;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad file
> descriptor (9) in tm_request, bad protocol version 2

Yeah, I think they've hit the nail on the head here.  This type of
message is most likely caused by a mismatch in function parameters
between the API OpenMPI's tm support was built to expect and the
actual API in libtorque.so.

They say they have a version of OpenMPI built on the cluster itself
that works fine, so if possible, they should use that.  If that's not
possible, at minimum they'll need to use the correctly-built orted.
They could also try supplying the libtorque.so.2 from the cluster
where OpenMPI was built, but it may or may not work with the pbs_mom
present on the cluster.  (You guys would know the answer to that
better than I would.)

HTH,
Michael

-- 
Michael Jennings <mej at lbl.gov>
Linux Systems and Cluster Engineer
High-Performance Computing Services
Bldg 50B-3209E      W: 510-495-2687
MS 050C-3396        F: 510-486-8615


More information about the torquedev mailing list