[torquedev] MPI question
Michael Jennings
mej at lbl.gov
Fri Jul 23 11:29:24 MDT 2010
On Friday, 23 July 2010, at 06:26:33 (-0600),
Ken Nielson wrote:
> I have a customer who is having some problems installing OpenMPI
> with TORQUE. Does anyone know what might be going wrong given the
> information below?
>
> We suspect that it may be due to differences in the tm interface between
> Torque and the flavour of pbs used on the machine where OpenMPI was built.
> On a different system I have access to, I see this in the mom_log:
> 07/22/2010 17:09:06;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad file
> descriptor (9) in tm_request, bad protocol version 2
Yeah, I think they've hit the nail on the head here. This type of
message is most likely caused by a mismatch in function parameters
between the API OpenMPI's tm support was built to expect and the
actual API in libtorque.so.
They say they have a version of OpenMPI built on the cluster itself
that works fine, so if possible, they should use that. If that's not
possible, at minimum they'll need to use the correctly-built orted.
They could also try supplying the libtorque.so.2 from the cluster
where OpenMPI was built, but it may or may not work with the pbs_mom
present on the cluster. (You guys would know the answer to that
better than I would.)
HTH,
Michael
--
Michael Jennings <mej at lbl.gov>
Linux Systems and Cluster Engineer
High-Performance Computing Services
Bldg 50B-3209E W: 510-495-2687
MS 050C-3396 F: 510-486-8615
More information about the torquedev
mailing list