[torqueusers] MPI Problem

David Beer dbeer at adaptivecomputing.com
Wed Oct 13 13:06:41 MDT 2010


Hi all, 

This is on behalf of a user. I don't know if any of you have experience with this, but I thought this would be a good place to ask. If anyone can help, please copy Bermudez.Luis at orbital.com on your reply. Many thanks. Here is his email:

I have two applications that are giving me issues when I try to run them
using more than one computing node.  They both run fine outside of Torque
for any number of nodes and inside Torque if only one node is requested.
These two applications have been compiled using the Intel FORTRAN compiler.
We have other tools compiled with the same compiler which run fine in any
number of nodes.  Currently I am using OpenMPI 1.4.2 and Torque 2.4.3.

I have tried to exclude the infiniband (mpirun --mca btl ^openib...) and
rely only on the high speed Ethernet with identical results.  I have also
opened the file descriptor limit to 28768 and added explicit ulimit calls
for the file descriptor and size memory locked in the torque startup
script.

In all instances, the message I am getting when the run crashes is:


mpirun noticed that process rank 4 with PID 17894 on node <node hostname>
exited on signal 11 (Segmentation fault).



Any suggestion to explain and help fixing this issue will be very much
appreciated.

-- 
David Beer 
Direct Line: 801-717-3386 | Fax: 801-717-3738
     Adaptive Computing
     1656 S. East Bay Blvd. Suite #300
     Provo, UT 84606



More information about the torqueusers mailing list