[torqueusers] MPI Problem : Perhaps unlimit stacksize in ~/.cshrc file
Coyle, James J [ITACD]
jjc at iastate.edu
Wed Oct 13 13:58:23 MDT 2010
Since the mpirun works on one node but not two, my guess is that something is
set in the script that applies only for that node.
My guess is that you have a smaller stacksize on compute nodes than on whatever
interactive nodes you are using, and the user has unlimit stacksize in the script.
If the user places unlimit stacksize in the Torque job script, this applies only
to the first node, so it will work on one node.
If you place unlimit stacksize in ~/.cshrc , this unlimits stacksize for all nodes.
(For bash users, place
ulimit -s unlimited
The alternative is for the sysadmin to unlimit stacksize on all compute nodes,
in /etc/security/limits.conf for example.
- Jim C.
James Coyle, PhD
High Performance Computing Group
115 Durham Center
Iowa State Univ.
Ames, Iowa 50011 web: http://www.public.iastate.edu/~jjc
>From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
>bounces at supercluster.org] On Behalf Of David Beer
>Sent: Wednesday, October 13, 2010 2:07 PM
>To: torqueusers at adaptivecomputing.com
>Subject: [torqueusers] MPI Problem
>This is on behalf of a user. I don't know if any of you have
>experience with this, but I thought this would be a good place to
>ask. If anyone can help, please copy Bermudez.Luis at orbital.com on
>your reply. Many thanks. Here is his email:
>I have two applications that are giving me issues when I try to run
>using more than one computing node. They both run fine outside of
>for any number of nodes and inside Torque if only one node is
>These two applications have been compiled using the Intel FORTRAN
>We have other tools compiled with the same compiler which run fine
>number of nodes. Currently I am using OpenMPI 1.4.2 and Torque
>I have tried to exclude the infiniband (mpirun --mca btl ^openib...)
>rely only on the high speed Ethernet with identical results. I have
>opened the file descriptor limit to 28768 and added explicit ulimit
>for the file descriptor and size memory locked in the torque startup
>In all instances, the message I am getting when the run crashes is:
>mpirun noticed that process rank 4 with PID 17894 on node <node
>exited on signal 11 (Segmentation fault).
>Any suggestion to explain and help fixing this issue will be very
>Direct Line: 801-717-3386 | Fax: 801-717-3738
> Adaptive Computing
> 1656 S. East Bay Blvd. Suite #300
> Provo, UT 84606
>torqueusers mailing list
>torqueusers at supercluster.org
More information about the torqueusers