[torqueusers] MPI Problem
akohlmey at cmm.chem.upenn.edu
Wed Oct 13 13:59:39 MDT 2010
On Wed, Oct 13, 2010 at 3:45 PM, Brock Palen <brockp at umich.edu> wrote:
> He mentioned he already did ulimit, but was it on the stack size? Many old fortran codes did not use ALLOCATE, also while you can set the stacksize to unlimited in /etc/securitty/limits.conf
an alternate approach to deal with intel fortran stack requirements
that requires less reconfiguration and scripting would be to add
the flag '-heap-arrays' when compiling. this is supported by all
intel fortran compilers released in recent years.
> It does not apply until after the pbs_mom init script starts, so it needs to be in the pbs_mom init script before the mom starts.
> Have him in a pbs script run:
> pbsdsh bash -c 'ulimit -s'
> and send the output,
> Brock Palen
> Center for Advanced Computing
> brockp at umich.edu
> On Oct 13, 2010, at 3:06 PM, David Beer wrote:
>> Hi all,
>> This is on behalf of a user. I don't know if any of you have experience with this, but I thought this would be a good place to ask. If anyone can help, please copy Bermudez.Luis at orbital.com on your reply. Many thanks. Here is his email:
>> I have two applications that are giving me issues when I try to run them
>> using more than one computing node. They both run fine outside of Torque
>> for any number of nodes and inside Torque if only one node is requested.
>> These two applications have been compiled using the Intel FORTRAN compiler.
>> We have other tools compiled with the same compiler which run fine in any
>> number of nodes. Currently I am using OpenMPI 1.4.2 and Torque 2.4.3.
>> I have tried to exclude the infiniband (mpirun --mca btl ^openib...) and
>> rely only on the high speed Ethernet with identical results. I have also
>> opened the file descriptor limit to 28768 and added explicit ulimit calls
>> for the file descriptor and size memory locked in the torque startup
>> In all instances, the message I am getting when the run crashes is:
>> mpirun noticed that process rank 4 with PID 17894 on node <node hostname>
>> exited on signal 11 (Segmentation fault).
>> Any suggestion to explain and help fixing this issue will be very much
>> David Beer
>> Direct Line: 801-717-3386 | Fax: 801-717-3738
>> Adaptive Computing
>> 1656 S. East Bay Blvd. Suite #300
>> Provo, UT 84606
>> torqueusers mailing list
>> torqueusers at supercluster.org
> torqueusers mailing list
> torqueusers at supercluster.org
Dr. Axel Kohlmeyer akohlmey at gmail.com
Institute for Computational Molecular Science
Temple University, Philadelphia PA, USA.
More information about the torqueusers