[torqueusers] MPI Problem

Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu
Wed Oct 13 13:59:39 MDT 2010


On Wed, Oct 13, 2010 at 3:45 PM, Brock Palen <brockp at umich.edu> wrote:
> He mentioned he already did ulimit, but was it on the stack size?  Many old fortran codes did not use ALLOCATE, also while you can set the stacksize to unlimited in /etc/securitty/limits.conf

FWIW,

an alternate approach to deal with intel fortran stack requirements
that requires less reconfiguration and scripting would be to add
the flag '-heap-arrays' when compiling. this is supported by all
intel fortran compilers released in recent years.

cheers,
    axel.


> It does not apply until after the pbs_mom init script starts, so it needs to be in the pbs_mom init script before the mom starts.
>
> Have him in a pbs script run:
>
> pbsdsh bash -c 'ulimit -s'
>
> and send the output,
>
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> brockp at umich.edu
> (734)936-1985
>
>
>
> On Oct 13, 2010, at 3:06 PM, David Beer wrote:
>
>> Hi all,
>>
>> This is on behalf of a user. I don't know if any of you have experience with this, but I thought this would be a good place to ask. If anyone can help, please copy Bermudez.Luis at orbital.com on your reply. Many thanks. Here is his email:
>>
>> I have two applications that are giving me issues when I try to run them
>> using more than one computing node.  They both run fine outside of Torque
>> for any number of nodes and inside Torque if only one node is requested.
>> These two applications have been compiled using the Intel FORTRAN compiler.
>> We have other tools compiled with the same compiler which run fine in any
>> number of nodes.  Currently I am using OpenMPI 1.4.2 and Torque 2.4.3.
>>
>> I have tried to exclude the infiniband (mpirun --mca btl ^openib...) and
>> rely only on the high speed Ethernet with identical results.  I have also
>> opened the file descriptor limit to 28768 and added explicit ulimit calls
>> for the file descriptor and size memory locked in the torque startup
>> script.
>>
>> In all instances, the message I am getting when the run crashes is:
>>
>>
>> mpirun noticed that process rank 4 with PID 17894 on node <node hostname>
>> exited on signal 11 (Segmentation fault).
>>
>>
>>
>> Any suggestion to explain and help fixing this issue will be very much
>> appreciated.
>>
>> --
>> David Beer
>> Direct Line: 801-717-3386 | Fax: 801-717-3738
>>     Adaptive Computing
>>     1656 S. East Bay Blvd. Suite #300
>>     Provo, UT 84606
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
Dr. Axel Kohlmeyer    akohlmey at gmail.com
http://sites.google.com/site/akohlmey/

Institute for Computational Molecular Science
Temple University, Philadelphia PA, USA.


More information about the torqueusers mailing list