[torqueusers] openmpi failing when run under torque

Andrus, Brian Contractor bdandrus at nps.edu
Thu Jul 12 14:35:24 MDT 2012


All,

I am upgrading our cluster to Centos6
I have having great grief in running a simple mpi program.
It runs fine under a direct login to the node, however if I try running it under an interactive session, it segfaults and core dumps.

Running when directly ssh-ing to the node:
============================================
[bdandrus at compute-3-1 OPENMPI]$ mpirun -np 4 ./a.out
Process 0 starting to receive
1: compute-3-1 says it is process 1 of 4. received
2: compute-3-1 says it is process 2 of 4. received
3: compute-3-1 says it is process 3 of 4. Received
=============================================

Running under a 'qsub -I ':
============================================
[bdandrus at compute-3-1 OPENMPI]$ mpirun -np 4 ./a.out
[compute-3-1:31659] *** Process received signal ***
[compute-3-1:31659] Signal: Segmentation fault (11)
[compute-3-1:31659] Signal code: Address not mapped (1)
[compute-3-1:31659] Failing at address: 0x40
[compute-3-1:31659] [ 0] /lib64/libpthread.so.0(+0xf500) [0x7ffff6b71500]
[compute-3-1:31659] [ 1] /lib64/libc.so.6(_IO_vfprintf+0x3679) [0x7ffff6817389]
[compute-3-1:31659] [ 2] /lib64/libc.so.6(vasprintf+0xba) [0x7ffff683e8da]
[compute-3-1:31659] [ 3] /opt/openmpi/1.6/lib/libmpi.so.1(opal_show_help_vstring+0x333) [0x7ffff7b51dc3]
[compute-3-1:31659] [ 4] /opt/openmpi/1.6/lib/libmpi.so.1(orte_show_help+0xac) [0x7ffff7ae19fc]
[compute-3-1:31659] [ 5] /opt/openmpi/1.6/lib/openmpi/mca_btl_openib.so(+0xb0ba) [0x7ffff30d60ba]
[compute-3-1:31659] [ 6] /opt/openmpi/1.6/lib/openmpi/mca_mpool_rdma.so(+0x15ff) [0x7ffff49715ff]
[compute-3-1:31659] [ 7] /opt/openmpi/1.6/lib/openmpi/mca_mpool_rdma.so(mca_mpool_rdma_alloc+0xa9) [0x7ffff49720c9]
[compute-3-1:31659] [ 8] /opt/openmpi/1.6/lib/libmpi.so.1(ompi_free_list_grow+0x280) [0x7ffff7a81980]
[compute-3-1:31659] [ 9] /opt/openmpi/1.6/lib/openmpi/mca_btl_openib.so(+0xc34a) [0x7ffff30d734a]
[compute-3-1:31659] [10] /opt/openmpi/1.6/lib/openmpi/mca_btl_openib.so(+0xfb6e) [0x7ffff30dab6e]
[compute-3-1:31659] [11] /opt/openmpi/1.6/lib/libmpi.so.1(mca_btl_base_select+0x114) [0x7ffff7ac3764]
[compute-3-1:31659] [12] /opt/openmpi/1.6/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x12) [0x7ffff39199d2]
[compute-3-1:31659] [13] /opt/openmpi/1.6/lib/libmpi.so.1(mca_bml_base_init+0x99) [0x7ffff7ac2f49]
[compute-3-1:31659] [14] /opt/openmpi/1.6/lib/openmpi/mca_pml_ob1.so(+0x4ea0) [0x7ffff3d23ea0]
[compute-3-1:31659] [15] /opt/openmpi/1.6/lib/libmpi.so.1(mca_pml_base_select+0x1e4) [0x7ffff7ad2404]
[compute-3-1:31659] [16] /opt/openmpi/1.6/lib/libmpi.so.1(ompi_mpi_init+0x3ca) [0x7ffff7a96dba]
[compute-3-1:31659] [17] /opt/openmpi/1.6/lib/libmpi.so.1(MPI_Init+0x170) [0x7ffff7aacf00]
[compute-3-1:31659] [18] ./a.out(main+0x4f) [0x400ba3]
[compute-3-1:31659] [19] /lib64/libc.so.6(__libc_start_main+0xfd) [0x7ffff67eecdd]
[compute-3-1:31659] [20] ./a.out() [0x400a99]
[compute-3-1:31659] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 31659 on node compute-3-1 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
===========================================================

This is under torque 3.0.5 using the torque scheduler as well (for testing).
Any ideas what may be going on here?


Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238





More information about the torqueusers mailing list