[torqueusers] Segmentation fault when using OpenMPI -pernode option

Damian Montaldo damianmontaldo at gmail.com
Fri Jun 1 13:29:16 MDT 2012


Hi, I need some help with Torque and a specific option of OpenMPI
I'm sending this email twice because the firt one looks like it didn't arrive.

I have nodes with 4 processors each and I want to launch only one
process in each node using the pernode option because I need restrict
that torque is not going to queue another jobs in that node.
As the manual says: On each node, launch one process (-- equivalent to
-npernode 1)

This is the error I got. I try to google it but a segmentation fault
it's a very common error and it's very common too to found it related
to the binary (executed by mpiexec) and I think that this is a
specific Torque error because running mpirun with the host file and
the pernode it seems to work.

$ cat TEST.e37495
[n52:04352] *** Process received signal ***
[n52:04352] Signal: Segmentation fault (11)
[n52:04352] Signal code: Address not mapped (1)
[n52:04352] Failing at address: 0x50
[n52:04352] [ 0] /lib/libpthread.so.0(+0xeff0) [0x2aca79ff4ff0]
[n52:04352] [ 1]
/usr/lib/libopen-rte.so.0(orte_util_encode_pidmap+0xbc)
[0x2aca792c334c]
[n52:04352] [ 2]
/usr/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0x2d4)
[0x2aca792d1ea4]
[n52:04352] [ 3]
/usr/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x11e)
[0x2aca792d596e]
[n52:04352] [ 4] /usr/lib/openmpi/lib/openmpi/mca_plm_tm.so(+0x1d4a)
[0x2aca7b382d4a]
[n52:04352] [ 5] mpiexec() [0x403aaf]
[n52:04352] [ 6] mpiexec() [0x402f74]
[n52:04352] [ 7] /lib/libc.so.6(__libc_start_main+0xfd) [0x2aca7a220c8d]
[n52:04352] [ 8] mpiexec() [0x402e99]
[n52:04352] *** End of error message ***
/var/spool/torque/mom_priv/jobs/37495....SC: line 107:  4352
Segmentation fault      mpiexec -verbose -pernode -np $NP python
..args...
[n48:15977] [[10692,0],2] routed:binomial: Connection to lifeline
[[10692,0],0] lost
[n49:15992] [[10692,0],1] routed:binomial: Connection to lifeline
[[10692,0],0] lost
[n42:16290] [[10692,0],3] routed:binomial: Connection to lifeline
[[10692,0],0] lost

$ qstat -f 37495
Job Id: 37495
   Job_Name = TEST
   resources_used.cput = 00:00:00
   resources_used.mem = 532kb
   resources_used.vmem = 9056kb
   resources_used.walltime = 00:00:01
   job_state = C
   queue = batch
   server = n0
   Checkpoint = u
   ctime = Thu May 31 20:42:47 2012
   exec_host = n52/3+n52/2+n52/1+n52/0+n49/3+n49/2+n49/1+n49/0+n48/3+n48/2+n4
       8/1+n48/0+n42/3+n42/2+n42/1+n42/0
   Hold_Types = n
   Join_Path = n
   Keep_Files = eo
   Mail_Points = abe
   mtime = Thu May 31 20:43:21 2012
   Priority = 0
   qtime = Thu May 31 20:42:47 2012
   Rerunable = True
   Resource_List.nodect = 4
   Resource_List.nodes = 4:ppn=4
   Resource_List.walltime = 01:00:00
   session_id = 4342
   Variable_List = PBS_O_LANG=es_AR.UTF-8,
       PBS_O_LOGNAME=dfslezak,
       PBS_O_PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games,
       PBS_O_SHELL=/bin/bash,PBS_SERVER=n0,
       PBS_O_QUEUE=batch,
       PBS_O_HOST=n0
   comment = Job started on Thu May 31 at 20:43
   etime = Thu May 31 20:42:47 2012
   exit_status = 0
   submit_args = -l walltime=1:00:00
   start_time = Thu May 31 20:43:21 2012
   Walltime.Remaining = 360
   start_count = 1
   fault_tolerant = False
   comp_time = Thu May 31 20:43:21 2012

$ mpiexec --version
mpiexec (OpenRTE) 1.4.2

I doesn't to be related to python but this is the version.
$ python --version
Python 2.6.6

It a Debian Linux (squeeze up to date) with this Torque version
$ dpkg -l | grep torque
ii  libtorque2             2.4.8+dfsg-9squeeze1   shared library for
Torque client and server
ii  torque-client         2.4.8+dfsg-9squeeze1   command line
interface to Torque server
ii  torque-common    2.4.8+dfsg-9squeeze1   Torque Queueing System shared files
ii  torque-mom         2.4.8+dfsg-9squeeze1   job execution engine for
Torque batch system
ii  torque-scheduler   2.4.8+dfsg-9squeeze1   scheduler part of Torque
ii  torque-server        2.4.8+dfsg-9squeeze1   PBS-derived batch
processing server

If you need more specific info (perhaps a qmgr print server?) just
tell, and of course, any help would be very appreciated!


More information about the torqueusers mailing list