[torqueusers] Segmentation fault when using OpenMPI -pernode option
Damian Montaldo
damianmontaldo at gmail.com
Fri Jun 1 13:29:16 MDT 2012
Hi, I need some help with Torque and a specific option of OpenMPI
I'm sending this email twice because the firt one looks like it didn't arrive.
I have nodes with 4 processors each and I want to launch only one
process in each node using the pernode option because I need restrict
that torque is not going to queue another jobs in that node.
As the manual says: On each node, launch one process (-- equivalent to
-npernode 1)
This is the error I got. I try to google it but a segmentation fault
it's a very common error and it's very common too to found it related
to the binary (executed by mpiexec) and I think that this is a
specific Torque error because running mpirun with the host file and
the pernode it seems to work.
$ cat TEST.e37495
[n52:04352] *** Process received signal ***
[n52:04352] Signal: Segmentation fault (11)
[n52:04352] Signal code: Address not mapped (1)
[n52:04352] Failing at address: 0x50
[n52:04352] [ 0] /lib/libpthread.so.0(+0xeff0) [0x2aca79ff4ff0]
[n52:04352] [ 1]
/usr/lib/libopen-rte.so.0(orte_util_encode_pidmap+0xbc)
[0x2aca792c334c]
[n52:04352] [ 2]
/usr/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0x2d4)
[0x2aca792d1ea4]
[n52:04352] [ 3]
/usr/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x11e)
[0x2aca792d596e]
[n52:04352] [ 4] /usr/lib/openmpi/lib/openmpi/mca_plm_tm.so(+0x1d4a)
[0x2aca7b382d4a]
[n52:04352] [ 5] mpiexec() [0x403aaf]
[n52:04352] [ 6] mpiexec() [0x402f74]
[n52:04352] [ 7] /lib/libc.so.6(__libc_start_main+0xfd) [0x2aca7a220c8d]
[n52:04352] [ 8] mpiexec() [0x402e99]
[n52:04352] *** End of error message ***
/var/spool/torque/mom_priv/jobs/37495....SC: line 107: 4352
Segmentation fault mpiexec -verbose -pernode -np $NP python
..args...
[n48:15977] [[10692,0],2] routed:binomial: Connection to lifeline
[[10692,0],0] lost
[n49:15992] [[10692,0],1] routed:binomial: Connection to lifeline
[[10692,0],0] lost
[n42:16290] [[10692,0],3] routed:binomial: Connection to lifeline
[[10692,0],0] lost
$ qstat -f 37495
Job Id: 37495
Job_Name = TEST
resources_used.cput = 00:00:00
resources_used.mem = 532kb
resources_used.vmem = 9056kb
resources_used.walltime = 00:00:01
job_state = C
queue = batch
server = n0
Checkpoint = u
ctime = Thu May 31 20:42:47 2012
exec_host = n52/3+n52/2+n52/1+n52/0+n49/3+n49/2+n49/1+n49/0+n48/3+n48/2+n4
8/1+n48/0+n42/3+n42/2+n42/1+n42/0
Hold_Types = n
Join_Path = n
Keep_Files = eo
Mail_Points = abe
mtime = Thu May 31 20:43:21 2012
Priority = 0
qtime = Thu May 31 20:42:47 2012
Rerunable = True
Resource_List.nodect = 4
Resource_List.nodes = 4:ppn=4
Resource_List.walltime = 01:00:00
session_id = 4342
Variable_List = PBS_O_LANG=es_AR.UTF-8,
PBS_O_LOGNAME=dfslezak,
PBS_O_PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games,
PBS_O_SHELL=/bin/bash,PBS_SERVER=n0,
PBS_O_QUEUE=batch,
PBS_O_HOST=n0
comment = Job started on Thu May 31 at 20:43
etime = Thu May 31 20:42:47 2012
exit_status = 0
submit_args = -l walltime=1:00:00
start_time = Thu May 31 20:43:21 2012
Walltime.Remaining = 360
start_count = 1
fault_tolerant = False
comp_time = Thu May 31 20:43:21 2012
$ mpiexec --version
mpiexec (OpenRTE) 1.4.2
I doesn't to be related to python but this is the version.
$ python --version
Python 2.6.6
It a Debian Linux (squeeze up to date) with this Torque version
$ dpkg -l | grep torque
ii libtorque2 2.4.8+dfsg-9squeeze1 shared library for
Torque client and server
ii torque-client 2.4.8+dfsg-9squeeze1 command line
interface to Torque server
ii torque-common 2.4.8+dfsg-9squeeze1 Torque Queueing System shared files
ii torque-mom 2.4.8+dfsg-9squeeze1 job execution engine for
Torque batch system
ii torque-scheduler 2.4.8+dfsg-9squeeze1 scheduler part of Torque
ii torque-server 2.4.8+dfsg-9squeeze1 PBS-derived batch
processing server
If you need more specific info (perhaps a qmgr print server?) just
tell, and of course, any help would be very appreciated!
More information about the torqueusers
mailing list