[torqueusers] Segmentation fault when using OpenMPI -pernode option

Gus Correa gus at ldeo.columbia.edu
Mon Jun 4 09:48:30 MDT 2012


Hi Damian

Did you build your OpenMPI with Torque support?
Or did you install it from a Debian package?
The Debian OpenMPI package [if it exists] may not
have Torque support.
In this case, the mpiexec/mpirun probably won't
know how to coordinate with Torque regarding
nodes, cores, resources, etc.

You can do
'mpicc --showme'
to see if
'-ltorque'
appears there.

I hope this helps,
Gus Correa

On 06/01/2012 03:29 PM, Damian Montaldo wrote:
> Hi, I need some help with Torque and a specific option of OpenMPI
> I'm sending this email twice because the firt one looks like it didn't arrive.
>
> I have nodes with 4 processors each and I want to launch only one
> process in each node using the pernode option because I need restrict
> that torque is not going to queue another jobs in that node.
> As the manual says: On each node, launch one process (-- equivalent to
> -npernode 1)
>
> This is the error I got. I try to google it but a segmentation fault
> it's a very common error and it's very common too to found it related
> to the binary (executed by mpiexec) and I think that this is a
> specific Torque error because running mpirun with the host file and
> the pernode it seems to work.
>
> $ cat TEST.e37495
> [n52:04352] *** Process received signal ***
> [n52:04352] Signal: Segmentation fault (11)
> [n52:04352] Signal code: Address not mapped (1)
> [n52:04352] Failing at address: 0x50
> [n52:04352] [ 0] /lib/libpthread.so.0(+0xeff0) [0x2aca79ff4ff0]
> [n52:04352] [ 1]
> /usr/lib/libopen-rte.so.0(orte_util_encode_pidmap+0xbc)
> [0x2aca792c334c]
> [n52:04352] [ 2]
> /usr/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0x2d4)
> [0x2aca792d1ea4]
> [n52:04352] [ 3]
> /usr/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x11e)
> [0x2aca792d596e]
> [n52:04352] [ 4] /usr/lib/openmpi/lib/openmpi/mca_plm_tm.so(+0x1d4a)
> [0x2aca7b382d4a]
> [n52:04352] [ 5] mpiexec() [0x403aaf]
> [n52:04352] [ 6] mpiexec() [0x402f74]
> [n52:04352] [ 7] /lib/libc.so.6(__libc_start_main+0xfd) [0x2aca7a220c8d]
> [n52:04352] [ 8] mpiexec() [0x402e99]
> [n52:04352] *** End of error message ***
> /var/spool/torque/mom_priv/jobs/37495....SC: line 107:  4352
> Segmentation fault      mpiexec -verbose -pernode -np $NP python
> ..args...
> [n48:15977] [[10692,0],2] routed:binomial: Connection to lifeline
> [[10692,0],0] lost
> [n49:15992] [[10692,0],1] routed:binomial: Connection to lifeline
> [[10692,0],0] lost
> [n42:16290] [[10692,0],3] routed:binomial: Connection to lifeline
> [[10692,0],0] lost
>
> $ qstat -f 37495
> Job Id: 37495
>     Job_Name = TEST
>     resources_used.cput = 00:00:00
>     resources_used.mem = 532kb
>     resources_used.vmem = 9056kb
>     resources_used.walltime = 00:00:01
>     job_state = C
>     queue = batch
>     server = n0
>     Checkpoint = u
>     ctime = Thu May 31 20:42:47 2012
>     exec_host = n52/3+n52/2+n52/1+n52/0+n49/3+n49/2+n49/1+n49/0+n48/3+n48/2+n4
>         8/1+n48/0+n42/3+n42/2+n42/1+n42/0
>     Hold_Types = n
>     Join_Path = n
>     Keep_Files = eo
>     Mail_Points = abe
>     mtime = Thu May 31 20:43:21 2012
>     Priority = 0
>     qtime = Thu May 31 20:42:47 2012
>     Rerunable = True
>     Resource_List.nodect = 4
>     Resource_List.nodes = 4:ppn=4
>     Resource_List.walltime = 01:00:00
>     session_id = 4342
>     Variable_List = PBS_O_LANG=es_AR.UTF-8,
>         PBS_O_LOGNAME=dfslezak,
>         PBS_O_PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games,
>         PBS_O_SHELL=/bin/bash,PBS_SERVER=n0,
>         PBS_O_QUEUE=batch,
>         PBS_O_HOST=n0
>     comment = Job started on Thu May 31 at 20:43
>     etime = Thu May 31 20:42:47 2012
>     exit_status = 0
>     submit_args = -l walltime=1:00:00
>     start_time = Thu May 31 20:43:21 2012
>     Walltime.Remaining = 360
>     start_count = 1
>     fault_tolerant = False
>     comp_time = Thu May 31 20:43:21 2012
>
> $ mpiexec --version
> mpiexec (OpenRTE) 1.4.2
>
> I doesn't to be related to python but this is the version.
> $ python --version
> Python 2.6.6
>
> It a Debian Linux (squeeze up to date) with this Torque version
> $ dpkg -l | grep torque
> ii  libtorque2             2.4.8+dfsg-9squeeze1   shared library for
> Torque client and server
> ii  torque-client         2.4.8+dfsg-9squeeze1   command line
> interface to Torque server
> ii  torque-common    2.4.8+dfsg-9squeeze1   Torque Queueing System shared files
> ii  torque-mom         2.4.8+dfsg-9squeeze1   job execution engine for
> Torque batch system
> ii  torque-scheduler   2.4.8+dfsg-9squeeze1   scheduler part of Torque
> ii  torque-server        2.4.8+dfsg-9squeeze1   PBS-derived batch
> processing server
>
> If you need more specific info (perhaps a qmgr print server?) just
> tell, and of course, any help would be very appreciated!
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list