Bug 193 - pbs_mom segfaults at job end (job using pbsdsh).
: pbs_mom segfaults at job end (job using pbsdsh).
Status: RESOLVED FIXED
Product: TORQUE
pbs_mom
: 4.0.*
: PC Linux
: P5 major
Assigned To: David Beer
:
:
:
  Show dependency treegraph
 
Reported: 2012-05-08 15:30 MDT by Roy Dragseth
Modified: 2012-07-11 15:11 MDT (History)
2 users (show)

See Also:


Attachments
gdb session with pbs_mom crash. (13.23 KB, text/plain)
2012-05-08 15:30 MDT, Roy Dragseth
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description Roy Dragseth 2012-05-08 15:30:23 MDT
Created an attachment (id=106) [details]
gdb session with pbs_mom crash.

This case might be a bit outside the straight and narrow...

In my torque-roll for Rocks I have trick for starting OpenMPI apps when OpenMPI
isn't compiled with libtm support (as is the case with most distro provided
OpenMPIs).

The setup is rather simple, just do some enviroment trickery so that mpirun use
a wrapper around pbsdsh that behaves like ssh (several other people have posted
similar solutions lately).  The app seems to run fine, but pbs_mom crashes upon
job exit.  It seems to be a double free error

*** glibc detected *** /opt/torque/sbin/pbs_mom: double free or corruption
(!prev): 0x00000000015dd0c0 ***


The job itself is trivial

[marve@hpc1 ~]$ qsub -lnodes=3:ppn=2,walltime=1000 -I
qsub: waiting for job 15.hpc1.local to start
qsub: job 13.hpc1.local ready

[marve@compute-0-2 ~]$ cat /opt/torque/etc/openmpi-setup.sh 
export OMPI_MCA_plm_rsh_agent="pbsdshwrapper"
export OMPI_MCA_orte_default_hostfile=$PBS_NODEFILE
export OMPI_MCA_orte_leave_session_attached=1
[marve@compute-0-2 ~]$ source /opt/torque/etc/openmpi-setup.sh 
[marve@compute-0-2 ~]$ /opt/openmpi/bin/mpirun mpi-verify.x 
Process 0 on compute-0-2.local
Process 1 on compute-0-2.local
Process 3 on compute-0-1.local
Process 5 on compute-0-0.local
Process 2 on compute-0-1.local
Process 4 on compute-0-0.local
[marve@compute-0-2 ~]$ logout

qsub: job 13.hpc1.local completed

As soon as the job exits pbs_mom goes down in flames, see attached file for a
backtrace while running under gdb.

the pbsdshwrapper is a python program that tries very hard to behave like ssh. 
It works with commercial mpi-libs and things like Gaussian-Linda too, at the
end it runs 

pbsdsh -h nodename whatever

to start the desired process on the sister compute nodes.

This is for torque 4.0.1, the setup works fine with torque 3 and earlier
releases.
Comment 1 David Beer 2012-07-11 15:11:17 MDT
Fixed in 4.0.3 and 4.1.0.