Bugzilla – Bug 193
pbs_mom segfaults at job end (job using pbsdsh).
Last modified: 2012-07-11 15:11:17 MDT
You need to log in before you can comment on or make changes to this bug.
Created an attachment (id=106) [details] gdb session with pbs_mom crash. This case might be a bit outside the straight and narrow... In my torque-roll for Rocks I have trick for starting OpenMPI apps when OpenMPI isn't compiled with libtm support (as is the case with most distro provided OpenMPIs). The setup is rather simple, just do some enviroment trickery so that mpirun use a wrapper around pbsdsh that behaves like ssh (several other people have posted similar solutions lately). The app seems to run fine, but pbs_mom crashes upon job exit. It seems to be a double free error *** glibc detected *** /opt/torque/sbin/pbs_mom: double free or corruption (!prev): 0x00000000015dd0c0 *** The job itself is trivial [marve@hpc1 ~]$ qsub -lnodes=3:ppn=2,walltime=1000 -I qsub: waiting for job 15.hpc1.local to start qsub: job 13.hpc1.local ready [marve@compute-0-2 ~]$ cat /opt/torque/etc/openmpi-setup.sh export OMPI_MCA_plm_rsh_agent="pbsdshwrapper" export OMPI_MCA_orte_default_hostfile=$PBS_NODEFILE export OMPI_MCA_orte_leave_session_attached=1 [marve@compute-0-2 ~]$ source /opt/torque/etc/openmpi-setup.sh [marve@compute-0-2 ~]$ /opt/openmpi/bin/mpirun mpi-verify.x Process 0 on compute-0-2.local Process 1 on compute-0-2.local Process 3 on compute-0-1.local Process 5 on compute-0-0.local Process 2 on compute-0-1.local Process 4 on compute-0-0.local [marve@compute-0-2 ~]$ logout qsub: job 13.hpc1.local completed As soon as the job exits pbs_mom goes down in flames, see attached file for a backtrace while running under gdb. the pbsdshwrapper is a python program that tries very hard to behave like ssh. It works with commercial mpi-libs and things like Gaussian-Linda too, at the end it runs pbsdsh -h nodename whatever to start the desired process on the sister compute nodes. This is for torque 4.0.1, the setup works fine with torque 3 and earlier releases.
Fixed in 4.0.3 and 4.1.0.