Bugzilla – Bug 193
pbs_mom segfaults at job end (job using pbsdsh).
Last modified: 2012-07-11 15:11:17 MDT
You need to
before you can comment on or make changes to this bug.
Created an attachment (id=106) [details]
gdb session with pbs_mom crash.
This case might be a bit outside the straight and narrow...
In my torque-roll for Rocks I have trick for starting OpenMPI apps when OpenMPI
isn't compiled with libtm support (as is the case with most distro provided
The setup is rather simple, just do some enviroment trickery so that mpirun use
a wrapper around pbsdsh that behaves like ssh (several other people have posted
similar solutions lately). The app seems to run fine, but pbs_mom crashes upon
job exit. It seems to be a double free error
*** glibc detected *** /opt/torque/sbin/pbs_mom: double free or corruption
(!prev): 0x00000000015dd0c0 ***
The job itself is trivial
[marve@hpc1 ~]$ qsub -lnodes=3:ppn=2,walltime=1000 -I
qsub: waiting for job 15.hpc1.local to start
qsub: job 13.hpc1.local ready
[marve@compute-0-2 ~]$ cat /opt/torque/etc/openmpi-setup.sh
[marve@compute-0-2 ~]$ source /opt/torque/etc/openmpi-setup.sh
[marve@compute-0-2 ~]$ /opt/openmpi/bin/mpirun mpi-verify.x
Process 0 on compute-0-2.local
Process 1 on compute-0-2.local
Process 3 on compute-0-1.local
Process 5 on compute-0-0.local
Process 2 on compute-0-1.local
Process 4 on compute-0-0.local
[marve@compute-0-2 ~]$ logout
qsub: job 13.hpc1.local completed
As soon as the job exits pbs_mom goes down in flames, see attached file for a
backtrace while running under gdb.
the pbsdshwrapper is a python program that tries very hard to behave like ssh.
It works with commercial mpi-libs and things like Gaussian-Linda too, at the
end it runs
pbsdsh -h nodename whatever
to start the desired process on the sister compute nodes.
This is for torque 4.0.1, the setup works fine with torque 3 and earlier
Fixed in 4.0.3 and 4.1.0.