[torquedev] [Bug 193] New: pbs_mom segfaults at job end (job using pbsdsh).

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Tue May 8 15:30:23 MDT 2012


http://www.clusterresources.com/bugzilla/show_bug.cgi?id=193

           Summary: pbs_mom segfaults at job end (job using pbsdsh).
           Product: TORQUE
           Version: 4.0.*
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: major
          Priority: P5
         Component: pbs_mom
        AssignedTo: knielson at adaptivecomputing.com
        ReportedBy: roy.dragseth at uit.no
                CC: torquedev at supercluster.org
   Estimated Hours: 0.0


Created an attachment (id=106)
 --> (http://www.clusterresources.com/bugzilla/attachment.cgi?id=106)
gdb session with pbs_mom crash.

This case might be a bit outside the straight and narrow...

In my torque-roll for Rocks I have trick for starting OpenMPI apps when OpenMPI
isn't compiled with libtm support (as is the case with most distro provided
OpenMPIs).

The setup is rather simple, just do some enviroment trickery so that mpirun use
a wrapper around pbsdsh that behaves like ssh (several other people have posted
similar solutions lately).  The app seems to run fine, but pbs_mom crashes upon
job exit.  It seems to be a double free error

*** glibc detected *** /opt/torque/sbin/pbs_mom: double free or corruption
(!prev): 0x00000000015dd0c0 ***


The job itself is trivial

[marve at hpc1 ~]$ qsub -lnodes=3:ppn=2,walltime=1000 -I
qsub: waiting for job 15.hpc1.local to start
qsub: job 13.hpc1.local ready

[marve at compute-0-2 ~]$ cat /opt/torque/etc/openmpi-setup.sh 
export OMPI_MCA_plm_rsh_agent="pbsdshwrapper"
export OMPI_MCA_orte_default_hostfile=$PBS_NODEFILE
export OMPI_MCA_orte_leave_session_attached=1
[marve at compute-0-2 ~]$ source /opt/torque/etc/openmpi-setup.sh 
[marve at compute-0-2 ~]$ /opt/openmpi/bin/mpirun mpi-verify.x 
Process 0 on compute-0-2.local
Process 1 on compute-0-2.local
Process 3 on compute-0-1.local
Process 5 on compute-0-0.local
Process 2 on compute-0-1.local
Process 4 on compute-0-0.local
[marve at compute-0-2 ~]$ logout

qsub: job 13.hpc1.local completed

As soon as the job exits pbs_mom goes down in flames, see attached file for a
backtrace while running under gdb.

the pbsdshwrapper is a python program that tries very hard to behave like ssh. 
It works with commercial mpi-libs and things like Gaussian-Linda too, at the
end it runs 

pbsdsh -h nodename whatever

to start the desired process on the sister compute nodes.

This is for torque 4.0.1, the setup works fine with torque 3 and earlier
releases.

-- 
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


More information about the torquedev mailing list