[torquedev] [Bug 208] New: pbs_mom segfaults in tm_request

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Tue Jul 24 13:10:42 MDT 2012


           Summary: pbs_mom segfaults in tm_request
           Product: TORQUE
           Version: 2.5.x
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: critical
          Priority: P5
         Component: pbs_mom
        AssignedTo: knielson at adaptivecomputing.com
        ReportedBy: siegert at sfu.ca
                CC: torquedev at supercluster.org
   Estimated Hours: 0.0

We see a large number of segfaults in pbs_mom. This is the syslog entry:

Jul 22 02:25:35 b177 pbs_mom: LOG_ERROR::Bad file descriptor (9) in tm_request,
comm failed Protocol failure in commit
Jul 22 02:25:35 b177 kernel: pbs_mom[12931]: segfault at 0000000000000008 rip
00002b13f903a5ef rsp 00007fff035e0990 error 4

This problem exists (at least) in torque 2.5.11 and 3.0.5.

As a consequence we are losing jobs on the cluster:

=>> PBS: job killed: node 28 (b177) requested job terminate, 'EOF' (code
1099) - received SISTER_EOF attempting to communicate with sister MOM's
mpiexec: killing job...

- Martin

Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

More information about the torquedev mailing list