[torquedev] [Bug 208] New: pbs_mom segfaults in tm_request
bugzilla-daemon at supercluster.org
bugzilla-daemon at supercluster.org
Tue Jul 24 13:10:42 MDT 2012
http://www.clusterresources.com/bugzilla/show_bug.cgi?id=208
Summary: pbs_mom segfaults in tm_request
Product: TORQUE
Version: 2.5.x
Platform: PC
OS/Version: Linux
Status: NEW
Severity: critical
Priority: P5
Component: pbs_mom
AssignedTo: knielson at adaptivecomputing.com
ReportedBy: siegert at sfu.ca
CC: torquedev at supercluster.org
Estimated Hours: 0.0
We see a large number of segfaults in pbs_mom. This is the syslog entry:
Jul 22 02:25:35 b177 pbs_mom: LOG_ERROR::Bad file descriptor (9) in tm_request,
comm failed Protocol failure in commit
Jul 22 02:25:35 b177 kernel: pbs_mom[12931]: segfault at 0000000000000008 rip
00002b13f903a5ef rsp 00007fff035e0990 error 4
This problem exists (at least) in torque 2.5.11 and 3.0.5.
As a consequence we are losing jobs on the cluster:
=>> PBS: job killed: node 28 (b177) requested job terminate, 'EOF' (code
1099) - received SISTER_EOF attempting to communicate with sister MOM's
mpiexec: killing job...
- Martin
--
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
More information about the torquedev
mailing list