[torquedev] [Bug 166] New: After upgrading to 2.5.9, MOMs keep segfaulting

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Mon Dec 12 09:21:36 MST 2011


http://www.clusterresources.com/bugzilla/show_bug.cgi?id=166

           Summary: After upgrading to 2.5.9, MOMs keep segfaulting
           Product: TORQUE
           Version: 2.5.x
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: major
          Priority: P5
         Component: pbs_mom
        AssignedTo: knielson at adaptivecomputing.com
        ReportedBy: leggett at ci.uchicago.edu
                CC: torquedev at supercluster.org
   Estimated Hours: 0.0


I upgraded to torque 2.5.9 from 2.5.7 last Tuesday and since then on one of my
clusters the MOMs keep segfaulting and dying. In dmesg I see something similar
to this:

pbs_mom[31409]: segfault at 0000000000000008 rip 0000003655618d6f rsp
00007fffc63f7f50 error 4


And in the mom logs I see this:


12/12/2011 09:59:13;0001;   pbs_mom;Job;35935.svc.uc.futuregrid.org;task not
started, 'rm', stdio setup failed (see syslog)
12/12/2011 09:59:13;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad file descriptor
(9) in tm_request, comm failed Protocol failure in commit


And in syslog I see:


Dec 12 09:59:02 c32 mpd: mpd ending mpdid=c32.uc.futuregrid.org_44987 (inside
cleanup)
Dec 12 09:59:07 c32 pbs_mom: LOG_ERROR::Connection refused (111) in open_demux,
open_demux: cannot connect to 127.0.0.1:60305
Dec 12 09:59:11 c32 last message repeated 2 times
Dec 12 09:59:13 c32 pbs_mom: LOG_ERROR::Inappropriate ioctl for device (25) in
open_demux, open_demux: connect 127.0.0.1:60305
Dec 12 09:59:13 c32 pbs_mom: LOG_ERROR::Inappropriate ioctl for device (25) in
start_process, cannot open mux stdout port
Dec 12 09:59:13 c32 pbs_mom: LOG_ERROR::Bad file descriptor (9) in tm_request,
comm failed Protocol failure in commit
Dec 12 09:59:13 c32 kernel: pbs_mom[31409]: segfault at 0000000000000008 rip
0000003655618d6f rsp 00007fffc63f7f50 error 4

-- 
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


More information about the torquedev mailing list