Bug 166 - After upgrading to 2.5.9, MOMs keep segfaulting
: After upgrading to 2.5.9, MOMs keep segfaulting
Status: NEW
Product: TORQUE
pbs_mom
: 2.5.x
: PC Linux
: P5 major
Assigned To: Ken Nielson
:
:
:
  Show dependency treegraph
 
Reported: 2011-12-12 09:21 MST by Ti Leggett
Modified: 2011-12-12 09:21 MST (History)
1 user (show)

See Also:


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description Ti Leggett 2011-12-12 09:21:35 MST
I upgraded to torque 2.5.9 from 2.5.7 last Tuesday and since then on one of my
clusters the MOMs keep segfaulting and dying. In dmesg I see something similar
to this:

pbs_mom[31409]: segfault at 0000000000000008 rip 0000003655618d6f rsp
00007fffc63f7f50 error 4


And in the mom logs I see this:


12/12/2011 09:59:13;0001;   pbs_mom;Job;35935.svc.uc.futuregrid.org;task not
started, 'rm', stdio setup failed (see syslog)
12/12/2011 09:59:13;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad file descriptor
(9) in tm_request, comm failed Protocol failure in commit


And in syslog I see:


Dec 12 09:59:02 c32 mpd: mpd ending mpdid=c32.uc.futuregrid.org_44987 (inside
cleanup)
Dec 12 09:59:07 c32 pbs_mom: LOG_ERROR::Connection refused (111) in open_demux,
open_demux: cannot connect to 127.0.0.1:60305
Dec 12 09:59:11 c32 last message repeated 2 times
Dec 12 09:59:13 c32 pbs_mom: LOG_ERROR::Inappropriate ioctl for device (25) in
open_demux, open_demux: connect 127.0.0.1:60305
Dec 12 09:59:13 c32 pbs_mom: LOG_ERROR::Inappropriate ioctl for device (25) in
start_process, cannot open mux stdout port
Dec 12 09:59:13 c32 pbs_mom: LOG_ERROR::Bad file descriptor (9) in tm_request,
comm failed Protocol failure in commit
Dec 12 09:59:13 c32 kernel: pbs_mom[31409]: segfault at 0000000000000008 rip
0000003655618d6f rsp 00007fffc63f7f50 error 4