Bugzilla – Bug 166
After upgrading to 2.5.9, MOMs keep segfaulting
Last modified: 2011-12-12 09:21:35 MST
You need to log in before you can comment on or make changes to this bug.
I upgraded to torque 2.5.9 from 2.5.7 last Tuesday and since then on one of my clusters the MOMs keep segfaulting and dying. In dmesg I see something similar to this: pbs_mom[31409]: segfault at 0000000000000008 rip 0000003655618d6f rsp 00007fffc63f7f50 error 4 And in the mom logs I see this: 12/12/2011 09:59:13;0001; pbs_mom;Job;35935.svc.uc.futuregrid.org;task not started, 'rm', stdio setup failed (see syslog) 12/12/2011 09:59:13;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad file descriptor (9) in tm_request, comm failed Protocol failure in commit And in syslog I see: Dec 12 09:59:02 c32 mpd: mpd ending mpdid=c32.uc.futuregrid.org_44987 (inside cleanup) Dec 12 09:59:07 c32 pbs_mom: LOG_ERROR::Connection refused (111) in open_demux, open_demux: cannot connect to 127.0.0.1:60305 Dec 12 09:59:11 c32 last message repeated 2 times Dec 12 09:59:13 c32 pbs_mom: LOG_ERROR::Inappropriate ioctl for device (25) in open_demux, open_demux: connect 127.0.0.1:60305 Dec 12 09:59:13 c32 pbs_mom: LOG_ERROR::Inappropriate ioctl for device (25) in start_process, cannot open mux stdout port Dec 12 09:59:13 c32 pbs_mom: LOG_ERROR::Bad file descriptor (9) in tm_request, comm failed Protocol failure in commit Dec 12 09:59:13 c32 kernel: pbs_mom[31409]: segfault at 0000000000000008 rip 0000003655618d6f rsp 00007fffc63f7f50 error 4