[torqueusers] Torque 2.5.9 MOMs keep segfaulting

David Beer dbeer at adaptivecomputing.com
Wed Jan 11 13:52:14 MST 2012


Do they segfault right away? If you can't find a core file, would it be possible to run the mom in gdb and get a backtrace of the crash when it happens?

David

----- Original Message -----
> torque was configured with --with-debug, "ulimit -c unlimited" is in
> the init script right before the moms are started like
> "/usr/sbin/pbs_mom -p -d /var/spool/torque" but I'm still not seeing
> a core file anywhere.
> 
> On Jan 11, 2012, at 10:26 AM, David Beer wrote:
> 
> > 
> > 
> > ----- Original Message -----
> >> I finally got around to doing this, but I don't see a core file in
> >> /var/spool/torque or in /usr/sbin. Where would the core get
> >> dumped?
> >> 
> > 
> > A mom's core file would be in /var/spool/torque/mom_priv. You need
> > to make sure ulimit -c is unlimited or set to a very large number.
> > 
> > David
> > 
> >> On Dec 20, 2011, at 3:03 PM, Ken Nielson wrote:
> >> 
> >>> ----- Original Message -----
> >>>> From: "Troy Baer" <tbaer at utk.edu>
> >>>> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
> >>>> Sent: Tuesday, December 20, 2011 8:59:56 AM
> >>>> Subject: Re: [torqueusers] Torque 2.5.9 MOMs keep segfaulting
> >>>> 
> >>>> On Thu, 2011-12-08 at 10:36 -0600, Ti Leggett wrote:
> >>>>> I just upgraded from 2.5.7 to 2.5.9 on Tuesday and since then,
> >>>>> MOMs
> >>>>> keep randomly segfaulting and dying. I see this in the MOM log
> >>>>> right before dying:
> >>>>> 
> >>>>> 12/08/2011 10:09:14;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad
> >>>>> file
> >>>>> descriptor (9) in tm_request, comm failed Protocol failure in
> >>>>> commit
> >>>>> 
> >>>>> 
> >>>>> And something similar to this in dmesg:
> >>>>> 
> >>>>> pbs_mom[22354]: segfault at 0000000000000008 rip
> >>>>> 00002b585249ed6f
> >>>>> rsp 00007fff19e96df0 error 4
> >>>> 
> >>>> We've also seen this on one of our systems and had to fall back
> >>>> to
> >>>> 2.5.8
> >>>> on it.
> >>>> 
> >>>> 	--Troy
> >>>> --
> >>>> Troy Baer, HPC System Administrator
> >>>> National Institute for Computational Sciences, University of
> >>>> Tennessee
> >>>> http://www.nics.tennessee.edu/
> >>>> Phone:  865-241-4233
> >>> 
> >>> Could someone configure TORQUE using --with-debug and then send a
> >>> stack trace of the crash?
> >>> 
> >>> Ken
> >>> _______________________________________________
> >>> torqueusers mailing list
> >>> torqueusers at supercluster.org
> >>> http://www.supercluster.org/mailman/listinfo/torqueusers
> >> 
> >> 
> >> _______________________________________________
> >> torqueusers mailing list
> >> torqueusers at supercluster.org
> >> http://www.supercluster.org/mailman/listinfo/torqueusers
> >> 
> > 
> > --
> > David Beer
> > Direct Line: 801-717-3386 | Fax: 801-717-3738
> >     Adaptive Computing
> >     1712 S East Bay Blvd, Suite 300
> >     Provo, UT 84606
> > 
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> 

-- 
David Beer 
Direct Line: 801-717-3386 | Fax: 801-717-3738
     Adaptive Computing
     1712 S East Bay Blvd, Suite 300
     Provo, UT 84606



More information about the torqueusers mailing list