[torqueusers] Torque 2.5.9 MOMs keep segfaulting
David Beer
dbeer at adaptivecomputing.com
Wed Jan 11 09:26:12 MST 2012
----- Original Message -----
> I finally got around to doing this, but I don't see a core file in
> /var/spool/torque or in /usr/sbin. Where would the core get dumped?
>
A mom's core file would be in /var/spool/torque/mom_priv. You need to make sure ulimit -c is unlimited or set to a very large number.
David
> On Dec 20, 2011, at 3:03 PM, Ken Nielson wrote:
>
> > ----- Original Message -----
> >> From: "Troy Baer" <tbaer at utk.edu>
> >> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
> >> Sent: Tuesday, December 20, 2011 8:59:56 AM
> >> Subject: Re: [torqueusers] Torque 2.5.9 MOMs keep segfaulting
> >>
> >> On Thu, 2011-12-08 at 10:36 -0600, Ti Leggett wrote:
> >>> I just upgraded from 2.5.7 to 2.5.9 on Tuesday and since then,
> >>> MOMs
> >>> keep randomly segfaulting and dying. I see this in the MOM log
> >>> right before dying:
> >>>
> >>> 12/08/2011 10:09:14;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad
> >>> file
> >>> descriptor (9) in tm_request, comm failed Protocol failure in
> >>> commit
> >>>
> >>>
> >>> And something similar to this in dmesg:
> >>>
> >>> pbs_mom[22354]: segfault at 0000000000000008 rip 00002b585249ed6f
> >>> rsp 00007fff19e96df0 error 4
> >>
> >> We've also seen this on one of our systems and had to fall back to
> >> 2.5.8
> >> on it.
> >>
> >> --Troy
> >> --
> >> Troy Baer, HPC System Administrator
> >> National Institute for Computational Sciences, University of
> >> Tennessee
> >> http://www.nics.tennessee.edu/
> >> Phone: 865-241-4233
> >
> > Could someone configure TORQUE using --with-debug and then send a
> > stack trace of the crash?
> >
> > Ken
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
--
David Beer
Direct Line: 801-717-3386 | Fax: 801-717-3738
Adaptive Computing
1712 S East Bay Blvd, Suite 300
Provo, UT 84606
More information about the torqueusers
mailing list