[torqueusers] Torque 2.5.9 MOMs keep segfaulting

David Beer dbeer at adaptivecomputing.com
Wed Jan 11 09:26:12 MST 2012



----- Original Message -----
> I finally got around to doing this, but I don't see a core file in
> /var/spool/torque or in /usr/sbin. Where would the core get dumped?
> 

A mom's core file would be in /var/spool/torque/mom_priv. You need to make sure ulimit -c is unlimited or set to a very large number.

David

> On Dec 20, 2011, at 3:03 PM, Ken Nielson wrote:
> 
> > ----- Original Message -----
> >> From: "Troy Baer" <tbaer at utk.edu>
> >> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
> >> Sent: Tuesday, December 20, 2011 8:59:56 AM
> >> Subject: Re: [torqueusers] Torque 2.5.9 MOMs keep segfaulting
> >> 
> >> On Thu, 2011-12-08 at 10:36 -0600, Ti Leggett wrote:
> >>> I just upgraded from 2.5.7 to 2.5.9 on Tuesday and since then,
> >>> MOMs
> >>> keep randomly segfaulting and dying. I see this in the MOM log
> >>> right before dying:
> >>> 
> >>> 12/08/2011 10:09:14;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad
> >>> file
> >>> descriptor (9) in tm_request, comm failed Protocol failure in
> >>> commit
> >>> 
> >>> 
> >>> And something similar to this in dmesg:
> >>> 
> >>> pbs_mom[22354]: segfault at 0000000000000008 rip 00002b585249ed6f
> >>> rsp 00007fff19e96df0 error 4
> >> 
> >> We've also seen this on one of our systems and had to fall back to
> >> 2.5.8
> >> on it.
> >> 
> >> 	--Troy
> >> --
> >> Troy Baer, HPC System Administrator
> >> National Institute for Computational Sciences, University of
> >> Tennessee
> >> http://www.nics.tennessee.edu/
> >> Phone:  865-241-4233
> > 
> > Could someone configure TORQUE using --with-debug and then send a
> > stack trace of the crash?
> > 
> > Ken
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 

-- 
David Beer 
Direct Line: 801-717-3386 | Fax: 801-717-3738
     Adaptive Computing
     1712 S East Bay Blvd, Suite 300
     Provo, UT 84606



More information about the torqueusers mailing list