[torqueusers] Torque 2.5.9 MOMs keep segfaulting

Ti Leggett leggett at mcs.anl.gov
Wed Jan 11 12:05:25 MST 2012


torque was configured with --with-debug, "ulimit -c unlimited" is in the init script right before the moms are started like "/usr/sbin/pbs_mom -p -d /var/spool/torque" but I'm still not seeing a core file anywhere.

On Jan 11, 2012, at 10:26 AM, David Beer wrote:

> 
> 
> ----- Original Message -----
>> I finally got around to doing this, but I don't see a core file in
>> /var/spool/torque or in /usr/sbin. Where would the core get dumped?
>> 
> 
> A mom's core file would be in /var/spool/torque/mom_priv. You need to make sure ulimit -c is unlimited or set to a very large number.
> 
> David
> 
>> On Dec 20, 2011, at 3:03 PM, Ken Nielson wrote:
>> 
>>> ----- Original Message -----
>>>> From: "Troy Baer" <tbaer at utk.edu>
>>>> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
>>>> Sent: Tuesday, December 20, 2011 8:59:56 AM
>>>> Subject: Re: [torqueusers] Torque 2.5.9 MOMs keep segfaulting
>>>> 
>>>> On Thu, 2011-12-08 at 10:36 -0600, Ti Leggett wrote:
>>>>> I just upgraded from 2.5.7 to 2.5.9 on Tuesday and since then,
>>>>> MOMs
>>>>> keep randomly segfaulting and dying. I see this in the MOM log
>>>>> right before dying:
>>>>> 
>>>>> 12/08/2011 10:09:14;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad
>>>>> file
>>>>> descriptor (9) in tm_request, comm failed Protocol failure in
>>>>> commit
>>>>> 
>>>>> 
>>>>> And something similar to this in dmesg:
>>>>> 
>>>>> pbs_mom[22354]: segfault at 0000000000000008 rip 00002b585249ed6f
>>>>> rsp 00007fff19e96df0 error 4
>>>> 
>>>> We've also seen this on one of our systems and had to fall back to
>>>> 2.5.8
>>>> on it.
>>>> 
>>>> 	--Troy
>>>> --
>>>> Troy Baer, HPC System Administrator
>>>> National Institute for Computational Sciences, University of
>>>> Tennessee
>>>> http://www.nics.tennessee.edu/
>>>> Phone:  865-241-4233
>>> 
>>> Could someone configure TORQUE using --with-debug and then send a
>>> stack trace of the crash?
>>> 
>>> Ken
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>> 
>> 
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>> 
> 
> -- 
> David Beer 
> Direct Line: 801-717-3386 | Fax: 801-717-3738
>     Adaptive Computing
>     1712 S East Bay Blvd, Suite 300
>     Provo, UT 84606
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 203 bytes
Desc: Message signed with OpenPGP using GPGMail
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120111/c5272f95/attachment.bin 


More information about the torqueusers mailing list