[torqueusers] Torque 2.5.9 MOMs keep segfaulting

Ti Leggett leggett at mcs.anl.gov
Mon Jan 16 09:44:18 MST 2012


They seem to die immediately. I can't really run them in gdb since it's randomly on nodes and I haven't found a way to trigger the failure.

On Jan 11, 2012, at 2:52 PM, David Beer wrote:

> Do they segfault right away? If you can't find a core file, would it be possible to run the mom in gdb and get a backtrace of the crash when it happens?
> 
> David
> 
> ----- Original Message -----
>> torque was configured with --with-debug, "ulimit -c unlimited" is in
>> the init script right before the moms are started like
>> "/usr/sbin/pbs_mom -p -d /var/spool/torque" but I'm still not seeing
>> a core file anywhere.
>> 
>> On Jan 11, 2012, at 10:26 AM, David Beer wrote:
>> 
>>> 
>>> 
>>> ----- Original Message -----
>>>> I finally got around to doing this, but I don't see a core file in
>>>> /var/spool/torque or in /usr/sbin. Where would the core get
>>>> dumped?
>>>> 
>>> 
>>> A mom's core file would be in /var/spool/torque/mom_priv. You need
>>> to make sure ulimit -c is unlimited or set to a very large number.
>>> 
>>> David
>>> 
>>>> On Dec 20, 2011, at 3:03 PM, Ken Nielson wrote:
>>>> 
>>>>> ----- Original Message -----
>>>>>> From: "Troy Baer" <tbaer at utk.edu>
>>>>>> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
>>>>>> Sent: Tuesday, December 20, 2011 8:59:56 AM
>>>>>> Subject: Re: [torqueusers] Torque 2.5.9 MOMs keep segfaulting
>>>>>> 
>>>>>> On Thu, 2011-12-08 at 10:36 -0600, Ti Leggett wrote:
>>>>>>> I just upgraded from 2.5.7 to 2.5.9 on Tuesday and since then,
>>>>>>> MOMs
>>>>>>> keep randomly segfaulting and dying. I see this in the MOM log
>>>>>>> right before dying:
>>>>>>> 
>>>>>>> 12/08/2011 10:09:14;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad
>>>>>>> file
>>>>>>> descriptor (9) in tm_request, comm failed Protocol failure in
>>>>>>> commit
>>>>>>> 
>>>>>>> 
>>>>>>> And something similar to this in dmesg:
>>>>>>> 
>>>>>>> pbs_mom[22354]: segfault at 0000000000000008 rip
>>>>>>> 00002b585249ed6f
>>>>>>> rsp 00007fff19e96df0 error 4
>>>>>> 
>>>>>> We've also seen this on one of our systems and had to fall back
>>>>>> to
>>>>>> 2.5.8
>>>>>> on it.
>>>>>> 
>>>>>> 	--Troy
>>>>>> --
>>>>>> Troy Baer, HPC System Administrator
>>>>>> National Institute for Computational Sciences, University of
>>>>>> Tennessee
>>>>>> http://www.nics.tennessee.edu/
>>>>>> Phone:  865-241-4233
>>>>> 
>>>>> Could someone configure TORQUE using --with-debug and then send a
>>>>> stack trace of the crash?
>>>>> 
>>>>> Ken
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> torqueusers at supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>> 
>>>> 
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>> 
>>> 
>>> --
>>> David Beer
>>> Direct Line: 801-717-3386 | Fax: 801-717-3738
>>>    Adaptive Computing
>>>    1712 S East Bay Blvd, Suite 300
>>>    Provo, UT 84606
>>> 
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>> 
>> 
> 
> -- 
> David Beer 
> Direct Line: 801-717-3386 | Fax: 801-717-3738
>     Adaptive Computing
>     1712 S East Bay Blvd, Suite 300
>     Provo, UT 84606
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 203 bytes
Desc: Message signed with OpenPGP using GPGMail
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120116/57813c70/attachment.bin 


More information about the torqueusers mailing list