[torqueusers] Torque 2.5.9 MOMs keep segfaulting

Ti Leggett leggett at mcs.anl.gov
Fri Feb 3 08:29:41 MST 2012


Some more information on this problem. The issue is triggered by one user who is using the Intel MPI implementation and using MPDs instead of hydra. My guess is the MPDs are trying to communicate outside of the MOM and this is confusing the MOMs and causing them to bail. I've asked the user to switch to hydra instead but haven't heard back yet.

On Jan 16, 2012, at 10:44 AM, Ti Leggett wrote:

> They seem to die immediately. I can't really run them in gdb since it's randomly on nodes and I haven't found a way to trigger the failure.
> 
> On Jan 11, 2012, at 2:52 PM, David Beer wrote:
> 
>> Do they segfault right away? If you can't find a core file, would it be possible to run the mom in gdb and get a backtrace of the crash when it happens?
>> 
>> David
>> 
>> ----- Original Message -----
>>> torque was configured with --with-debug, "ulimit -c unlimited" is in
>>> the init script right before the moms are started like
>>> "/usr/sbin/pbs_mom -p -d /var/spool/torque" but I'm still not seeing
>>> a core file anywhere.
>>> 
>>> On Jan 11, 2012, at 10:26 AM, David Beer wrote:
>>> 
>>>> 
>>>> 
>>>> ----- Original Message -----
>>>>> I finally got around to doing this, but I don't see a core file in
>>>>> /var/spool/torque or in /usr/sbin. Where would the core get
>>>>> dumped?
>>>>> 
>>>> 
>>>> A mom's core file would be in /var/spool/torque/mom_priv. You need
>>>> to make sure ulimit -c is unlimited or set to a very large number.
>>>> 
>>>> David
>>>> 
>>>>> On Dec 20, 2011, at 3:03 PM, Ken Nielson wrote:
>>>>> 
>>>>>> ----- Original Message -----
>>>>>>> From: "Troy Baer" <tbaer at utk.edu>
>>>>>>> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
>>>>>>> Sent: Tuesday, December 20, 2011 8:59:56 AM
>>>>>>> Subject: Re: [torqueusers] Torque 2.5.9 MOMs keep segfaulting
>>>>>>> 
>>>>>>> On Thu, 2011-12-08 at 10:36 -0600, Ti Leggett wrote:
>>>>>>>> I just upgraded from 2.5.7 to 2.5.9 on Tuesday and since then,
>>>>>>>> MOMs
>>>>>>>> keep randomly segfaulting and dying. I see this in the MOM log
>>>>>>>> right before dying:
>>>>>>>> 
>>>>>>>> 12/08/2011 10:09:14;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad
>>>>>>>> file
>>>>>>>> descriptor (9) in tm_request, comm failed Protocol failure in
>>>>>>>> commit
>>>>>>>> 
>>>>>>>> 
>>>>>>>> And something similar to this in dmesg:
>>>>>>>> 
>>>>>>>> pbs_mom[22354]: segfault at 0000000000000008 rip
>>>>>>>> 00002b585249ed6f
>>>>>>>> rsp 00007fff19e96df0 error 4
>>>>>>> 
>>>>>>> We've also seen this on one of our systems and had to fall back
>>>>>>> to
>>>>>>> 2.5.8
>>>>>>> on it.
>>>>>>> 
>>>>>>> 	--Troy
>>>>>>> --
>>>>>>> Troy Baer, HPC System Administrator
>>>>>>> National Institute for Computational Sciences, University of
>>>>>>> Tennessee
>>>>>>> http://www.nics.tennessee.edu/
>>>>>>> Phone:  865-241-4233
>>>>>> 
>>>>>> Could someone configure TORQUE using --with-debug and then send a
>>>>>> stack trace of the crash?
>>>>>> 
>>>>>> Ken
>>>>>> _______________________________________________
>>>>>> torqueusers mailing list
>>>>>> torqueusers at supercluster.org
>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> torqueusers at supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>> 
>>>> 
>>>> --
>>>> David Beer
>>>> Direct Line: 801-717-3386 | Fax: 801-717-3738
>>>>   Adaptive Computing
>>>>   1712 S East Bay Blvd, Suite 300
>>>>   Provo, UT 84606
>>>> 
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>> 
>>> 
>> 
>> -- 
>> David Beer 
>> Direct Line: 801-717-3386 | Fax: 801-717-3738
>>    Adaptive Computing
>>    1712 S East Bay Blvd, Suite 300
>>    Provo, UT 84606
>> 
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 203 bytes
Desc: Message signed with OpenPGP using GPGMail
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120203/3c72600c/attachment.bin 


More information about the torqueusers mailing list