[torqueusers] Torque 2.5.9 MOMs keep segfaulting

Ti Leggett leggett at mcs.anl.gov
Wed Jan 11 09:05:17 MST 2012


I finally got around to doing this, but I don't see a core file in /var/spool/torque or in /usr/sbin. Where would the core get dumped?

On Dec 20, 2011, at 3:03 PM, Ken Nielson wrote:

> ----- Original Message -----
>> From: "Troy Baer" <tbaer at utk.edu>
>> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
>> Sent: Tuesday, December 20, 2011 8:59:56 AM
>> Subject: Re: [torqueusers] Torque 2.5.9 MOMs keep segfaulting
>> 
>> On Thu, 2011-12-08 at 10:36 -0600, Ti Leggett wrote:
>>> I just upgraded from 2.5.7 to 2.5.9 on Tuesday and since then, MOMs
>>> keep randomly segfaulting and dying. I see this in the MOM log
>>> right before dying:
>>> 
>>> 12/08/2011 10:09:14;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad file
>>> descriptor (9) in tm_request, comm failed Protocol failure in
>>> commit
>>> 
>>> 
>>> And something similar to this in dmesg:
>>> 
>>> pbs_mom[22354]: segfault at 0000000000000008 rip 00002b585249ed6f
>>> rsp 00007fff19e96df0 error 4
>> 
>> We've also seen this on one of our systems and had to fall back to
>> 2.5.8
>> on it.
>> 
>> 	--Troy
>> --
>> Troy Baer, HPC System Administrator
>> National Institute for Computational Sciences, University of
>> Tennessee
>> http://www.nics.tennessee.edu/
>> Phone:  865-241-4233
> 
> Could someone configure TORQUE using --with-debug and then send a stack trace of the crash?
> 
> Ken 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 203 bytes
Desc: Message signed with OpenPGP using GPGMail
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120111/e0264439/attachment.bin 


More information about the torqueusers mailing list