[torqueusers] pbs_mom segfaulting

Jan Lindheim lindheim at cacr.caltech.edu
Thu Jan 29 17:15:17 MST 2009


On Wed, Jan 28, 2009 at 09:15:51AM -0700, Ken Nielson wrote:
> Jan,
> 
> A core or at least the output of a stack trace would be nice. I would like to know if it is the same issue Joshua has described and if his fix might address the seg fault you have.

Unfortunately, all I have is the message logged when pbs_mom failed.
It looked exactly the same on two different nodes we saw the problem on.

Jan 26 08:33:35 shc147 kernel: pbs_mom[31332]: segfault at 00000000000001ae rip 00002aaaaad33135 rsp 00007ffffff32848 error 4

Jan

> 
> Ken
> 
> ----- Original Message -----
> From: "Joshua Bernstein" <jbernstein at penguincomputing.com>
> To: "Jan Lindheim" <lindheim at cacr.caltech.edu>
> Cc: torqueusers at supercluster.org
> Sent: Tuesday, January 27, 2009 5:49:29 PM GMT -07:00 US/Canada Mountain
> Subject: Re: [torqueusers] pbs_mom segfaulting
> 
> Hi Jan,
> 
> Jan Lindheim wrote:
> > After upgrading to the torque 2.3.6 recently, we have seen pbs_mom
> > segfaulting and jobs getting stuck.  This is on an Opteron system, running
> > SLES9.1.  Has anybody else reported instability with pbs_mom lately?
> 
> I've personally had problems with 2.3.6 and other versions producing a SEGV. You 
> might want to read through the thread here:
> 
> http://www.clusterresources.com/pipermail/torquedev/2008-December/001276.html
> 
> I have an RPM of version of 2.4.0 I can send you that contains the fix I 
> proposed in the post aforementioned. I'd be curious to see if that fixes your 
> issue. Ping me off list and I'd be happy to send you the RPM.
> 
> -Joshua Bernstein
> Senior Software Engineer
> Penguin Computing
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 


More information about the torqueusers mailing list