[torqueusers] pbs_mom segfaulting
Jan Lindheim
lindheim at cacr.caltech.edu
Thu Jan 29 17:15:17 MST 2009
On Wed, Jan 28, 2009 at 09:15:51AM -0700, Ken Nielson wrote:
> Jan,
>
> A core or at least the output of a stack trace would be nice. I would like to know if it is the same issue Joshua has described and if his fix might address the seg fault you have.
Unfortunately, all I have is the message logged when pbs_mom failed.
It looked exactly the same on two different nodes we saw the problem on.
Jan 26 08:33:35 shc147 kernel: pbs_mom[31332]: segfault at 00000000000001ae rip 00002aaaaad33135 rsp 00007ffffff32848 error 4
Jan
>
> Ken
>
> ----- Original Message -----
> From: "Joshua Bernstein" <jbernstein at penguincomputing.com>
> To: "Jan Lindheim" <lindheim at cacr.caltech.edu>
> Cc: torqueusers at supercluster.org
> Sent: Tuesday, January 27, 2009 5:49:29 PM GMT -07:00 US/Canada Mountain
> Subject: Re: [torqueusers] pbs_mom segfaulting
>
> Hi Jan,
>
> Jan Lindheim wrote:
> > After upgrading to the torque 2.3.6 recently, we have seen pbs_mom
> > segfaulting and jobs getting stuck. This is on an Opteron system, running
> > SLES9.1. Has anybody else reported instability with pbs_mom lately?
>
> I've personally had problems with 2.3.6 and other versions producing a SEGV. You
> might want to read through the thread here:
>
> http://www.clusterresources.com/pipermail/torquedev/2008-December/001276.html
>
> I have an RPM of version of 2.4.0 I can send you that contains the fix I
> proposed in the post aforementioned. I'd be curious to see if that fixes your
> issue. Ping me off list and I'd be happy to send you the RPM.
>
> -Joshua Bernstein
> Senior Software Engineer
> Penguin Computing
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
More information about the torqueusers
mailing list