[torquedev] torque 2.4.6 crash

Martin Siegert siegert at sfu.ca
Fri Feb 26 11:34:49 MST 2010


Sorry, forgot to cc torquedev.

- Martin

----- Forwarded message from Martin Siegert <siegert at sfu.ca> -----

Date: Fri, 26 Feb 2010 10:31:12 -0800
From: Martin Siegert <siegert at sfu.ca>
To: David Beer <dbeer at adaptivecomputing.com>
Subject: Re: [torquedev] torque 2.4.6 crash

Hi David,

I attach gdb to pbs_server, set a breakpoint at stat_job.c:304, and then
run "qstat -n". This is what I see in the gdb session:

(gdb) b stat_job.c:304
Breakpoint 1 at 0x42c643: file stat_job.c, line 304.
(gdb) c
Continuing.

Breakpoint 1, status_attrib (pal=0x0, padef=0x64ca60, pattr=0x71cb50,
    limit=73, priv=1, phead=0x1d4cc7a8, bad=0x71a9c8, IsOwner=1)
    at stat_job.c:304
304               if ((pattr + JOB_ATR_resource)->at_flags & ATR_VFLAG_SET)
(gdb) n
306                 pres = (resource *)GET_NEXT((pattr + JOB_ATR_resource)->at_val.at_list);
(gdb) p (pattr + JOB_ATR_resource)->at_val.at_list
$1 = {ll_prior = 0x12c, ll_next = 0x0, ll_struct = 0x0}
(gdb) n

Program received signal SIGABRT, Aborted.
0x0000003b02830215 in raise () from /lib64/libc.so.6
(gdb)

Cheers,
Martin

On Fri, Feb 26, 2010 at 11:01:48AM -0700, David Beer wrote:
> Hi, 
> 
> We seem to be unable to reproduce this bug (Ken and I have both tried) and we get normal output. Can you send in some more information about the crash? Is this job running on a single node or multiple nodes? Are there any special qmgr settings we should be aware of?
> 
> David
> 
> 
> ----- "Martin Siegert" <siegert at sfu.ca> wrote:
> 
> > Confirmed.
> > This is a show stopper for 2.4.6.
> > 
> > - Martin
> > 
> > -- 
> > Martin Siegert
> > Head, Research Computing
> > WestGrid Site Lead
> > IT Services                                phone: 778 782-4691
> > Simon Fraser University                    fax:   778 782-4242
> > Burnaby, British Columbia                  email: siegert at sfu.ca
> > Canada  V5A 1S6
> > 
> > On Fri, Feb 26, 2010 at 04:31:03PM +0100, Stijn De Weirdt wrote:
> > > i just build 2.4.6 but it crashes doing the following:
> > > 
> > > qstat -n
> > > 
> > > (qstat (without -n) works)
> > > 
> > > 
> > > pbserver -D output:
> > > 
> > > # pbs_server -D
> > > pbs_server is up
> > > Assertion failed, bad pointer in link: file "stat_job.c", line 306
> > > Aborted
> > > 
> > > spool/server_priv/jobs is empty. previous settings come from 2.4.4.
> > the
> > > OS is Sl5.4 x86_64. i used the torque.spec file to build rpms and do
> > the
> > > upgrade.
> > > 
> > > strace doesn't reveal any obvious candidates that cause this.
> > > 
> > > 
> > > stijn
> > > 
> > > 
> > > -- 
> > > http://hasthelhcdestroyedtheearth.com/
> > > 
> > > 
> > > _______________________________________________
> > > torqueusers mailing list
> > > torqueusers at supercluster.org
> > > http://www.supercluster.org/mailman/listinfo/torqueusers
> > _______________________________________________
> > torquedev mailing list
> > torquedev at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torquedev
> 
> -- 
> David Beer | Senior Software Engineer
> Adaptive Computing

----- End forwarded message -----


More information about the torquedev mailing list