[torquedev] torque 2.4.6 crash

David Beer dbeer at adaptivecomputing.com
Fri Feb 26 14:23:57 MST 2010


This has now been fixed and a snapshot is available at: http://www.clusterresources.com/downloads/torque/snapshots/torque-2.4.7-snap.201002261420.tar.gz

David

----- "Martin Siegert" <siegert at sfu.ca> wrote:

> Yup - this works: pbs_server no longer aborts.
> 
> - Martin
> 
> On Fri, Feb 26, 2010 at 12:18:39PM -0700, David Beer wrote:
> > I meant ll_next there.
> > 
> > ----- "David Beer" <dbeer at adaptivecomputing.com> wrote:
> > 
> > > Yes, that will fix this bug. I'm concerned as to how its possible
> that
> > > the attribute has been set and it is still null. I didn't know
> that
> > > was possible. I'm going to check in your patch except I'm going
> to
> > > move the check up into the if statement:
> > > 
> > > if (((pattr + JOB_ATR_resource)->at_val.at_list.at_next != NULL)
> &&
> > >     ((pattr + JOB_ATR_resource)->at_flags & ATR_VFLAG_SET))
> > > 
> > > David
> > > 
> > > ----- "Martin Siegert" <siegert at sfu.ca> wrote:
> > > 
> > > > Just tested the attached patch.
> > > > This indeed avoids the crash.
> > > > 
> > > > - Martin
> > > > 
> > > > On Fri, Feb 26, 2010 at 10:51:58AM -0800, Martin Siegert wrote:
> > > > > As far as I can tell
> > > > > 
> > > > > GET_NEXT((pattr + JOB_ATR_resource)->at_val.at_list)
> > > > > 
> > > > > is equivalent to
> > > > > 
> > > > > (pattr + JOB_ATR_resource)->at_val.at_list).ll_next.ll_struct
> > > > > 
> > > > > However, if (pattr +
> JOB_ATR_resource)->at_val.at_list).ll_next
> > > is
> > > > > NULL, you must not access
> > > > > (pattr + JOB_ATR_resource)->at_val.at_list).ll_next.ll_struct
> > > > > 
> > > > > (gdb) p ((pattr + JOB_ATR_resource)->at_val.at_list).ll_next
> > > > > $2 = (struct list_link *) 0x0
> > > > > (gdb) p ((pattr +
> > > > JOB_ATR_resource)->at_val.at_list).ll_next.ll_struct
> > > > > Cannot access memory at address 0x10
> > > > > (gdb)
> > > > > 
> > > > > Thus, you must check ((pattr +
> > > > JOB_ATR_resource)->at_val.at_list).ll_next
> > > > > first before using the GET_NEXT macro.
> > > > > 
> > > > > Cheers,
> > > > > Martin
> > > > > 
> > > > > On Fri, Feb 26, 2010 at 10:34:49AM -0800, Martin Siegert
> wrote:
> > > > > > Sorry, forgot to cc torquedev.
> > > > > > 
> > > > > > - Martin
> > > > > > 
> > > > > > ----- Forwarded message from Martin Siegert
> <siegert at sfu.ca>
> > > > -----
> > > > > > 
> > > > > > Date: Fri, 26 Feb 2010 10:31:12 -0800
> > > > > > From: Martin Siegert <siegert at sfu.ca>
> > > > > > To: David Beer <dbeer at adaptivecomputing.com>
> > > > > > Subject: Re: [torquedev] torque 2.4.6 crash
> > > > > > 
> > > > > > Hi David,
> > > > > > 
> > > > > > I attach gdb to pbs_server, set a breakpoint at
> stat_job.c:304,
> > > > and then
> > > > > > run "qstat -n". This is what I see in the gdb session:
> > > > > > 
> > > > > > (gdb) b stat_job.c:304
> > > > > > Breakpoint 1 at 0x42c643: file stat_job.c, line 304.
> > > > > > (gdb) c
> > > > > > Continuing.
> > > > > > 
> > > > > > Breakpoint 1, status_attrib (pal=0x0, padef=0x64ca60,
> > > > pattr=0x71cb50,
> > > > > >     limit=73, priv=1, phead=0x1d4cc7a8, bad=0x71a9c8,
> > > IsOwner=1)
> > > > > >     at stat_job.c:304
> > > > > > 304               if ((pattr + JOB_ATR_resource)->at_flags
> &
> > > > ATR_VFLAG_SET)
> > > > > > (gdb) n
> > > > > > 306                 pres = (resource *)GET_NEXT((pattr +
> > > > JOB_ATR_resource)->at_val.at_list);
> > > > > > (gdb) p (pattr + JOB_ATR_resource)->at_val.at_list
> > > > > > $1 = {ll_prior = 0x12c, ll_next = 0x0, ll_struct = 0x0}
> > > > > > (gdb) n
> > > > > > 
> > > > > > Program received signal SIGABRT, Aborted.
> > > > > > 0x0000003b02830215 in raise () from /lib64/libc.so.6
> > > > > > (gdb)
> > > > > > 
> > > > > > Cheers,
> > > > > > Martin
> > > > > > 
> > > > > > On Fri, Feb 26, 2010 at 11:01:48AM -0700, David Beer wrote:
> > > > > > > Hi, 
> > > > > > > 
> > > > > > > We seem to be unable to reproduce this bug (Ken and I
> have
> > > both
> > > > tried) and we get normal output. Can you send in some more
> > > information
> > > > about the crash? Is this job running on a single node or
> multiple
> > > > nodes? Are there any special qmgr settings we should be aware
> of?
> > > > > > > 
> > > > > > > David
> > > > > > > 
> > > > > > > 
> > > > > > > ----- "Martin Siegert" <siegert at sfu.ca> wrote:
> > > > > > > 
> > > > > > > > Confirmed.
> > > > > > > > This is a show stopper for 2.4.6.
> > > > > > > > 
> > > > > > > > - Martin
> > > > > > > > 
> > > > > > > > -- 
> > > > > > > > Martin Siegert
> > > > > > > > Head, Research Computing
> > > > > > > > WestGrid Site Lead
> > > > > > > > IT Services                                phone: 778
> > > > 782-4691
> > > > > > > > Simon Fraser University                    fax:   778
> > > > 782-4242
> > > > > > > > Burnaby, British Columbia                  email:
> > > > siegert at sfu.ca
> > > > > > > > Canada  V5A 1S6
> > > > > > > > 
> > > > > > > > On Fri, Feb 26, 2010 at 04:31:03PM +0100, Stijn De
> Weirdt
> > > > wrote:
> > > > > > > > > i just build 2.4.6 but it crashes doing the
> following:
> > > > > > > > > 
> > > > > > > > > qstat -n
> > > > > > > > > 
> > > > > > > > > (qstat (without -n) works)
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > pbserver -D output:
> > > > > > > > > 
> > > > > > > > > # pbs_server -D
> > > > > > > > > pbs_server is up
> > > > > > > > > Assertion failed, bad pointer in link: file
> "stat_job.c",
> > > > line 306
> > > > > > > > > Aborted
> > > > > > > > > 
> > > > > > > > > spool/server_priv/jobs is empty. previous settings
> come
> > > from
> > > > 2.4.4.
> > > > > > > > the
> > > > > > > > > OS is Sl5.4 x86_64. i used the torque.spec file to
> build
> > > > rpms and do
> > > > > > > > the
> > > > > > > > > upgrade.
> > > > > > > > > 
> > > > > > > > > strace doesn't reveal any obvious candidates that
> cause
> > > > this.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > stijn

-- 
David Beer | Senior Software Engineer
Adaptive Computing



More information about the torquedev mailing list