[torquedev] torque 2.4.6 crash

David Beer dbeer at adaptivecomputing.com
Fri Feb 26 12:18:39 MST 2010


I meant ll_next there.

----- "David Beer" <dbeer at adaptivecomputing.com> wrote:

> Yes, that will fix this bug. I'm concerned as to how its possible that
> the attribute has been set and it is still null. I didn't know that
> was possible. I'm going to check in your patch except I'm going to
> move the check up into the if statement:
> 
> if (((pattr + JOB_ATR_resource)->at_val.at_list.at_next != NULL) &&
>     ((pattr + JOB_ATR_resource)->at_flags & ATR_VFLAG_SET))
> 
> David
> 
> ----- "Martin Siegert" <siegert at sfu.ca> wrote:
> 
> > Just tested the attached patch.
> > This indeed avoids the crash.
> > 
> > - Martin
> > 
> > On Fri, Feb 26, 2010 at 10:51:58AM -0800, Martin Siegert wrote:
> > > As far as I can tell
> > > 
> > > GET_NEXT((pattr + JOB_ATR_resource)->at_val.at_list)
> > > 
> > > is equivalent to
> > > 
> > > (pattr + JOB_ATR_resource)->at_val.at_list).ll_next.ll_struct
> > > 
> > > However, if (pattr + JOB_ATR_resource)->at_val.at_list).ll_next
> is
> > > NULL, you must not access
> > > (pattr + JOB_ATR_resource)->at_val.at_list).ll_next.ll_struct
> > > 
> > > (gdb) p ((pattr + JOB_ATR_resource)->at_val.at_list).ll_next
> > > $2 = (struct list_link *) 0x0
> > > (gdb) p ((pattr +
> > JOB_ATR_resource)->at_val.at_list).ll_next.ll_struct
> > > Cannot access memory at address 0x10
> > > (gdb)
> > > 
> > > Thus, you must check ((pattr +
> > JOB_ATR_resource)->at_val.at_list).ll_next
> > > first before using the GET_NEXT macro.
> > > 
> > > Cheers,
> > > Martin
> > > 
> > > On Fri, Feb 26, 2010 at 10:34:49AM -0800, Martin Siegert wrote:
> > > > Sorry, forgot to cc torquedev.
> > > > 
> > > > - Martin
> > > > 
> > > > ----- Forwarded message from Martin Siegert <siegert at sfu.ca>
> > -----
> > > > 
> > > > Date: Fri, 26 Feb 2010 10:31:12 -0800
> > > > From: Martin Siegert <siegert at sfu.ca>
> > > > To: David Beer <dbeer at adaptivecomputing.com>
> > > > Subject: Re: [torquedev] torque 2.4.6 crash
> > > > 
> > > > Hi David,
> > > > 
> > > > I attach gdb to pbs_server, set a breakpoint at stat_job.c:304,
> > and then
> > > > run "qstat -n". This is what I see in the gdb session:
> > > > 
> > > > (gdb) b stat_job.c:304
> > > > Breakpoint 1 at 0x42c643: file stat_job.c, line 304.
> > > > (gdb) c
> > > > Continuing.
> > > > 
> > > > Breakpoint 1, status_attrib (pal=0x0, padef=0x64ca60,
> > pattr=0x71cb50,
> > > >     limit=73, priv=1, phead=0x1d4cc7a8, bad=0x71a9c8,
> IsOwner=1)
> > > >     at stat_job.c:304
> > > > 304               if ((pattr + JOB_ATR_resource)->at_flags &
> > ATR_VFLAG_SET)
> > > > (gdb) n
> > > > 306                 pres = (resource *)GET_NEXT((pattr +
> > JOB_ATR_resource)->at_val.at_list);
> > > > (gdb) p (pattr + JOB_ATR_resource)->at_val.at_list
> > > > $1 = {ll_prior = 0x12c, ll_next = 0x0, ll_struct = 0x0}
> > > > (gdb) n
> > > > 
> > > > Program received signal SIGABRT, Aborted.
> > > > 0x0000003b02830215 in raise () from /lib64/libc.so.6
> > > > (gdb)
> > > > 
> > > > Cheers,
> > > > Martin
> > > > 
> > > > On Fri, Feb 26, 2010 at 11:01:48AM -0700, David Beer wrote:
> > > > > Hi, 
> > > > > 
> > > > > We seem to be unable to reproduce this bug (Ken and I have
> both
> > tried) and we get normal output. Can you send in some more
> information
> > about the crash? Is this job running on a single node or multiple
> > nodes? Are there any special qmgr settings we should be aware of?
> > > > > 
> > > > > David
> > > > > 
> > > > > 
> > > > > ----- "Martin Siegert" <siegert at sfu.ca> wrote:
> > > > > 
> > > > > > Confirmed.
> > > > > > This is a show stopper for 2.4.6.
> > > > > > 
> > > > > > - Martin
> > > > > > 
> > > > > > -- 
> > > > > > Martin Siegert
> > > > > > Head, Research Computing
> > > > > > WestGrid Site Lead
> > > > > > IT Services                                phone: 778
> > 782-4691
> > > > > > Simon Fraser University                    fax:   778
> > 782-4242
> > > > > > Burnaby, British Columbia                  email:
> > siegert at sfu.ca
> > > > > > Canada  V5A 1S6
> > > > > > 
> > > > > > On Fri, Feb 26, 2010 at 04:31:03PM +0100, Stijn De Weirdt
> > wrote:
> > > > > > > i just build 2.4.6 but it crashes doing the following:
> > > > > > > 
> > > > > > > qstat -n
> > > > > > > 
> > > > > > > (qstat (without -n) works)
> > > > > > > 
> > > > > > > 
> > > > > > > pbserver -D output:
> > > > > > > 
> > > > > > > # pbs_server -D
> > > > > > > pbs_server is up
> > > > > > > Assertion failed, bad pointer in link: file "stat_job.c",
> > line 306
> > > > > > > Aborted
> > > > > > > 
> > > > > > > spool/server_priv/jobs is empty. previous settings come
> from
> > 2.4.4.
> > > > > > the
> > > > > > > OS is Sl5.4 x86_64. i used the torque.spec file to build
> > rpms and do
> > > > > > the
> > > > > > > upgrade.
> > > > > > > 
> > > > > > > strace doesn't reveal any obvious candidates that cause
> > this.
> > > > > > > 
> > > > > > > 
> > > > > > > stijn
> > > > > > > 
> > > > > > > 
> > > > > > > -- 
> > > > > > > http://hasthelhcdestroyedtheearth.com/
> > > > > > > 
> > > > > > > 
> > > > > > > _______________________________________________
> > > > > > > torqueusers mailing list
> > > > > > > torqueusers at supercluster.org
> > > > > > > http://www.supercluster.org/mailman/listinfo/torqueusers
> > > > > > _______________________________________________
> > > > > > torquedev mailing list
> > > > > > torquedev at supercluster.org
> > > > > > http://www.supercluster.org/mailman/listinfo/torquedev
> > > > > 
> > > > > -- 
> > > > > David Beer | Senior Software Engineer
> > > > > Adaptive Computing
> > > > 
> > > > ----- End forwarded message -----
> > > > _______________________________________________
> > > > torquedev mailing list
> > > > torquedev at supercluster.org
> > > > http://www.supercluster.org/mailman/listinfo/torquedev
> > > 
> > > -- 
> > > Martin Siegert
> > > Head, Research Computing
> > > WestGrid Site Lead
> > > IT Services                                phone: 778 782-4691
> > > Simon Fraser University                    fax:   778 782-4242
> > > Burnaby, British Columbia                  email: siegert at sfu.ca
> > > Canada  V5A 1S6
> > > _______________________________________________
> > > torquedev mailing list
> > > torquedev at supercluster.org
> > > http://www.supercluster.org/mailman/listinfo/torquedev
> > 
> > -- 
> > Martin Siegert
> > Head, Research Computing
> > WestGrid Site Lead
> > IT Services                                phone: 778 782-4691
> > Simon Fraser University                    fax:   778 782-4242
> > Burnaby, British Columbia                  email: siegert at sfu.ca
> > Canada  V5A 1S6
> > 
> > _______________________________________________
> > torquedev mailing list
> > torquedev at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torquedev
> 
> -- 
> David Beer | Senior Software Engineer
> Adaptive Computing
> 
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev

-- 
David Beer | Senior Software Engineer
Adaptive Computing



More information about the torquedev mailing list