[torquedev] torque 2.4.6 crash

Martin Siegert siegert at sfu.ca
Fri Feb 26 11:51:58 MST 2010


As far as I can tell

GET_NEXT((pattr + JOB_ATR_resource)->at_val.at_list)

is equivalent to

(pattr + JOB_ATR_resource)->at_val.at_list).ll_next.ll_struct

However, if (pattr + JOB_ATR_resource)->at_val.at_list).ll_next is
NULL, you must not access
(pattr + JOB_ATR_resource)->at_val.at_list).ll_next.ll_struct

(gdb) p ((pattr + JOB_ATR_resource)->at_val.at_list).ll_next
$2 = (struct list_link *) 0x0
(gdb) p ((pattr + JOB_ATR_resource)->at_val.at_list).ll_next.ll_struct
Cannot access memory at address 0x10
(gdb)

Thus, you must check ((pattr + JOB_ATR_resource)->at_val.at_list).ll_next
first before using the GET_NEXT macro.

Cheers,
Martin

On Fri, Feb 26, 2010 at 10:34:49AM -0800, Martin Siegert wrote:
> Sorry, forgot to cc torquedev.
> 
> - Martin
> 
> ----- Forwarded message from Martin Siegert <siegert at sfu.ca> -----
> 
> Date: Fri, 26 Feb 2010 10:31:12 -0800
> From: Martin Siegert <siegert at sfu.ca>
> To: David Beer <dbeer at adaptivecomputing.com>
> Subject: Re: [torquedev] torque 2.4.6 crash
> 
> Hi David,
> 
> I attach gdb to pbs_server, set a breakpoint at stat_job.c:304, and then
> run "qstat -n". This is what I see in the gdb session:
> 
> (gdb) b stat_job.c:304
> Breakpoint 1 at 0x42c643: file stat_job.c, line 304.
> (gdb) c
> Continuing.
> 
> Breakpoint 1, status_attrib (pal=0x0, padef=0x64ca60, pattr=0x71cb50,
>     limit=73, priv=1, phead=0x1d4cc7a8, bad=0x71a9c8, IsOwner=1)
>     at stat_job.c:304
> 304               if ((pattr + JOB_ATR_resource)->at_flags & ATR_VFLAG_SET)
> (gdb) n
> 306                 pres = (resource *)GET_NEXT((pattr + JOB_ATR_resource)->at_val.at_list);
> (gdb) p (pattr + JOB_ATR_resource)->at_val.at_list
> $1 = {ll_prior = 0x12c, ll_next = 0x0, ll_struct = 0x0}
> (gdb) n
> 
> Program received signal SIGABRT, Aborted.
> 0x0000003b02830215 in raise () from /lib64/libc.so.6
> (gdb)
> 
> Cheers,
> Martin
> 
> On Fri, Feb 26, 2010 at 11:01:48AM -0700, David Beer wrote:
> > Hi, 
> > 
> > We seem to be unable to reproduce this bug (Ken and I have both tried) and we get normal output. Can you send in some more information about the crash? Is this job running on a single node or multiple nodes? Are there any special qmgr settings we should be aware of?
> > 
> > David
> > 
> > 
> > ----- "Martin Siegert" <siegert at sfu.ca> wrote:
> > 
> > > Confirmed.
> > > This is a show stopper for 2.4.6.
> > > 
> > > - Martin
> > > 
> > > -- 
> > > Martin Siegert
> > > Head, Research Computing
> > > WestGrid Site Lead
> > > IT Services                                phone: 778 782-4691
> > > Simon Fraser University                    fax:   778 782-4242
> > > Burnaby, British Columbia                  email: siegert at sfu.ca
> > > Canada  V5A 1S6
> > > 
> > > On Fri, Feb 26, 2010 at 04:31:03PM +0100, Stijn De Weirdt wrote:
> > > > i just build 2.4.6 but it crashes doing the following:
> > > > 
> > > > qstat -n
> > > > 
> > > > (qstat (without -n) works)
> > > > 
> > > > 
> > > > pbserver -D output:
> > > > 
> > > > # pbs_server -D
> > > > pbs_server is up
> > > > Assertion failed, bad pointer in link: file "stat_job.c", line 306
> > > > Aborted
> > > > 
> > > > spool/server_priv/jobs is empty. previous settings come from 2.4.4.
> > > the
> > > > OS is Sl5.4 x86_64. i used the torque.spec file to build rpms and do
> > > the
> > > > upgrade.
> > > > 
> > > > strace doesn't reveal any obvious candidates that cause this.
> > > > 
> > > > 
> > > > stijn
> > > > 
> > > > 
> > > > -- 
> > > > http://hasthelhcdestroyedtheearth.com/
> > > > 
> > > > 
> > > > _______________________________________________
> > > > torqueusers mailing list
> > > > torqueusers at supercluster.org
> > > > http://www.supercluster.org/mailman/listinfo/torqueusers
> > > _______________________________________________
> > > torquedev mailing list
> > > torquedev at supercluster.org
> > > http://www.supercluster.org/mailman/listinfo/torquedev
> > 
> > -- 
> > David Beer | Senior Software Engineer
> > Adaptive Computing
> 
> ----- End forwarded message -----
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev

-- 
Martin Siegert
Head, Research Computing
WestGrid Site Lead
IT Services                                phone: 778 782-4691
Simon Fraser University                    fax:   778 782-4242
Burnaby, British Columbia                  email: siegert at sfu.ca
Canada  V5A 1S6


More information about the torquedev mailing list