[torquedev] torque 2.4.6 crash

Martin Siegert siegert at sfu.ca
Fri Feb 26 12:00:05 MST 2010


Just tested the attached patch.
This indeed avoids the crash.

- Martin

On Fri, Feb 26, 2010 at 10:51:58AM -0800, Martin Siegert wrote:
> As far as I can tell
> 
> GET_NEXT((pattr + JOB_ATR_resource)->at_val.at_list)
> 
> is equivalent to
> 
> (pattr + JOB_ATR_resource)->at_val.at_list).ll_next.ll_struct
> 
> However, if (pattr + JOB_ATR_resource)->at_val.at_list).ll_next is
> NULL, you must not access
> (pattr + JOB_ATR_resource)->at_val.at_list).ll_next.ll_struct
> 
> (gdb) p ((pattr + JOB_ATR_resource)->at_val.at_list).ll_next
> $2 = (struct list_link *) 0x0
> (gdb) p ((pattr + JOB_ATR_resource)->at_val.at_list).ll_next.ll_struct
> Cannot access memory at address 0x10
> (gdb)
> 
> Thus, you must check ((pattr + JOB_ATR_resource)->at_val.at_list).ll_next
> first before using the GET_NEXT macro.
> 
> Cheers,
> Martin
> 
> On Fri, Feb 26, 2010 at 10:34:49AM -0800, Martin Siegert wrote:
> > Sorry, forgot to cc torquedev.
> > 
> > - Martin
> > 
> > ----- Forwarded message from Martin Siegert <siegert at sfu.ca> -----
> > 
> > Date: Fri, 26 Feb 2010 10:31:12 -0800
> > From: Martin Siegert <siegert at sfu.ca>
> > To: David Beer <dbeer at adaptivecomputing.com>
> > Subject: Re: [torquedev] torque 2.4.6 crash
> > 
> > Hi David,
> > 
> > I attach gdb to pbs_server, set a breakpoint at stat_job.c:304, and then
> > run "qstat -n". This is what I see in the gdb session:
> > 
> > (gdb) b stat_job.c:304
> > Breakpoint 1 at 0x42c643: file stat_job.c, line 304.
> > (gdb) c
> > Continuing.
> > 
> > Breakpoint 1, status_attrib (pal=0x0, padef=0x64ca60, pattr=0x71cb50,
> >     limit=73, priv=1, phead=0x1d4cc7a8, bad=0x71a9c8, IsOwner=1)
> >     at stat_job.c:304
> > 304               if ((pattr + JOB_ATR_resource)->at_flags & ATR_VFLAG_SET)
> > (gdb) n
> > 306                 pres = (resource *)GET_NEXT((pattr + JOB_ATR_resource)->at_val.at_list);
> > (gdb) p (pattr + JOB_ATR_resource)->at_val.at_list
> > $1 = {ll_prior = 0x12c, ll_next = 0x0, ll_struct = 0x0}
> > (gdb) n
> > 
> > Program received signal SIGABRT, Aborted.
> > 0x0000003b02830215 in raise () from /lib64/libc.so.6
> > (gdb)
> > 
> > Cheers,
> > Martin
> > 
> > On Fri, Feb 26, 2010 at 11:01:48AM -0700, David Beer wrote:
> > > Hi, 
> > > 
> > > We seem to be unable to reproduce this bug (Ken and I have both tried) and we get normal output. Can you send in some more information about the crash? Is this job running on a single node or multiple nodes? Are there any special qmgr settings we should be aware of?
> > > 
> > > David
> > > 
> > > 
> > > ----- "Martin Siegert" <siegert at sfu.ca> wrote:
> > > 
> > > > Confirmed.
> > > > This is a show stopper for 2.4.6.
> > > > 
> > > > - Martin
> > > > 
> > > > -- 
> > > > Martin Siegert
> > > > Head, Research Computing
> > > > WestGrid Site Lead
> > > > IT Services                                phone: 778 782-4691
> > > > Simon Fraser University                    fax:   778 782-4242
> > > > Burnaby, British Columbia                  email: siegert at sfu.ca
> > > > Canada  V5A 1S6
> > > > 
> > > > On Fri, Feb 26, 2010 at 04:31:03PM +0100, Stijn De Weirdt wrote:
> > > > > i just build 2.4.6 but it crashes doing the following:
> > > > > 
> > > > > qstat -n
> > > > > 
> > > > > (qstat (without -n) works)
> > > > > 
> > > > > 
> > > > > pbserver -D output:
> > > > > 
> > > > > # pbs_server -D
> > > > > pbs_server is up
> > > > > Assertion failed, bad pointer in link: file "stat_job.c", line 306
> > > > > Aborted
> > > > > 
> > > > > spool/server_priv/jobs is empty. previous settings come from 2.4.4.
> > > > the
> > > > > OS is Sl5.4 x86_64. i used the torque.spec file to build rpms and do
> > > > the
> > > > > upgrade.
> > > > > 
> > > > > strace doesn't reveal any obvious candidates that cause this.
> > > > > 
> > > > > 
> > > > > stijn
> > > > > 
> > > > > 
> > > > > -- 
> > > > > http://hasthelhcdestroyedtheearth.com/
> > > > > 
> > > > > 
> > > > > _______________________________________________
> > > > > torqueusers mailing list
> > > > > torqueusers at supercluster.org
> > > > > http://www.supercluster.org/mailman/listinfo/torqueusers
> > > > _______________________________________________
> > > > torquedev mailing list
> > > > torquedev at supercluster.org
> > > > http://www.supercluster.org/mailman/listinfo/torquedev
> > > 
> > > -- 
> > > David Beer | Senior Software Engineer
> > > Adaptive Computing
> > 
> > ----- End forwarded message -----
> > _______________________________________________
> > torquedev mailing list
> > torquedev at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torquedev
> 
> -- 
> Martin Siegert
> Head, Research Computing
> WestGrid Site Lead
> IT Services                                phone: 778 782-4691
> Simon Fraser University                    fax:   778 782-4242
> Burnaby, British Columbia                  email: siegert at sfu.ca
> Canada  V5A 1S6
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev

-- 
Martin Siegert
Head, Research Computing
WestGrid Site Lead
IT Services                                phone: 778 782-4691
Simon Fraser University                    fax:   778 782-4242
Burnaby, British Columbia                  email: siegert at sfu.ca
Canada  V5A 1S6
-------------- next part --------------
A non-text attachment was scrubbed...
Name: torque-2.4.6-stat_job.patch
Type: text/x-patch
Size: 718 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20100226/8f68db7c/attachment.bin 


More information about the torquedev mailing list