[torqueusers] [torquedev] torque 2.4.6 crash
Ramon Bastiaans
ramon.bastiaans at sara.nl
Fri Mar 26 08:22:00 MDT 2010
Can this be released?
If Torque & qstat -n is broken, then why keep a broken release as latest
version?
Kind regards,
- Ramon.
On 02/26/2010 10:23 PM, David Beer wrote:
> This has now been fixed and a snapshot is available at: http://www.clusterresources.com/downloads/torque/snapshots/torque-2.4.7-snap.201002261420.tar.gz
>
> David
>
> ----- "Martin Siegert"<siegert at sfu.ca> wrote:
>
>
>> Yup - this works: pbs_server no longer aborts.
>>
>> - Martin
>>
>> On Fri, Feb 26, 2010 at 12:18:39PM -0700, David Beer wrote:
>>
>>> I meant ll_next there.
>>>
>>> ----- "David Beer"<dbeer at adaptivecomputing.com> wrote:
>>>
>>>
>>>> Yes, that will fix this bug. I'm concerned as to how its possible
>>>>
>> that
>>
>>>> the attribute has been set and it is still null. I didn't know
>>>>
>> that
>>
>>>> was possible. I'm going to check in your patch except I'm going
>>>>
>> to
>>
>>>> move the check up into the if statement:
>>>>
>>>> if (((pattr + JOB_ATR_resource)->at_val.at_list.at_next != NULL)
>>>>
>> &&
>>
>>>> ((pattr + JOB_ATR_resource)->at_flags& ATR_VFLAG_SET))
>>>>
>>>> David
>>>>
>>>> ----- "Martin Siegert"<siegert at sfu.ca> wrote:
>>>>
>>>>
>>>>> Just tested the attached patch.
>>>>> This indeed avoids the crash.
>>>>>
>>>>> - Martin
>>>>>
>>>>> On Fri, Feb 26, 2010 at 10:51:58AM -0800, Martin Siegert wrote:
>>>>>
>>>>>> As far as I can tell
>>>>>>
>>>>>> GET_NEXT((pattr + JOB_ATR_resource)->at_val.at_list)
>>>>>>
>>>>>> is equivalent to
>>>>>>
>>>>>> (pattr + JOB_ATR_resource)->at_val.at_list).ll_next.ll_struct
>>>>>>
>>>>>> However, if (pattr +
>>>>>>
>> JOB_ATR_resource)->at_val.at_list).ll_next
>>
>>>> is
>>>>
>>>>>> NULL, you must not access
>>>>>> (pattr + JOB_ATR_resource)->at_val.at_list).ll_next.ll_struct
>>>>>>
>>>>>> (gdb) p ((pattr + JOB_ATR_resource)->at_val.at_list).ll_next
>>>>>> $2 = (struct list_link *) 0x0
>>>>>> (gdb) p ((pattr +
>>>>>>
>>>>> JOB_ATR_resource)->at_val.at_list).ll_next.ll_struct
>>>>>
>>>>>> Cannot access memory at address 0x10
>>>>>> (gdb)
>>>>>>
>>>>>> Thus, you must check ((pattr +
>>>>>>
>>>>> JOB_ATR_resource)->at_val.at_list).ll_next
>>>>>
>>>>>> first before using the GET_NEXT macro.
>>>>>>
>>>>>> Cheers,
>>>>>> Martin
>>>>>>
>>>>>> On Fri, Feb 26, 2010 at 10:34:49AM -0800, Martin Siegert
>>>>>>
>> wrote:
>>
>>>>>>> Sorry, forgot to cc torquedev.
>>>>>>>
>>>>>>> - Martin
>>>>>>>
>>>>>>> ----- Forwarded message from Martin Siegert
>>>>>>>
>> <siegert at sfu.ca>
>>
>>>>> -----
>>>>>
>>>>>>> Date: Fri, 26 Feb 2010 10:31:12 -0800
>>>>>>> From: Martin Siegert<siegert at sfu.ca>
>>>>>>> To: David Beer<dbeer at adaptivecomputing.com>
>>>>>>> Subject: Re: [torquedev] torque 2.4.6 crash
>>>>>>>
>>>>>>> Hi David,
>>>>>>>
>>>>>>> I attach gdb to pbs_server, set a breakpoint at
>>>>>>>
>> stat_job.c:304,
>>
>>>>> and then
>>>>>
>>>>>>> run "qstat -n". This is what I see in the gdb session:
>>>>>>>
>>>>>>> (gdb) b stat_job.c:304
>>>>>>> Breakpoint 1 at 0x42c643: file stat_job.c, line 304.
>>>>>>> (gdb) c
>>>>>>> Continuing.
>>>>>>>
>>>>>>> Breakpoint 1, status_attrib (pal=0x0, padef=0x64ca60,
>>>>>>>
>>>>> pattr=0x71cb50,
>>>>>
>>>>>>> limit=73, priv=1, phead=0x1d4cc7a8, bad=0x71a9c8,
>>>>>>>
>>>> IsOwner=1)
>>>>
>>>>>>> at stat_job.c:304
>>>>>>> 304 if ((pattr + JOB_ATR_resource)->at_flags
>>>>>>>
>> &
>>
>>>>> ATR_VFLAG_SET)
>>>>>
>>>>>>> (gdb) n
>>>>>>> 306 pres = (resource *)GET_NEXT((pattr +
>>>>>>>
>>>>> JOB_ATR_resource)->at_val.at_list);
>>>>>
>>>>>>> (gdb) p (pattr + JOB_ATR_resource)->at_val.at_list
>>>>>>> $1 = {ll_prior = 0x12c, ll_next = 0x0, ll_struct = 0x0}
>>>>>>> (gdb) n
>>>>>>>
>>>>>>> Program received signal SIGABRT, Aborted.
>>>>>>> 0x0000003b02830215 in raise () from /lib64/libc.so.6
>>>>>>> (gdb)
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Martin
>>>>>>>
>>>>>>> On Fri, Feb 26, 2010 at 11:01:48AM -0700, David Beer wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> We seem to be unable to reproduce this bug (Ken and I
>>>>>>>>
>> have
>>
>>>> both
>>>>
>>>>> tried) and we get normal output. Can you send in some more
>>>>>
>>>> information
>>>>
>>>>> about the crash? Is this job running on a single node or
>>>>>
>> multiple
>>
>>>>> nodes? Are there any special qmgr settings we should be aware
>>>>>
>> of?
>>
>>>>>>>> David
>>>>>>>>
>>>>>>>>
>>>>>>>> ----- "Martin Siegert"<siegert at sfu.ca> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> Confirmed.
>>>>>>>>> This is a show stopper for 2.4.6.
>>>>>>>>>
>>>>>>>>> - Martin
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Martin Siegert
>>>>>>>>> Head, Research Computing
>>>>>>>>> WestGrid Site Lead
>>>>>>>>> IT Services phone: 778
>>>>>>>>>
>>>>> 782-4691
>>>>>
>>>>>>>>> Simon Fraser University fax: 778
>>>>>>>>>
>>>>> 782-4242
>>>>>
>>>>>>>>> Burnaby, British Columbia email:
>>>>>>>>>
>>>>> siegert at sfu.ca
>>>>>
>>>>>>>>> Canada V5A 1S6
>>>>>>>>>
>>>>>>>>> On Fri, Feb 26, 2010 at 04:31:03PM +0100, Stijn De
>>>>>>>>>
>> Weirdt
>>
>>>>> wrote:
>>>>>
>>>>>>>>>> i just build 2.4.6 but it crashes doing the
>>>>>>>>>>
>> following:
>>
>>>>>>>>>> qstat -n
>>>>>>>>>>
>>>>>>>>>> (qstat (without -n) works)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> pbserver -D output:
>>>>>>>>>>
>>>>>>>>>> # pbs_server -D
>>>>>>>>>> pbs_server is up
>>>>>>>>>> Assertion failed, bad pointer in link: file
>>>>>>>>>>
>> "stat_job.c",
>>
>>>>> line 306
>>>>>
>>>>>>>>>> Aborted
>>>>>>>>>>
>>>>>>>>>> spool/server_priv/jobs is empty. previous settings
>>>>>>>>>>
>> come
>>
>>>> from
>>>>
>>>>> 2.4.4.
>>>>>
>>>>>>>>> the
>>>>>>>>>
>>>>>>>>>> OS is Sl5.4 x86_64. i used the torque.spec file to
>>>>>>>>>>
>> build
>>
>>>>> rpms and do
>>>>>
>>>>>>>>> the
>>>>>>>>>
>>>>>>>>>> upgrade.
>>>>>>>>>>
>>>>>>>>>> strace doesn't reveal any obvious candidates that
>>>>>>>>>>
>> cause
>>
>>>>> this.
>>>>>
>>>>>>>>>>
>>>>>>>>>> stijn
>>>>>>>>>>
>
--
R. Bastiaans, B.ICT :: Systems Programmer, HPC&V
SARA - Computing& Networking Services
Science Park 121 PO Box 94613
1098 XG Amsterdam NL 1090 GP Amsterdam NL
P.+31 (0)20 592 3000 F.+31 (0)20 668 3167
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5148 bytes
Desc: S/MIME Cryptographic Signature
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20100326/1bdb5a82/attachment.bin
More information about the torqueusers
mailing list