[torquedev] [torqueusers] torque 2.4.6 crash

Ramon Bastiaans ramon.bastiaans at sara.nl
Fri Mar 26 08:22:00 MDT 2010


Can this be released?

If Torque & qstat -n is broken, then why keep a broken release as latest 
version?


Kind regards,
- Ramon.

On 02/26/2010 10:23 PM, David Beer wrote:
> This has now been fixed and a snapshot is available at: http://www.clusterresources.com/downloads/torque/snapshots/torque-2.4.7-snap.201002261420.tar.gz
>
> David
>
> ----- "Martin Siegert"<siegert at sfu.ca>  wrote:
>
>    
>> Yup - this works: pbs_server no longer aborts.
>>
>> - Martin
>>
>> On Fri, Feb 26, 2010 at 12:18:39PM -0700, David Beer wrote:
>>      
>>> I meant ll_next there.
>>>
>>> ----- "David Beer"<dbeer at adaptivecomputing.com>  wrote:
>>>
>>>        
>>>> Yes, that will fix this bug. I'm concerned as to how its possible
>>>>          
>> that
>>      
>>>> the attribute has been set and it is still null. I didn't know
>>>>          
>> that
>>      
>>>> was possible. I'm going to check in your patch except I'm going
>>>>          
>> to
>>      
>>>> move the check up into the if statement:
>>>>
>>>> if (((pattr + JOB_ATR_resource)->at_val.at_list.at_next != NULL)
>>>>          
>> &&
>>      
>>>>      ((pattr + JOB_ATR_resource)->at_flags&  ATR_VFLAG_SET))
>>>>
>>>> David
>>>>
>>>> ----- "Martin Siegert"<siegert at sfu.ca>  wrote:
>>>>
>>>>          
>>>>> Just tested the attached patch.
>>>>> This indeed avoids the crash.
>>>>>
>>>>> - Martin
>>>>>
>>>>> On Fri, Feb 26, 2010 at 10:51:58AM -0800, Martin Siegert wrote:
>>>>>            
>>>>>> As far as I can tell
>>>>>>
>>>>>> GET_NEXT((pattr + JOB_ATR_resource)->at_val.at_list)
>>>>>>
>>>>>> is equivalent to
>>>>>>
>>>>>> (pattr + JOB_ATR_resource)->at_val.at_list).ll_next.ll_struct
>>>>>>
>>>>>> However, if (pattr +
>>>>>>              
>> JOB_ATR_resource)->at_val.at_list).ll_next
>>      
>>>> is
>>>>          
>>>>>> NULL, you must not access
>>>>>> (pattr + JOB_ATR_resource)->at_val.at_list).ll_next.ll_struct
>>>>>>
>>>>>> (gdb) p ((pattr + JOB_ATR_resource)->at_val.at_list).ll_next
>>>>>> $2 = (struct list_link *) 0x0
>>>>>> (gdb) p ((pattr +
>>>>>>              
>>>>> JOB_ATR_resource)->at_val.at_list).ll_next.ll_struct
>>>>>            
>>>>>> Cannot access memory at address 0x10
>>>>>> (gdb)
>>>>>>
>>>>>> Thus, you must check ((pattr +
>>>>>>              
>>>>> JOB_ATR_resource)->at_val.at_list).ll_next
>>>>>            
>>>>>> first before using the GET_NEXT macro.
>>>>>>
>>>>>> Cheers,
>>>>>> Martin
>>>>>>
>>>>>> On Fri, Feb 26, 2010 at 10:34:49AM -0800, Martin Siegert
>>>>>>              
>> wrote:
>>      
>>>>>>> Sorry, forgot to cc torquedev.
>>>>>>>
>>>>>>> - Martin
>>>>>>>
>>>>>>> ----- Forwarded message from Martin Siegert
>>>>>>>                
>> <siegert at sfu.ca>
>>      
>>>>> -----
>>>>>            
>>>>>>> Date: Fri, 26 Feb 2010 10:31:12 -0800
>>>>>>> From: Martin Siegert<siegert at sfu.ca>
>>>>>>> To: David Beer<dbeer at adaptivecomputing.com>
>>>>>>> Subject: Re: [torquedev] torque 2.4.6 crash
>>>>>>>
>>>>>>> Hi David,
>>>>>>>
>>>>>>> I attach gdb to pbs_server, set a breakpoint at
>>>>>>>                
>> stat_job.c:304,
>>      
>>>>> and then
>>>>>            
>>>>>>> run "qstat -n". This is what I see in the gdb session:
>>>>>>>
>>>>>>> (gdb) b stat_job.c:304
>>>>>>> Breakpoint 1 at 0x42c643: file stat_job.c, line 304.
>>>>>>> (gdb) c
>>>>>>> Continuing.
>>>>>>>
>>>>>>> Breakpoint 1, status_attrib (pal=0x0, padef=0x64ca60,
>>>>>>>                
>>>>> pattr=0x71cb50,
>>>>>            
>>>>>>>      limit=73, priv=1, phead=0x1d4cc7a8, bad=0x71a9c8,
>>>>>>>                
>>>> IsOwner=1)
>>>>          
>>>>>>>      at stat_job.c:304
>>>>>>> 304               if ((pattr + JOB_ATR_resource)->at_flags
>>>>>>>                
>> &
>>      
>>>>> ATR_VFLAG_SET)
>>>>>            
>>>>>>> (gdb) n
>>>>>>> 306                 pres = (resource *)GET_NEXT((pattr +
>>>>>>>                
>>>>> JOB_ATR_resource)->at_val.at_list);
>>>>>            
>>>>>>> (gdb) p (pattr + JOB_ATR_resource)->at_val.at_list
>>>>>>> $1 = {ll_prior = 0x12c, ll_next = 0x0, ll_struct = 0x0}
>>>>>>> (gdb) n
>>>>>>>
>>>>>>> Program received signal SIGABRT, Aborted.
>>>>>>> 0x0000003b02830215 in raise () from /lib64/libc.so.6
>>>>>>> (gdb)
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Martin
>>>>>>>
>>>>>>> On Fri, Feb 26, 2010 at 11:01:48AM -0700, David Beer wrote:
>>>>>>>                
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> We seem to be unable to reproduce this bug (Ken and I
>>>>>>>>                  
>> have
>>      
>>>> both
>>>>          
>>>>> tried) and we get normal output. Can you send in some more
>>>>>            
>>>> information
>>>>          
>>>>> about the crash? Is this job running on a single node or
>>>>>            
>> multiple
>>      
>>>>> nodes? Are there any special qmgr settings we should be aware
>>>>>            
>> of?
>>      
>>>>>>>> David
>>>>>>>>
>>>>>>>>
>>>>>>>> ----- "Martin Siegert"<siegert at sfu.ca>  wrote:
>>>>>>>>
>>>>>>>>                  
>>>>>>>>> Confirmed.
>>>>>>>>> This is a show stopper for 2.4.6.
>>>>>>>>>
>>>>>>>>> - Martin
>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>> Martin Siegert
>>>>>>>>> Head, Research Computing
>>>>>>>>> WestGrid Site Lead
>>>>>>>>> IT Services                                phone: 778
>>>>>>>>>                    
>>>>> 782-4691
>>>>>            
>>>>>>>>> Simon Fraser University                    fax:   778
>>>>>>>>>                    
>>>>> 782-4242
>>>>>            
>>>>>>>>> Burnaby, British Columbia                  email:
>>>>>>>>>                    
>>>>> siegert at sfu.ca
>>>>>            
>>>>>>>>> Canada  V5A 1S6
>>>>>>>>>
>>>>>>>>> On Fri, Feb 26, 2010 at 04:31:03PM +0100, Stijn De
>>>>>>>>>                    
>> Weirdt
>>      
>>>>> wrote:
>>>>>            
>>>>>>>>>> i just build 2.4.6 but it crashes doing the
>>>>>>>>>>                      
>> following:
>>      
>>>>>>>>>> qstat -n
>>>>>>>>>>
>>>>>>>>>> (qstat (without -n) works)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> pbserver -D output:
>>>>>>>>>>
>>>>>>>>>> # pbs_server -D
>>>>>>>>>> pbs_server is up
>>>>>>>>>> Assertion failed, bad pointer in link: file
>>>>>>>>>>                      
>> "stat_job.c",
>>      
>>>>> line 306
>>>>>            
>>>>>>>>>> Aborted
>>>>>>>>>>
>>>>>>>>>> spool/server_priv/jobs is empty. previous settings
>>>>>>>>>>                      
>> come
>>      
>>>> from
>>>>          
>>>>> 2.4.4.
>>>>>            
>>>>>>>>> the
>>>>>>>>>                    
>>>>>>>>>> OS is Sl5.4 x86_64. i used the torque.spec file to
>>>>>>>>>>                      
>> build
>>      
>>>>> rpms and do
>>>>>            
>>>>>>>>> the
>>>>>>>>>                    
>>>>>>>>>> upgrade.
>>>>>>>>>>
>>>>>>>>>> strace doesn't reveal any obvious candidates that
>>>>>>>>>>                      
>> cause
>>      
>>>>> this.
>>>>>            
>>>>>>>>>>
>>>>>>>>>> stijn
>>>>>>>>>>                      
>    


-- 
R. Bastiaans, B.ICT :: Systems Programmer, HPC&V

SARA - Computing&  Networking Services
Science Park 121     PO Box 94613
1098 XG Amsterdam NL 1090 GP Amsterdam NL
P.+31 (0)20 592 3000 F.+31 (0)20 668 3167


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5148 bytes
Desc: S/MIME Cryptographic Signature
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20100326/1bdb5a82/attachment.bin 


More information about the torquedev mailing list