[torquedev] pbm_mom segfault in TMomCheckJobChild

Joshua Bernstein jbernstein at penguincomputing.com
Sat Dec 20 15:11:13 MST 2008


On Dec 20, 2008, at 4:57 AM, Glen Beane wrote:

> On Sat, Dec 20, 2008 at 3:55 AM, Joshua Bernstein
> <jbernstein at penguincomputing.com> wrote:
>>
>> On Dec 17, 2008, at 3:30 PM, Joshua Bernstein wrote:
>>
>>>
>>>
>>> Garrick Staples wrote:
>>>>
>>>> On Wed, Dec 17, 2008 at 02:41:23PM -0800, Joshua Bernstein alleged:
>>>>>
>>>>> Garrick Staples wrote:
>>>>>>
>>>>>> On Tue, Dec 16, 2008 at 07:18:24PM -0500, Glen Beane alleged:
>>>>>>>
>>>>>>> On Tue, Dec 16, 2008 at 7:17 PM, Glen Beane  
>>>>>>> <glen.beane at gmail.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> On Tue, Dec 16, 2008 at 3:06 PM, Joshua Bernstein
>>>>>>>> <jbernstein at penguincomputing.com> wrote:
>>>>>>>>
>>>>>>>>> if (i == -1)
>>>>>>>>>      if (errno == EINTR)
>>>>>>>>>         continue;
>>>>>>>>>
>>>>>>>>> The ordering is important.  Otherwise the compiler sees if  
>>>>>>>>> (a && b)
>>>>>>>>> and is allowed to look at 'b' first to handle short-circuit
>>>>>>>>> evaluation.
>>>>>>>>
>>>>>>>> I would NEVER use such a brain dead compiler.  Compound Boolean
>>>>>>>> expressions are evaluated left to right.
>>>>>>>> if (ptr == NULL && ptr->foo == bar) is never going to access  
>>>>>>>> a null
>>>>>>>> pointer because a correct compiler is never going to do the  
>>>>>>>> ptr->foo
>>>>>>>> == bar test first.
>>>>>>>
>>>>>>> i mean if (ptr != NULL && ptr->foo == bar)
>>>>>>
>>>>>> According to the C faq (a reference that I deeply trust), these
>>>>>> constructs are
>>>>>> perfectly legal.  The || and && operators (and the ?: and comma
>>>>>> operators)
>>>>>> create sequence points between the operands and gaurantee the  
>>>>>> order of
>>>>>> evalution.
>>>>>>
>>>>>> http://c-faq.com/expr/seqpointops.html
>>>>>> http://c-faq.com/expr/shortcircuit.html
>>>>>> http://c-faq.com/expr/seqpoints.html
>>>>>
>>>>> Ah well. I stand corrected about the ordering issue. Though the  
>>>>> fact
>>>>> later on that errno is assigned even if the read() call didn't  
>>>>> fail still
>>>>> remains.
>>>>
>>>> But nothing ever reads RC.  While I agree that it is sloppy to  
>>>> assign a
>>>> possibly bogus value, I don't see an actual bug anywhere.  It's  
>>>> not a
>>>> pointer
>>>> that gets followed to a bogus memory address to segfault or bus  
>>>> error.
>>>>  It's
>>>> just an int that is never acted upon.  RC is never read once  
>>>> assigned the
>>>> (possibly) bogus value in errno, right?
>>>
>>> Agreed. Is RC have some sort of global scope that is perhaps read
>>> elsewhere? If so, then I'd imagine I'd see the segfault when the  
>>> value is
>>> read, not assigned.
>>>
>>> What if through several calls through this function, the region  
>>> of memory
>>> that once held a valid value for errno, now contains a null  
>>> pointer, thus
>>> the assignment fails, consider:
>>>
>>> void main() {
>>>        int *i;
>>>        *i = '\0';
>>> }
>>>
>>> This produces a segfault. What further bolsters my theory is that  
>>> several
>>> jobs run through this code just fine, so its doesn't happen every  
>>> time we
>>> enter the function. But given, this workload, we *always* get the  
>>> segfault
>>> within 10 minutes or so.
>>>
>>>> I hate to beat the point, but it seems you are looking for 2  
>>>> real bugs
>>>> and I'd
>>>> hate for you to stop looking at this point :)
>>>
>>> I won't stop looking, and of course I'm convinced I'm right here,  
>>> but I'm
>>> open to other suggestions as to what could be going wrong. I  
>>> didn't guess
>>> here, GDB told me so. ;-)
>>
>> Well, I've been able to easily reproduce the failure on two other  
>> clusters
>> in house in relatively short periods of time. Eventually pbs_mom  
>> leaks
>> enough file descriptors (1024), in which case it either locks up  
>> and eats
>> 100% of the CPU, or simply segfaults in the way previously shown.  
>> The two
>> patches I provided seems to make the problem go away.
>>
>> So where does that leave us with 2.3.6? What is the new timeline?
>>
>
> how are you reproducing the failure?

After running say 500 to 1000 very short jobs through a stock queue  
configuration, eventually one, then all of the pbs_mom's either throw  
an out of file descriptors error, and sit eating 100% of a CPU in an  
epoll() loop. Or sometimes they simply segfault if given more fd's.  
Its really pretty easy to replicate. This of course is seen with the  
2.3.3 release, though I've been able to reproduce it at another site  
with 2.3.5 and 2.1.9, and even 2.4.0.

-Josh



More information about the torquedev mailing list