[torquedev] pbm_mom segfault in TMomCheckJobChild

Joshua Bernstein jbernstein at penguincomputing.com
Sat Dec 20 01:55:55 MST 2008


On Dec 17, 2008, at 3:30 PM, Joshua Bernstein wrote:

>
>
> Garrick Staples wrote:
>> On Wed, Dec 17, 2008 at 02:41:23PM -0800, Joshua Bernstein alleged:
>>>
>>> Garrick Staples wrote:
>>>> On Tue, Dec 16, 2008 at 07:18:24PM -0500, Glen Beane alleged:
>>>>> On Tue, Dec 16, 2008 at 7:17 PM, Glen Beane  
>>>>> <glen.beane at gmail.com> wrote:
>>>>>> On Tue, Dec 16, 2008 at 3:06 PM, Joshua Bernstein
>>>>>> <jbernstein at penguincomputing.com> wrote:
>>>>>>
>>>>>>> if (i == -1)
>>>>>>>       if (errno == EINTR)
>>>>>>>          continue;
>>>>>>>
>>>>>>> The ordering is important.  Otherwise the compiler sees if (a  
>>>>>>> && b)
>>>>>>> and is allowed to look at 'b' first to handle short-circuit  
>>>>>>> evaluation.
>>>>>> I would NEVER use such a brain dead compiler.  Compound Boolean
>>>>>> expressions are evaluated left to right.
>>>>>> if (ptr == NULL && ptr->foo == bar) is never going to access a  
>>>>>> null
>>>>>> pointer because a correct compiler is never going to do the  
>>>>>> ptr->foo
>>>>>> == bar test first.
>>>>> i mean if (ptr != NULL && ptr->foo == bar)
>>>> According to the C faq (a reference that I deeply trust), these  
>>>> constructs are
>>>> perfectly legal.  The || and && operators (and the ?: and comma  
>>>> operators)
>>>> create sequence points between the operands and gaurantee the  
>>>> order of
>>>> evalution.
>>>>
>>>> http://c-faq.com/expr/seqpointops.html
>>>> http://c-faq.com/expr/shortcircuit.html
>>>> http://c-faq.com/expr/seqpoints.html
>>> Ah well. I stand corrected about the ordering issue. Though the  
>>> fact later on that errno is assigned even if the read() call  
>>> didn't fail still remains.
>> But nothing ever reads RC.  While I agree that it is sloppy to  
>> assign a
>> possibly bogus value, I don't see an actual bug anywhere.  It's  
>> not a pointer
>> that gets followed to a bogus memory address to segfault or bus  
>> error.  It's
>> just an int that is never acted upon.  RC is never read once  
>> assigned the
>> (possibly) bogus value in errno, right?
>
> Agreed. Is RC have some sort of global scope that is perhaps read  
> elsewhere? If so, then I'd imagine I'd see the segfault when the  
> value is read, not assigned.
>
> What if through several calls through this function, the region of  
> memory that once held a valid value for errno, now contains a null  
> pointer, thus the assignment fails, consider:
>
> void main() {
>         int *i;
>         *i = '\0';
> }
>
> This produces a segfault. What further bolsters my theory is that  
> several jobs run through this code just fine, so its doesn't happen  
> every time we enter the function. But given, this workload, we  
> *always* get the segfault within 10 minutes or so.
>
>> I hate to beat the point, but it seems you are looking for 2 real  
>> bugs and I'd
>> hate for you to stop looking at this point :)
>
> I won't stop looking, and of course I'm convinced I'm right here,  
> but I'm open to other suggestions as to what could be going wrong.  
> I didn't guess here, GDB told me so. ;-)

Well, I've been able to easily reproduce the failure on two other  
clusters in house in relatively short periods of time. Eventually  
pbs_mom leaks enough file descriptors (1024), in which case it either  
locks up and eats 100% of the CPU, or simply segfaults in the way  
previously shown. The two patches I provided seems to make the  
problem go away.

So where does that leave us with 2.3.6? What is the new timeline?

-Joshua Bernstein
Software Engineer
Penguin Computing


More information about the torquedev mailing list