[torquedev] pbm_mom segfault in TMomCheckJobChild

Glen Beane glen.beane at gmail.com
Sat Dec 20 05:57:03 MST 2008


On Sat, Dec 20, 2008 at 3:55 AM, Joshua Bernstein
<jbernstein at penguincomputing.com> wrote:
>
> On Dec 17, 2008, at 3:30 PM, Joshua Bernstein wrote:
>
>>
>>
>> Garrick Staples wrote:
>>>
>>> On Wed, Dec 17, 2008 at 02:41:23PM -0800, Joshua Bernstein alleged:
>>>>
>>>> Garrick Staples wrote:
>>>>>
>>>>> On Tue, Dec 16, 2008 at 07:18:24PM -0500, Glen Beane alleged:
>>>>>>
>>>>>> On Tue, Dec 16, 2008 at 7:17 PM, Glen Beane <glen.beane at gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> On Tue, Dec 16, 2008 at 3:06 PM, Joshua Bernstein
>>>>>>> <jbernstein at penguincomputing.com> wrote:
>>>>>>>
>>>>>>>> if (i == -1)
>>>>>>>>      if (errno == EINTR)
>>>>>>>>         continue;
>>>>>>>>
>>>>>>>> The ordering is important.  Otherwise the compiler sees if (a && b)
>>>>>>>> and is allowed to look at 'b' first to handle short-circuit
>>>>>>>> evaluation.
>>>>>>>
>>>>>>> I would NEVER use such a brain dead compiler.  Compound Boolean
>>>>>>> expressions are evaluated left to right.
>>>>>>> if (ptr == NULL && ptr->foo == bar) is never going to access a null
>>>>>>> pointer because a correct compiler is never going to do the ptr->foo
>>>>>>> == bar test first.
>>>>>>
>>>>>> i mean if (ptr != NULL && ptr->foo == bar)
>>>>>
>>>>> According to the C faq (a reference that I deeply trust), these
>>>>> constructs are
>>>>> perfectly legal.  The || and && operators (and the ?: and comma
>>>>> operators)
>>>>> create sequence points between the operands and gaurantee the order of
>>>>> evalution.
>>>>>
>>>>> http://c-faq.com/expr/seqpointops.html
>>>>> http://c-faq.com/expr/shortcircuit.html
>>>>> http://c-faq.com/expr/seqpoints.html
>>>>
>>>> Ah well. I stand corrected about the ordering issue. Though the fact
>>>> later on that errno is assigned even if the read() call didn't fail still
>>>> remains.
>>>
>>> But nothing ever reads RC.  While I agree that it is sloppy to assign a
>>> possibly bogus value, I don't see an actual bug anywhere.  It's not a
>>> pointer
>>> that gets followed to a bogus memory address to segfault or bus error.
>>>  It's
>>> just an int that is never acted upon.  RC is never read once assigned the
>>> (possibly) bogus value in errno, right?
>>
>> Agreed. Is RC have some sort of global scope that is perhaps read
>> elsewhere? If so, then I'd imagine I'd see the segfault when the value is
>> read, not assigned.
>>
>> What if through several calls through this function, the region of memory
>> that once held a valid value for errno, now contains a null pointer, thus
>> the assignment fails, consider:
>>
>> void main() {
>>        int *i;
>>        *i = '\0';
>> }
>>
>> This produces a segfault. What further bolsters my theory is that several
>> jobs run through this code just fine, so its doesn't happen every time we
>> enter the function. But given, this workload, we *always* get the segfault
>> within 10 minutes or so.
>>
>>> I hate to beat the point, but it seems you are looking for 2 real bugs
>>> and I'd
>>> hate for you to stop looking at this point :)
>>
>> I won't stop looking, and of course I'm convinced I'm right here, but I'm
>> open to other suggestions as to what could be going wrong. I didn't guess
>> here, GDB told me so. ;-)
>
> Well, I've been able to easily reproduce the failure on two other clusters
> in house in relatively short periods of time. Eventually pbs_mom leaks
> enough file descriptors (1024), in which case it either locks up and eats
> 100% of the CPU, or simply segfaults in the way previously shown. The two
> patches I provided seems to make the problem go away.
>
> So where does that leave us with 2.3.6? What is the new timeline?
>

how are you reproducing the failure?


More information about the torquedev mailing list