[torquedev] pbm_mom segfault in TMomCheckJobChild
Joshua Bernstein
jbernstein at penguincomputing.com
Sat Dec 20 01:55:55 MST 2008
On Dec 17, 2008, at 3:30 PM, Joshua Bernstein wrote:
>
>
> Garrick Staples wrote:
>> On Wed, Dec 17, 2008 at 02:41:23PM -0800, Joshua Bernstein alleged:
>>>
>>> Garrick Staples wrote:
>>>> On Tue, Dec 16, 2008 at 07:18:24PM -0500, Glen Beane alleged:
>>>>> On Tue, Dec 16, 2008 at 7:17 PM, Glen Beane
>>>>> <glen.beane at gmail.com> wrote:
>>>>>> On Tue, Dec 16, 2008 at 3:06 PM, Joshua Bernstein
>>>>>> <jbernstein at penguincomputing.com> wrote:
>>>>>>
>>>>>>> if (i == -1)
>>>>>>> if (errno == EINTR)
>>>>>>> continue;
>>>>>>>
>>>>>>> The ordering is important. Otherwise the compiler sees if (a
>>>>>>> && b)
>>>>>>> and is allowed to look at 'b' first to handle short-circuit
>>>>>>> evaluation.
>>>>>> I would NEVER use such a brain dead compiler. Compound Boolean
>>>>>> expressions are evaluated left to right.
>>>>>> if (ptr == NULL && ptr->foo == bar) is never going to access a
>>>>>> null
>>>>>> pointer because a correct compiler is never going to do the
>>>>>> ptr->foo
>>>>>> == bar test first.
>>>>> i mean if (ptr != NULL && ptr->foo == bar)
>>>> According to the C faq (a reference that I deeply trust), these
>>>> constructs are
>>>> perfectly legal. The || and && operators (and the ?: and comma
>>>> operators)
>>>> create sequence points between the operands and gaurantee the
>>>> order of
>>>> evalution.
>>>>
>>>> http://c-faq.com/expr/seqpointops.html
>>>> http://c-faq.com/expr/shortcircuit.html
>>>> http://c-faq.com/expr/seqpoints.html
>>> Ah well. I stand corrected about the ordering issue. Though the
>>> fact later on that errno is assigned even if the read() call
>>> didn't fail still remains.
>> But nothing ever reads RC. While I agree that it is sloppy to
>> assign a
>> possibly bogus value, I don't see an actual bug anywhere. It's
>> not a pointer
>> that gets followed to a bogus memory address to segfault or bus
>> error. It's
>> just an int that is never acted upon. RC is never read once
>> assigned the
>> (possibly) bogus value in errno, right?
>
> Agreed. Is RC have some sort of global scope that is perhaps read
> elsewhere? If so, then I'd imagine I'd see the segfault when the
> value is read, not assigned.
>
> What if through several calls through this function, the region of
> memory that once held a valid value for errno, now contains a null
> pointer, thus the assignment fails, consider:
>
> void main() {
> int *i;
> *i = '\0';
> }
>
> This produces a segfault. What further bolsters my theory is that
> several jobs run through this code just fine, so its doesn't happen
> every time we enter the function. But given, this workload, we
> *always* get the segfault within 10 minutes or so.
>
>> I hate to beat the point, but it seems you are looking for 2 real
>> bugs and I'd
>> hate for you to stop looking at this point :)
>
> I won't stop looking, and of course I'm convinced I'm right here,
> but I'm open to other suggestions as to what could be going wrong.
> I didn't guess here, GDB told me so. ;-)
Well, I've been able to easily reproduce the failure on two other
clusters in house in relatively short periods of time. Eventually
pbs_mom leaks enough file descriptors (1024), in which case it either
locks up and eats 100% of the CPU, or simply segfaults in the way
previously shown. The two patches I provided seems to make the
problem go away.
So where does that leave us with 2.3.6? What is the new timeline?
-Joshua Bernstein
Software Engineer
Penguin Computing
More information about the torquedev
mailing list