[torquedev] Re: pbm_mom segfault in TMomCheckJobChild
Josh Butikofer
josh at clusterresources.com
Tue Dec 16 14:37:22 MST 2008
In your patch to net_server.c, you leave the FD_CLR(i, &readset) line,
but if I read the TORQUE code correctly, this is done in the
close_conn() function. So I am going to take out that line as well.
Josh Butikofer
Cluster Resources, Inc.
#############################
Joshua Bernstein wrote:
> Hi,
>
> As some of you may know, last week I posted to torqueusers about a
> pbs_mom segfault that I observered in at least three version of TORQUE,
> including 2.1.9, 2.3.3, and 2.3.5. A bit more background about the
> segfault can be found in my original post:
>
> http://www.clusterresources.com/pipermail/torqueusers/2008-December/008411.html
>
>
> One symptom I failed to mention in the original post was that
> pbs_mom seemed to be leaking file descriptors. In some cases, pbm_mom
> held open 800 to 900 of them, and rarely did that number drop during
> operation. So, the two fixes I propose below not only fix the issue
> related to the segfault, but also appear to properly clean up the file
> descriptor issue.
>
> So, after some debugging I went to checkout the latest CVS and built
> my patches around the December 9 snapshot of 2.4.0. For whatever reason
> I must have glossed over the 2.3.6 snaps.
>
> Though in preparation for for the 2.3.6 release, I'd like to see my
> fixes commited to mainline, and thus I've redone them against the 2.3.6
> snapshot from December 11. I propose that at least these two patches,
> (along with other relevant required changes) be applied to 2.3.6 before
> its release. Below I'll try to explain the rational for each patch.
>
> The first patch is to src/lib/Libnet/net_server.c. It seems it was
> already applied to the 2.4.0 branch, but for whatever reason is still
> left uncorrected in the 2.3.6 snapshot. I also talked about this fix in
> my original post:
>
> - close(i);
> -
> - num_connections--; /* added by CRI - should this be here? */
> -
> + close_conn(i);
>
> Basically close(i) is incorrect, as well as the decrementing of
> num_connections. close_conn(i) is used elsewhere, has the added benefit
> of clearing out some relevant data structures and also takes care of
> decrementing num_connections. Thus the fix involves changing the close()
> to a close_conn() and removing the decrementing of num_connections. The
> reset of net_server.c seems to relatively unchanged between 2.4.0 and
> 2.3.6 (other then some formatting changes) and there should be low risk
> applying this patch. Though this fix didn't seem to make the SegFault
> disappears it did decrease the rate at which fd's were eaten up. My
> patch is attached to this e-mail as net_server.closeconn.patch.
>
> The second patch, while small, I think is the most important. Even after
> applying my fix above pbs_mom still segfaulted. And as shown in the
> original post, this was happening inside of TMomCheckJobChild. I've
> change the exact output so it would wrap nicely, but the ideas are the
> same. The disassembly leads us to:
>
> 0x000000000042add8 <TMomCheckJobChild+319>: callq 0x4064f0
> 0x000000000042addd <TMomCheckJobChild+324>: mov (%rax),%eax
> 0x000000000042addf <TMomCheckJobChild+326>: mov %eax,(%rbx)
> 0x000000000042ade1 <TMomCheckJobChild+328>: mov [sic](%rbp),%rdx
> 0x000000000042ade5 <TMomCheckJobChild+332>: mov [sic](%rbp),%eax
> 0x000000000042ade8 <TMomCheckJobChild+335>: mov %eax,(%rdx)
> 0x000000000042adea <TMomCheckJobChild+337>: cmpl [sic]<LOGLEVEL>
>
> The address is question is the third line down (42addf), and is situated
> between the callq ( to read(), and the cmpl() to call the logging
> routine. Thus the problem has to be one of the two assignments found on
> lines 6454 and 6456. The last assignment will always be valid as both
> *Count and the varible i, will both have valid values. However, the
> assignment of errno may cause a problem. If you look at the spec for any
> system call, in this case the previous read(), upon success, the value
> of errno is undefined. Thus during a successful read() errno is *not*
> reset back to 0, and thus errno may just hold garbage from the stack.
> There are two fixes for this situation, I simply chose the simplest. In
> this case, the simple thing to do, is just to set errno to 0, going into
> the fuction before the both the select() and read(). This is the path I
> took as it lead to the least about of code. The other option is to
> properly recode the loop that tests for error. Line 6448 would become
> something like:
>
> if (i == -1)
> if (errno == EINTR)
> continue;
>
> The ordering is important. Otherwise the compiler sees if (a && b)
> and is allowed to look at 'b' first to handle short-circuit evaluation.
> Anyway, my patch simply sets errno to 0, early on. My patch is attached
> to this e-mail as start_exec.errno.patch. Applying this patch, fixes the
> segfault, and keeps the number of open file descriptors used pbs_mom to
> a much more reasonable level (12 to 16), without aggressive growth.
>
> Now, start_exec seems to have changed quite abit between 2.3.6 and
> 2.4.0, so hopefully this is the only part of the file that is affected.
>
> Whew! Cheers!
>
> -Joshua Bernstein
> Software Engineer
> Penguin Computing
>
More information about the torquedev
mailing list