[torquedev] Re: pbm_mom segfault in TMomCheckJobChild

Josh Butikofer josh at clusterresources.com
Tue Dec 16 13:27:59 MST 2008


Josh,

Thanks a lot for this detailed description and the patches. We will get 
them into the mainline 2.3.x before 2.3.6 is released and we will also 
put this into 2.4!

Awesome sleuthing. :)

Josh Butikofer
Cluster Resources, Inc.
#############################


Joshua Bernstein wrote:
> Hi,
> 
>     As some of you may know, last week I posted to torqueusers about a 
> pbs_mom segfault that I observered in at least three version of TORQUE, 
> including 2.1.9, 2.3.3, and 2.3.5. A bit more background about the 
> segfault can be found in my original post:
> 
> http://www.clusterresources.com/pipermail/torqueusers/2008-December/008411.html 
> 
> 
>     One symptom I failed to mention in the original post was that 
> pbs_mom seemed to be leaking file descriptors. In some cases, pbm_mom 
> held open 800 to 900 of them, and rarely did that number drop during 
> operation. So, the two fixes I propose below not only fix the issue 
> related to the segfault, but also appear to properly clean up the file 
> descriptor issue.
> 
>     So, after some debugging I went to checkout the latest CVS and built 
> my patches around the December 9 snapshot of 2.4.0. For whatever reason 
> I must have glossed over the 2.3.6 snaps.
> 
>     Though in preparation for for the 2.3.6 release, I'd like to see my 
> fixes commited to mainline, and thus I've redone them against the 2.3.6 
> snapshot from December 11. I propose that at least these two patches, 
> (along with other relevant required changes) be applied to 2.3.6 before 
> its release. Below I'll try to explain the rational for each patch.
> 
>     The first patch is to src/lib/Libnet/net_server.c. It seems it was 
> already applied to the 2.4.0 branch, but for whatever reason is still 
> left uncorrected in the 2.3.6 snapshot. I also talked about this fix in 
> my original post:
> 
> -        close(i);
> -
> -        num_connections--;  /* added by CRI - should this be here? */
> -
> +        close_conn(i);
> 
> Basically close(i) is incorrect, as well as the decrementing of 
> num_connections. close_conn(i) is used elsewhere, has the added benefit 
> of clearing out some relevant data structures and also takes care of 
> decrementing num_connections. Thus the fix involves changing the close() 
> to a close_conn() and removing the decrementing of num_connections. The 
> reset of net_server.c seems to relatively unchanged between 2.4.0 and 
> 2.3.6 (other then some formatting changes) and there should be low risk 
> applying this patch. Though this fix didn't seem to make the SegFault 
> disappears it did decrease the rate at which fd's were eaten up. My 
> patch is attached to this e-mail as net_server.closeconn.patch.
> 
> The second patch, while small, I think is the most important. Even after 
> applying my fix above pbs_mom still segfaulted. And as shown in the 
> original post, this was happening inside of TMomCheckJobChild. I've 
> change the exact output so it would wrap nicely, but the ideas are the 
> same. The disassembly leads us to:
> 
> 0x000000000042add8 <TMomCheckJobChild+319>:     callq  0x4064f0
> 0x000000000042addd <TMomCheckJobChild+324>:     mov    (%rax),%eax
> 0x000000000042addf <TMomCheckJobChild+326>:     mov    %eax,(%rbx)
> 0x000000000042ade1 <TMomCheckJobChild+328>:     mov    [sic](%rbp),%rdx
> 0x000000000042ade5 <TMomCheckJobChild+332>:     mov    [sic](%rbp),%eax
> 0x000000000042ade8 <TMomCheckJobChild+335>:     mov    %eax,(%rdx)
> 0x000000000042adea <TMomCheckJobChild+337>:     cmpl   [sic]<LOGLEVEL>
> 
> The address is question is the third line down (42addf), and is situated 
> between the callq ( to read(), and the cmpl() to call the logging 
> routine. Thus the problem has to be one of the two assignments found on 
> lines 6454 and 6456. The last assignment will always be valid as both 
> *Count and the varible i, will both have valid values. However, the 
> assignment of errno may cause a problem. If you look at the spec for any 
> system call, in this case the previous read(), upon success, the value 
> of errno is undefined. Thus during a successful read() errno is *not* 
> reset back to 0, and thus errno may just hold garbage from the stack. 
> There are two fixes for this situation, I simply chose the simplest. In 
> this case, the simple thing to do, is just to set errno to 0, going into 
> the fuction before the both the select() and read(). This is the path I 
> took as it lead to the least about of code. The other option is to 
> properly recode the loop that tests for error. Line 6448 would become 
> something like:
> 
> if (i == -1)
>     if (errno == EINTR)
>            continue;
> 
> The ordering is important.  Otherwise the compiler sees if (a && b)
> and is allowed to look at 'b' first to handle short-circuit evaluation. 
> Anyway, my patch simply sets errno to 0, early on. My patch is attached 
> to this e-mail as start_exec.errno.patch. Applying this patch, fixes the 
> segfault, and keeps the number of open file descriptors used pbs_mom to 
> a much more reasonable level (12 to 16), without aggressive growth.
> 
> Now, start_exec seems to have changed quite abit between 2.3.6 and 
> 2.4.0, so hopefully this is the only part of the file that is affected.
> 
> Whew! Cheers!
> 
> -Joshua Bernstein
> Software Engineer
> Penguin Computing
> 


More information about the torquedev mailing list