[torquedev] pbm_mom segfault in TMomCheckJobChild

Joshua Bernstein jbernstein at penguincomputing.com
Tue Dec 16 13:06:30 MST 2008


Hi,

	As some of you may know, last week I posted to torqueusers about a 
pbs_mom segfault that I observered in at least three version of TORQUE, 
including 2.1.9, 2.3.3, and 2.3.5. A bit more background about the 
segfault can be found in my original post:

http://www.clusterresources.com/pipermail/torqueusers/2008-December/008411.html

	One symptom I failed to mention in the original post was that pbs_mom 
seemed to be leaking file descriptors. In some cases, pbm_mom held open 
800 to 900 of them, and rarely did that number drop during operation. 
So, the two fixes I propose below not only fix the issue related to the 
segfault, but also appear to properly clean up the file descriptor issue.

	So, after some debugging I went to checkout the latest CVS and built my 
patches around the December 9 snapshot of 2.4.0. For whatever reason I 
must have glossed over the 2.3.6 snaps.

	Though in preparation for for the 2.3.6 release, I'd like to see my 
fixes commited to mainline, and thus I've redone them against the 2.3.6 
snapshot from December 11. I propose that at least these two patches, 
(along with other relevant required changes) be applied to 2.3.6 before 
its release. Below I'll try to explain the rational for each patch.

	The first patch is to src/lib/Libnet/net_server.c. It seems it was 
already applied to the 2.4.0 branch, but for whatever reason is still 
left uncorrected in the 2.3.6 snapshot. I also talked about this fix in 
my original post:

-        close(i);
-
-        num_connections--;  /* added by CRI - should this be here? */
-
+        close_conn(i);

Basically close(i) is incorrect, as well as the decrementing of 
num_connections. close_conn(i) is used elsewhere, has the added benefit 
of clearing out some relevant data structures and also takes care of 
decrementing num_connections. Thus the fix involves changing the close() 
to a close_conn() and removing the decrementing of num_connections. The 
reset of net_server.c seems to relatively unchanged between 2.4.0 and 
2.3.6 (other then some formatting changes) and there should be low risk 
applying this patch. Though this fix didn't seem to make the SegFault 
disappears it did decrease the rate at which fd's were eaten up. My 
patch is attached to this e-mail as net_server.closeconn.patch.

The second patch, while small, I think is the most important. Even after 
applying my fix above pbs_mom still segfaulted. And as shown in the 
original post, this was happening inside of TMomCheckJobChild. I've 
change the exact output so it would wrap nicely, but the ideas are the 
same. The disassembly leads us to:

0x000000000042add8 <TMomCheckJobChild+319>:     callq  0x4064f0
0x000000000042addd <TMomCheckJobChild+324>:     mov    (%rax),%eax
0x000000000042addf <TMomCheckJobChild+326>:     mov    %eax,(%rbx)
0x000000000042ade1 <TMomCheckJobChild+328>:     mov    [sic](%rbp),%rdx
0x000000000042ade5 <TMomCheckJobChild+332>:     mov    [sic](%rbp),%eax
0x000000000042ade8 <TMomCheckJobChild+335>:     mov    %eax,(%rdx)
0x000000000042adea <TMomCheckJobChild+337>:     cmpl   [sic]<LOGLEVEL>

The address is question is the third line down (42addf), and is situated 
between the callq ( to read(), and the cmpl() to call the logging 
routine. Thus the problem has to be one of the two assignments found on 
lines 6454 and 6456. The last assignment will always be valid as both 
*Count and the varible i, will both have valid values. However, the 
assignment of errno may cause a problem. If you look at the spec for any 
system call, in this case the previous read(), upon success, the value 
of errno is undefined. Thus during a successful read() errno is *not* 
reset back to 0, and thus errno may just hold garbage from the stack. 
There are two fixes for this situation, I simply chose the simplest. In 
this case, the simple thing to do, is just to set errno to 0, going into 
the fuction before the both the select() and read(). This is the path I 
took as it lead to the least about of code. The other option is to 
properly recode the loop that tests for error. Line 6448 would become 
something like:

if (i == -1)
	if (errno == EINTR)
            continue;

The ordering is important.  Otherwise the compiler sees if (a && b)
and is allowed to look at 'b' first to handle short-circuit evaluation. 
Anyway, my patch simply sets errno to 0, early on. My patch is attached 
to this e-mail as start_exec.errno.patch. Applying this patch, fixes the 
segfault, and keeps the number of open file descriptors used pbs_mom to 
a much more reasonable level (12 to 16), without aggressive growth.

Now, start_exec seems to have changed quite abit between 2.3.6 and 
2.4.0, so hopefully this is the only part of the file that is affected.

Whew! Cheers!

-Joshua Bernstein
Software Engineer
Penguin Computing

-------------- next part --------------
A non-text attachment was scrubbed...
Name: net_server.closeconn.patch
Type: text/x-patch
Size: 371 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20081216/3b09550d/net_server.closeconn.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: start_exec.errno.patch
Type: text/x-patch
Size: 250 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20081216/3b09550d/start_exec.errno.bin


More information about the torquedev mailing list