[torquedev] pbm_mom segfault in TMomCheckJobChild
Joshua Bernstein
jbernstein at penguincomputing.com
Tue Dec 16 13:06:30 MST 2008
Hi,
As some of you may know, last week I posted to torqueusers about a
pbs_mom segfault that I observered in at least three version of TORQUE,
including 2.1.9, 2.3.3, and 2.3.5. A bit more background about the
segfault can be found in my original post:
http://www.clusterresources.com/pipermail/torqueusers/2008-December/008411.html
One symptom I failed to mention in the original post was that pbs_mom
seemed to be leaking file descriptors. In some cases, pbm_mom held open
800 to 900 of them, and rarely did that number drop during operation.
So, the two fixes I propose below not only fix the issue related to the
segfault, but also appear to properly clean up the file descriptor issue.
So, after some debugging I went to checkout the latest CVS and built my
patches around the December 9 snapshot of 2.4.0. For whatever reason I
must have glossed over the 2.3.6 snaps.
Though in preparation for for the 2.3.6 release, I'd like to see my
fixes commited to mainline, and thus I've redone them against the 2.3.6
snapshot from December 11. I propose that at least these two patches,
(along with other relevant required changes) be applied to 2.3.6 before
its release. Below I'll try to explain the rational for each patch.
The first patch is to src/lib/Libnet/net_server.c. It seems it was
already applied to the 2.4.0 branch, but for whatever reason is still
left uncorrected in the 2.3.6 snapshot. I also talked about this fix in
my original post:
- close(i);
-
- num_connections--; /* added by CRI - should this be here? */
-
+ close_conn(i);
Basically close(i) is incorrect, as well as the decrementing of
num_connections. close_conn(i) is used elsewhere, has the added benefit
of clearing out some relevant data structures and also takes care of
decrementing num_connections. Thus the fix involves changing the close()
to a close_conn() and removing the decrementing of num_connections. The
reset of net_server.c seems to relatively unchanged between 2.4.0 and
2.3.6 (other then some formatting changes) and there should be low risk
applying this patch. Though this fix didn't seem to make the SegFault
disappears it did decrease the rate at which fd's were eaten up. My
patch is attached to this e-mail as net_server.closeconn.patch.
The second patch, while small, I think is the most important. Even after
applying my fix above pbs_mom still segfaulted. And as shown in the
original post, this was happening inside of TMomCheckJobChild. I've
change the exact output so it would wrap nicely, but the ideas are the
same. The disassembly leads us to:
0x000000000042add8 <TMomCheckJobChild+319>: callq 0x4064f0
0x000000000042addd <TMomCheckJobChild+324>: mov (%rax),%eax
0x000000000042addf <TMomCheckJobChild+326>: mov %eax,(%rbx)
0x000000000042ade1 <TMomCheckJobChild+328>: mov [sic](%rbp),%rdx
0x000000000042ade5 <TMomCheckJobChild+332>: mov [sic](%rbp),%eax
0x000000000042ade8 <TMomCheckJobChild+335>: mov %eax,(%rdx)
0x000000000042adea <TMomCheckJobChild+337>: cmpl [sic]<LOGLEVEL>
The address is question is the third line down (42addf), and is situated
between the callq ( to read(), and the cmpl() to call the logging
routine. Thus the problem has to be one of the two assignments found on
lines 6454 and 6456. The last assignment will always be valid as both
*Count and the varible i, will both have valid values. However, the
assignment of errno may cause a problem. If you look at the spec for any
system call, in this case the previous read(), upon success, the value
of errno is undefined. Thus during a successful read() errno is *not*
reset back to 0, and thus errno may just hold garbage from the stack.
There are two fixes for this situation, I simply chose the simplest. In
this case, the simple thing to do, is just to set errno to 0, going into
the fuction before the both the select() and read(). This is the path I
took as it lead to the least about of code. The other option is to
properly recode the loop that tests for error. Line 6448 would become
something like:
if (i == -1)
if (errno == EINTR)
continue;
The ordering is important. Otherwise the compiler sees if (a && b)
and is allowed to look at 'b' first to handle short-circuit evaluation.
Anyway, my patch simply sets errno to 0, early on. My patch is attached
to this e-mail as start_exec.errno.patch. Applying this patch, fixes the
segfault, and keeps the number of open file descriptors used pbs_mom to
a much more reasonable level (12 to 16), without aggressive growth.
Now, start_exec seems to have changed quite abit between 2.3.6 and
2.4.0, so hopefully this is the only part of the file that is affected.
Whew! Cheers!
-Joshua Bernstein
Software Engineer
Penguin Computing
-------------- next part --------------
A non-text attachment was scrubbed...
Name: net_server.closeconn.patch
Type: text/x-patch
Size: 371 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20081216/3b09550d/net_server.closeconn.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: start_exec.errno.patch
Type: text/x-patch
Size: 250 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20081216/3b09550d/start_exec.errno.bin
More information about the torquedev
mailing list