[torqueusers] Fwd: pbsserver 4.1.4 accumulates CLOSE_WAIT connections, bug ?

Michael Jennings mej at lbl.gov
Wed Apr 24 09:57:38 MDT 2013


On Wednesday, 24 April 2013, at 11:10:35 (+0200),
Lech Nieroda wrote:

> >     'netstat -nap' snippet on the server side:
> >     [snip]
> >     tcp      461      0 172.18.0.197:15001 <http://172.18.0.197:15001>
> >     172.18.1.167:422 <http://172.18.1.167:422>
> >           CLOSE_WAIT  11478/pbs_server
> >     tcp      417      0 172.18.0.197:15001 <http://172.18.0.197:15001>
> >     172.18.1.101:356 <http://172.18.1.101:356>
> >           CLOSE_WAIT  11478/pbs_server
> >     tcp      413      0 172.18.0.197:15001 <http://172.18.0.197:15001>
> >     172.18.4.60:312 <http://172.18.4.60:312>
> >           CLOSE_WAIT  11478/pbs_server
> >     tcp      380      0 172.18.0.197:15001 <http://172.18.0.197:15001>
> >     172.18.4.142:394 <http://172.18.4.142:394>
> >           CLOSE_WAIT  11478/pbs_server
> >     tcp      440      0 172.18.0.197:15001 <http://172.18.0.197:15001>
> >     172.18.3.13:270 <http://172.18.3.13:270>
> >           CLOSE_WAIT  11478/pbs_server
> >     tcp      417      0 172.18.0.197:15001 <http://172.18.0.197:15001>
> >     172.18.2.124:382 <http://172.18.2.124:382>
> >           CLOSE_WAIT  11478/pbs_server
> >     tcp      415      0 172.18.0.197:15001 <http://172.18.0.197:15001>
> >     172.18.4.138:398 <http://172.18.4.138:398>
> >           CLOSE_WAIT  11478/pbs_server
> >     tcp        1      0 172.18.0.197:15001 <http://172.18.0.197:15001>
> >     172.18.0.194:52162 <http://172.18.0.194:52162>
> >           CLOSE_WAIT  11478/pbs_server
> >     tcp        1      0 172.18.0.197:15001 <http://172.18.0.197:15001>
> >     172.18.6.14:51976 <http://172.18.6.14:51976>
> >           CLOSE_WAIT  11478/pbs_server

Unlike TIME_WAIT, CLOSE_WAIT should be avoidable.  The CLOSE_WAIT
state occurs on the passive side of the closing connection.  In this
case, the MOM called close() (or shutdown()) on the socket, but either
pbs_server failed to call close() (or shutdown()) on its end of the
same socket, or the final ACK from the MOM side was never sent or
failed to arrive.

In multi-threaded processes or processes which fork while socket
connections are open, it is vitally important to, in the active
closer, call shutdown(sock, SHUT_RDWR) when done using the socket, and
subsequently call close() once EOF is received from the other side.
The passive closer should call both as well if there exists the
possibility for other threads/processes to have file descriptors
associated with that socket.

It is always safe to do one of the following two procedures, depending
on the scenario:

1.  Send any writes on the socket
2.  Call shutdown(sock, SHUT_WR)
3.  Read until EOF
4.  Call shutdown(sock, SHUT_RD)
5.  Call close(sock)

 -OR-

1.  Complete all reads/writes on the socket
2.  Call shutdown(sock, SHUT_RDWR)
3.  Call close(sock)

If there's a possibility of pending data on the socket, use SHUT_WR
first and read until EOF.  If not, use SHUT_RDWR.

The purpose of shutdown() in addition to close() being that it
prevents other processes/threads which may have open file descriptors
on the socket from utilizing them, and it forces the TCP FIN sequence
to begin.  If multiple processes/threads have the same socket's file
descriptor open, a simple close() will not actually terminate the
connection at the TCP level.

It's still important to call close(), however, for all the same
reasons you call close() any other time (most importantly obtaining
pending write errors and not leaking file descriptors).

More information is available at:
http://michael.toren.net/mirrors/sock-faq/unix-socket-faq-2.html#ss2.5

HTH,
Michael

-- 
Michael Jennings <mej at lbl.gov>
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E        W: 510-495-2687
MS 050B-3209          F: 510-486-8615


More information about the torqueusers mailing list