[torqueusers] using non-privileged ports

Michael Jennings mej at lbl.gov
Fri Oct 28 18:15:56 MDT 2011


On Friday, 28 October 2011, at 13:57:15 (-0700),
Martin Siegert wrote:

> # netstat -na | grep 15002
> tcp        0      0 172.18.1.0:629              172.18.1.152:15002          TIME_WAIT
> tcp        0      0 172.18.1.0:701              172.18.1.152:15002          TIME_WAIT
> tcp        0      0 172.18.1.0:689              172.18.1.152:15002          TIME_WAIT
> tcp        0      0 172.18.1.0:685              172.18.1.152:15002          TIME_WAIT
> tcp        0      0 172.18.1.0:951              172.18.1.152:15002          TIME_WAIT
> tcp        0      0 172.18.1.0:979              172.18.1.152:15002          TIME_WAIT
> tcp        0      0 172.18.1.0:962              172.18.1.152:15002          TIME_WAIT
> tcp        0      0 172.18.1.0:669              172.18.1.154:15002          TIME_WAIT
> tcp        0      0 172.18.1.0:662              172.18.1.154:15002          TIME_WAIT
> tcp        0      0 172.18.1.0:804              172.18.1.154:15002          TIME_WAIT
> ...
> # netstat -na | grep 15002 | wc -l
> 974
> 
> For some reason the mom-server connections are not closed correctly and we
> end up with all these sockets in the TIME_WAIT state. Note that there are
> even several ones for the same node. Consequently we run out of ports.
> 
> Is this a torque problem?
> 
> 
> w we work around the problem by setting
> 
> net.ipv4.tcp_tw_recycle = 1
> net.ipv4.tcp_tw_reuse = 1

This points strongly to the following being missing somewhere:

    int i = 1;

    setsockopt(sock, SOL_SOCKET, SO_REUSEADDR, (char *)&i, sizeof(i));

Possibly in the code that opens the sockets to connect to the moms?

Michael

-- 
Michael Jennings <mej at lbl.gov>
Linux Systems and Cluster Engineer
High-Performance Computing Services
Bldg 50B-3209E      W: 510-495-2687
MS 050C-3396        F: 510-486-8615


More information about the torqueusers mailing list