[torqueusers] using non-privileged ports
Michael Jennings
mej at lbl.gov
Fri Oct 28 18:15:56 MDT 2011
On Friday, 28 October 2011, at 13:57:15 (-0700),
Martin Siegert wrote:
> # netstat -na | grep 15002
> tcp 0 0 172.18.1.0:629 172.18.1.152:15002 TIME_WAIT
> tcp 0 0 172.18.1.0:701 172.18.1.152:15002 TIME_WAIT
> tcp 0 0 172.18.1.0:689 172.18.1.152:15002 TIME_WAIT
> tcp 0 0 172.18.1.0:685 172.18.1.152:15002 TIME_WAIT
> tcp 0 0 172.18.1.0:951 172.18.1.152:15002 TIME_WAIT
> tcp 0 0 172.18.1.0:979 172.18.1.152:15002 TIME_WAIT
> tcp 0 0 172.18.1.0:962 172.18.1.152:15002 TIME_WAIT
> tcp 0 0 172.18.1.0:669 172.18.1.154:15002 TIME_WAIT
> tcp 0 0 172.18.1.0:662 172.18.1.154:15002 TIME_WAIT
> tcp 0 0 172.18.1.0:804 172.18.1.154:15002 TIME_WAIT
> ...
> # netstat -na | grep 15002 | wc -l
> 974
>
> For some reason the mom-server connections are not closed correctly and we
> end up with all these sockets in the TIME_WAIT state. Note that there are
> even several ones for the same node. Consequently we run out of ports.
>
> Is this a torque problem?
>
>
> w we work around the problem by setting
>
> net.ipv4.tcp_tw_recycle = 1
> net.ipv4.tcp_tw_reuse = 1
This points strongly to the following being missing somewhere:
int i = 1;
setsockopt(sock, SOL_SOCKET, SO_REUSEADDR, (char *)&i, sizeof(i));
Possibly in the code that opens the sockets to connect to the moms?
Michael
--
Michael Jennings <mej at lbl.gov>
Linux Systems and Cluster Engineer
High-Performance Computing Services
Bldg 50B-3209E W: 510-495-2687
MS 050C-3396 F: 510-486-8615
More information about the torqueusers
mailing list