[torqueusers] using non-privileged ports
siegert at sfu.ca
Fri Oct 28 14:57:15 MDT 2011
On Fri, Oct 28, 2011 at 10:39:56AM -0700, Martin Siegert wrote:
> we just recompiled torque with
> (since we constantly ran out of ports). Now we have a different
> problem which is just as bad:
> # qstat -an1
> Connection timed out
> qstat: cannot connect to server b0 (errno=110) Connection timed out
> This does not appear right away after starting the server, but after
> a few hours of running. As far as I can tell the only way to get the
> server out of this state is to restart it.
> But there must be many sites that run torque with --disable-privports.
> Thus: what am I missing?
We gave up: --disable-privports does not appear to be working. Now we
are back to our previous problem (this is on the server - there are
no connections in the TIME_WAIT on the nodes):
# netstat -na | grep 15002
tcp 0 0 172.18.1.0:629 172.18.1.152:15002 TIME_WAIT
tcp 0 0 172.18.1.0:701 172.18.1.152:15002 TIME_WAIT
tcp 0 0 172.18.1.0:689 172.18.1.152:15002 TIME_WAIT
tcp 0 0 172.18.1.0:685 172.18.1.152:15002 TIME_WAIT
tcp 0 0 172.18.1.0:951 172.18.1.152:15002 TIME_WAIT
tcp 0 0 172.18.1.0:979 172.18.1.152:15002 TIME_WAIT
tcp 0 0 172.18.1.0:962 172.18.1.152:15002 TIME_WAIT
tcp 0 0 172.18.1.0:669 172.18.1.154:15002 TIME_WAIT
tcp 0 0 172.18.1.0:662 172.18.1.154:15002 TIME_WAIT
tcp 0 0 172.18.1.0:804 172.18.1.154:15002 TIME_WAIT
# netstat -na | grep 15002 | wc -l
For some reason the mom-server connections are not closed correctly and we
end up with all these sockets in the TIME_WAIT state. Note that there are
even several ones for the same node. Consequently we run out of ports.
Is this a torque problem?
w we work around the problem by setting
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1
More information about the torqueusers