[torqueusers] using non-privileged ports

Martin Siegert siegert at sfu.ca
Fri Oct 28 14:57:15 MDT 2011


Hi,

On Fri, Oct 28, 2011 at 10:39:56AM -0700, Martin Siegert wrote:
> Hi,
> 
> we just recompiled torque with
> 
> --disable-privports
> 
> (since we constantly ran out of ports). Now we have a different
> problem which is just as bad:
> 
> # qstat -an1
> Connection timed out
> qstat: cannot connect to server b0 (errno=110) Connection timed out
> 
> This does not appear right away after starting the server, but after
> a few hours of running. As far as I can tell the only way to get the
> server out of this state is to restart it.
> 
> But there must be many sites that run torque with --disable-privports.
> Thus: what am I missing?

We gave up: --disable-privports does not appear to be working. Now we
are back to our previous problem (this is on the server - there are
no connections in the TIME_WAIT on the nodes):

# netstat -na | grep 15002
tcp        0      0 172.18.1.0:629              172.18.1.152:15002          TIME_WAIT
tcp        0      0 172.18.1.0:701              172.18.1.152:15002          TIME_WAIT
tcp        0      0 172.18.1.0:689              172.18.1.152:15002          TIME_WAIT
tcp        0      0 172.18.1.0:685              172.18.1.152:15002          TIME_WAIT
tcp        0      0 172.18.1.0:951              172.18.1.152:15002          TIME_WAIT
tcp        0      0 172.18.1.0:979              172.18.1.152:15002          TIME_WAIT
tcp        0      0 172.18.1.0:962              172.18.1.152:15002          TIME_WAIT
tcp        0      0 172.18.1.0:669              172.18.1.154:15002          TIME_WAIT
tcp        0      0 172.18.1.0:662              172.18.1.154:15002          TIME_WAIT
tcp        0      0 172.18.1.0:804              172.18.1.154:15002          TIME_WAIT
...
# netstat -na | grep 15002 | wc -l
974

For some reason the mom-server connections are not closed correctly and we
end up with all these sockets in the TIME_WAIT state. Note that there are
even several ones for the same node. Consequently we run out of ports.

Is this a torque problem?


w we work around the problem by setting

net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1

Cheers,
Martin


More information about the torqueusers mailing list