[torqueusers] using non-privileged ports

Ken Nielson knielson at adaptivecomputing.com
Fri Oct 28 15:13:26 MDT 2011


----- Original Message -----
> From: "Martin Siegert" <siegert at sfu.ca>
> To: torqueusers at supercluster.org
> Sent: Friday, October 28, 2011 2:57:15 PM
> Subject: Re: [torqueusers] using non-privileged ports
> 
> Hi,
> 
> On Fri, Oct 28, 2011 at 10:39:56AM -0700, Martin Siegert wrote:
> > Hi,
> > 
> > we just recompiled torque with
> > 
> > --disable-privports
> > 
> > (since we constantly ran out of ports). Now we have a different
> > problem which is just as bad:
> > 
> > # qstat -an1
> > Connection timed out
> > qstat: cannot connect to server b0 (errno=110) Connection timed out
> > 
> > This does not appear right away after starting the server, but
> > after
> > a few hours of running. As far as I can tell the only way to get
> > the
> > server out of this state is to restart it.
> > 
> > But there must be many sites that run torque with
> > --disable-privports.
> > Thus: what am I missing?
> 
> We gave up: --disable-privports does not appear to be working. Now we
> are back to our previous problem (this is on the server - there are
> no connections in the TIME_WAIT on the nodes):
> 
> # netstat -na | grep 15002
> tcp        0      0 172.18.1.0:629              172.18.1.152:15002
>          TIME_WAIT
> tcp        0      0 172.18.1.0:701              172.18.1.152:15002
>          TIME_WAIT
> tcp        0      0 172.18.1.0:689              172.18.1.152:15002
>          TIME_WAIT
> tcp        0      0 172.18.1.0:685              172.18.1.152:15002
>          TIME_WAIT
> tcp        0      0 172.18.1.0:951              172.18.1.152:15002
>          TIME_WAIT
> tcp        0      0 172.18.1.0:979              172.18.1.152:15002
>          TIME_WAIT
> tcp        0      0 172.18.1.0:962              172.18.1.152:15002
>          TIME_WAIT
> tcp        0      0 172.18.1.0:669              172.18.1.154:15002
>          TIME_WAIT
> tcp        0      0 172.18.1.0:662              172.18.1.154:15002
>          TIME_WAIT
> tcp        0      0 172.18.1.0:804              172.18.1.154:15002
>          TIME_WAIT
> ...
> # netstat -na | grep 15002 | wc -l
> 974
> 
> For some reason the mom-server connections are not closed correctly
> and we
> end up with all these sockets in the TIME_WAIT state. Note that there
> are
> even several ones for the same node. Consequently we run out of
> ports.
> 
> Is this a torque problem?
> 
> 
> w we work around the problem by setting
> 
> net.ipv4.tcp_tw_recycle = 1
> net.ipv4.tcp_tw_reuse = 1
> 
> Cheers,
> Martin

Martin,

Thanks for the information. I will see what is happening with this.

Ken


More information about the torqueusers mailing list