[torqueusers] many server-client connections in TIME_WAIT
Jason Williams
jasonw at jhu.edu
Wed Jun 24 06:08:01 MDT 2009
Tom Rudwick wrote:
> We fixed that problem with these kernel tuning parameters in sysctl.conf:
>
> # release sockets faster because we use a lot of them
> net.ipv4.tcp_fin_timeout = 20
> # Reuse sockets as fast as possible
> net.ipv4.tcp_tw_reuse = 1
> net.ipv4.tcp_tw_recycle = 1
>
> I think especially the bottom two are most effective.
>
> Tom
>
>
> Arnau Bria wrote:
>
>> Hi all,
>>
>> our server has many connections from clients in TIME_WAIT status:
>> # netstat -puta|grep pbs|wc -l
>> 1071
>>
>> [...]
>> tcp 0 0 pbs02.pic.es:748 td062.pic.es:pbs_mom TIME_WAIT -
>> tcp 0 0 pbs02.pic.es:736 td062.pic.es:pbs_mom TIME_WAIT -
>> tcp 0 0 pbs02.pic.es:709 td062.pic.es:pbs_mom TIME_WAIT -
>> tcp 0 0 pbs02.pic.es:638 td062.pic.es:pbs_mom TIME_WAIT -
>> tcp 0 0 pbs02.pic.es:918 td062.pic.es:pbs_mom TIME_WAIT -
>> tcp 0 0 pbs02.pic.es:1016 td062.pic.es:pbs_mom TIME_WAIT -
>> tcp 0 0 pbs02.pic.es:773 td062.pic.es:pbs_mom TIME_WAIT -
>> tcp 0 0 pbs02.pic.es:924 td060.pic.es:pbs_mom TIME_WAIT -
>> tcp 0 0 pbs02.pic.es:809 td060.pic.es:pbs_mom TIME_WAIT -
>> tcp 0 0 pbs02.pic.es:689 td058.pic.es:pbs_mom TIME_WAIT -
>> tcp 0 0 pbs02.pic.es:682 td058.pic.es:pbs_mom TIME_WAIT -
>> [...]
>>
>> If I restart pbs_server, all of them die, but after few second I have
>> 1000 conncetions again.
>>
>> What could be blocking the socket to be closed?
>>
>> # rpm -qa|grep torque
>> torque-client-2.3.0-snap.200801151629.2cri.slc4
>> torque-server-2.3.0-snap.200801151629.2cri.slc4
>> torque-2.3.0-snap.200801151629.2cri.slc4
>>
>>
>> TIA,
>> Arnau
>>
My advice to anyone looking to use these or any other kernel parameters
is to research them as heavily as possible and find out if they will
help or hurt your environment before you use them. Basically, changing
these kernel parameters and any other fixes like this modify the way
your system will talk TCP over the network. I ran into suggestions like
this when I was fighting the TIME_WAIT fight and shied away from them
because the problem isn't with the kernel, it's with the software and
it's implementation. Plus I did not want to modify the way my head node
talked TCP because the cluster isn't the only thing it talks to.
In summary, make sure you understand what you're doing with kernel
params before blindly applying suggestions that involve them. And
remember that you are modifying the behavior of the OS, not just
Torque. But that's just my humble $0.02.
--
Jason Williams
Systems Administrator
Johns Hopkins University
Physics and Astronomy Department
More information about the torqueusers
mailing list