[torqueusers] many server-client connections in TIME_WAIT

Jason Williams jasonw at jhu.edu
Wed Jun 24 06:08:01 MDT 2009


Tom Rudwick wrote:
> We fixed that problem with these kernel tuning parameters in sysctl.conf:
>
> # release sockets faster because we use a lot of them
> net.ipv4.tcp_fin_timeout = 20
> # Reuse sockets as fast as possible
> net.ipv4.tcp_tw_reuse = 1
> net.ipv4.tcp_tw_recycle = 1
>
> I think especially the bottom two are most effective.
>
> Tom
>
>
> Arnau Bria wrote:
>   
>> Hi all,
>>
>> our server has many connections from clients in TIME_WAIT status:
>> # netstat -puta|grep pbs|wc -l
>> 1071
>>
>> [...]
>> tcp        0      0 pbs02.pic.es:748            td062.pic.es:pbs_mom        TIME_WAIT   -                   
>> tcp        0      0 pbs02.pic.es:736            td062.pic.es:pbs_mom        TIME_WAIT   -                   
>> tcp        0      0 pbs02.pic.es:709            td062.pic.es:pbs_mom        TIME_WAIT   -                   
>> tcp        0      0 pbs02.pic.es:638            td062.pic.es:pbs_mom        TIME_WAIT   -                   
>> tcp        0      0 pbs02.pic.es:918            td062.pic.es:pbs_mom        TIME_WAIT   -                   
>> tcp        0      0 pbs02.pic.es:1016           td062.pic.es:pbs_mom        TIME_WAIT   -                   
>> tcp        0      0 pbs02.pic.es:773            td062.pic.es:pbs_mom        TIME_WAIT   -                   
>> tcp        0      0 pbs02.pic.es:924            td060.pic.es:pbs_mom        TIME_WAIT   -                   
>> tcp        0      0 pbs02.pic.es:809            td060.pic.es:pbs_mom        TIME_WAIT   -                   
>> tcp        0      0 pbs02.pic.es:689            td058.pic.es:pbs_mom        TIME_WAIT   -                   
>> tcp        0      0 pbs02.pic.es:682            td058.pic.es:pbs_mom        TIME_WAIT   -                   
>> [...]
>>
>> If I restart pbs_server, all of them die, but after few second I have
>> 1000 conncetions again.
>>
>> What could be blocking the socket to be closed?
>>
>> # rpm -qa|grep torque
>> torque-client-2.3.0-snap.200801151629.2cri.slc4
>> torque-server-2.3.0-snap.200801151629.2cri.slc4
>> torque-2.3.0-snap.200801151629.2cri.slc4
>>
>>
>> TIA,
>> Arnau
>>     

My advice to anyone looking to use these or any other kernel parameters 
is to research them as heavily as possible and find out if they will 
help or hurt your environment before you use them.  Basically, changing 
these kernel parameters and any other fixes like this modify the way 
your system will talk TCP over the network.  I ran into suggestions like 
this when I was fighting the TIME_WAIT fight and shied away from them 
because the problem isn't with the kernel, it's with the software and 
it's implementation.  Plus I did not want to modify the way my head node 
talked TCP because the cluster isn't the only thing it talks to.

In summary, make sure you understand what you're doing with kernel 
params before blindly applying suggestions that involve them. And 
remember that you are modifying the behavior of the OS, not just 
Torque.  But that's just my humble $0.02.



-- 
Jason Williams
Systems Administrator
Johns Hopkins University
Physics and Astronomy Department




More information about the torqueusers mailing list