[torqueusers] many server-client connections in TIME_WAIT

Jason Williams jasonw at jhu.edu
Tue Jun 23 06:17:50 MDT 2009


Arnau Bria wrote:
> Hi all,
>
> our server has many connections from clients in TIME_WAIT status:
> # netstat -puta|grep pbs|wc -l
> 1071
>
> [...]
> tcp        0      0 pbs02.pic.es:748            td062.pic.es:pbs_mom        TIME_WAIT   -                   
> tcp        0      0 pbs02.pic.es:736            td062.pic.es:pbs_mom        TIME_WAIT   -                   
> tcp        0      0 pbs02.pic.es:709            td062.pic.es:pbs_mom        TIME_WAIT   -                   
> tcp        0      0 pbs02.pic.es:638            td062.pic.es:pbs_mom        TIME_WAIT   -                   
> tcp        0      0 pbs02.pic.es:918            td062.pic.es:pbs_mom        TIME_WAIT   -                   
> tcp        0      0 pbs02.pic.es:1016           td062.pic.es:pbs_mom        TIME_WAIT   -                   
> tcp        0      0 pbs02.pic.es:773            td062.pic.es:pbs_mom        TIME_WAIT   -                   
> tcp        0      0 pbs02.pic.es:924            td060.pic.es:pbs_mom        TIME_WAIT   -                   
> tcp        0      0 pbs02.pic.es:809            td060.pic.es:pbs_mom        TIME_WAIT   -                   
> tcp        0      0 pbs02.pic.es:689            td058.pic.es:pbs_mom        TIME_WAIT   -                   
> tcp        0      0 pbs02.pic.es:682            td058.pic.es:pbs_mom        TIME_WAIT   -                   
> [...]
>
> If I restart pbs_server, all of them die, but after few second I have
> 1000 conncetions again.
>
> What could be blocking the socket to be closed?
>
> # rpm -qa|grep torque
> torque-client-2.3.0-snap.200801151629.2cri.slc4
> torque-server-2.3.0-snap.200801151629.2cri.slc4
> torque-2.3.0-snap.200801151629.2cri.slc4
>
>
> TIA,
> Arnau
>   

Hey Arnau,

I've run into the same problem for a very long time.  It's a result of 
the way that the torque server and the torque mom clients communicate.  
I have a system with only 154 compute nodes and if I did a the same 
command I used to get a large number of timewaits like you.  I did a 
huge amount of digging and came up with the fact that changing the  
job_stat_rate on 'large' clusters is a recommended path to help fix this.

I have set my job_stat_rate in qmgr to 240.  The default is 45.  Here's 
the link to the server parms page:

http://www.clusterresources.com/products/torque/docs/a.bserverparameters.shtml

It states:

"specifies the maximum age of mom level job data which is allowed when 
servicing a qstat request.  If data is older than this value, the 
pbs_server daemon will contact mom's with stale data to request an 
update.  *NOTE*:  For large systems, this value should be increased to 5 
minutes or higher." 

Adjusting this value on my 154 node cluster cut my TIME_WAIT connections 
by almost 1/3.  And I have yet to see any real detrimental effects. It 
effectively 'throttles' the connections to the mom clients for job data 
which, apparently on 'large' clusters, can happen a bit too quickly and 
cause a lot of TIME_WAIT connections.  It took me a good day or two of 
digging and trying all sorts of various things before I finally found 
out about that option.  Hope it helps.


-- 
Jason Williams
Systems Administrator
Johns Hopkins University
Physics and Astronomy Department



More information about the torqueusers mailing list