[torqueusers] many server-client connections in TIME_WAIT
jasonw at jhu.edu
Tue Jun 23 06:17:50 MDT 2009
Arnau Bria wrote:
> Hi all,
> our server has many connections from clients in TIME_WAIT status:
> # netstat -puta|grep pbs|wc -l
> tcp 0 0 pbs02.pic.es:748 td062.pic.es:pbs_mom TIME_WAIT -
> tcp 0 0 pbs02.pic.es:736 td062.pic.es:pbs_mom TIME_WAIT -
> tcp 0 0 pbs02.pic.es:709 td062.pic.es:pbs_mom TIME_WAIT -
> tcp 0 0 pbs02.pic.es:638 td062.pic.es:pbs_mom TIME_WAIT -
> tcp 0 0 pbs02.pic.es:918 td062.pic.es:pbs_mom TIME_WAIT -
> tcp 0 0 pbs02.pic.es:1016 td062.pic.es:pbs_mom TIME_WAIT -
> tcp 0 0 pbs02.pic.es:773 td062.pic.es:pbs_mom TIME_WAIT -
> tcp 0 0 pbs02.pic.es:924 td060.pic.es:pbs_mom TIME_WAIT -
> tcp 0 0 pbs02.pic.es:809 td060.pic.es:pbs_mom TIME_WAIT -
> tcp 0 0 pbs02.pic.es:689 td058.pic.es:pbs_mom TIME_WAIT -
> tcp 0 0 pbs02.pic.es:682 td058.pic.es:pbs_mom TIME_WAIT -
> If I restart pbs_server, all of them die, but after few second I have
> 1000 conncetions again.
> What could be blocking the socket to be closed?
> # rpm -qa|grep torque
I've run into the same problem for a very long time. It's a result of
the way that the torque server and the torque mom clients communicate.
I have a system with only 154 compute nodes and if I did a the same
command I used to get a large number of timewaits like you. I did a
huge amount of digging and came up with the fact that changing the
job_stat_rate on 'large' clusters is a recommended path to help fix this.
I have set my job_stat_rate in qmgr to 240. The default is 45. Here's
the link to the server parms page:
"specifies the maximum age of mom level job data which is allowed when
servicing a qstat request. If data is older than this value, the
pbs_server daemon will contact mom's with stale data to request an
update. *NOTE*: For large systems, this value should be increased to 5
minutes or higher."
Adjusting this value on my 154 node cluster cut my TIME_WAIT connections
by almost 1/3. And I have yet to see any real detrimental effects. It
effectively 'throttles' the connections to the mom clients for job data
which, apparently on 'large' clusters, can happen a bit too quickly and
cause a lot of TIME_WAIT connections. It took me a good day or two of
digging and trying all sorts of various things before I finally found
out about that option. Hope it helps.
Johns Hopkins University
Physics and Astronomy Department
More information about the torqueusers