[torqueusers] many server-client connections in TIME_WAIT
Jason Williams
jasonw at jhu.edu
Tue Jun 23 06:17:50 MDT 2009
Arnau Bria wrote:
> Hi all,
>
> our server has many connections from clients in TIME_WAIT status:
> # netstat -puta|grep pbs|wc -l
> 1071
>
> [...]
> tcp 0 0 pbs02.pic.es:748 td062.pic.es:pbs_mom TIME_WAIT -
> tcp 0 0 pbs02.pic.es:736 td062.pic.es:pbs_mom TIME_WAIT -
> tcp 0 0 pbs02.pic.es:709 td062.pic.es:pbs_mom TIME_WAIT -
> tcp 0 0 pbs02.pic.es:638 td062.pic.es:pbs_mom TIME_WAIT -
> tcp 0 0 pbs02.pic.es:918 td062.pic.es:pbs_mom TIME_WAIT -
> tcp 0 0 pbs02.pic.es:1016 td062.pic.es:pbs_mom TIME_WAIT -
> tcp 0 0 pbs02.pic.es:773 td062.pic.es:pbs_mom TIME_WAIT -
> tcp 0 0 pbs02.pic.es:924 td060.pic.es:pbs_mom TIME_WAIT -
> tcp 0 0 pbs02.pic.es:809 td060.pic.es:pbs_mom TIME_WAIT -
> tcp 0 0 pbs02.pic.es:689 td058.pic.es:pbs_mom TIME_WAIT -
> tcp 0 0 pbs02.pic.es:682 td058.pic.es:pbs_mom TIME_WAIT -
> [...]
>
> If I restart pbs_server, all of them die, but after few second I have
> 1000 conncetions again.
>
> What could be blocking the socket to be closed?
>
> # rpm -qa|grep torque
> torque-client-2.3.0-snap.200801151629.2cri.slc4
> torque-server-2.3.0-snap.200801151629.2cri.slc4
> torque-2.3.0-snap.200801151629.2cri.slc4
>
>
> TIA,
> Arnau
>
Hey Arnau,
I've run into the same problem for a very long time. It's a result of
the way that the torque server and the torque mom clients communicate.
I have a system with only 154 compute nodes and if I did a the same
command I used to get a large number of timewaits like you. I did a
huge amount of digging and came up with the fact that changing the
job_stat_rate on 'large' clusters is a recommended path to help fix this.
I have set my job_stat_rate in qmgr to 240. The default is 45. Here's
the link to the server parms page:
http://www.clusterresources.com/products/torque/docs/a.bserverparameters.shtml
It states:
"specifies the maximum age of mom level job data which is allowed when
servicing a qstat request. If data is older than this value, the
pbs_server daemon will contact mom's with stale data to request an
update. *NOTE*: For large systems, this value should be increased to 5
minutes or higher."
Adjusting this value on my 154 node cluster cut my TIME_WAIT connections
by almost 1/3. And I have yet to see any real detrimental effects. It
effectively 'throttles' the connections to the mom clients for job data
which, apparently on 'large' clusters, can happen a bit too quickly and
cause a lot of TIME_WAIT connections. It took me a good day or two of
digging and trying all sorts of various things before I finally found
out about that option. Hope it helps.
--
Jason Williams
Systems Administrator
Johns Hopkins University
Physics and Astronomy Department
More information about the torqueusers
mailing list