[torqueusers] Hanging TIME_WAIT

Josh Butikofer josh at clusterresources.com
Tue Feb 17 15:28:14 MST 2009


Tim,

I'm going to try and address both yours and Jason's post at the same time. I 
will use Jason's text as a basis for my responses:

Jason Williams wrote:
 > Hello All,
 > I've spent some time googling around for an answer to this, and not
 > really found one.  I have, however, found of several people complaining
 > of the same issue.   The problem I am having is that my pbs_server
 > machine seems to be running out of available reserved ports (ports <
 > 1024).  I've actually traced the issue to what looks like outgoing
 > communications to all my pbs_mom instances on my compute nodes.  It
 > seems that the server is using a reserved port on the local side of the
 > connection, and then, for some reason, the connection drops into
 > TIME_WAIT and sits there when I examine netstat.  The cluster has about
 > 120 nodes on it, so the reserved ports can fill up quite fast causing
 > all automounted NFS mounts to basically die.
 >
 > I've searched this list's archives with the search function on the
 > mailing list page and didn't really come up with anything.  So I am
 > wondering if anyone else has seen this and has a possible solution?  Any
 > suggestions are welcome as it's causing my users some significant
 > amounts of grief.
 >
 > I'm also kind of curious to know if any one happens to know why what
 > looks like an out going connection is using a reserved port on the local
 > side.  That strikes me as a bit odd, but I'm sure there's a good reason
 > for it.

Believe it or not, this usage of privileged ports is a feature of TORQUE and is 
currently the way that TORQUE ensures that it can trust communication from 
client commands and pbs_mom deamons. The theory behind this is only a process 
with super user privileges can attach to a privileged port when creating an 
outgoing TCP connection. The remote process (in this case pbs_server) accepts 
the socket and examines the origin port to see if is a port < 1024. This lets 
the remote process (pbs_server) know that the connecting process is running as 
root and can be trusted. This explanation is an oversimplification of all the 
steps going on, but I think it makes the point.

I, too, have noticed a lot of customers recently complaining about privileged 
ports getting used up fast. I'm not sure if this is due to a regression in 
TORQUE or if it is simply the result of more jobs being present in clusters.

I would like to ask the community, especially long-time users of TORQUE that 
have upgraded to TORQUE 2.3.x, if they have noticed any problems related to 
privilege port usage. For those who use TORQUE 2.1.x, do you see privileged 
ports sitting in a TIME_WAIT state after they are used?

There is a way to disable the usage of privileged ports, but doing so has big 
security implications. You can disable privileged ports using the 
"--disable-privports" configure option. If this is done, however, it is possible 
for a competent malicious user to hijack pbs_iff and submit jobs as other users, 
cancel other users' jobs, etc. In other words, they can "lie" to pbs_server 
about their UID. Disabling privileged ports works in some environments as they 
aren't concerned about this possible security risk--but most sites shy away from 
this option.

Another potential code change we could make in TORQUE (perhaps make it 
configurable) is to have the clients set the SO_REUSEADDR option to avoid the 
TIME_WAIT after a connection is closed. I haven't tested this, though, so I may 
be wrong.

Josh Butikofer
Cluster Resources, Inc.
#############################


Tim Freeman wrote:
> With Torque 2.3.6, we are seeing many connections settle in to the TIME_WAIT
> state and clog up the cluster because of privileged port socket exhaustion.
> 
> Jason Williams reported what looks to be the exact thing last month:
> 
> http://supercluster.org/pipermail/torqueusers/2009-January/008548.html
> 
> We're seeing the same netstat output too, the foreign socket is printed as the
> local address.
> 
> Is there anything we can do?  Does this imply some misconfiguration?
> 
> Thankyou,
> Tim
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list