[torqueusers] Hanging TIME_WAIT
josh at clusterresources.com
Tue Feb 17 15:28:14 MST 2009
I'm going to try and address both yours and Jason's post at the same time. I
will use Jason's text as a basis for my responses:
Jason Williams wrote:
> Hello All,
> I've spent some time googling around for an answer to this, and not
> really found one. I have, however, found of several people complaining
> of the same issue. The problem I am having is that my pbs_server
> machine seems to be running out of available reserved ports (ports <
> 1024). I've actually traced the issue to what looks like outgoing
> communications to all my pbs_mom instances on my compute nodes. It
> seems that the server is using a reserved port on the local side of the
> connection, and then, for some reason, the connection drops into
> TIME_WAIT and sits there when I examine netstat. The cluster has about
> 120 nodes on it, so the reserved ports can fill up quite fast causing
> all automounted NFS mounts to basically die.
> I've searched this list's archives with the search function on the
> mailing list page and didn't really come up with anything. So I am
> wondering if anyone else has seen this and has a possible solution? Any
> suggestions are welcome as it's causing my users some significant
> amounts of grief.
> I'm also kind of curious to know if any one happens to know why what
> looks like an out going connection is using a reserved port on the local
> side. That strikes me as a bit odd, but I'm sure there's a good reason
> for it.
Believe it or not, this usage of privileged ports is a feature of TORQUE and is
currently the way that TORQUE ensures that it can trust communication from
client commands and pbs_mom deamons. The theory behind this is only a process
with super user privileges can attach to a privileged port when creating an
outgoing TCP connection. The remote process (in this case pbs_server) accepts
the socket and examines the origin port to see if is a port < 1024. This lets
the remote process (pbs_server) know that the connecting process is running as
root and can be trusted. This explanation is an oversimplification of all the
steps going on, but I think it makes the point.
I, too, have noticed a lot of customers recently complaining about privileged
ports getting used up fast. I'm not sure if this is due to a regression in
TORQUE or if it is simply the result of more jobs being present in clusters.
I would like to ask the community, especially long-time users of TORQUE that
have upgraded to TORQUE 2.3.x, if they have noticed any problems related to
privilege port usage. For those who use TORQUE 2.1.x, do you see privileged
ports sitting in a TIME_WAIT state after they are used?
There is a way to disable the usage of privileged ports, but doing so has big
security implications. You can disable privileged ports using the
"--disable-privports" configure option. If this is done, however, it is possible
for a competent malicious user to hijack pbs_iff and submit jobs as other users,
cancel other users' jobs, etc. In other words, they can "lie" to pbs_server
about their UID. Disabling privileged ports works in some environments as they
aren't concerned about this possible security risk--but most sites shy away from
Another potential code change we could make in TORQUE (perhaps make it
configurable) is to have the clients set the SO_REUSEADDR option to avoid the
TIME_WAIT after a connection is closed. I haven't tested this, though, so I may
Cluster Resources, Inc.
Tim Freeman wrote:
> With Torque 2.3.6, we are seeing many connections settle in to the TIME_WAIT
> state and clog up the cluster because of privileged port socket exhaustion.
> Jason Williams reported what looks to be the exact thing last month:
> We're seeing the same netstat output too, the foreign socket is printed as the
> local address.
> Is there anything we can do? Does this imply some misconfiguration?
> torqueusers mailing list
> torqueusers at supercluster.org
More information about the torqueusers