[torqueusers] Hanging TIME_WAIT

Jason Williams jasonw at jhu.edu
Tue Feb 17 19:58:51 MST 2009


Josh Butikofer wrote:
> Tim,
>
> I'm going to try and address both yours and Jason's post at the same 
> time. I will use Jason's text as a basis for my responses:
>
> Jason Williams wrote:
> > Hello All,
> > I've spent some time googling around for an answer to this, and not
> > really found one.  I have, however, found of several people complaining
> > of the same issue.   The problem I am having is that my pbs_server
> > machine seems to be running out of available reserved ports (ports <
> > 1024).  I've actually traced the issue to what looks like outgoing
> > communications to all my pbs_mom instances on my compute nodes.  It
> > seems that the server is using a reserved port on the local side of the
> > connection, and then, for some reason, the connection drops into
> > TIME_WAIT and sits there when I examine netstat.  The cluster has about
> > 120 nodes on it, so the reserved ports can fill up quite fast causing
> > all automounted NFS mounts to basically die.
> >
> > I've searched this list's archives with the search function on the
> > mailing list page and didn't really come up with anything.  So I am
> > wondering if anyone else has seen this and has a possible solution?  
> Any
> > suggestions are welcome as it's causing my users some significant
> > amounts of grief.
> >
> > I'm also kind of curious to know if any one happens to know why what
> > looks like an out going connection is using a reserved port on the 
> local
> > side.  That strikes me as a bit odd, but I'm sure there's a good reason
> > for it.
>
> Believe it or not, this usage of privileged ports is a feature of 
> TORQUE and is currently the way that TORQUE ensures that it can trust 
> communication from client commands and pbs_mom deamons. The theory 
> behind this is only a process with super user privileges can attach to 
> a privileged port when creating an outgoing TCP connection. The remote 
> process (in this case pbs_server) accepts the socket and examines the 
> origin port to see if is a port < 1024. This lets the remote process 
> (pbs_server) know that the connecting process is running as root and 
> can be trusted. This explanation is an oversimplification of all the 
> steps going on, but I think it makes the point.
>
> I, too, have noticed a lot of customers recently complaining about 
> privileged ports getting used up fast. I'm not sure if this is due to 
> a regression in TORQUE or if it is simply the result of more jobs 
> being present in clusters.
>
> I would like to ask the community, especially long-time users of 
> TORQUE that have upgraded to TORQUE 2.3.x, if they have noticed any 
> problems related to privilege port usage. For those who use TORQUE 
> 2.1.x, do you see privileged ports sitting in a TIME_WAIT state after 
> they are used?
>
> There is a way to disable the usage of privileged ports, but doing so 
> has big security implications. You can disable privileged ports using 
> the "--disable-privports" configure option. If this is done, however, 
> it is possible for a competent malicious user to hijack pbs_iff and 
> submit jobs as other users, cancel other users' jobs, etc. In other 
> words, they can "lie" to pbs_server about their UID. Disabling 
> privileged ports works in some environments as they aren't concerned 
> about this possible security risk--but most sites shy away from this 
> option.
>
> Another potential code change we could make in TORQUE (perhaps make it 
> configurable) is to have the clients set the SO_REUSEADDR option to 
> avoid the TIME_WAIT after a connection is closed. I haven't tested 
> this, though, so I may be wrong.
>
> Josh Butikofer
> Cluster Resources, Inc.
> #############################
>
>
> Tim Freeman wrote:
>> With Torque 2.3.6, we are seeing many connections settle in to the 
>> TIME_WAIT
>> state and clog up the cluster because of privileged port socket 
>> exhaustion.
>>
>> Jason Williams reported what looks to be the exact thing last month:
>>
>> http://supercluster.org/pipermail/torqueusers/2009-January/008548.html
>>
>> We're seeing the same netstat output too, the foreign socket is 
>> printed as the
>> local address.
>>
>> Is there anything we can do?  Does this imply some misconfiguration?
>>
>> Thankyou,
>> Tim

Josh,
Thank you very much for the very detailed response.  I can tell you that 
the errors we were seeing were actually cropping up when we had >500 
jobs running on the system. I am not sure if this matters either, but 
that host is also an NFS server as well as an xCat server.  Both of 
those applications, from my understanding, also use reserved ports 
heavily.  But torque was the only one I could see that wasn't releasing 
them fast enough.  I was hoping there was some sort of a simple yet 
secure-ish solution, but it looks like I might have to do some code 
diving after all.

I am curious about one thing though.  Have you guys ever considered a 
sort of ssl communication or some sort of internal authentication of 
client-to-server communications?  I haven't looked at the code around 
the functionality you mention yet, and it will probably be the weekend 
before I get around to it, but perhaps something a bit different might 
be a good idea.  Especially on medium to large-ish, shared clusters like 
the one I am running.  Maybe a sort of 'certificate' or 'passcode' based 
client connection verification would help out.  It would definitely save 
on reserved ports. :-)

Just a thought.  Any comments before I go diving into the code and wind 
up getting my boss to sign off on me taking the time to write such an 
animal (if possible) into the torque code for our implementation?

--
Jason


More information about the torqueusers mailing list