[torqueusers] Hanging TIME_WAIT

Josh Butikofer josh at clusterresources.com
Tue Feb 17 22:13:24 MST 2009

> Josh,
> Thank you very much for the very detailed response.  I can tell you
> that 
> the errors we were seeing were actually cropping up when we had >500 
> jobs running on the system. I am not sure if this matters either, but
> that host is also an NFS server as well as an xCat server.  Both of 
> those applications, from my understanding, also use reserved ports 
> heavily.

Without seeing the network statistics/port usage of your site it is hard to say if 500 jobs would justify running out of privileged ports so quickly. If, however, NFS and xCat are active in using restricted ports as well, then this is probably not a TORQUE regression. It is common for clusters with large numbers of jobs to report conflicts with NFS due to an over usage of these ports.

> But torque was the only one I could see that wasn't
> releasing 
> them fast enough.  I was hoping there was some sort of a simple yet 
> secure-ish solution, but it looks like I might have to do some code 
> diving after all.

Well, there may be a simple way to get the current security model to get better mileage until we can implement something more scalable. As I mentioned previously, we can eliminate the TIME_WAIT state by setting options on the sockets which will cause them to close immediately.

> I am curious about one thing though.  Have you guys ever considered a
> sort of ssl communication or some sort of internal authentication of 
> client-to-server communications?  I haven't looked at the code around
> the functionality you mention yet, and it will probably be the weekend
> before I get around to it, but perhaps something a bit different might
> be a good idea.  Especially on medium to large-ish, shared clusters
> like 
> the one I am running.  Maybe a sort of 'certificate' or 'passcode'
> based 
> client connection verification would help out.  It would definitely
> save 
> on reserved ports. :-)

Yes--we have thought about this and it has been one of the "wishlist" items for a while now. It is on our development roadmap, so eventually we plan on improving the system.

> Just a thought.  Any comments before I go diving into the code and
> wind up getting my boss to sign off on me taking the time to write such an
> animal (if possible) into the torque code for our implementation?

Well, if you are able to donate the time and code, it would definitely help out a lot of TORQUE users. We can give you assistance in many ways, including showing you where in the code communication is handled, advice on what we think would work best, performing tests in our labs, etc.


More information about the torqueusers mailing list