[torqueusers] Hanging TIME_WAIT

Tim Freeman tfreeman at mcs.anl.gov
Tue Feb 17 16:41:12 MST 2009


On Tue, 17 Feb 2009 15:28:14 -0700
Josh Butikofer <josh at clusterresources.com> wrote:

> Tim,
> 
> I'm going to try and address both yours and Jason's post at the same time. I 
> will use Jason's text as a basis for my responses:
> 
> Jason Williams wrote:
>  > Hello All,
>  > I've spent some time googling around for an answer to this, and not
>  > really found one.  I have, however, found of several people complaining
>  > of the same issue.   The problem I am having is that my pbs_server
>  > machine seems to be running out of available reserved ports (ports <
>  > 1024).  I've actually traced the issue to what looks like outgoing
>  > communications to all my pbs_mom instances on my compute nodes.  It
>  > seems that the server is using a reserved port on the local side of the
>  > connection, and then, for some reason, the connection drops into
>  > TIME_WAIT and sits there when I examine netstat.  The cluster has about
>  > 120 nodes on it, so the reserved ports can fill up quite fast causing
>  > all automounted NFS mounts to basically die.
>  >
>  > I've searched this list's archives with the search function on the
>  > mailing list page and didn't really come up with anything.  So I am
>  > wondering if anyone else has seen this and has a possible solution?  Any
>  > suggestions are welcome as it's causing my users some significant
>  > amounts of grief.
>  >
>  > I'm also kind of curious to know if any one happens to know why what
>  > looks like an out going connection is using a reserved port on the local
>  > side.  That strikes me as a bit odd, but I'm sure there's a good reason
>  > for it.
> 
> Believe it or not, this usage of privileged ports is a feature of TORQUE and
> is currently the way that TORQUE ensures that it can trust communication from 
> client commands and pbs_mom deamons. The theory behind this is only a process 
> with super user privileges can attach to a privileged port when creating an 
> outgoing TCP connection. The remote process (in this case pbs_server) accepts 
> the socket and examines the origin port to see if is a port < 1024. This lets 
> the remote process (pbs_server) know that the connecting process is running
> as root and can be trusted. This explanation is an oversimplification of all
> the steps going on, but I think it makes the point.
> 
> I, too, have noticed a lot of customers recently complaining about privileged 
> ports getting used up fast. I'm not sure if this is due to a regression in 
> TORQUE or if it is simply the result of more jobs being present in clusters.
> 
> I would like to ask the community, especially long-time users of TORQUE that 
> have upgraded to TORQUE 2.3.x, if they have noticed any problems related to 
> privilege port usage. For those who use TORQUE 2.1.x, do you see privileged 
> ports sitting in a TIME_WAIT state after they are used?
> 
> There is a way to disable the usage of privileged ports, but doing so has big 
> security implications. You can disable privileged ports using the 
> "--disable-privports" configure option. If this is done, however, it is
> possible for a competent malicious user to hijack pbs_iff and submit jobs as
> other users, cancel other users' jobs, etc. In other words, they can "lie" to
> pbs_server about their UID. Disabling privileged ports works in some
> environments as they aren't concerned about this possible security risk--but
> most sites shy away from this option.

Josh, thankyou for responding and thankyou for the suggestion.

This is on a private VM based cluster with its own LAN so the security issue
doesn't really apply.

I went ahead and tried --disable-privports out but got this error (works fine
when configure is run without it, I also tried make clean, etc.).


if /bin/sh ../../../libtool --tag=CC --mode=compile gcc -m32 -DHAVE_CONFIG_H -I. -I. -I../../../src/include  -I../../../src/include -I../Libdis -DIFF_PATH=\"/opt/torque-2.3.6/sbin/pbs_iff\" -DPBS_DEFAULT_FILE=\"/opt/torque-2.3.6/PBS_spool//server_name\"   -g -O2 -D_LARGEFILE64_SOURCE -W -Wall -Wno-unused-parameter -Wno-long-long -pedantic -Werror -MT net_client.lo -MD -MP -MF ".deps/net_client.Tpo" -c -o net_client.lo `test -f '../Libnet/net_client.c' || echo './'`../Libnet/net_client.c; \

then mv -f ".deps/net_client.Tpo" ".deps/net_client.Plo"; else rm -f ".deps/net_client.Tpo"; exit 1; fi
 gcc -m32 -DHAVE_CONFIG_H -I. -I. -I../../../src/include -I../../../src/include -I../Libdis -DIFF_PATH=\"/opt/torque-2.3.6/sbin/pbs_iff\" -DPBS_DEFAULT_FILE=\"/opt/torque-2.3.6/PBS_spool//server_name\" -g -O2 -D_LARGEFILE64_SOURCE -W -Wall -Wno-unused-parameter -Wno-long-long -pedantic -Werror -MT net_client.lo -MD -MP -MF .deps/net_client.Tpo -c ../Libnet/net_client.c  -fPIC -DPIC -o .libs/net_client.o

../Libnet/net_client.c: In function `client_to_svr':
../Libnet/net_client.c:198: warning: unused variable `flags'
../Libnet/net_client.c:216: warning: label `retry' defined but not used
make[3]: *** [net_client.lo] Error 1
make[3]: Leaving directory `/opt/torque-2.3.6-src/src/lib/Libpbs'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/opt/torque-2.3.6-src/src/lib'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/opt/torque-2.3.6-src/src'
make: *** [all-recursive] Error 1


Thanks,
Tim


More information about the torqueusers mailing list