[torqueusers] Hanging TIME_WAIT
Tim Freeman
tfreeman at mcs.anl.gov
Tue Feb 17 16:41:12 MST 2009
On Tue, 17 Feb 2009 15:28:14 -0700
Josh Butikofer <josh at clusterresources.com> wrote:
> Tim,
>
> I'm going to try and address both yours and Jason's post at the same time. I
> will use Jason's text as a basis for my responses:
>
> Jason Williams wrote:
> > Hello All,
> > I've spent some time googling around for an answer to this, and not
> > really found one. I have, however, found of several people complaining
> > of the same issue. The problem I am having is that my pbs_server
> > machine seems to be running out of available reserved ports (ports <
> > 1024). I've actually traced the issue to what looks like outgoing
> > communications to all my pbs_mom instances on my compute nodes. It
> > seems that the server is using a reserved port on the local side of the
> > connection, and then, for some reason, the connection drops into
> > TIME_WAIT and sits there when I examine netstat. The cluster has about
> > 120 nodes on it, so the reserved ports can fill up quite fast causing
> > all automounted NFS mounts to basically die.
> >
> > I've searched this list's archives with the search function on the
> > mailing list page and didn't really come up with anything. So I am
> > wondering if anyone else has seen this and has a possible solution? Any
> > suggestions are welcome as it's causing my users some significant
> > amounts of grief.
> >
> > I'm also kind of curious to know if any one happens to know why what
> > looks like an out going connection is using a reserved port on the local
> > side. That strikes me as a bit odd, but I'm sure there's a good reason
> > for it.
>
> Believe it or not, this usage of privileged ports is a feature of TORQUE and
> is currently the way that TORQUE ensures that it can trust communication from
> client commands and pbs_mom deamons. The theory behind this is only a process
> with super user privileges can attach to a privileged port when creating an
> outgoing TCP connection. The remote process (in this case pbs_server) accepts
> the socket and examines the origin port to see if is a port < 1024. This lets
> the remote process (pbs_server) know that the connecting process is running
> as root and can be trusted. This explanation is an oversimplification of all
> the steps going on, but I think it makes the point.
>
> I, too, have noticed a lot of customers recently complaining about privileged
> ports getting used up fast. I'm not sure if this is due to a regression in
> TORQUE or if it is simply the result of more jobs being present in clusters.
>
> I would like to ask the community, especially long-time users of TORQUE that
> have upgraded to TORQUE 2.3.x, if they have noticed any problems related to
> privilege port usage. For those who use TORQUE 2.1.x, do you see privileged
> ports sitting in a TIME_WAIT state after they are used?
>
> There is a way to disable the usage of privileged ports, but doing so has big
> security implications. You can disable privileged ports using the
> "--disable-privports" configure option. If this is done, however, it is
> possible for a competent malicious user to hijack pbs_iff and submit jobs as
> other users, cancel other users' jobs, etc. In other words, they can "lie" to
> pbs_server about their UID. Disabling privileged ports works in some
> environments as they aren't concerned about this possible security risk--but
> most sites shy away from this option.
Josh, thankyou for responding and thankyou for the suggestion.
This is on a private VM based cluster with its own LAN so the security issue
doesn't really apply.
I went ahead and tried --disable-privports out but got this error (works fine
when configure is run without it, I also tried make clean, etc.).
if /bin/sh ../../../libtool --tag=CC --mode=compile gcc -m32 -DHAVE_CONFIG_H -I. -I. -I../../../src/include -I../../../src/include -I../Libdis -DIFF_PATH=\"/opt/torque-2.3.6/sbin/pbs_iff\" -DPBS_DEFAULT_FILE=\"/opt/torque-2.3.6/PBS_spool//server_name\" -g -O2 -D_LARGEFILE64_SOURCE -W -Wall -Wno-unused-parameter -Wno-long-long -pedantic -Werror -MT net_client.lo -MD -MP -MF ".deps/net_client.Tpo" -c -o net_client.lo `test -f '../Libnet/net_client.c' || echo './'`../Libnet/net_client.c; \
then mv -f ".deps/net_client.Tpo" ".deps/net_client.Plo"; else rm -f ".deps/net_client.Tpo"; exit 1; fi
gcc -m32 -DHAVE_CONFIG_H -I. -I. -I../../../src/include -I../../../src/include -I../Libdis -DIFF_PATH=\"/opt/torque-2.3.6/sbin/pbs_iff\" -DPBS_DEFAULT_FILE=\"/opt/torque-2.3.6/PBS_spool//server_name\" -g -O2 -D_LARGEFILE64_SOURCE -W -Wall -Wno-unused-parameter -Wno-long-long -pedantic -Werror -MT net_client.lo -MD -MP -MF .deps/net_client.Tpo -c ../Libnet/net_client.c -fPIC -DPIC -o .libs/net_client.o
../Libnet/net_client.c: In function `client_to_svr':
../Libnet/net_client.c:198: warning: unused variable `flags'
../Libnet/net_client.c:216: warning: label `retry' defined but not used
make[3]: *** [net_client.lo] Error 1
make[3]: Leaving directory `/opt/torque-2.3.6-src/src/lib/Libpbs'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/opt/torque-2.3.6-src/src/lib'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/opt/torque-2.3.6-src/src'
make: *** [all-recursive] Error 1
Thanks,
Tim
More information about the torqueusers
mailing list