[torqueusers] Hanging TIME_WAIT

Josh Butikofer josh at clusterresources.com
Tue Feb 24 09:53:13 MST 2009


Everyone,

We have figured out why configuring with "--disable-privports" is causing 
failures with TORQUE 2.3.x. It appears that when UNIX domain sockets were added 
to TORQUE 2.3.x (to improve the performance of TORQUE client commands), a small 
side-effect was introduced which causes client commands using UNIX sockets to 
fail IF privileged ports are disabled. You will find if you run the client 
commands on a remote host, and a TCP connection is used, the commands will work 
properly.

This was an easy fix and is now available with the latest TORQUE 2.3.7 snapshot. 
You can also apply the attached patch which will resolve the problem without 
requiring you to download a new tarball.

Regards,

Josh Butikofer
Cluster Resources, Inc.
#############################


Tim Freeman wrote:
> On Wed, 18 Feb 2009 14:54:23 -0700
> Josh Butikofer <josh at clusterresources.com> wrote:
> 
>> And you re-compiled and re-installed all components of TORQUE: the
>> pbs_server, client commands, and pbs_mom daemons?
> 
> Yes.  I guess I could also try it from complete scratch and see if this repeats.
> 
> Tim
> 
>> Josh Butikofer
>> Cluster Resources, Inc.
>> #############################
>>
>>
>> Tim Freeman wrote:
>>> On Tue, 17 Feb 2009 21:56:52 -0700 (MST)
>>> Josh Butikofer <josh at clusterresources.com> wrote:
>>>
>>>> Tim,
>>>>
>>>>> Josh, thankyou for responding and thankyou for the suggestion.
>>>> No problem.
>>>>  
>>>>> This is on a private VM based cluster with its own LAN so the security
>>>>> issue
>>>>> doesn't really apply.
>>>>>
>>>>> I went ahead and tried --disable-privports out but got this error
>>>>> (works fine
>>>>> when configure is run without it, I also tried make clean, etc.).
>>>> Some compiler warnings are preventing it from fully compiling. Add
>>>> "--disable-gcc-warnings" to the configure arguments as well and you should
>>>> get better results.
>>> Well, that compiles and installs but now Torque does not function anymore.
>>>
>>> Getting "15056 Bad DIS based Request Protocol" errors in the logs when
>>> running qmgr and qsub.
>>>
>>>
>>> $ echo "hostname" | qsub
>>> qsub: Invalid request MSG=no job owner specified
>>>
>>>
>>> $ qmgr -c "print server"
>>>
>>> #
>>> # Create queues and set their attributes.
>>> #
>>> #
>>> # Create and define queue defaultq
>>> #
>>> create queue defaultq
>>> set queue defaultq queue_type = Route
>>> set queue defaultq route_destinations = batchq
>>> set queue defaultq enabled = True
>>> set queue defaultq started = True
>>> #
>>> # Create and define queue batchq
>>> #
>>> create queue batchq
>>> set queue batchq queue_type = Execution
>>> set queue batchq enabled = True
>>> set queue batchq started = True
>>> qmgr obj= svr=default: Bad DIS based Request Protocol MSG=cannot decode
>>> message
>>>
-------------- next part --------------
Index: src/lib/Libnet/net_server.c
===================================================================
--- src/lib/Libnet/net_server.c	(revision 2779)
+++ src/lib/Libnet/net_server.c	(working copy)
@@ -637,16 +637,31 @@
 
 #ifndef NOPRIVPORTS
 
-  if (socktype == PBS_SOCK_INET && port < IPPORT_RESERVED)
+  if ((socktype == PBS_SOCK_INET) && (port < IPPORT_RESERVED))
+    {
     svr_conn[sock].cn_authen = PBS_NET_CONN_FROM_PRIVIL;
+    }
   else
+    {
+    /* AF_UNIX sockets */
     svr_conn[sock].cn_authen = 0;
+    }
 
 #else /* !NOPRIVPORTS */
-  svr_conn[sock].cn_authen = PBS_NET_CONN_FROM_PRIVIL;
 
-#endif /* NOPRIVPORTS */
+  if (socktype == PBS_SOCK_INET)
+    {
+    /* All TCP connections are privileged */
+    svr_conn[sock].cn_authen = PBS_NET_CONN_FROM_PRIVIL;
+    }
+  else
+    {
+    /* AF_UNIX sockets */
+    svr_conn[sock].cn_authen = 0;
+    }
 
+#endif /* !NOPRIVPORTS */
+
   return;
   }  /* END add_conn() */
 


More information about the torqueusers mailing list