<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "http://www.clusterresources.com/bugzilla/bugzilla.dtd">

<bugzilla version="3.4.5"
          urlbase="http://www.clusterresources.com/bugzilla/"
          
          maintainer="support@clusterresources.com"
>

    <bug>
          <bug_id>85</bug_id>
          
          <creation_ts>2010-10-05 14:42:00 -0600</creation_ts>
          <short_desc>Potential 4+ hour hang in pbs_server</short_desc>
          <delta_ts>2010-12-23 01:45:41 -0700</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>TORQUE</product>
          <component>pbs_server</component>
          <version>2.5.x</version>
          <rep_platform>PC</rep_platform>
          <op_sys>Linux</op_sys>
          <bug_status>NEW</bug_status>
          
          
          
          
          
          
          <priority>P5</priority>
          <bug_severity>enhancement</bug_severity>
          <target_milestone>---</target_milestone>
          
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="David Beer">dbeer</reporter>
          <assigned_to name="David Beer">dbeer</assigned_to>
          <cc>dbeer</cc>
    
    <cc>efocht</cc>
    
    <cc>SimonT</cc>
    
    <cc>thzeiser</cc>
    
    <cc>torquedev</cc>

      

      

      

          <long_desc isprivate="0">
            <who name="David Beer">dbeer</who>
            <bug_when>2010-10-05 14:42:20 -0600</bug_when>
            <thetext>In src/lib/Libnet/net_client.c, when a socket can&apos;t be accessed for normal reasons, including operation in progress and timeout errors, it will continue to retry different possible sockets until it runs out. In some cases, such as a node dying in the middle of communication, all of these retries will fail. This is what is happening to the client. Now, in the current state of TORQUE (and this has been true for a long time) it will retry 880 times. Each time can take up to 18 seconds (2 5-second timeouts and 1 8-second timeout by default). This means that the pbs_server can be stuck retrying against a dead node for 4.4 hours. I&apos;m thinking that this wouldn&apos;t be acceptable in any scenario. The patch I sent them makes a hard retry limit something that can be configured in to TORQUE, but my personal opinion is that since no one is likely to find a 4.4 hour wait acceptable, we ought to change the default. I propose deciding on a maximum number of retries, and using that by default. What are your thoughts on this?</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="David Beer">dbeer</who>
            <bug_when>2010-10-05 14:42:45 -0600</bug_when>
            <thetext>Josh Bernstein wrote:

I think the maximum number of retries makes sense here. How about 
something like 10? If this was a server configuration variable that 
would be a nice thing as well.

Please do file a bug on this so it can be tracked in Bugzilla properly.

-Josh</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Simon Toth">SimonT</who>
            <bug_when>2010-10-06 00:00:12 -0600</bug_when>
            <thetext>If creation of a socket fails (on all 880 retries) then you can&apos;t really use the software anyway. Sure you can fall-back after certain amount of retries, but does that really help you? You can&apos;t create the socket in the first place, therefore you will just make the server go to another request and create more havoc.</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="David Beer">dbeer</who>
            <bug_when>2010-10-06 09:17:30 -0600</bug_when>
            <thetext>(In reply to comment #2)
&gt; If creation of a socket fails (on all 880 retries) then you can&apos;t really use
&gt; the software anyway. Sure you can fall-back after certain amount of retries,
&gt; but does that really help you? You can&apos;t create the socket in the first place,
&gt; therefore you will just make the server go to another request and create more
&gt; havoc.

Actually, you can still use the software. You couldn&apos;t use it if this were happening on every node, but if it happens only on one or two nodes out of your entire cluster, then your pbs_server is hanging endlessly and the rest of your cluster is going unused. This is why a limit can be useful.</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Simon Toth">SimonT</who>
            <bug_when>2010-10-06 11:07:49 -0600</bug_when>
            <thetext>(In reply to comment #3)
&gt; (In reply to comment #2)
&gt; &gt; If creation of a socket fails (on all 880 retries) then you can&apos;t really use
&gt; &gt; the software anyway. Sure you can fall-back after certain amount of retries,
&gt; &gt; but does that really help you? You can&apos;t create the socket in the first place,
&gt; &gt; therefore you will just make the server go to another request and create more
&gt; &gt; havoc.
&gt; 
&gt; Actually, you can still use the software. You couldn&apos;t use it if this were
&gt; happening on every node, but if it happens only on one or two nodes out of your
&gt; entire cluster, then your pbs_server is hanging endlessly and the rest of your
&gt; cluster is going unused. This is why a limit can be useful.

Sorry you lost me. What is hanging? The server, or the node?

If the server is hanging because the sockets are failing then they will fail for all nodes. Its just like out of memory error. Or could you please explain what part of the code is this referring to exactly?</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="David Beer">dbeer</who>
            <bug_when>2010-10-06 16:37:46 -0600</bug_when>
            <thetext>(In reply to comment #4)

&gt; 
&gt; Sorry you lost me. What is hanging? The server, or the node?
&gt; 

the server

&gt; If the server is hanging because the sockets are failing then they will fail
&gt; for all nodes. Its just like out of memory error. Or could you please explain
&gt; what part of the code is this referring to exactly?

The server is hanging because the node it is in the middle of communicating with dies, mid-communication. Please read the first post on this ticket for more information.</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Simon Toth">SimonT</who>
            <bug_when>2010-10-07 01:40:46 -0600</bug_when>
            <thetext>&gt; &gt; If the server is hanging because the sockets are failing then they will fail
&gt; &gt; for all nodes. Its just like out of memory error. Or could you please explain
&gt; &gt; what part of the code is this referring to exactly?
&gt; 
&gt; The server is hanging because the node it is in the middle of communicating
&gt; with dies, mid-communication. Please read the first post on this ticket for
&gt; more information.

OK, the real issue I&apos;m pointing out here is that we shouldn&apos;t limit the amount of tries but handle return values correctly. What exactly is the return value of the bind() call in this case?</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Erich Focht">efocht</who>
            <bug_when>2010-11-04 08:48:13 -0600</bug_when>
            <thetext>(In reply to comment #6)
&gt; OK, the real issue I&apos;m pointing out here is that we shouldn&apos;t limit the amount
&gt; of tries but handle return values correctly. What exactly is the return value
&gt; of the bind() call in this case?

We&apos;re seeing this issue as well (with 2.5.3) and it is really annoying.

For the return code, here&apos;s a trace:
bind(11, {sa_family=AF_INET, sin_port=htons(301), sin_addr=inet_addr(&quot;0.0.0.0&quot;)}, 16) = 0
connect(11, {sa_family=AF_INET, sin_port=htons(15002), sin_addr=inet_addr(&quot;10.188.11.238&quot;)}, 16) = -1 EINPROGRESS (Operation now in progr)

So connect() returns EINPROGRESS, then times out.

It&apos;s easy to test: start a job, then kill the job&apos;s head node.

BTW: we increased tcp_timeout to 120 since it&apos;s arather big cluster, so
just reducing the number of retries is not quite ... useful.

Regards,
Erich</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="thzeiser">thzeiser</who>
            <bug_when>2010-12-23 01:45:41 -0700</bug_when>
            <thetext>Is there any progress with this bug? The problem reported is a real show-stopper and not only the request of a minor enhancement!</thetext>
          </long_desc>
      
      

    </bug>

</bugzilla>