[torquedev] [Bug 85] New: Potential 4+ hour hang in pbs_server

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Tue Oct 5 14:42:21 MDT 2010


           Summary: Potential 4+ hour hang in pbs_server
           Product: TORQUE
           Version: 2.5.x
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: enhancement
          Priority: P5
         Component: pbs_server
        AssignedTo: dbeer at adaptivecomputing.com
        ReportedBy: dbeer at adaptivecomputing.com
                CC: torquedev at supercluster.org
   Estimated Hours: 0.0

In src/lib/Libnet/net_client.c, when a socket can't be accessed for normal
reasons, including operation in progress and timeout errors, it will continue
to retry different possible sockets until it runs out. In some cases, such as a
node dying in the middle of communication, all of these retries will fail. This
is what is happening to the client. Now, in the current state of TORQUE (and
this has been true for a long time) it will retry 880 times. Each time can take
up to 18 seconds (2 5-second timeouts and 1 8-second timeout by default). This
means that the pbs_server can be stuck retrying against a dead node for 4.4
hours. I'm thinking that this wouldn't be acceptable in any scenario. The patch
I sent them makes a hard retry limit something that can be configured in to
TORQUE, but my personal opinion is that since no one is likely to find a 4.4
hour wait acceptable, we ought to change the default. I propose deciding on a
maximum number of retries, and using that by default. What are your thoughts on

Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

More information about the torquedev mailing list