[torquedev] Potential 4+ hour hang in pbs_server

David Beer dbeer at adaptivecomputing.com
Tue Oct 5 14:00:16 MDT 2010


Hi all,

A customer reported this issue, and I gave him a fix that currently isn't checked in to TORQUE. I'm wondering what the best way to fix this issue is:

In src/lib/Libnet/net_client.c, when a socket can't be accessed for normal reasons, including operation in progress and timeout errors, it will continue to retry different possible sockets until it runs out. In some cases, such as a node dying in the middle of communication, all of these retries will fail. This is what is happening to the client. Now, in the current state of TORQUE (and this has been true for a long time) it will retry 880 times. Each time can take up to 18 seconds (2 5-second timeouts and 1 8-second timeout by default). This means that the pbs_server can be stuck retrying against a dead node for 4.4 hours. I'm thinking that this wouldn't be acceptable in any scenario. The patch I sent them makes a hard retry limit something that can be configured in to TORQUE, but my personal opinion is that since no one is likely to find a 4.4 hour wait acceptable, we ought to change the default. I propose deciding on a maximum number of retries, and using that by default. What are your thoughts on this?

-- 
David Beer 
Direct Line: 801-717-3386 | Fax: 801-717-3738
     Adaptive Computing
     1656 S. East Bay Blvd. Suite #300
     Provo, UT 84606



More information about the torquedev mailing list