[torquedev] Potential 4+ hour hang in pbs_server

Joshua Bernstein jbernstein at penguincomputing.com
Tue Oct 5 14:28:14 MDT 2010



David Beer wrote:
> Hi all,
> 
> A customer reported this issue, and I gave him a fix that currently
> isn't checked in to TORQUE. I'm wondering what the best way to fix
> this issue is:
> 
> In src/lib/Libnet/net_client.c, when a socket can't be accessed for
> normal reasons, including operation in progress and timeout errors,
> it will continue to retry different possible sockets until it runs
> out. In some cases, such as a node dying in the middle of
> communication, all of these retries will fail. This is what is
> happening to the client. Now, in the current state of TORQUE (and
> this has been true for a long time) it will retry 880 times. Each
> time can take up to 18 seconds (2 5-second timeouts and 1 8-second
> timeout by default). This means that the pbs_server can be stuck
> retrying against a dead node for 4.4 hours. I'm thinking that this
> wouldn't be acceptable in any scenario. The patch I sent them makes a
> hard retry limit something that can be configured in to TORQUE, but
> my personal opinion is that since no one is likely to find a 4.4 hour
> wait acceptable, we ought to change the default. I propose deciding
> on a maximum number of retries, and using that by default. What are
> your thoughts on this?

David, I think the maximum number of retries makes sense here. How about 
something like 10? If this was a server configuration variable that 
would be a nice thing as well.

Please do file a bug on this so it can be tracked in Bugzilla properly.

-Josh


More information about the torquedev mailing list