[torquedev] Potential 4+ hour hang in pbs_server
"Mgr. Šimon Tóth"
SimonT at mail.muni.cz
Wed Oct 6 00:01:28 MDT 2010
>> A customer reported this issue, and I gave him a fix that currently
>> isn't checked in to TORQUE. I'm wondering what the best way to fix
>> this issue is:
>> In src/lib/Libnet/net_client.c, when a socket can't be accessed for
>> normal reasons, including operation in progress and timeout errors,
>> it will continue to retry different possible sockets until it runs
>> out. In some cases, such as a node dying in the middle of
>> communication, all of these retries will fail. This is what is
>> happening to the client. Now, in the current state of TORQUE (and
>> this has been true for a long time) it will retry 880 times. Each
>> time can take up to 18 seconds (2 5-second timeouts and 1 8-second
>> timeout by default). This means that the pbs_server can be stuck
>> retrying against a dead node for 4.4 hours. I'm thinking that this
>> wouldn't be acceptable in any scenario. The patch I sent them makes a
>> hard retry limit something that can be configured in to TORQUE, but
>> my personal opinion is that since no one is likely to find a 4.4 hour
>> wait acceptable, we ought to change the default. I propose deciding
>> on a maximum number of retries, and using that by default. What are
>> your thoughts on this?
> David, I think the maximum number of retries makes sense here. How about
> something like 10? If this was a server configuration variable that
> would be a nice thing as well.
> Please do file a bug on this so it can be tracked in Bugzilla properly.
If creation of a socket fails (on all 880 retries) then you can't really
use the software anyway. Sure you can fall-back after certain amount of
retries, but does that really help you? You can't create the socket in
the first place, therefore you will just make the server go to another
request and create more havoc.
Mgr. Šimon Tóth
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 3366 bytes
Desc: S/MIME Cryptographic Signature
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20101006/29623796/attachment-0001.bin
More information about the torquedev