[torquedev] Re: [Mauiusers] [patch] Work around Maui freezes due to the slow responses of Torque server

Garrick Staples garrick at usc.edu
Mon Jun 23 17:57:39 MDT 2008


On Mon, Jun 23, 2008 at 01:12:24PM +0400, Eygene Ryabinkin alleged:
> Good day.
> 
> Sorry for cross-posting, but this concerns both Torque and Maui.
> At least I was not able to figure the workaround that will touch
> only one product.
> 
> I had noticed that Maui on our production system used to freeze for
> 15 minutes.  During this time no requests were processed and strace
> showed that Maui is blocked inside the read() system call.
> 
> Investigations showed that the problem was that Torque server is
> not responding to Maui within the 9 seconds interval and Maui tries
> to close the connection via pbs_disconnect().  But the latter posts
> another request (Disconnect) and Torque reads these two requests
> in one read() call: they are effectively coalesced.  Maui's timeout
> is due to the fact that Torque was busy processing other requests
> (and it times out in connection to the worker nodes twice: it is
> enough to overflow the 9 seconds timeout).  So first Maui's request
> is not lost: it is processed by Torque, but only after the Maui's
> call to pbs_disconnect(), making the Disconnect request to be
> effectively lost.
> 
> But pbs_disconnect() tries to read all outstanding data from Torque
> server and this leads to the blocking read(): once all outstanding
> data from Torque is read, the final read() should return end-of-file,
> but it won't do it until Torque's side of the channel will be closed.
> And this will happen only after 15 minute timeout: remember that
> Disconnect request is lost.
> 
> The two attached patches cure the problem: Maui drops connection
> with the new function, pbs_abort_connection().  I am also attaching
> my internal notes about this problem: it contains strace outputs
> and my thoughts about the problem; this can be of some help for
> developers.
> 
> Since this new function is a rework of the Torque API, I had put
> the configure's check for this function and the usage of a new
> function in Maui is conditionalized at the compile-time.
> 
> I am evaluating this patch for the two days: it shows no problems
> yet.  Moreover, it cures the original freeze ;))

First off, I certainly agree with your analysis of the situation.

But instead of adding a new client function, why not just change the existing
client lib to simply close the socket on timeout?  Any further attempts to use
the socket would return an error (or we could even get really nifty and
auto-reconnect).

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20080623/c09b3ea4/attachment.bin


More information about the torquedev mailing list