[Mauiusers] [patch] Work around Maui freezes due to the slow responses of Torque server

Eygene Ryabinkin rea+maui at grid.kiae.ru
Wed Jun 25 03:33:40 MDT 2008


Garrick, good day.

Mon, Jun 23, 2008 at 04:57:39PM -0700, Garrick Staples wrote:
> But instead of adding a new client function, why not just change the existing
> client lib to simply close the socket on timeout?  Any further attempts to use
> the socket would return an error (or we could even get really nifty and
> auto-reconnect).

I tried to make API calls explicit: if client wants to prematurely
close the connection, it should do it with pbs_abort_connection().

Closing socket on timeout makes sense, but it can break some consumers
of Torque API: think Maui.  I looked over the Maui's code (moab/MPBSI.c)
and it turns to open initial connection to the PBS server and then
make multiple queries in some places.  From the other hand, it will
detect errors and will hopefully handle them properly, so the
breakage won't happen.

Auto-reconnect will be fine if the performance overhead is acceptable:
you'll need to check if socket is closed and reopen it every time
clients wants to write something, if lazy reconnect is used.  Another
way is to reconnect immediately after close, thus eliminating the
need of the checks, but it puts additional burden on the Torque
server -- it will need to handle sligtly more connections in the
situations where client's sequence is
  1. open connection;
  2. issue request;
  3. disconnect.
Reconnecting immediately after failure on the step 2 will be
inefficient -- client will disconnect almost immediately and server
will be unnecessarily asked for one more connection.  From the
other hand, timeouts shouldn't happen often: if they are, then
something should be done to prevent them in the first place.

Perhaps the most radical change to prevent timeouts to happen is
to make Torque's server calls to MOMs non-blocking and event-driven,
but it is a lot of work.  In principle, Torque already uses
poll()/select(), so such functionality is already partly present,
but as my case showed, connect() poses some troubles too.


I think that some consensus should be achieved prior to any
modifications.  I will be happy to work around this problem and
test it on our resources, but perhaps two weeks later -- a bit busy
now and don't want to put the production cluster into the test mode
just now.  Since two major consumers of Torque API are Torque itself
and Maui and I am talking to both communities, could other people
try to say something about possible solutions?

Thanks a lot!
-- 
Eygene Ryabinkin, Russian Research Centre "Kurchatov Institute"


More information about the mauiusers mailing list