[torquedev] [patch] Work around Maui freezes due to the slow responses of Torque server

Eygene Ryabinkin rea+maui at grid.kiae.ru
Mon Jun 23 03:12:24 MDT 2008


Good day.

Sorry for cross-posting, but this concerns both Torque and Maui.
At least I was not able to figure the workaround that will touch
only one product.

I had noticed that Maui on our production system used to freeze for
15 minutes.  During this time no requests were processed and strace
showed that Maui is blocked inside the read() system call.

Investigations showed that the problem was that Torque server is
not responding to Maui within the 9 seconds interval and Maui tries
to close the connection via pbs_disconnect().  But the latter posts
another request (Disconnect) and Torque reads these two requests
in one read() call: they are effectively coalesced.  Maui's timeout
is due to the fact that Torque was busy processing other requests
(and it times out in connection to the worker nodes twice: it is
enough to overflow the 9 seconds timeout).  So first Maui's request
is not lost: it is processed by Torque, but only after the Maui's
call to pbs_disconnect(), making the Disconnect request to be
effectively lost.

But pbs_disconnect() tries to read all outstanding data from Torque
server and this leads to the blocking read(): once all outstanding
data from Torque is read, the final read() should return end-of-file,
but it won't do it until Torque's side of the channel will be closed.
And this will happen only after 15 minute timeout: remember that
Disconnect request is lost.

The two attached patches cure the problem: Maui drops connection
with the new function, pbs_abort_connection().  I am also attaching
my internal notes about this problem: it contains strace outputs
and my thoughts about the problem; this can be of some help for
developers.

Since this new function is a rework of the Torque API, I had put
the configure's check for this function and the usage of a new
function in Maui is conditionalized at the compile-time.

I am evaluating this patch for the two days: it shows no problems
yet.  Moreover, it cures the original freeze ;))


A side note: I had also changed configure.ac at the line where
Makefiles for various batch systems are included to the main file.
It is not related to the current problem, but autoconf 2.61 fails
to properly substitute multiple variables in one line if these
variables will be substituted to the content of some file
(AC_SUBST_FILE).  So, no matter if these patches will be accepted,
it will be good to take a look at the line 21 of Makefile.in:
-----
@ll_definitions@@sdr_definitions@@pbs_definitions@@sge_definitions@@lsf_definitions@@mx_definitions@@pcre_definitions@
-----
These variables should better be on the separate lines.


And diff chunks to the configure (that is a longest one in the
Maui's patch) can be dropped, since you'll likely produce your own
configure if the patch will be accepted.

Sorry for such a long letter ;))  And thanks for your patience!
-- 
Eygene Ryabinkin, Russian Research Centre "Kurchatov Institute"
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Use-pbs_abort_connection-for-failed-calls-to-Torqu.patch
Type: text/x-diff
Size: 317599 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20080623/18ddceb1/0001-Use-pbs_abort_connection-for-failed-calls-to-Torqu-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Add-routine-pbs_abort_connection.patch
Type: text/x-diff
Size: 3029 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20080623/18ddceb1/0001-Add-routine-pbs_abort_connection-0001.bin
-------------- next part --------------
Torque/MAUI saga
================
Internal notes about problem found in Maui <-> Torque interaction.
Started: 5th June 2008
Ended: 20th June 2008.

OK, the thing is that sometimes Maui hangs ;))  This is the visible
symptom: it becomes irresponsive for 15 minutes.  No scheduling,
no answers to the external commands, nothing.  But after 15 minutes
Maui unfreezes and continues to perform its job.  This (presumably)
happens only when Maui schedules Torque jobs.  Although I had not
checked the cases when Maui drives other batch engines, post-mortem
code analysis showed that the problem is Torque-specific.

Strace showed that Maui hangs on the read() system call.  After 15
minutes of inactivity read() returns 0 (EOF condition) and Maui
continues to run.  This happens inside Torque's function pbs_disconnect():
it tries to read the remainder of data from the passed socket, but
gets blocked upon reading of an EOF.  All outstanding data is read,
but the other side had not closed the socket, so read() is blocked
until the other side (Torque server in our case) will close its
descriptor, associated with the connection in question.

Here is the exempt from the Maui's strace output with some comments
denoted by []:
-----
14:02:16.043050 write(4, "06/20 14:02:16 INFO:     startin"..., 49) = 49
14:02:16.043232 write(4, "06/20 14:02:16 MRMGetInfo()\n", 28) = 28
14:02:16.043424 write(4, "06/20 14:02:16 MRMClusterQuery()"..., 33) = 33
14:02:16.043581 write(4, "06/20 14:02:16 MPBSClusterQuery("..., 48) = 48
[ Asking Torque for node status, PBS_BATCH_StatusNode = 58 ]
14:02:16.043805 write(8, "+2+12+58+4root+0+0+0", 20) = 20
14:02:16.044016 poll([{fd=8, events=POLLIN|POLLHUP, revents=POLLIN}], 1, 9000) = 1
[ Torque answered, reading chunk one... ]
14:02:16.048711 read(8, "+2+1+0+0+63+120+32+13n1.lcgwn.ki"..., 262144) = 32768
14:02:16.049261 poll([{fd=8, events=POLLIN|POLLHUP, revents=POLLIN}], 1, 9000) = 1
[ ...and chunk two ]
14:02:16.049464 read(8, "3+397opsys=linux,uname=Linux n55"..., 262144) = 38243
14:02:16.050487 write(4, "06/20 14:02:16 __MPBSGetNodeStat"..., 52) = 52
14:02:16.050685 write(4, "06/20 14:02:16 INFO:     PBS nod"..., 73) = 73
14:02:16.050909 write(4, "06/20 14:02:16 MPBSNodeUpdate(n1"..., 69) = 69
14:02:16.051082 write(4, "06/20 14:02:16 MPBSLoadQueueInfo"..., 56) = 56
[ Asking Torque for the Queue status, PBS_BATCH_StatusQue = 20 ]
14:02:16.051275 write(8, "+2+12+20+4root+0+0+0", 20) = 20
14:02:16.051461 poll([{fd=8, events=POLLIN|POLLHUP}], 1, 9000) = 0
[ Ooops, poll timeout (9 seconds) elapsed, but no reply was sent.
  Going to write log entry and close the connection ]
14:02:25.050789 write(4, "06/20 14:02:25 ERROR:    cannot "..., 73) = 73
[ Called pbs_disconnect(), issuing request PBS_BATCH_Disconnect = 59 ]
14:02:25.050996 write(8, "+2+12+59+4root", 14) = 14
14:02:25.051173 rt_sigaction(SIGALRM, {SIG_IGN}, {SIG_DFL}, 8) = 0
14:02:25.051325 alarm(9)                = 0
[ Reading outstanding data ... ]
14:02:25.051462 read(8, "+2+1+0+0+62+10+1+5dteam2+122+212"..., 65536) = 5400
[ ... and one more chunk (actually, we read all data during the previous
 read() and it happened to be the reply to the StatusQue request) ... ]
14:02:26.239137 read(8, 0x2aaaaace1aa0, 65536) = ? ERESTARTSYS (To be restarted)
[ ... but it is interrupted with the signal and will be restarted ... ]
14:02:34.052922 --- SIGALRM (Alarm clock) @ 0 (0) ---
[ ... and this call freezes Maui for 15 minutes ... ]
14:02:34.053054 read(8,  <unfinished ...>
-----

What's happening?  Heh, the answer is as always simple:

1. Torque was not able to handle our request in 9 seconds -- it was
   busy with other tasks (may be waiting for other timeouts);

2. but once it got some spare time, it handled our request and even
   replied to us with the status of the queue;

3. BUT two requests from Maui (StatusQue and Disconnect) were
   coalesced into one read from the Torque's side.  It had processed
   the first one and even replied to Maui (but it was already late,
   Maui decided to close the connection), but failed to catch the
   second one.

Here are the exempts from the Torque's strace:
-----
[ StatusNode request from MAUI ]
14:02:16.044185 read(19, "+2+12+58+4root+0+0+0", 262144) = 20
14:02:16.044282 open("/etc/hosts", O_RDONLY) = 11
* ... many lines skipped ... *
[ StatusNode reply to MAUI ]
14:02:16.048533 write(19, "+2+1+0+0+63+120+32+13n1.lcgwn.ki"..., 71011) = 71011
14:02:16.049659 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 11
* ... many lines skipped ... *
14:02:26.236395 poll([{fd=19, events=POLLIN|POLLHUP, revents=POLLIN}], 1, 6000) = 1
[ And this is two requests coming from MAUI: one of them (20, StatusQue)
  was already timed out from the MAUI side, so it sent Disconnect request
  (59) and two packets coalesced into one.  Note that 34 = 20 (size of
  Maui's StatusQue request) + 14 (size of Maui's Disconnect request) ]
14:02:26.236502 read(19, "+2+12+20+4root+0+0+0+2+12+59+4ro"..., 262144) = 34
14:02:26.236628 open("/etc/hosts", O_RDONLY) = 11
* ... many lines skipped ... *
-----


The end?  Nope, let's hack some code!
=====================================

What now?  Torque should be fixed to handle such coalesced requests.
The remainder of the file describes how it is done and why it was
done in this very way.

At the lowest level, incoming requests are dispatched inside the
function dis_request_read(), server/dis_read.c.  This function is
good at handling the single request, so we don't want to modify it
to handle multiple requests: the preferred way of handling them is
to call this function a number of times while incoming data buffer
won't be exhausted.

The only consumer of dis_request_read() in Torque is the function
process_request() in server/process_request.c.  The function is
invoked twice, from different branches of preprocessed code (one
for MOM and another for non-MOM case).  I decided not to touch this
function until there won't be another way to cope with the problem:
logically, this function handles only one request, so we should invoke
it once for every request.

Grep'ping Torque's code, I had found that process_requests() serves
as the callback that is passed to the functions init_network() and
add_conn().  It would be logical to add the missing functionality
to the functions that call these callbacks: they seem to route the
incoming requests, so it should be their job to see if more than
one request came in the single transaction and handle such cases.
But strace shows that in our case read() gets the whole buffer,
so there is additional bufferization level somewhere and we can
not determine if there is an additional data to process with only
descriptor number.

And since buffering takes place inside DIS routines and DIS_tcp_setup(),
that initialises and resets read and write buffers, is called at
the beginning of dis_request_read(), then we should try to handle
our case within dis_request_read().  Otherwise we will have to make
many modifications to the dis_request_read() to detect and inform
the caller about outstanding data.

But there is another route: since Torque seems to like only
request-reply conversation without any pipelining, we can check if
some outstanding data is present at the end of request handling and
drop the connection if client side posted two or more requests in
a row.  This will also handle our case more or less gracefully,
since the client will try to read from socket that will be closed
at the server side and thus read() in pbs_disconnect will not block.

Strictly speaking, read() will block, but only until the unhandled
request will be processed.  This is not too bad, but it effectively
eliminates Maui's timeouts and prolongs its inactivity.

Another modification is to create new API function pbs_abort_connection()
that will not try to read onstanding data, but will just close the
connection and will free the resources.  This function is intended
to be used only in emergency situations and Maui's situation looks
like that -- something went wrong on the server side and connection
should be dropped.  It can harm the other side, since all writes
will return EPIPE and SIGPIPE will be raised.  So we should ignore
SIGPIPE when server writes data back.  And luckily, this is already
done in server/pbsd_init.c: I think that it was protection from the
clients that were closing their side of connection unexpectedly.

So, for now I am taking the easiest route: to add pbs_abort_connection()
and use it in Maui.

-- 
Eygene Ryabinkin, Russian Research Centre "Kurchatov Institute"


More information about the torquedev mailing list