[torquedev] Re: [Mauiusers] [patch] Work around Maui freezes due to the slow responses of Torque server

Craig Macdonald craigm at dcs.gla.ac.uk
Mon Jun 23 06:30:52 MDT 2008


Hello Eygene,

I have experienced these pauses before. This was resolved by using nscd 
on the master node. However a workaround in the code is probably desirable.

C

Eygene Ryabinkin wrote:
> Good day.
>
> Sorry for cross-posting, but this concerns both Torque and Maui.
> At least I was not able to figure the workaround that will touch
> only one product.
>
> I had noticed that Maui on our production system used to freeze for
> 15 minutes.  During this time no requests were processed and strace
> showed that Maui is blocked inside the read() system call.
>
> Investigations showed that the problem was that Torque server is
> not responding to Maui within the 9 seconds interval and Maui tries
> to close the connection via pbs_disconnect().  But the latter posts
> another request (Disconnect) and Torque reads these two requests
> in one read() call: they are effectively coalesced.  Maui's timeout
> is due to the fact that Torque was busy processing other requests
> (and it times out in connection to the worker nodes twice: it is
> enough to overflow the 9 seconds timeout).  So first Maui's request
> is not lost: it is processed by Torque, but only after the Maui's
> call to pbs_disconnect(), making the Disconnect request to be
> effectively lost.
>
> But pbs_disconnect() tries to read all outstanding data from Torque
> server and this leads to the blocking read(): once all outstanding
> data from Torque is read, the final read() should return end-of-file,
> but it won't do it until Torque's side of the channel will be closed.
> And this will happen only after 15 minute timeout: remember that
> Disconnect request is lost.
>
> The two attached patches cure the problem: Maui drops connection
> with the new function, pbs_abort_connection().  I am also attaching
> my internal notes about this problem: it contains strace outputs
> and my thoughts about the problem; this can be of some help for
> developers.
>
> Since this new function is a rework of the Torque API, I had put
> the configure's check for this function and the usage of a new
> function in Maui is conditionalized at the compile-time.
>
> I am evaluating this patch for the two days: it shows no problems
> yet.  Moreover, it cures the original freeze ;))
>
>
> A side note: I had also changed configure.ac at the line where
> Makefiles for various batch systems are included to the main file.
> It is not related to the current problem, but autoconf 2.61 fails
> to properly substitute multiple variables in one line if these
> variables will be substituted to the content of some file
> (AC_SUBST_FILE).  So, no matter if these patches will be accepted,
> it will be good to take a look at the line 21 of Makefile.in:
> -----
> @ll_definitions@@sdr_definitions@@pbs_definitions@@sge_definitions@@lsf_definitions@@mx_definitions@@pcre_definitions@
> -----
> These variables should better be on the separate lines.
>
>
> And diff chunks to the configure (that is a longest one in the
> Maui's patch) can be dropped, since you'll likely produce your own
> configure if the patch will be accepted.
>
> Sorry for such a long letter ;))  And thanks for your patience!
>   
> ------------------------------------------------------------------------
>
> _______________________________________________
> mauiusers mailing list
> mauiusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/mauiusers



More information about the torquedev mailing list