[torquedev] Re: [Mauiusers] [patch] Work around Maui freezes due to
the slow responses of Torque server
rea+maui at grid.kiae.ru
Mon Jun 23 06:41:40 MDT 2008
Craig, good day.
Mon, Jun 23, 2008 at 01:30:52PM +0100, Craig Macdonald wrote:
> I have experienced these pauses before.
15 minutes one where Maui blocked on read()?
> This was resolved by using nscd on the master node.
In my case I clearly see from the strace of pbs_server that it just
receives many descriptors that have something to read from via the
select() call. But it then fails to contact two cluster nodes,
each one with 5 seconds timeout; and Maui times out 1 second before
its request goes to be handled. So my problem seems to be unrelated
to the NSCD (and LDAP; I assume you mean that you use LDAP
authentication and NSS). I had very bad luck with NSCD and LDAP
in the past (with RHEL 3.x), so I am not feeling myself very eager
to test it once again: in the past nscd just got stuck at some point
of its operation, so nodes were almost completely unresponsive to
the external logins.
> However a workaround in the code is probably desirable.
May be my case is not related to yours. Will you be able to test
Eygene Ryabinkin, Russian Research Centre "Kurchatov Institute"
More information about the torquedev