[torquedev] Re: [Mauiusers] [patch] Work around Maui freezes due to
the slow responses of Torque server
Eygene Ryabinkin
rea+maui at grid.kiae.ru
Mon Jun 23 06:41:40 MDT 2008
Craig, good day.
Mon, Jun 23, 2008 at 01:30:52PM +0100, Craig Macdonald wrote:
> I have experienced these pauses before.
15 minutes one where Maui blocked on read()?
> This was resolved by using nscd on the master node.
In my case I clearly see from the strace of pbs_server that it just
receives many descriptors that have something to read from via the
select() call. But it then fails to contact two cluster nodes,
each one with 5 seconds timeout; and Maui times out 1 second before
its request goes to be handled. So my problem seems to be unrelated
to the NSCD (and LDAP; I assume you mean that you use LDAP
authentication and NSS). I had very bad luck with NSCD and LDAP
in the past (with RHEL 3.x), so I am not feeling myself very eager
to test it once again: in the past nscd just got stuck at some point
of its operation, so nodes were almost completely unresponsive to
the external logins.
> However a workaround in the code is probably desirable.
May be my case is not related to yours. Will you be able to test
the patches?
Thank you!
--
Eygene Ryabinkin, Russian Research Centre "Kurchatov Institute"
More information about the torquedev
mailing list