[Mauiusers] [patch] Work around Maui freezes due to the slow responses of Torque server

Eygene Ryabinkin rea+maui at grid.kiae.ru
Mon Jun 23 06:41:40 MDT 2008


Craig, good day.

Mon, Jun 23, 2008 at 01:30:52PM +0100, Craig Macdonald wrote:
> I have experienced these pauses before.

15 minutes one where Maui blocked on read()?

> This was resolved by using nscd on the master node.

In my case I clearly see from the strace of pbs_server that it just
receives many descriptors that have something to read from via the
select() call.  But it then fails to contact two cluster nodes,
each one with 5 seconds timeout; and Maui times out 1 second before
its request goes to be handled.  So my problem seems to be unrelated
to the NSCD (and LDAP; I assume you mean that you use LDAP
authentication and NSS).  I had very bad luck with NSCD and LDAP
in the past (with RHEL 3.x), so I am not feeling myself very eager
to test it once again: in the past nscd just got stuck at some point
of its operation, so nodes were almost completely unresponsive to
the external logins.

> However a workaround in the code is probably desirable.

May be my case is not related to yours.  Will you be able to test
the patches?

Thank you!
-- 
Eygene Ryabinkin, Russian Research Centre "Kurchatov Institute"


More information about the mauiusers mailing list