[torquedev] Re: [Mauiusers] [patch] Work around Maui freezes due to
the slow responses of Torque server
Craig Macdonald
craigm at dcs.gla.ac.uk
Mon Jun 23 07:00:54 MDT 2008
Eygene Ryabinkin wrote:
> Craig, good day.
>
> Mon, Jun 23, 2008 at 01:30:52PM +0100, Craig Macdonald wrote:
>
>> I have experienced these pauses before.
>>
>
> 15 minutes one where Maui blocked on read()?
>
Yes, absolutely. See
http://www.clusterresources.com/pipermail/torquedev/2007-February/000495.html
IIRC Maui says its doing a non-blocking, but its not the case in
pbs_disconnect.
>> This was resolved by using nscd on the master node.
>>
>
> In my case I clearly see from the strace of pbs_server that it just
> receives many descriptors that have something to read from via the
> select() call. But it then fails to contact two cluster nodes,
> each one with 5 seconds timeout; and Maui times out 1 second before
> its request goes to be handled. So my problem seems to be unrelated
> to the NSCD (and LDAP; I assume you mean that you use LDAP
> authentication and NSS). I had very bad luck with NSCD and LDAP
> in the past (with RHEL 3.x), so I am not feeling myself very eager
> to test it once again: in the past nscd just got stuck at some point
> of its operation, so nodes were almost completely unresponsive to
> the external logins.
>
We use NIS for authentication. I didnt manage to strace pbs_server. I
just presumed pbs_server
was doing some lookup. Could easily trigger by submitting about 50 jobs
at once.
> May be my case is not related to yours. Will you be able to test
> the patches?
>
I'm sorry, I'm unable to test such a patch, as I dont have root access
on our cluster machines.
C
More information about the torquedev
mailing list