[Mauiusers] [patch] Work around Maui freezes due to the slow responses of Torque server

Craig Macdonald craigm at dcs.gla.ac.uk
Mon Jun 23 07:00:54 MDT 2008


Eygene Ryabinkin wrote:
> Craig, good day.
>
> Mon, Jun 23, 2008 at 01:30:52PM +0100, Craig Macdonald wrote:
>   
>> I have experienced these pauses before.
>>     
>
> 15 minutes one where Maui blocked on read()?
>   
Yes, absolutely. See 
http://www.clusterresources.com/pipermail/torquedev/2007-February/000495.html
IIRC Maui says its doing a non-blocking, but its not the case in 
pbs_disconnect.

>> This was resolved by using nscd on the master node.
>>     
>
> In my case I clearly see from the strace of pbs_server that it just
> receives many descriptors that have something to read from via the
> select() call.  But it then fails to contact two cluster nodes,
> each one with 5 seconds timeout; and Maui times out 1 second before
> its request goes to be handled.  So my problem seems to be unrelated
> to the NSCD (and LDAP; I assume you mean that you use LDAP
> authentication and NSS).  I had very bad luck with NSCD and LDAP
> in the past (with RHEL 3.x), so I am not feeling myself very eager
> to test it once again: in the past nscd just got stuck at some point
> of its operation, so nodes were almost completely unresponsive to
> the external logins.
>   


We use NIS for authentication. I didnt manage to strace pbs_server. I 
just presumed pbs_server
was doing some lookup. Could easily trigger by submitting about 50 jobs 
at once.

> May be my case is not related to yours.  Will you be able to test
> the patches?
>   
I'm sorry, I'm unable to test such a patch, as I dont have root access 
on our cluster machines.

C


More information about the mauiusers mailing list