[torquedev] read_nonblocking_socket() wtf?

Garrick Staples garrick at usc.edu
Fri Jul 16 14:31:32 MDT 2010


On Fri, Jul 16, 2010 at 04:29:32PM -0400, Michael Barnes alleged:
> 
> On Jul 16, 2010, at 4:06 PM, Garrick Staples wrote:
> 
> > I've been looking into a problem regarding maui sometimes hanging in a read()
> > on its socket to pbs_server. The hangs happen in pbs_disconnect() after a
> > normal timeout. I thought this was weird because we define read() to be
> > read_nonblocking_socket() which a nice little 30-second loop around a
> > nonblocking read().
> > 
> > The define to read_nonblocking_socket() replaces a blocking read wrapped with
> > an ALRM of pbs_tcp_timeout seconds.
> > 
> > So why would maui hang on a non-blocking read()? Is there something broken in
> > my kernel? What a mystery!
> > 
> > It turns out that read_nonblocking_socket does the exact opposite of what it
> > says because the fcntl() call is commented out! WTF? A neat little ALRM-wrapped
> > read() call is replaced with a broken hard-wired implementation.
> > 
> > I'm on 2.1.x. Is this all fixed up in later branches?
> 
> 
> I'm on 2.1.10, and my read_nonblocking_socket() has the proper
> fcntl call, but we see still see hangs from Maui from time to time.
> I believe the hangs are on order of 15 minutes or so.

You uncommented the fcntl() call yourself?

I just changed the read() to read_blocking_socket() and testing it now. Keeping
the alarm() calls will be nice.


-- 
Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California

Life is Good!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20100716/bf030d42/attachment.bin 


More information about the torquedev mailing list