[torquedev] read_nonblocking_socket() wtf?

Michael Barnes Michael.Barnes at jlab.org
Fri Jul 16 14:29:32 MDT 2010


On Jul 16, 2010, at 4:06 PM, Garrick Staples wrote:

> I've been looking into a problem regarding maui sometimes hanging in a read()
> on its socket to pbs_server. The hangs happen in pbs_disconnect() after a
> normal timeout. I thought this was weird because we define read() to be
> read_nonblocking_socket() which a nice little 30-second loop around a
> nonblocking read().
> 
> The define to read_nonblocking_socket() replaces a blocking read wrapped with
> an ALRM of pbs_tcp_timeout seconds.
> 
> So why would maui hang on a non-blocking read()? Is there something broken in
> my kernel? What a mystery!
> 
> It turns out that read_nonblocking_socket does the exact opposite of what it
> says because the fcntl() call is commented out! WTF? A neat little ALRM-wrapped
> read() call is replaced with a broken hard-wired implementation.
> 
> I'm on 2.1.x. Is this all fixed up in later branches?


I'm on 2.1.10, and my read_nonblocking_socket() has the proper
fcntl call, but we see still see hangs from Maui from time to time.
I believe the hangs are on order of 15 minutes or so.

-mb

--
+-----------------------------------------------
| Michael Barnes
|
| Thomas Jefferson National Accelerator Facility
| Scientific Computing Group
| 12000 Jefferson Ave.
| Newport News, VA 23606
| (757) 269-7634
+-----------------------------------------------






More information about the torquedev mailing list