[torquedev] [Bug 81] Timeouts caused by hanging Disconnect requests

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Thu Sep 23 12:10:52 MDT 2010


http://www.clusterresources.com/bugzilla/show_bug.cgi?id=81

--- Comment #4 from Simon Toth <SimonT at mail.muni.cz> 2010-09-23 12:10:52 MDT ---
(In reply to comment #3)
> (In reply to comment #2)
> > I'm pretty confident that it will solve the problem.
> > 
> > What I'm not that confident is if this is actually valid. The whole point of
> > the code block is to actually confirm the disconnect. If we don't need that,
> > then we can simply skip the read part completely and just close the socket.
> 
> You have a point. It does not look like we really care about what is read. The
> block simply looks for read to return 0 or -1 and then it does not act on any
> data that may be received. 
> 
> But if you look at process_request on the server side you will see that when
> PBS_BATCH_Disconnect is received process_request calls close_conn which calls
> close() on the socket. The read in pbs_disconnect will receive the FIN from the
> close and we now know the server side of the socket is done.

Yes and that's exactly what the read is waiting for.

> > Plus I'm betting that the bug 76 is actually caused by the same problem as this
> > one. On the other side of the connection there is a forked process that still
> > holds the open socket (although it shouldn't).
> 
> There are no forked processes. It is all handled in process_request.

Well, sort of. The fork actually happens in the previous process_request call.
This took two days of running strace, but if you have the disconnect following
a run request (can be a different source), then what will happen is:

- processing run request
- forking for send_job
- sending reply
- processing disconnect
- closing socket
- send_job still running and holding the socket and therefore EOF is not
detected on the other side

-- 
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


More information about the torquedev mailing list