[torquedev] [Bug 81] Timeouts caused by hanging Disconnect requests

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Thu Sep 23 23:11:17 MDT 2010


http://www.clusterresources.com/bugzilla/show_bug.cgi?id=81

--- Comment #8 from Simon Toth <SimonT at mail.muni.cz> 2010-09-23 23:11:17 MDT ---
(In reply to comment #7)
> > Well, sort of. The fork actually happens in the previous process_request call.
> > This took two days of running strace, but if you have the disconnect following
> > a run request (can be a different source), then what will happen is:
> > 
> > - processing run request
> > - forking for send_job
> > - sending reply
> > - processing disconnect
> > - closing socket
> > - send_job still running and holding the socket and therefore EOF is not
> > detected on the other side
> 
> What client utility are you running when this happens?


This is qsub->server->node + pbs_sched interaction (for interactive jobs).

I did one mistake, the sending reply actually happens after send_job finishes.

-- 
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


More information about the torquedev mailing list