Bug 81 - Timeouts caused by hanging Disconnect requests
: Timeouts caused by hanging Disconnect requests
Status: RESOLVED FIXED
Product: TORQUE
pbs_server
: 2.4.x
: PC Linux
: P5 normal
Assigned To: Glen
:
:
:
  Show dependency treegraph
 
Reported: 2010-09-15 09:47 MDT by Simon Toth
Modified: 2010-09-23 23:11 MDT (History)
3 users (show)

See Also:


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description Simon Toth 2010-09-15 09:47:29 MDT
When a disconnect request is prefaced by a run job request the disconnect will
hang until the send job fork finishes (because the fork still holds the closed
socket).

This is specifically true for qsub and can lead to a state when no interactive
jobs can be run.

* qsub tries to disconnect and hangs because sendjob still holds the socket
* mom receives the jobs and tries to contact the qsub
* send job is done on server and exits
* qsub is finally unlocked because noone holds the socket anymore
* mom has a timeout on the read request for term type
* qsub is ready to talk to mom

Therefore each forked child on server should close all connections (well,
except those related to the processed request).
Comment 1 Ken Nielson 2010-09-21 14:08:08 MDT
A patch has been applied and released in 2.4.11 for bug 76. You can try that
version and see if this problem is fixed. If so we can count it as a duplicate
and resolve it.

Ken Nielson
Comment 2 Simon Toth 2010-09-22 05:48:04 MDT
I'm pretty confident that it will solve the problem.

What I'm not that confident is if this is actually valid. The whole point of
the code block is to actually confirm the disconnect. If we don't need that,
then we can simply skip the read part completely and just close the socket.

Plus I'm betting that the bug 76 is actually caused by the same problem as this
one. On the other side of the connection there is a forked process that still
holds the open socket (although it shouldn't).
Comment 3 Ken Nielson 2010-09-23 11:23:33 MDT
(In reply to comment #2)
> I'm pretty confident that it will solve the problem.
> 
> What I'm not that confident is if this is actually valid. The whole point of
> the code block is to actually confirm the disconnect. If we don't need that,
> then we can simply skip the read part completely and just close the socket.

You have a point. It does not look like we really care about what is read. The
block simply looks for read to return 0 or -1 and then it does not act on any
data that may be received. 

But if you look at process_request on the server side you will see that when
PBS_BATCH_Disconnect is received process_request calls close_conn which calls
close() on the socket. The read in pbs_disconnect will receive the FIN from the
close and we now know the server side of the socket is done.
> 
> Plus I'm betting that the bug 76 is actually caused by the same problem as this
> one. On the other side of the connection there is a forked process that still
> holds the open socket (although it shouldn't).

There are no forked processes. It is all handled in process_request.

Ken Nielson
Comment 4 Simon Toth 2010-09-23 12:10:52 MDT
(In reply to comment #3)
> (In reply to comment #2)
> > I'm pretty confident that it will solve the problem.
> > 
> > What I'm not that confident is if this is actually valid. The whole point of
> > the code block is to actually confirm the disconnect. If we don't need that,
> > then we can simply skip the read part completely and just close the socket.
> 
> You have a point. It does not look like we really care about what is read. The
> block simply looks for read to return 0 or -1 and then it does not act on any
> data that may be received. 
> 
> But if you look at process_request on the server side you will see that when
> PBS_BATCH_Disconnect is received process_request calls close_conn which calls
> close() on the socket. The read in pbs_disconnect will receive the FIN from the
> close and we now know the server side of the socket is done.

Yes and that's exactly what the read is waiting for.

> > Plus I'm betting that the bug 76 is actually caused by the same problem as this
> > one. On the other side of the connection there is a forked process that still
> > holds the open socket (although it shouldn't).
> 
> There are no forked processes. It is all handled in process_request.

Well, sort of. The fork actually happens in the previous process_request call.
This took two days of running strace, but if you have the disconnect following
a run request (can be a different source), then what will happen is:

- processing run request
- forking for send_job
- sending reply
- processing disconnect
- closing socket
- send_job still running and holding the socket and therefore EOF is not
detected on the other side
Comment 5 Eygene Ryabinkin 2010-09-23 12:25:33 MDT
Sorry for jumping in, but isn't it the same issue that was risen in 2008,
  http://www.clusterresources.com/pipermail/torquedev/2008-June/001111.html
and was, actually, never fixed in the mainline code, to my knowledge.
Comment 6 Ken Nielson 2010-09-23 13:51:40 MDT
(In reply to comment #5)
> Sorry for jumping in, but isn't it the same issue that was risen in 2008,
>   http://www.clusterresources.com/pipermail/torquedev/2008-June/001111.html
> and was, actually, never fixed in the mainline code, to my knowledge.

It sounds like it is the same bug.

Ken Nielson
Comment 7 Ken Nielson 2010-09-23 13:52:38 MDT
> Well, sort of. The fork actually happens in the previous process_request call.
> This took two days of running strace, but if you have the disconnect following
> a run request (can be a different source), then what will happen is:
> 
> - processing run request
> - forking for send_job
> - sending reply
> - processing disconnect
> - closing socket
> - send_job still running and holding the socket and therefore EOF is not
> detected on the other side

What client utility are you running when this happens?

Ken
Comment 8 Simon Toth 2010-09-23 23:11:17 MDT
(In reply to comment #7)
> > Well, sort of. The fork actually happens in the previous process_request call.
> > This took two days of running strace, but if you have the disconnect following
> > a run request (can be a different source), then what will happen is:
> > 
> > - processing run request
> > - forking for send_job
> > - sending reply
> > - processing disconnect
> > - closing socket
> > - send_job still running and holding the socket and therefore EOF is not
> > detected on the other side
> 
> What client utility are you running when this happens?


This is qsub->server->node + pbs_sched interaction (for interactive jobs).

I did one mistake, the sending reply actually happens after send_job finishes.