Bug 85 - Potential 4+ hour hang in pbs_server
: Potential 4+ hour hang in pbs_server
Status: NEW
Product: TORQUE
pbs_server
: 2.5.x
: PC Linux
: P5 enhancement
Assigned To: David Beer
:
:
:
  Show dependency treegraph
 
Reported: 2010-10-05 14:42 MDT by David Beer
Modified: 2014-04-15 15:16 MDT (History)
6 users (show)

See Also:


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description David Beer 2010-10-05 14:42:20 MDT
In src/lib/Libnet/net_client.c, when a socket can't be accessed for normal
reasons, including operation in progress and timeout errors, it will continue
to retry different possible sockets until it runs out. In some cases, such as a
node dying in the middle of communication, all of these retries will fail. This
is what is happening to the client. Now, in the current state of TORQUE (and
this has been true for a long time) it will retry 880 times. Each time can take
up to 18 seconds (2 5-second timeouts and 1 8-second timeout by default). This
means that the pbs_server can be stuck retrying against a dead node for 4.4
hours. I'm thinking that this wouldn't be acceptable in any scenario. The patch
I sent them makes a hard retry limit something that can be configured in to
TORQUE, but my personal opinion is that since no one is likely to find a 4.4
hour wait acceptable, we ought to change the default. I propose deciding on a
maximum number of retries, and using that by default. What are your thoughts on
this?
Comment 1 David Beer 2010-10-05 14:42:45 MDT
Josh Bernstein wrote:

I think the maximum number of retries makes sense here. How about 
something like 10? If this was a server configuration variable that 
would be a nice thing as well.

Please do file a bug on this so it can be tracked in Bugzilla properly.

-Josh
Comment 2 Simon Toth 2010-10-06 00:00:12 MDT
If creation of a socket fails (on all 880 retries) then you can't really use
the software anyway. Sure you can fall-back after certain amount of retries,
but does that really help you? You can't create the socket in the first place,
therefore you will just make the server go to another request and create more
havoc.
Comment 3 David Beer 2010-10-06 09:17:30 MDT
(In reply to comment #2)
> If creation of a socket fails (on all 880 retries) then you can't really use
> the software anyway. Sure you can fall-back after certain amount of retries,
> but does that really help you? You can't create the socket in the first place,
> therefore you will just make the server go to another request and create more
> havoc.

Actually, you can still use the software. You couldn't use it if this were
happening on every node, but if it happens only on one or two nodes out of your
entire cluster, then your pbs_server is hanging endlessly and the rest of your
cluster is going unused. This is why a limit can be useful.
Comment 4 Simon Toth 2010-10-06 11:07:49 MDT
(In reply to comment #3)
> (In reply to comment #2)
> > If creation of a socket fails (on all 880 retries) then you can't really use
> > the software anyway. Sure you can fall-back after certain amount of retries,
> > but does that really help you? You can't create the socket in the first place,
> > therefore you will just make the server go to another request and create more
> > havoc.
> 
> Actually, you can still use the software. You couldn't use it if this were
> happening on every node, but if it happens only on one or two nodes out of your
> entire cluster, then your pbs_server is hanging endlessly and the rest of your
> cluster is going unused. This is why a limit can be useful.

Sorry you lost me. What is hanging? The server, or the node?

If the server is hanging because the sockets are failing then they will fail
for all nodes. Its just like out of memory error. Or could you please explain
what part of the code is this referring to exactly?
Comment 5 David Beer 2010-10-06 16:37:46 MDT
(In reply to comment #4)

> 
> Sorry you lost me. What is hanging? The server, or the node?
> 

the server

> If the server is hanging because the sockets are failing then they will fail
> for all nodes. Its just like out of memory error. Or could you please explain
> what part of the code is this referring to exactly?

The server is hanging because the node it is in the middle of communicating
with dies, mid-communication. Please read the first post on this ticket for
more information.
Comment 6 Simon Toth 2010-10-07 01:40:46 MDT
> > If the server is hanging because the sockets are failing then they will fail
> > for all nodes. Its just like out of memory error. Or could you please explain
> > what part of the code is this referring to exactly?
> 
> The server is hanging because the node it is in the middle of communicating
> with dies, mid-communication. Please read the first post on this ticket for
> more information.

OK, the real issue I'm pointing out here is that we shouldn't limit the amount
of tries but handle return values correctly. What exactly is the return value
of the bind() call in this case?
Comment 7 Erich Focht 2010-11-04 08:48:13 MDT
(In reply to comment #6)
> OK, the real issue I'm pointing out here is that we shouldn't limit the amount
> of tries but handle return values correctly. What exactly is the return value
> of the bind() call in this case?

We're seeing this issue as well (with 2.5.3) and it is really annoying.

For the return code, here's a trace:
bind(11, {sa_family=AF_INET, sin_port=htons(301),
sin_addr=inet_addr("0.0.0.0")}, 16) = 0
connect(11, {sa_family=AF_INET, sin_port=htons(15002),
sin_addr=inet_addr("10.188.11.238")}, 16) = -1 EINPROGRESS (Operation now in
progr)

So connect() returns EINPROGRESS, then times out.

It's easy to test: start a job, then kill the job's head node.

BTW: we increased tcp_timeout to 120 since it's arather big cluster, so
just reducing the number of retries is not quite ... useful.

Regards,
Erich
Comment 8 thzeiser 2010-12-23 01:45:41 MST
Is there any progress with this bug? The problem reported is a real
show-stopper and not only the request of a minor enhancement!
Comment 9 Wade Colson 2014-04-15 15:16:03 MDT
*** Bug 260998 has been marked as a duplicate of this bug. ***
Seen from the domain http://volichat.com
Page where seen: http://volichat.com/webcam-chat-rooms
Marked for reference. Resolved as fixed @bugzilla.