|
Bugzilla – Full Text Bug Listing |
| Summary: | Potential 4+ hour hang in pbs_server | ||
|---|---|---|---|
| Product: | TORQUE | Reporter: | David Beer <dbeer> |
| Component: | pbs_server | Assignee: | David Beer <dbeer> |
| Status: | NEW | ||
| Severity: | enhancement | CC: | dbeer, efocht, SimonT, thzeiser, torquedev |
| Priority: | P5 | ||
| Version: | 2.5.x | ||
| Hardware: | PC | ||
| OS: | Linux | ||
Josh Bernstein wrote: I think the maximum number of retries makes sense here. How about something like 10? If this was a server configuration variable that would be a nice thing as well. Please do file a bug on this so it can be tracked in Bugzilla properly. -Josh
If creation of a socket fails (on all 880 retries) then you can't really use the software anyway. Sure you can fall-back after certain amount of retries, but does that really help you? You can't create the socket in the first place, therefore you will just make the server go to another request and create more havoc.
(In reply to comment #2) > If creation of a socket fails (on all 880 retries) then you can't really use > the software anyway. Sure you can fall-back after certain amount of retries, > but does that really help you? You can't create the socket in the first place, > therefore you will just make the server go to another request and create more > havoc. Actually, you can still use the software. You couldn't use it if this were happening on every node, but if it happens only on one or two nodes out of your entire cluster, then your pbs_server is hanging endlessly and the rest of your cluster is going unused. This is why a limit can be useful.
(In reply to comment #3) > (In reply to comment #2) > > If creation of a socket fails (on all 880 retries) then you can't really use > > the software anyway. Sure you can fall-back after certain amount of retries, > > but does that really help you? You can't create the socket in the first place, > > therefore you will just make the server go to another request and create more > > havoc. > > Actually, you can still use the software. You couldn't use it if this were > happening on every node, but if it happens only on one or two nodes out of your > entire cluster, then your pbs_server is hanging endlessly and the rest of your > cluster is going unused. This is why a limit can be useful. Sorry you lost me. What is hanging? The server, or the node? If the server is hanging because the sockets are failing then they will fail for all nodes. Its just like out of memory error. Or could you please explain what part of the code is this referring to exactly?
(In reply to comment #4) > > Sorry you lost me. What is hanging? The server, or the node? > the server > If the server is hanging because the sockets are failing then they will fail > for all nodes. Its just like out of memory error. Or could you please explain > what part of the code is this referring to exactly? The server is hanging because the node it is in the middle of communicating with dies, mid-communication. Please read the first post on this ticket for more information.
> > If the server is hanging because the sockets are failing then they will fail
> > for all nodes. Its just like out of memory error. Or could you please explain
> > what part of the code is this referring to exactly?
>
> The server is hanging because the node it is in the middle of communicating
> with dies, mid-communication. Please read the first post on this ticket for
> more information.
OK, the real issue I'm pointing out here is that we shouldn't limit the amount
of tries but handle return values correctly. What exactly is the return value
of the bind() call in this case?
(In reply to comment #6) > OK, the real issue I'm pointing out here is that we shouldn't limit the amount > of tries but handle return values correctly. What exactly is the return value > of the bind() call in this case? We're seeing this issue as well (with 2.5.3) and it is really annoying. For the return code, here's a trace: bind(11, {sa_family=AF_INET, sin_port=htons(301), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 connect(11, {sa_family=AF_INET, sin_port=htons(15002), sin_addr=inet_addr("10.188.11.238")}, 16) = -1 EINPROGRESS (Operation now in progr) So connect() returns EINPROGRESS, then times out. It's easy to test: start a job, then kill the job's head node. BTW: we increased tcp_timeout to 120 since it's arather big cluster, so just reducing the number of retries is not quite ... useful. Regards, Erich
Is there any progress with this bug? The problem reported is a real show-stopper and not only the request of a minor enhancement!