[torquedev] [Bug 85] Potential 4+ hour hang in pbs_server

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Wed Oct 6 11:07:49 MDT 2010


http://www.clusterresources.com/bugzilla/show_bug.cgi?id=85

--- Comment #4 from Simon Toth <SimonT at mail.muni.cz> 2010-10-06 11:07:49 MDT ---
(In reply to comment #3)
> (In reply to comment #2)
> > If creation of a socket fails (on all 880 retries) then you can't really use
> > the software anyway. Sure you can fall-back after certain amount of retries,
> > but does that really help you? You can't create the socket in the first place,
> > therefore you will just make the server go to another request and create more
> > havoc.
> 
> Actually, you can still use the software. You couldn't use it if this were
> happening on every node, but if it happens only on one or two nodes out of your
> entire cluster, then your pbs_server is hanging endlessly and the rest of your
> cluster is going unused. This is why a limit can be useful.

Sorry you lost me. What is hanging? The server, or the node?

If the server is hanging because the sockets are failing then they will fail
for all nodes. Its just like out of memory error. Or could you please explain
what part of the code is this referring to exactly?

-- 
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


More information about the torquedev mailing list