[torquedev] [Bug 85] Potential 4+ hour hang in pbs_server
bugzilla-daemon at supercluster.org
bugzilla-daemon at supercluster.org
Thu Nov 4 08:48:14 MDT 2010
http://www.clusterresources.com/bugzilla/show_bug.cgi?id=85
Erich Focht <efocht at gmail.com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |efocht at gmail.com
--- Comment #7 from Erich Focht <efocht at gmail.com> 2010-11-04 08:48:13 MDT ---
(In reply to comment #6)
> OK, the real issue I'm pointing out here is that we shouldn't limit the amount
> of tries but handle return values correctly. What exactly is the return value
> of the bind() call in this case?
We're seeing this issue as well (with 2.5.3) and it is really annoying.
For the return code, here's a trace:
bind(11, {sa_family=AF_INET, sin_port=htons(301),
sin_addr=inet_addr("0.0.0.0")}, 16) = 0
connect(11, {sa_family=AF_INET, sin_port=htons(15002),
sin_addr=inet_addr("10.188.11.238")}, 16) = -1 EINPROGRESS (Operation now in
progr)
So connect() returns EINPROGRESS, then times out.
It's easy to test: start a job, then kill the job's head node.
BTW: we increased tcp_timeout to 120 since it's arather big cluster, so
just reducing the number of retries is not quite ... useful.
Regards,
Erich
--
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
More information about the torquedev
mailing list