[torquedev] [Bug 85] Potential 4+ hour hang in pbs_server

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Thu Nov 4 08:48:14 MDT 2010


http://www.clusterresources.com/bugzilla/show_bug.cgi?id=85

Erich Focht <efocht at gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |efocht at gmail.com

--- Comment #7 from Erich Focht <efocht at gmail.com> 2010-11-04 08:48:13 MDT ---
(In reply to comment #6)
> OK, the real issue I'm pointing out here is that we shouldn't limit the amount
> of tries but handle return values correctly. What exactly is the return value
> of the bind() call in this case?

We're seeing this issue as well (with 2.5.3) and it is really annoying.

For the return code, here's a trace:
bind(11, {sa_family=AF_INET, sin_port=htons(301),
sin_addr=inet_addr("0.0.0.0")}, 16) = 0
connect(11, {sa_family=AF_INET, sin_port=htons(15002),
sin_addr=inet_addr("10.188.11.238")}, 16) = -1 EINPROGRESS (Operation now in
progr)

So connect() returns EINPROGRESS, then times out.

It's easy to test: start a job, then kill the job's head node.

BTW: we increased tcp_timeout to 120 since it's arather big cluster, so
just reducing the number of retries is not quite ... useful.

Regards,
Erich

-- 
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


More information about the torquedev mailing list