[torqueusers] torque-2.5.3-1cri.x86_64 hang when a node falls

Arnau Bria arnaubria at pic.es
Wed Mar 23 09:45:21 MDT 2011


On Fri, Mar 4, 2011 at 4:52 PM, David Beer <dbeer at adaptivecomputing.com>wrote:

> Arnau,
>

Hi again David,


> This is due to a "bug" that was fixed in 2.5.5. Bug is in quotes because
> the code worked as designed, but the design allowed for TORQUE to hang for
> over 4 hours in some circumstances. If you configure TORQUE with
> --with-tcp-retry-limit=2 on TORQUE 2.5.5 it will only retry twice and then
> move on. The problem is that TORQUE would retry about 900 times and could
> take 18+ seconds with each retry, which meant that TORQUE could be retrying
> the same node for about 4.5 hours. The reason this happens is that when a
> node dies in the middle of the connecting process, the error it gets (as you
> saw) is EINPROGRESS, which is usually a transient error. However, it will
> get an EINPROGRESS for every connection on any port, and therefore cause the
> hang that you observed. Please use 2.5.5 and configure with the tcp retry
> limit to avoid this error.
>


I did compile torque 2.5.5  --with-tcp-retry-limit and worked fine for few
days. But today, with a new hanged node I saw some similar behaivour:

# strace -p22192
Process 22192 attached - interrupt to quit
select(17, NULL, [16], NULL, {1, 662000}) = 0 (Timeout)
select(17, NULL, [16], NULL, {5, 0})    = 0 (Timeout)
close(16)                               = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 16
fcntl(16, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(16, F_SETFL, O_RDWR|O_NONBLOCK)   = 0
setsockopt(16, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(16, {sa_family=AF_INET, sin_port=htons(1006),
sin_addr=inet_addr("0.0.0.0")}, 16) = 0
connect(16, {sa_family=AF_INET, sin_port=htons(15002),
sin_addr=inet_addr("192.168.100.239")}, 16) = -1 EINPROGRESS (Operation now
in progress)
select(17, NULL, [16], NULL, {5, 0})    = 0 (Timeout)
select(17, NULL, [16], NULL, {5, 0})    = 0 (Timeout)
close(16)                               = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 16
fcntl(16, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(16, F_SETFL, O_RDWR|O_NONBLOCK)   = 0
setsockopt(16, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(16, {sa_family=AF_INET, sin_port=htons(1007),
sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EADDRINUSE (Address already in use)
bind(16, {sa_family=AF_INET, sin_port=htons(1008),
sin_addr=inet_addr("0.0.0.0")}, 16) = 0
connect(16, {sa_family=AF_INET, sin_port=htons(15002),
sin_addr=inet_addr("192.168.100.239")}, 16) = -1 EINPROGRESS (Operation now
in progress)
[...]

and so on...


my configure looks like:

./configure --prefix=/usr --with-server-home=/var/spool/pbs
--enable-maxdefault --disable-drmaa
--with-default-server=pbs.pic.es--disable-xopen-networking
--disable-gui --with-rcp=scp
--enable-high-availability --enable-libreadline --with-tcp-retry-limit=2

and after a fine configure with no errror I did "make rpm".

Have I done something worng?
How may I know if my binary is compiled with tcp-retry-limit?


David
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110323/32c313a8/attachment.html 


More information about the torqueusers mailing list