[torqueusers] torque-2.5.3-1cri.x86_64 hang when a node falls

David Beer dbeer at adaptivecomputing.com
Fri Mar 4 08:52:20 MST 2011


Arnau,

This is due to a "bug" that was fixed in 2.5.5. Bug is in quotes because the code worked as designed, but the design allowed for TORQUE to hang for over 4 hours in some circumstances. If you configure TORQUE with --with-tcp-retry-limit=2 on TORQUE 2.5.5 it will only retry twice and then move on. The problem is that TORQUE would retry about 900 times and could take 18+ seconds with each retry, which meant that TORQUE could be retrying the same node for about 4.5 hours. The reason this happens is that when a node dies in the middle of the connecting process, the error it gets (as you saw) is EINPROGRESS, which is usually a transient error. However, it will get an EINPROGRESS for every connection on any port, and therefore cause the hang that you observed. Please use 2.5.5 and configure with the tcp retry limit to avoid this error.

David

----- Original Message -----
> Hi all,
> 
> I've noticed that our torque server hangs when one (or various) nodes
> becomes "down". Only a restart of sever make thing work again.
> 
> Today I found one case and I did a strace over the pid and I found
> that
> server enters in a "bucle" and stops doing anything else:
> 
> bind(15, {sa_family=AF_INET, sin_port=htons(726),
> sin_addr=inet_addr("0.0.0.0")}, 16) = 0
> connect(15, {sa_family=AF_INET, sin_port=htons(15002),
> sin_addr=inet_addr("192.168.100.175")}, 16) = -1 EINPROGRESS
> (Operation now in progress)
> select(16, NULL, [15], NULL, {5, 0}) = 0 (Timeout)
> select(16, NULL, [15], NULL, {5, 0}) = 0 (Timeout)
> close(15) = 0
> socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 15
> fcntl(15, F_GETFL) = 0x2 (flags O_RDWR)
> fcntl(15, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> setsockopt(15, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> bind(15, {sa_family=AF_INET, sin_port=htons(727),
> sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EADDRINUSE (Address already
> in use)
> bind(15, {sa_family=AF_INET, sin_port=htons(728),
> sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EADDRINUSE (Address already
> in use)
> bind(15, {sa_family=AF_INET, sin_port=htons(729),
> sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EADDRINUSE (Address already
> in use)
> bind(15, {sa_family=AF_INET, sin_port=htons(730),
> sin_addr=inet_addr("0.0.0.0")}, 16) = 0
> connect(15, {sa_family=AF_INET, sin_port=htons(15002),
> sin_addr=inet_addr("192.168.100.175")}, 16) = -1 EINPROGRESS
> (Operation now in progress)
> select(16, NULL, [15], NULL, {5, 0}) = 0 (Timeout)
> select(16, NULL, [15], NULL, {5, 0}) = 0 (Timeout)
> close(15) = 0
> socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 15
> fcntl(15, F_GETFL) = 0x2 (flags O_RDWR)
> fcntl(15, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> setsockopt(15, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> bind(15, {sa_family=AF_INET, sin_port=htons(731),
> sin_addr=inet_addr("0.0.0.0")}, 16) = 0
> connect(15, {sa_family=AF_INET, sin_port=htons(15002),
> sin_addr=inet_addr("192.168.100.175")}, 16) = -1 EINPROGRESS
> (Operation now in progress)
> select(16, NULL, [15], NULL, {5, 0}) = 0 (Timeout)
> select(16, NULL, [15], NULL, {5, 0}) = 0 (Timeout)
> close(15) = 0
> socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 15
> fcntl(15, F_GETFL) = 0x2 (flags O_RDWR)
> fcntl(15, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> setsockopt(15, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> bind(15, {sa_family=AF_INET, sin_port=htons(732),
> sin_addr=inet_addr("0.0.0.0")}, 16) = 0
> connect(15, {sa_family=AF_INET, sin_port=htons(15002),
> sin_addr=inet_addr("192.168.100.175")}, 16) = -1 EINPROGRESS
> (Operation now in progress)
> 
> logs:
> # date
> Fri Mar 4 13:01:17 CET 2011
> 
> # tail /var/spool/pbs/server_logs/20110304
> 03/04/2011
> 12:55:54;0010;PBS_Server;Job;15855903.pbs03.pic.es;Exit_status=0
> resources_used.cput=00:00:26 resources_used.mem=29952kb
> resources_used.vmem=1037280kb resources_used.walltime=00:02:27
> 03/04/2011
> 12:55:54;0100;PBS_Server;Job;15855903.pbs03.pic.es;dequeuing from
> gshort_sl5, state COMPLETE
> 03/04/2011 12:55:58;000d;PBS_Server;Job;15855916.pbs03.pic.es;Post job
> file processing error; job 15855916.pbs03.pic.es on host
> td560.pic.es/3
> 03/04/2011
> 12:55:58;0100;PBS_Server;Job;15855916.pbs03.pic.es;dequeuing from
> glong_sl5, state COMPLETE
> 03/04/2011
> 12:55:58;0010;PBS_Server;Job;15853833.pbs03.pic.es;Exit_status=0
> resources_used.cput=11:02:05 resources_used.mem=732592kb
> resources_used.vmem=1753108kb resources_used.walltime=11:11:46
> 03/04/2011
> 12:55:59;0100;PBS_Server;Job;15853833.pbs03.pic.es;dequeuing from
> glong_sl5, state COMPLETE
> 03/04/2011 12:56:07;0002;PBS_Server;Svr;PBS_Server;Torque Server
> Version = 2.5.3, loglevel = 0
> 03/04/2011
> 12:56:08;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::stream_eof,
> connection to td412.pic.es is bad, remote service may be down, message
> may be corrupt, or connection may have been dropped remotely (End of
> File). setting node state to down
> 03/04/2011
> 12:56:21;0010;PBS_Server;Job;15855904.pbs03.pic.es;Exit_status=0
> resources_used.cput=00:00:28 resources_used.mem=35512kb
> resources_used.vmem=1109988kb resources_used.walltime=00:03:45
> 03/04/2011
> 12:56:21;0100;PBS_Server;Job;15855904.pbs03.pic.es;dequeuing from
> gshort_sl5, state COMPLETE
> 
> 
> Even, when the node becomes up again, torque is still in that bucle...
> 
> So, is there any parameter when I could say torque to ignore nodes
> that
> do not respond?
> 
> TIA,
> Arnau
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

-- 
David Beer 
Direct Line: 801-717-3386 | Fax: 801-717-3738
     Adaptive Computing
     1656 S. East Bay Blvd. Suite #300
     Provo, UT 84606



More information about the torqueusers mailing list