[torqueusers] torque-2.5.3-1cri.x86_64 hang when a node falls

Arnau Bria arnaubria at pic.es
Fri Mar 4 05:04:03 MST 2011


Hi all,

I've noticed that our torque server hangs when one (or various) nodes
becomes "down". Only a restart of sever make thing work again.

Today I found one case and I did a strace over the pid and I found that
server enters in a "bucle" and stops doing anything else:

bind(15, {sa_family=AF_INET, sin_port=htons(726), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
connect(15, {sa_family=AF_INET, sin_port=htons(15002), sin_addr=inet_addr("192.168.100.175")}, 16) = -1 EINPROGRESS (Operation now in progress)
select(16, NULL, [15], NULL, {5, 0})    = 0 (Timeout)
select(16, NULL, [15], NULL, {5, 0})    = 0 (Timeout)
close(15)                               = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 15
fcntl(15, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(15, F_SETFL, O_RDWR|O_NONBLOCK)   = 0
setsockopt(15, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(15, {sa_family=AF_INET, sin_port=htons(727), sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EADDRINUSE (Address already in use)
bind(15, {sa_family=AF_INET, sin_port=htons(728), sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EADDRINUSE (Address already in use)
bind(15, {sa_family=AF_INET, sin_port=htons(729), sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EADDRINUSE (Address already in use)
bind(15, {sa_family=AF_INET, sin_port=htons(730), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
connect(15, {sa_family=AF_INET, sin_port=htons(15002), sin_addr=inet_addr("192.168.100.175")}, 16) = -1 EINPROGRESS (Operation now in progress)
select(16, NULL, [15], NULL, {5, 0})    = 0 (Timeout)
select(16, NULL, [15], NULL, {5, 0})    = 0 (Timeout)
close(15)                               = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 15
fcntl(15, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(15, F_SETFL, O_RDWR|O_NONBLOCK)   = 0
setsockopt(15, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(15, {sa_family=AF_INET, sin_port=htons(731), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
connect(15, {sa_family=AF_INET, sin_port=htons(15002), sin_addr=inet_addr("192.168.100.175")}, 16) = -1 EINPROGRESS (Operation now in progress)
select(16, NULL, [15], NULL, {5, 0})    = 0 (Timeout)
select(16, NULL, [15], NULL, {5, 0})    = 0 (Timeout)
close(15)                               = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 15
fcntl(15, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(15, F_SETFL, O_RDWR|O_NONBLOCK)   = 0
setsockopt(15, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(15, {sa_family=AF_INET, sin_port=htons(732), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
connect(15, {sa_family=AF_INET, sin_port=htons(15002), sin_addr=inet_addr("192.168.100.175")}, 16) = -1 EINPROGRESS (Operation now in progress)

logs:
# date
Fri Mar  4 13:01:17 CET 2011

# tail /var/spool/pbs/server_logs/20110304 
03/04/2011 12:55:54;0010;PBS_Server;Job;15855903.pbs03.pic.es;Exit_status=0 resources_used.cput=00:00:26 resources_used.mem=29952kb resources_used.vmem=1037280kb resources_used.walltime=00:02:27
03/04/2011 12:55:54;0100;PBS_Server;Job;15855903.pbs03.pic.es;dequeuing from gshort_sl5, state COMPLETE
03/04/2011 12:55:58;000d;PBS_Server;Job;15855916.pbs03.pic.es;Post job file processing error; job 15855916.pbs03.pic.es on host td560.pic.es/3
03/04/2011 12:55:58;0100;PBS_Server;Job;15855916.pbs03.pic.es;dequeuing from glong_sl5, state COMPLETE
03/04/2011 12:55:58;0010;PBS_Server;Job;15853833.pbs03.pic.es;Exit_status=0 resources_used.cput=11:02:05 resources_used.mem=732592kb resources_used.vmem=1753108kb resources_used.walltime=11:11:46
03/04/2011 12:55:59;0100;PBS_Server;Job;15853833.pbs03.pic.es;dequeuing from glong_sl5, state COMPLETE
03/04/2011 12:56:07;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.3, loglevel = 0
03/04/2011 12:56:08;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::stream_eof, connection to td412.pic.es is bad, remote service may be down, message may be corrupt, or connection may have been dropped remotely (End of File).  setting node state to down
03/04/2011 12:56:21;0010;PBS_Server;Job;15855904.pbs03.pic.es;Exit_status=0 resources_used.cput=00:00:28 resources_used.mem=35512kb resources_used.vmem=1109988kb resources_used.walltime=00:03:45
03/04/2011 12:56:21;0100;PBS_Server;Job;15855904.pbs03.pic.es;dequeuing from gshort_sl5, state COMPLETE


Even, when the node becomes up again, torque is still in that bucle...

So, is there any parameter when I could say torque to ignore nodes that
do not respond?

TIA,
Arnau


More information about the torqueusers mailing list