[torqueusers] Fwd: pbsserver 4.1.4 accumulates CLOSE_WAIT connections, bug ?

Lech Nieroda nieroda.lech at uni-koeln.de
Tue Apr 23 10:08:47 MDT 2013


Dear list,

we have encountered a problem with our torque 4.1.4/maui 3.3.1
scheduling system. It seems to be a bug within pbs_server, where under
certain conditions a socket isn't closed properly.

The symptoms:
- the server stops writing to the log and stops responding to commands.
The only way to restart it is 'kill -9'
- thousands of CLOSE_WAIT connections accumulate in a short period of
time (we've had ca 20k of those)
- clients log error messages like "LOG_ERROR::read_tcp_reply,
Mismatching protocols. Expected protocol 4 but read reply for 0"

Workaround:
- a restart of the server removes all CLOSE_WAIT connections. Once
restarted, the server writes to the log again but remains unresponsive.
The clients still log the "mismatching protocols" messages
- after the server restart, all pbs_moms are restarted (again, only a
'kill -9' is able to stop the pbs_moms)

Afterwards, the system is operational again.

Here are some logs:

'netstat -nap' snippet on the server side:
[snip]
tcp      461      0 172.18.0.197:15001          172.18.1.167:422
     CLOSE_WAIT  11478/pbs_server
tcp      417      0 172.18.0.197:15001          172.18.1.101:356
     CLOSE_WAIT  11478/pbs_server
tcp      413      0 172.18.0.197:15001          172.18.4.60:312
     CLOSE_WAIT  11478/pbs_server
tcp      380      0 172.18.0.197:15001          172.18.4.142:394
     CLOSE_WAIT  11478/pbs_server
tcp      440      0 172.18.0.197:15001          172.18.3.13:270
     CLOSE_WAIT  11478/pbs_server
tcp      417      0 172.18.0.197:15001          172.18.2.124:382
     CLOSE_WAIT  11478/pbs_server
tcp      415      0 172.18.0.197:15001          172.18.4.138:398
     CLOSE_WAIT  11478/pbs_server
tcp        1      0 172.18.0.197:15001          172.18.0.194:52162
     CLOSE_WAIT  11478/pbs_server
tcp        1      0 172.18.0.197:15001          172.18.6.14:51976
     CLOSE_WAIT  11478/pbs_server
[snap]

last server messages:
[snip]
04/21/2013 11:38:03;0004;PBS_Server.11481;Svr;svr_is_request;message
received from addr 172.18.2.72:405: mom_port 15002  - rm_port 15003
04/21/2013 11:38:03;0004;PBS_Server.11481;Svr;svr_is_request;message
STATUS (4) received from mom on host cheops40218 (172.18.2.72:405) (sock 12)
04/21/2013 11:38:03;0004;PBS_Server.11481;Svr;svr_is_request;IS_STATUS
received from cheops40218
04/21/2013 11:38:03;0040;PBS_Server.11481;Req;is_stat_get;received
status from node cheops40218
04/21/2013
11:38:03;0040;PBS_Server.11481;Req;update_node_state;adjusting state for
node cheops40218 - state=0, newstate=0
04/21/2013 11:38:03;0004;PBS_Server.11525;Svr;svr_is_request;message
received from addr 172.18.2.67:921: mom_port 15002  - rm_port 15003
04/21/2013 11:38:03;0004;PBS_Server.11525;Svr;svr_is_request;message
STATUS (4) received from mom on host cheops40213 (172.18.2.67:921) (sock 13)
04/21/2013 11:38:03;0004;PBS_Server.11525;Svr;svr_is_request;IS_STATUS
received from cheops40213
04/21/2013 11:38:03;0040;PBS_Server.11525;Req;is_stat_get;received
status from node cheops40213
04/21/2013
11:38:03;0040;PBS_Server.11525;Req;update_node_state;adjusting state for
node cheops40213 - state=0, newstate=0
04/21/2013 11:38:11;0100;PBS_Server.11605;Req;;Type StatusJob request
received from root at cheops10, sock=11
[snap]


netstat on the client side (show ESTABLISHED or SYN_SENT on port 15001):
[snip]
cheops21410: tcp        0    375 172.18.4.128:890
172.18.0.197:15001          ESTABLISHED 8676/pbs_mom
[snap]

[snip]
cheops41511: tcp        0      1 172.18.5.165:716
172.18.0.197:15001          SYN_SENT    9602/pbs_mom
[snap]

client messages:
[snip]
Apr 21 11:43:25 s_sys at cheops21410 pbs_mom: LOG_ERROR::read_tcp_reply,
Mismatching protocols. Expected protocol 4 but read reply for 0
Apr 21 11:43:25 s_sys at cheops21410 pbs_mom: LOG_ERROR::read_tcp_reply,
Could not read reply for protocol 4 command 4: End of File
Apr 21 11:43:25 s_sys at cheops21410 pbs_mom:
LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
Apr 21 11:43:25 s_sys at cheops21410 pbs_mom:
LOG_ERROR::mom_server_all_update_stat, Could not contact any of the
servers to send an update
[snap]

Is this a server bug?

Regards,
Lech Nieroda
-- 
Dipl.-Wirt.-Inf. Lech Nieroda
Regionales Rechenzentrum der Universität zu Köln (RRZK)
Universität zu Köln
Weyertal 121
Raum 309 (3. Etage)
D-50931 Köln
Deutschland

Tel.: +49 (221) 470-89606
E-Mail: nieroda.lech at uni-koeln.de




More information about the torqueusers mailing list