[torqueusers] Fwd: pbsserver 4.1.4 accumulates CLOSE_WAIT connections, bug ?

Lech Nieroda nieroda.lech at uni-koeln.de
Wed Apr 24 03:10:35 MDT 2013


Ken,

On 23.04.2013 20:00, Ken Nielson wrote:
> On Tue, Apr 23, 2013 at 10:08 AM, Lech Nieroda
> <nieroda.lech at uni-koeln.de <mailto:nieroda.lech at uni-koeln.de>> wrote:
>
>     Dear list,
>
>     we have encountered a problem with our torque 4.1.4/maui 3.3.1
>     scheduling system. It seems to be a bug within pbs_server, where under
>     certain conditions a socket isn't closed properly.
>
>     The symptoms:
>     - the server stops writing to the log and stops responding to commands.
>     The only way to restart it is 'kill -9'
>     - thousands of CLOSE_WAIT connections accumulate in a short period of
>     time (we've had ca 20k of those)
>     - clients log error messages like "LOG_ERROR::read_tcp_reply,
>     Mismatching protocols. Expected protocol 4 but read reply for 0"
>
>     Workaround:
>     - a restart of the server removes all CLOSE_WAIT connections. Once
>     restarted, the server writes to the log again but remains unresponsive.
>     The clients still log the "mismatching protocols" messages
>     - after the server restart, all pbs_moms are restarted (again, only a
>     'kill -9' is able to stop the pbs_moms)
>
>     Afterwards, the system is operational again.
>
>     Here are some logs:
>
>     'netstat -nap' snippet on the server side:
>     [snip]
>     tcp      461      0 172.18.0.197:15001 <http://172.18.0.197:15001>
>     172.18.1.167:422 <http://172.18.1.167:422>
>           CLOSE_WAIT  11478/pbs_server
>     tcp      417      0 172.18.0.197:15001 <http://172.18.0.197:15001>
>     172.18.1.101:356 <http://172.18.1.101:356>
>           CLOSE_WAIT  11478/pbs_server
>     tcp      413      0 172.18.0.197:15001 <http://172.18.0.197:15001>
>     172.18.4.60:312 <http://172.18.4.60:312>
>           CLOSE_WAIT  11478/pbs_server
>     tcp      380      0 172.18.0.197:15001 <http://172.18.0.197:15001>
>     172.18.4.142:394 <http://172.18.4.142:394>
>           CLOSE_WAIT  11478/pbs_server
>     tcp      440      0 172.18.0.197:15001 <http://172.18.0.197:15001>
>     172.18.3.13:270 <http://172.18.3.13:270>
>           CLOSE_WAIT  11478/pbs_server
>     tcp      417      0 172.18.0.197:15001 <http://172.18.0.197:15001>
>     172.18.2.124:382 <http://172.18.2.124:382>
>           CLOSE_WAIT  11478/pbs_server
>     tcp      415      0 172.18.0.197:15001 <http://172.18.0.197:15001>
>     172.18.4.138:398 <http://172.18.4.138:398>
>           CLOSE_WAIT  11478/pbs_server
>     tcp        1      0 172.18.0.197:15001 <http://172.18.0.197:15001>
>     172.18.0.194:52162 <http://172.18.0.194:52162>
>           CLOSE_WAIT  11478/pbs_server
>     tcp        1      0 172.18.0.197:15001 <http://172.18.0.197:15001>
>     172.18.6.14:51976 <http://172.18.6.14:51976>
>           CLOSE_WAIT  11478/pbs_server
>     [snap]
>
>     last server messages:
>     [snip]
>     04/21/2013 11:38:03;0004;PBS_Server.11481;Svr;svr_is_request;message
>     received from addr 172.18.2.72:405 <http://172.18.2.72:405>:
>     mom_port 15002  - rm_port 15003
>     04/21/2013 11:38:03;0004;PBS_Server.11481;Svr;svr_is_request;message
>     STATUS (4) received from mom on host cheops40218 (172.18.2.72:405
>     <http://172.18.2.72:405>) (sock 12)
>     04/21/2013 11:38:03;0004;PBS_Server.11481;Svr;svr_is_request;IS_STATUS
>     received from cheops40218
>     04/21/2013 11:38:03;0040;PBS_Server.11481;Req;is_stat_get;received
>     status from node cheops40218
>     04/21/2013
>     11:38:03;0040;PBS_Server.11481;Req;update_node_state;adjusting state for
>     node cheops40218 - state=0, newstate=0
>     04/21/2013 11:38:03;0004;PBS_Server.11525;Svr;svr_is_request;message
>     received from addr 172.18.2.67:921 <http://172.18.2.67:921>:
>     mom_port 15002  - rm_port 15003
>     04/21/2013 11:38:03;0004;PBS_Server.11525;Svr;svr_is_request;message
>     STATUS (4) received from mom on host cheops40213 (172.18.2.67:921
>     <http://172.18.2.67:921>) (sock 13)
>     04/21/2013 11:38:03;0004;PBS_Server.11525;Svr;svr_is_request;IS_STATUS
>     received from cheops40213
>     04/21/2013 11:38:03;0040;PBS_Server.11525;Req;is_stat_get;received
>     status from node cheops40213
>     04/21/2013
>     11:38:03;0040;PBS_Server.11525;Req;update_node_state;adjusting state for
>     node cheops40213 - state=0, newstate=0
>     04/21/2013 11:38:11;0100;PBS_Server.11605;Req;;Type StatusJob request
>     received from root at cheops10, sock=11
>     [snap]
>
>
>     netstat on the client side (show ESTABLISHED or SYN_SENT on port 15001):
>     [snip]
>     cheops21410: tcp        0    375 172.18.4.128:890
>     <http://172.18.4.128:890>
>     172.18.0.197:15001 <http://172.18.0.197:15001>          ESTABLISHED
>     8676/pbs_mom
>     7[snap]
>
>     [snip]
>     cheops41511: tcp        0      1 172.18.5.165:716
>     <http://172.18.5.165:716>
>     172.18.0.197:15001 <http://172.18.0.197:15001>          SYN_SENT
>       9602/pbs_mom
>     [snap]
>
>     client messages:
>     [snip]
>     Apr 21 11:43:25 s_sys at cheops21410 pbs_mom: LOG_ERROR::read_tcp_reply,
>     Mismatching protocols. Expected protocol 4 but read reply for 0
>     Apr 21 11:43:25 s_sys at cheops21410 pbs_mom: LOG_ERROR::read_tcp_reply,
>     Could not read reply for protocol 4 command 4: End of File
>     Apr 21 11:43:25 s_sys at cheops21410 pbs_mom:
>     LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
>     Apr 21 11:43:25 s_sys at cheops21410 pbs_mom:
>     LOG_ERROR::mom_server_all_update_stat, Could not contact any of the
>     servers to send an update
>     [snap]
>
>     Is this a server bug?
>
>     Regards,
>     Lech Nieroda
>     --
>     Dipl.-Wirt.-Inf. Lech Nieroda
>     Regionales Rechenzentrum der Universität zu Köln (RRZK)
>     Universität zu Köln
>     Weyertal 121
>     Raum 309 (3. Etage)
>     D-50931 Köln
>     Deutschland
>
>     Tel.: +49 (221) 470-89606 <tel:%2B49%20%28221%29%20470-89606>
>     E-Mail: nieroda.lech at uni-koeln.de <mailto:nieroda.lech at uni-koeln.de>
>
>
>     _______________________________________________
>     torqueusers mailing list
>     torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>     http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
> Lech,
>
> Have you been able to reproduce the condition on demand?

No, we haven't found out yet what would trigger these undead 
connections. This has happened twice since we've installed the patched 
4.1.4 torque version and both times there wasn't any specific problem 
visible in the logs before those connections started to accumulate.

I suppose there is a seldom executed if-branch which lacks a proper 
close command and thus creates the hanging connections...

Regards,
Lech

-- 
Dipl.-Wirt.-Inf. Lech Nieroda
Regionales Rechenzentrum der Universität zu Köln (RRZK)
Universität zu Köln
Weyertal 121
Raum 309 (3. Etage)
D-50931 Köln
Deutschland

Tel.: +49 (221) 470-89606
E-Mail: nieroda.lech at uni-koeln.de


More information about the torqueusers mailing list