[torqueusers] Fwd: pbsserver 4.1.4 accumulates CLOSE_WAIT connections, bug ?

Ken Nielson knielson at adaptivecomputing.com
Tue Apr 23 12:00:37 MDT 2013


On Tue, Apr 23, 2013 at 10:08 AM, Lech Nieroda <nieroda.lech at uni-koeln.de>wrote:

> Dear list,
>
> we have encountered a problem with our torque 4.1.4/maui 3.3.1
> scheduling system. It seems to be a bug within pbs_server, where under
> certain conditions a socket isn't closed properly.
>
> The symptoms:
> - the server stops writing to the log and stops responding to commands.
> The only way to restart it is 'kill -9'
> - thousands of CLOSE_WAIT connections accumulate in a short period of
> time (we've had ca 20k of those)
> - clients log error messages like "LOG_ERROR::read_tcp_reply,
> Mismatching protocols. Expected protocol 4 but read reply for 0"
>
> Workaround:
> - a restart of the server removes all CLOSE_WAIT connections. Once
> restarted, the server writes to the log again but remains unresponsive.
> The clients still log the "mismatching protocols" messages
> - after the server restart, all pbs_moms are restarted (again, only a
> 'kill -9' is able to stop the pbs_moms)
>
> Afterwards, the system is operational again.
>
> Here are some logs:
>
> 'netstat -nap' snippet on the server side:
> [snip]
> tcp      461      0 172.18.0.197:15001          172.18.1.167:422
>      CLOSE_WAIT  11478/pbs_server
> tcp      417      0 172.18.0.197:15001          172.18.1.101:356
>      CLOSE_WAIT  11478/pbs_server
> tcp      413      0 172.18.0.197:15001          172.18.4.60:312
>      CLOSE_WAIT  11478/pbs_server
> tcp      380      0 172.18.0.197:15001          172.18.4.142:394
>      CLOSE_WAIT  11478/pbs_server
> tcp      440      0 172.18.0.197:15001          172.18.3.13:270
>      CLOSE_WAIT  11478/pbs_server
> tcp      417      0 172.18.0.197:15001          172.18.2.124:382
>      CLOSE_WAIT  11478/pbs_server
> tcp      415      0 172.18.0.197:15001          172.18.4.138:398
>      CLOSE_WAIT  11478/pbs_server
> tcp        1      0 172.18.0.197:15001          172.18.0.194:52162
>      CLOSE_WAIT  11478/pbs_server
> tcp        1      0 172.18.0.197:15001          172.18.6.14:51976
>      CLOSE_WAIT  11478/pbs_server
> [snap]
>
> last server messages:
> [snip]
> 04/21/2013 11:38:03;0004;PBS_Server.11481;Svr;svr_is_request;message
> received from addr 172.18.2.72:405: mom_port 15002  - rm_port 15003
> 04/21/2013 11:38:03;0004;PBS_Server.11481;Svr;svr_is_request;message
> STATUS (4) received from mom on host cheops40218 (172.18.2.72:405) (sock
> 12)
> 04/21/2013 11:38:03;0004;PBS_Server.11481;Svr;svr_is_request;IS_STATUS
> received from cheops40218
> 04/21/2013 11:38:03;0040;PBS_Server.11481;Req;is_stat_get;received
> status from node cheops40218
> 04/21/2013
> 11:38:03;0040;PBS_Server.11481;Req;update_node_state;adjusting state for
> node cheops40218 - state=0, newstate=0
> 04/21/2013 11:38:03;0004;PBS_Server.11525;Svr;svr_is_request;message
> received from addr 172.18.2.67:921: mom_port 15002  - rm_port 15003
> 04/21/2013 11:38:03;0004;PBS_Server.11525;Svr;svr_is_request;message
> STATUS (4) received from mom on host cheops40213 (172.18.2.67:921) (sock
> 13)
> 04/21/2013 11:38:03;0004;PBS_Server.11525;Svr;svr_is_request;IS_STATUS
> received from cheops40213
> 04/21/2013 11:38:03;0040;PBS_Server.11525;Req;is_stat_get;received
> status from node cheops40213
> 04/21/2013
> 11:38:03;0040;PBS_Server.11525;Req;update_node_state;adjusting state for
> node cheops40213 - state=0, newstate=0
> 04/21/2013 11:38:11;0100;PBS_Server.11605;Req;;Type StatusJob request
> received from root at cheops10, sock=11
> [snap]
>
>
> netstat on the client side (show ESTABLISHED or SYN_SENT on port 15001):
> [snip]
> cheops21410: tcp        0    375 172.18.4.128:890
> 172.18.0.197:15001          ESTABLISHED 8676/pbs_mom
> 7[snap]
>
> [snip]
> cheops41511: tcp        0      1 172.18.5.165:716
> 172.18.0.197:15001          SYN_SENT    9602/pbs_mom
> [snap]
>
> client messages:
> [snip]
> Apr 21 11:43:25 s_sys at cheops21410 pbs_mom: LOG_ERROR::read_tcp_reply,
> Mismatching protocols. Expected protocol 4 but read reply for 0
> Apr 21 11:43:25 s_sys at cheops21410 pbs_mom: LOG_ERROR::read_tcp_reply,
> Could not read reply for protocol 4 command 4: End of File
> Apr 21 11:43:25 s_sys at cheops21410 pbs_mom:
> LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
> Apr 21 11:43:25 s_sys at cheops21410 pbs_mom:
> LOG_ERROR::mom_server_all_update_stat, Could not contact any of the
> servers to send an update
> [snap]
>
> Is this a server bug?
>
> Regards,
> Lech Nieroda
> --
> Dipl.-Wirt.-Inf. Lech Nieroda
> Regionales Rechenzentrum der Universität zu Köln (RRZK)
> Universität zu Köln
> Weyertal 121
> Raum 309 (3. Etage)
> D-50931 Köln
> Deutschland
>
> Tel.: +49 (221) 470-89606
> E-Mail: nieroda.lech at uni-koeln.de
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>

Lech,

Have you been able to reproduce the condition on demand?

-- 
Ken Nielson
+1 801.717.3700 office +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
www.adaptivecomputing.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130423/4bb69e33/attachment.html 


More information about the torqueusers mailing list