Bug 212 - server spins on select() with expired sockets
: server spins on select() with expired sockets
Status: RESOLVED FIXED
Product: TORQUE
pbs_server
: 4.0.*
: PC Linux
: P5 major
Assigned To: David Beer
:
:
:
  Show dependency treegraph
 
Reported: 2012-08-06 04:55 MDT by Viktor Štujber
Modified: 2012-08-15 11:29 MDT (History)
2 users (show)

See Also:


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description Viktor Štujber 2012-08-06 04:55:36 MDT
Our torque 4.1.0 server often goes into a cpu-consuming loop. Here's
information I gathered so far.

> strace -p 28590
select(1024, [8 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
33 34 35 36 37 38 39 41 43 44], NULL, NULL, {5, 0}) = 31 (in [8 12 13 14 15 16
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 33 34 35 36 37 38 39 41 43 44],
left {4, 999992})
nanosleep({0, 100000}, NULL)            = 0
tgkill(28590, 28592, SIG_0)             = 0
tgkill(28590, 28593, SIG_0)             = 0
<repeat infinitely>

(gdb) p svr_conn[8]
$34 = {cn_addr = 2477722413, cn_handle = 0, cn_port = 15002, cn_authen = 1,
cn_socktype = 2, cn_active = ToServerDIS, cn_lasttime = 1344247871, cn_func =
0, cn_oncl = 0, cn_mutex = 0x2a12ce0, cn_stay_open = 0}

I shut down all the connected clients, and all of the abovementioned socket IDs
are in the CLOSE_WAIT state. The select() call signals activity on all 31
sockets and returns immediately, but all of them are of type 'ToServerDIS' (not
'Idle'), and none of them have a cn_func assigned, so the code in
lib/Libnet/net_server.c wait_request() just keeps spinning over all 10240 blank
connection slots with no sleep, causing significant cpu usage.
Comment 1 David Beer 2012-08-15 11:29:54 MDT
I checked a fix into 4.1-fixes.