[torqueusers] torque: 2.5.x and blocking sockets

Lukasz Flis l.flis at cyf-kr.edu.pl
Thu Jan 24 15:57:07 MST 2013


Hi All,

This is the question for experienced torque developers before I
brake something up by fixing it wrong.

We are facing pbs_server lockups from time to time. In such case server
restart is required to make it work again.

We are running torque 2.5.12

strace of the locked pbs_server process shows unfinished write() syscall
which waits this way forever.

gdb backtrace tells more:
warning: no loadable sections found in added symbol-file system-supplied
DSO at 0x7fffcf371000
0x0000003b300c6860 in __write_nocancel () from /lib64/libc.so.6
(gdb) bt
#0  0x0000003b300c6860 in __write_nocancel () from /lib64/libc.so.6
#1  0x00000033dcc13f67 in write_nonblocking_socket (fd=165,
buf=0x283a9ac4, count=10205738)
    at ../Libifl/nonblock.c:31
#2  0x00000033dcc1f7ef in DIS_tcp_wflush (fd=165) at ../Libifl/tcp_dis.c:377
#3  0x000000000042869a in dis_reply_write (sfds=165, preply=0x134e3748)
at reply_send.c:188
#4  0x0000000000428815 in reply_send (request=0x134e32d0) at
reply_send.c:283
#5  0x0000000000443aea in req_stat_job_step2 (cntl=0x1089a940) at
req_stat.c:725
#6  0x0000000000442ebd in req_stat_job (preq=0x134e32d0) at req_stat.c:308
#7  0x00000000004272f5 in dispatch_request (sfds=165,
request=0x134e32d0) at process_request.c:984
#8  0x000000000042701e in process_request (sfds=165) at
process_request.c:730
#9  0x00000033dcc2cdca in wait_request (waittime=1, SState=0x747418) at
../Libnet/net_server.c:508
#10 0x0000000000423b6c in main_loop () at pbsd_main.c:1203
#11 0x0000000000424b07 in main (argc=3, argv=0x7fffcf288e58) at
pbsd_main.c:1802
(gdb)


For me it looks like write_nonblocking_socket function sometimes is
getting blocking socket for some reason and then waits on write forever

In such case timeout check doesn't work because it is constructed with
assumption that write never blocks here

This way some misbehaving client with unstable network of firewall
problems can cause pbs_server to stop serving requests

I see no checks whether socket is blocking or not in this function.
On the other hand read_nonblocking_socket is doing such checks and turns
O_NONBLOCK if necessary

As write_nonblocking_socket is quite frequently used piece of code I
would like to ask if changing blocking-mode settings for socket inside
the function may brake some functionality or slow things down significantly?

Any one can confirm such pbs_server hangs has been seen around the globe?


PS> I would like to remind you of unnoficial #torque IRC channel on
freenode network where you sometimes can get quick help :)

Best Regards
--
Lukasz Flis


More information about the torqueusers mailing list