[torqueusers] [torquedev] torque: 2.5.x and blocking sockets

Ken Nielson knielson at adaptivecomputing.com
Fri Jan 25 16:04:28 MST 2013

On Thu, Jan 24, 2013 at 3:57 PM, Lukasz Flis <l.flis at cyf-kr.edu.pl> wrote:

> Hi All,
> This is the question for experienced torque developers before I
> brake something up by fixing it wrong.
> We are facing pbs_server lockups from time to time. In such case server
> restart is required to make it work again.
> We are running torque 2.5.12
> strace of the locked pbs_server process shows unfinished write() syscall
> which waits this way forever.
> gdb backtrace tells more:
> warning: no loadable sections found in added symbol-file system-supplied
> DSO at 0x7fffcf371000
> 0x0000003b300c6860 in __write_nocancel () from /lib64/libc.so.6
> (gdb) bt
> #0  0x0000003b300c6860 in __write_nocancel () from /lib64/libc.so.6
> #1  0x00000033dcc13f67 in write_nonblocking_socket (fd=165,
> buf=0x283a9ac4, count=10205738)
>     at ../Libifl/nonblock.c:31
> #2  0x00000033dcc1f7ef in DIS_tcp_wflush (fd=165) at
> ../Libifl/tcp_dis.c:377
> #3  0x000000000042869a in dis_reply_write (sfds=165, preply=0x134e3748)
> at reply_send.c:188
> #4  0x0000000000428815 in reply_send (request=0x134e32d0) at
> reply_send.c:283
> #5  0x0000000000443aea in req_stat_job_step2 (cntl=0x1089a940) at
> req_stat.c:725
> #6  0x0000000000442ebd in req_stat_job (preq=0x134e32d0) at req_stat.c:308
> #7  0x00000000004272f5 in dispatch_request (sfds=165,
> request=0x134e32d0) at process_request.c:984
> #8  0x000000000042701e in process_request (sfds=165) at
> process_request.c:730
> #9  0x00000033dcc2cdca in wait_request (waittime=1, SState=0x747418) at
> ../Libnet/net_server.c:508
> #10 0x0000000000423b6c in main_loop () at pbsd_main.c:1203
> #11 0x0000000000424b07 in main (argc=3, argv=0x7fffcf288e58) at
> pbsd_main.c:1802
> (gdb)
> For me it looks like write_nonblocking_socket function sometimes is
> getting blocking socket for some reason and then waits on write forever
> In such case timeout check doesn't work because it is constructed with
> assumption that write never blocks here
> This way some misbehaving client with unstable network of firewall
> problems can cause pbs_server to stop serving requests
> I see no checks whether socket is blocking or not in this function.
> On the other hand read_nonblocking_socket is doing such checks and turns
> O_NONBLOCK if necessary
> As write_nonblocking_socket is quite frequently used piece of code I
> would like to ask if changing blocking-mode settings for socket inside
> the function may brake some functionality or slow things down
> significantly?
> Any one can confirm such pbs_server hangs has been seen around the globe?
> PS> I would like to remind you of unnoficial #torque IRC channel on
> freenode network where you sometimes can get quick help :)
> Best Regards
> --
> Lukasz Flis


>From what you are describing and the backtrace I would say you are correct.
it seems we are getting stuck on a blocking write. Ensuring this is
non-blocking is the right thing to do.

Something else to think about, if a write gets blocked it is because the
receiving side is not reading the data.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130125/161bcbb9/attachment.html 

More information about the torqueusers mailing list