[torqueusers] Socket issues in Torque 4.1.x

Douglas Holt douglas.holt at st.com
Wed Sep 12 07:26:06 MDT 2012


Since switching from branch 3.0.x to 4.1.x we've been encountering an
issue where we appear to be running out of available sockets while
queuing/scheduling jobs. We routinely queue 10's of thousands of jobs
at a time (up to around 30-40k total) and after several hundred or a
thousand I start seeing these errors in the logs and random jobs get
dropped (not queued). I've tried limiting the rate at which I add
jobs, adjusting the number of open files (ulimit -n 32788), adjusting
TCP_WAIT timeout from 60 to 5 seconds
(/proc/sys/net/ipv4/tcp_fin_timeout), etc. This is essentially a
brand-new system with a default installation of Torque 4.1.1.


09/08/2012 
12:16:54;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request,
closed connections to fd 29 - num_connections=74 (select bad socket)
09/08/2012 
12:16:54;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request,
closed connections to fd 12 - num_connections=68 (select bad socket)
09/08/2012 
12:16:56;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request,
closed connections to fd 12 - num_connections=59 (select bad socket)
09/08/2012 
12:16:56;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request,
closed connections to fd 29 - num_connections=54 (select bad socket)
09/08/2012 
12:16:56;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request,
closed connections to fd 32 - num_connections=52 (select bad socket)
09/08/2012 
12:16:58;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request,
closed connections to fd 78 - num_connections=27 (select bad socket)
09/08/2012 
12:16:59;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request,
closed connections to fd 11 - num_connections=17 (select bad socket)

09/08/2012 00:13:32;0001;PBS_Server.40767;Svr;PBS_Server;LOG_ERROR::Bad
file descriptor (9) in wait_request, Unable to select sockets to read
requests
09/08/2012 00:14:43;0001;PBS_Server.40767;Svr;PBS_Server;LOG_ERROR::Bad
file descriptor (9) in wait_request, Unable to select sockets to read
requests
09/08/2012 00:16:09;0001;PBS_Server.40767;Svr;PBS_Server;LOG_ERROR::Bad
file descriptor (9) in wait_request, Unable to select sockets to read
requests
09/08/2012 00:18:24;0001;PBS_Server.40767;Svr;PBS_Server;LOG_ERROR::Bad
file descriptor (9) in wait_request, Unable to select sockets to read
requests
09/08/2012 00:19:23;0001;PBS_Server.40767;Svr;PBS_Server;LOG_ERROR::Bad
file descriptor (9) in wait_request, Unable to select sockets to read
requests


Even when there are only a few hundred socket connections I'll still
get messages like this when using various commands:

-bash-4.1# qstat
Error code - 15096 : message [cannot connect to port -1 in
socket_connect_addr - errno:9 Bad file descriptor]
parse_daemon_response error
Error communicating with node(xxx.xxx.xxx.xxx)
Communication failure.
qstat: cannot connect to server node (errno=15096) Error getting
connection to socket

Any suggestions? Thanks,

Doug Holt




More information about the torqueusers mailing list