[torqueusers] internal socket table full log message

Stuart Barkley stuartb at 4gh.net
Fri Mar 25 15:44:24 MDT 2011


Catching up on things.

We have also seen this problem (with 2.5.3) once.  I studied it in
some detail at the time and decided it was probably a leak on the tcp
connection table.  My guess is that it is related to our users who
start 2500 array jobs all at once (and sometimes have them all die at
once).  I've been waiting for it to reoccur so I could check the
actual number of connections and look at things in more detail.

I take that back.

I think we where only seeing the "half-full" message a few lines above
the "full" message in src/server/svr_connect.c.  I didn't trace
through all the code, but it looked like the "num_connections is 6"
indicates how full the table should really be and thus my conclusion
something was losing count of closed connections.

On Mon, 7 Mar 2011 at 10:33 -0000, Ken Nielson wrote:

> ulimit is not what is causing this problem. The error comes from an internal table full problem. In TORQUE 2.5.5 the table size is 10240.
>
> How busy and large is your system? Do you have lingering sockets. Try netstat and see how many open tcp connections you have.
>
> Ken Nielson
> Adaptive Computing
>
> ----- Original Message -----
> From: "\"Hung-Sheng Tsao (Lao Tsao 老曹) Ph. D.\"" <laotsao at gmail.com>
> To: torqueusers at supercluster.org
> Sent: Monday, March 7, 2011 3:44:19 AM
> Subject: Re: [torqueusers] internal socket table full log message
>
> what is server's ulimit -n
> by default it is set to 1024
> one can increase it, try 2048, 4096
> regards
>
> On 3/6/2011 11:52 PM, Abhishek Gupta wrote:
>
> We are getting this message popped up in our pbs log file:
>
> 03/06/2011 23:13:33;0001;PBS_Server;Svr;PBS_Server;LOG_ALERT::socket_to_handle, internal socket table full (1024) - num_connections is 6
> 03/06/2011 23:13:34;0001;PBS_Server;Svr;PBS_Server;LOG_ALERT::socket_to_handle, internal socket table full (1024) - num_connections is 7
>
> Anyone has any idea about this message? Due to this message, no one was able to run the jobs and I had restart the service.
> Thanks,
> Abhi.


More information about the torqueusers mailing list