[torqueusers] Slow response of torque when jobs are running

Craig Macdonald craigm at dcs.gla.ac.uk
Fri Dec 4 11:09:47 MST 2009


Hi Luc,

I had problems with timeouts before, particularly for maui, when I didnt 
have nscd running. This caused various things to freeze up, but timeouts 
were the symptoms. Not saying this is the problem, but just an idea...

Craig

Luc Vereecken wrote:
> Hi all,
>
> I have upgraded my queuing system to torque-2.4.3-snap.200912031436, 
> and as far as I can tell, everything is working correctly. However, 
> when there are jobs running, response from torque commands, such as 
> pbsnodes, qstat, qdel, etc becomes very slow at times, sometimes 
> taking 30 seconds up to 5 minutes to do anything, both on the head 
> node and the compute nodes.
>
> It is not related to load on the head node, the network seems to be 
> working fine, but it seems as if pbs_server is waiting for a timeout 
> or something. Since I have only 40 nodes, I'm surprised to be 
> confronted with something like this, so I'm fairly baffled. With a 
> fully loaded cluster, pbs_iff also fails on the nodes and headnode 
> (pbs_iff: cannot read reply from pbs_server) which I suspect is due 
> to a timeout against the slow server. In the serverlogs, I find 
> messages such as the below:
> --------
> 12/04/2009 
> 17:38:17;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::wait_request, 
> connection 17 to host 2886730506 has timed out after 900 seconds - 
> closing stale connection
> 12/04/2009 
> 17:38:17;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::wait_request, 
> connection 18 to host 2886730762 has timed out after 900 seconds - 
> closing stale connection
> 12/04/2009 
> 17:38:17;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::wait_request, 
> connection 21 to host 2886730499 has timed out after 900 seconds - 
> closing stale connection
> 12/04/2009 
> 17:38:17;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::wait_request, 
> connection 38 to host 2886730760 has timed out after 900 seconds - 
> closing stale connection
> 12/04/2009 17:43:41;0002;PBS_Server;Svr;PBS_Server;Torque Server 
> Version = 2.4.3-snap.200912031436, loglevel = 0
> 12/04/2009 17:47:29;0002;PBS_Server;Req;dis_reply_write;DIS reply failure, -1
> 12/04/2009 17:49:05;0002;PBS_Server;Svr;PBS_Server;Torque Server 
> Version = 2.4.3-snap.200912031436, loglevel = 0
> 12/04/2009 
> 17:49:05;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::wait_request, 
> connection 54 to host 2886729985 has timed out after 900 seconds - 
> closing stale connection
> 12/04/2009 
> 17:49:05;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::wait_request, 
> connection 56 to host 0 has timed out after 900 seconds - closing 
> stale connection
> 12/04/2009 17:49:20;0040;PBS_Server;Svr;gweyring;Scheduler was sent 
> the command time
> -------
> Oddly, I have good response times and no timeouts with only a few jobs running.
>
> Any idea what might be causing this, and how to get a snappier 
> response from user commands ? I have no idea where to start looking 
> for a solution for this, as this problem seems to scale with the 
> number of running jobs...
>
> Luc
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>   



More information about the torqueusers mailing list