[torqueusers] Slow response of torque when jobs are running

Luc Vereecken Luc.Vereecken at chem.kuleuven.be
Fri Dec 4 09:53:54 MST 2009


Hi all,

I have upgraded my queuing system to torque-2.4.3-snap.200912031436, 
and as far as I can tell, everything is working correctly. However, 
when there are jobs running, response from torque commands, such as 
pbsnodes, qstat, qdel, etc becomes very slow at times, sometimes 
taking 30 seconds up to 5 minutes to do anything, both on the head 
node and the compute nodes.

It is not related to load on the head node, the network seems to be 
working fine, but it seems as if pbs_server is waiting for a timeout 
or something. Since I have only 40 nodes, I'm surprised to be 
confronted with something like this, so I'm fairly baffled. With a 
fully loaded cluster, pbs_iff also fails on the nodes and headnode 
(pbs_iff: cannot read reply from pbs_server) which I suspect is due 
to a timeout against the slow server. In the serverlogs, I find 
messages such as the below:
--------
12/04/2009 
17:38:17;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::wait_request, 
connection 17 to host 2886730506 has timed out after 900 seconds - 
closing stale connection
12/04/2009 
17:38:17;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::wait_request, 
connection 18 to host 2886730762 has timed out after 900 seconds - 
closing stale connection
12/04/2009 
17:38:17;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::wait_request, 
connection 21 to host 2886730499 has timed out after 900 seconds - 
closing stale connection
12/04/2009 
17:38:17;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::wait_request, 
connection 38 to host 2886730760 has timed out after 900 seconds - 
closing stale connection
12/04/2009 17:43:41;0002;PBS_Server;Svr;PBS_Server;Torque Server 
Version = 2.4.3-snap.200912031436, loglevel = 0
12/04/2009 17:47:29;0002;PBS_Server;Req;dis_reply_write;DIS reply failure, -1
12/04/2009 17:49:05;0002;PBS_Server;Svr;PBS_Server;Torque Server 
Version = 2.4.3-snap.200912031436, loglevel = 0
12/04/2009 
17:49:05;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::wait_request, 
connection 54 to host 2886729985 has timed out after 900 seconds - 
closing stale connection
12/04/2009 
17:49:05;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::wait_request, 
connection 56 to host 0 has timed out after 900 seconds - closing 
stale connection
12/04/2009 17:49:20;0040;PBS_Server;Svr;gweyring;Scheduler was sent 
the command time
-------
Oddly, I have good response times and no timeouts with only a few jobs running.

Any idea what might be causing this, and how to get a snappier 
response from user commands ? I have no idea where to start looking 
for a solution for this, as this problem seems to scale with the 
number of running jobs...

Luc




More information about the torqueusers mailing list