[torqueusers] Slow response of torque when jobs are running
Luc Vereecken
Luc.Vereecken at chem.kuleuven.be
Fri Dec 4 09:53:54 MST 2009
Hi all,
I have upgraded my queuing system to torque-2.4.3-snap.200912031436,
and as far as I can tell, everything is working correctly. However,
when there are jobs running, response from torque commands, such as
pbsnodes, qstat, qdel, etc becomes very slow at times, sometimes
taking 30 seconds up to 5 minutes to do anything, both on the head
node and the compute nodes.
It is not related to load on the head node, the network seems to be
working fine, but it seems as if pbs_server is waiting for a timeout
or something. Since I have only 40 nodes, I'm surprised to be
confronted with something like this, so I'm fairly baffled. With a
fully loaded cluster, pbs_iff also fails on the nodes and headnode
(pbs_iff: cannot read reply from pbs_server) which I suspect is due
to a timeout against the slow server. In the serverlogs, I find
messages such as the below:
--------
12/04/2009
17:38:17;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::wait_request,
connection 17 to host 2886730506 has timed out after 900 seconds -
closing stale connection
12/04/2009
17:38:17;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::wait_request,
connection 18 to host 2886730762 has timed out after 900 seconds -
closing stale connection
12/04/2009
17:38:17;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::wait_request,
connection 21 to host 2886730499 has timed out after 900 seconds -
closing stale connection
12/04/2009
17:38:17;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::wait_request,
connection 38 to host 2886730760 has timed out after 900 seconds -
closing stale connection
12/04/2009 17:43:41;0002;PBS_Server;Svr;PBS_Server;Torque Server
Version = 2.4.3-snap.200912031436, loglevel = 0
12/04/2009 17:47:29;0002;PBS_Server;Req;dis_reply_write;DIS reply failure, -1
12/04/2009 17:49:05;0002;PBS_Server;Svr;PBS_Server;Torque Server
Version = 2.4.3-snap.200912031436, loglevel = 0
12/04/2009
17:49:05;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::wait_request,
connection 54 to host 2886729985 has timed out after 900 seconds -
closing stale connection
12/04/2009
17:49:05;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::wait_request,
connection 56 to host 0 has timed out after 900 seconds - closing
stale connection
12/04/2009 17:49:20;0040;PBS_Server;Svr;gweyring;Scheduler was sent
the command time
-------
Oddly, I have good response times and no timeouts with only a few jobs running.
Any idea what might be causing this, and how to get a snappier
response from user commands ? I have no idea where to start looking
for a solution for this, as this problem seems to scale with the
number of running jobs...
Luc
More information about the torqueusers
mailing list