[torqueusers] Slow response of torque when jobs are running
Craig Macdonald
craigm at dcs.gla.ac.uk
Fri Dec 4 11:09:47 MST 2009
Hi Luc,
I had problems with timeouts before, particularly for maui, when I didnt
have nscd running. This caused various things to freeze up, but timeouts
were the symptoms. Not saying this is the problem, but just an idea...
Craig
Luc Vereecken wrote:
> Hi all,
>
> I have upgraded my queuing system to torque-2.4.3-snap.200912031436,
> and as far as I can tell, everything is working correctly. However,
> when there are jobs running, response from torque commands, such as
> pbsnodes, qstat, qdel, etc becomes very slow at times, sometimes
> taking 30 seconds up to 5 minutes to do anything, both on the head
> node and the compute nodes.
>
> It is not related to load on the head node, the network seems to be
> working fine, but it seems as if pbs_server is waiting for a timeout
> or something. Since I have only 40 nodes, I'm surprised to be
> confronted with something like this, so I'm fairly baffled. With a
> fully loaded cluster, pbs_iff also fails on the nodes and headnode
> (pbs_iff: cannot read reply from pbs_server) which I suspect is due
> to a timeout against the slow server. In the serverlogs, I find
> messages such as the below:
> --------
> 12/04/2009
> 17:38:17;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::wait_request,
> connection 17 to host 2886730506 has timed out after 900 seconds -
> closing stale connection
> 12/04/2009
> 17:38:17;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::wait_request,
> connection 18 to host 2886730762 has timed out after 900 seconds -
> closing stale connection
> 12/04/2009
> 17:38:17;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::wait_request,
> connection 21 to host 2886730499 has timed out after 900 seconds -
> closing stale connection
> 12/04/2009
> 17:38:17;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::wait_request,
> connection 38 to host 2886730760 has timed out after 900 seconds -
> closing stale connection
> 12/04/2009 17:43:41;0002;PBS_Server;Svr;PBS_Server;Torque Server
> Version = 2.4.3-snap.200912031436, loglevel = 0
> 12/04/2009 17:47:29;0002;PBS_Server;Req;dis_reply_write;DIS reply failure, -1
> 12/04/2009 17:49:05;0002;PBS_Server;Svr;PBS_Server;Torque Server
> Version = 2.4.3-snap.200912031436, loglevel = 0
> 12/04/2009
> 17:49:05;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::wait_request,
> connection 54 to host 2886729985 has timed out after 900 seconds -
> closing stale connection
> 12/04/2009
> 17:49:05;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::wait_request,
> connection 56 to host 0 has timed out after 900 seconds - closing
> stale connection
> 12/04/2009 17:49:20;0040;PBS_Server;Svr;gweyring;Scheduler was sent
> the command time
> -------
> Oddly, I have good response times and no timeouts with only a few jobs running.
>
> Any idea what might be causing this, and how to get a snappier
> response from user commands ? I have no idea where to start looking
> for a solution for this, as this problem seems to scale with the
> number of running jobs...
>
> Luc
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
More information about the torqueusers
mailing list