[torqueusers] Slow response of torque when jobs are running

Luc Vereecken Luc.Vereecken at chem.kuleuven.be
Fri Dec 11 11:50:02 MST 2009

Hi all,

Status issue: resolved (it seems)

Thanks to all who tried to help me with this. Turns out it was a 
less-than-obvious driver/kernel issue on the hardware used (see below).


Long explanation:
After checking everything I could think of that might have anything 
at all to do with this issue, it appeared more and more likely that 
torque was triggering something VERY odd on my system. After all, 
many sites are using torque to good success, including myself for 
many years. Time to start looking at the problem from the other side. 
This is new hardware, so "apparently" working might not be "correctly" working.

Symptoms: all network traffic is fast and correct, expect for torque 
which works correctly but is very slow.

It could be strictly hardware related. But at the wire level, torque 
traffic is not any different from any other traffic, which all did 
work ok. Not affected by offloading to the nic.

The next alternative is a NIC driver issue, or a kernel issue 
(equivalent for the current problem).
Most of the traffic I have is tcp (name resolution, ssh, file 
copying,...), however torque uses udp often. An udp bug in the 
driver/kernel could perhaps be the reason everything worked other 
than torque. I did some tests some time ago running udp streams 
across the network to test udp, but they did not reveal anything 
unusual. Then again, at that time I did not know as much about the 
problem so maybe I just never looked at the timelapse for 
communication initiation, which seems to be the problem evidenced in 
the straces shown in this thread. Can't remember.

I'm administering this machine from abroad at the moment, but I 
happened to be physically near the machine for half a day wednesday. 
So, I upgraded the kernel and also switched to a different network 
driver, keeping everything else identical (hardware, torque version, 
utility software, network setup, etc). The kernel upgrade I did was 
only a small (security) upgrade of the basic kernel used in the suse 
10.3 family, and I moved to using the sky2 driver. Lo and behold ! 
Torque became responsive and wellbehaved. Problem solved as far as I 
can see. So, I can only conclude that I was triggering some odd thing 
in the old kernel and/or the old nic driver.

Note that the sky2 driver on this hardware is not ideal. If I do an 
ifdown, followed by an ifup, the hardware comes up but I get no 
(software) network connectivity back despite ifconfig showing the nic 
as up. Connectivity does get restored if I rmmod and re-insmod the 
sky2 module (or if I reboot of course). Having the system possibly 
not recover from network changes is tricky, given the remote 
administration. Still, having to ask a local friend-administrator to 
help me out for an rmmod/insmod for an unlikely situation, is a lot 
better than having a virtually unusably slow torque server.

So what caused the problem: kernel or driver? I don't know. I didn't 
have enough time on-site to test the earlier nic driver again (a 
hand-compiled version of the venerable sk98lin driver that did 
support ifdown/ifup cycles and worked well other than for torque). It 
remains odd that only torque was affected (I still think it's 
udp-related). Also, I didn't have time yet to install the (better) 
intel NICs I want this machine to use. That might have been an 
alternative way to test the driver dependence. I might or might not 
retry the driver thing to further isolate the driver-or-kernel 
problem next time I'm on-site (near christmas probably). At any rate, 
it is now a hardware/driver/kernel problem, not a torque issue.

Other note: I was surprised before that a qstat triggered pbs_server 
to contact each and every execution node with a job running, causing 
the delays to become dependent on the number of jobs running. In 
hindsight, this was caused by the bad communication between mom and 
server. Pbs_server caches job information (with an adjustable 
validity time) but lousy communication caused that pbs_server only 
had stale information available.

At 16:53 04/12/2009, Luc Vereecken wrote:
>Hi all,
>I have upgraded my queuing system to torque-2.4.3-snap.200912031436,
>and as far as I can tell, everything is working correctly. However,
>when there are jobs running, response from torque commands, such as
>pbsnodes, qstat, qdel, etc becomes very slow at times, sometimes
>taking 30 seconds up to 5 minutes to do anything, both on the head
>node and the compute nodes.
>It is not related to load on the head node, the network seems to be
>working fine, but it seems as if pbs_server is waiting for a timeout
>or something. Since I have only 40 nodes, I'm surprised to be
>confronted with something like this, so I'm fairly baffled. With a
>fully loaded cluster, pbs_iff also fails on the nodes and headnode
>(pbs_iff: cannot read reply from pbs_server) which I suspect is due
>to a timeout against the slow server. In the serverlogs, I find
>messages such as the below:
>connection 17 to host 2886730506 has timed out after 900 seconds -
>closing stale connection
>connection 18 to host 2886730762 has timed out after 900 seconds -
>closing stale connection
>connection 21 to host 2886730499 has timed out after 900 seconds -
>closing stale connection
>connection 38 to host 2886730760 has timed out after 900 seconds -
>closing stale connection
>12/04/2009 17:43:41;0002;PBS_Server;Svr;PBS_Server;Torque Server
>Version = 2.4.3-snap.200912031436, loglevel = 0
>12/04/2009 17:47:29;0002;PBS_Server;Req;dis_reply_write;DIS reply failure, -1
>12/04/2009 17:49:05;0002;PBS_Server;Svr;PBS_Server;Torque Server
>Version = 2.4.3-snap.200912031436, loglevel = 0
>connection 54 to host 2886729985 has timed out after 900 seconds -
>closing stale connection
>connection 56 to host 0 has timed out after 900 seconds - closing
>stale connection
>12/04/2009 17:49:20;0040;PBS_Server;Svr;gweyring;Scheduler was sent
>the command time
>Oddly, I have good response times and no timeouts with only a few 
>jobs running.
>Any idea what might be causing this, and how to get a snappier
>response from user commands ? I have no idea where to start looking
>for a solution for this, as this problem seems to scale with the
>number of running jobs...
>torqueusers mailing list
>torqueusers at supercluster.org

More information about the torqueusers mailing list