[torqueusers] Slow response of torque when jobs are running
Luc Vereecken
Luc.Vereecken at chem.kuleuven.be
Fri Dec 11 11:50:02 MST 2009
Hi all,
Status issue: resolved (it seems)
Thanks to all who tried to help me with this. Turns out it was a
less-than-obvious driver/kernel issue on the hardware used (see below).
Cheers,
Luc
Long explanation:
After checking everything I could think of that might have anything
at all to do with this issue, it appeared more and more likely that
torque was triggering something VERY odd on my system. After all,
many sites are using torque to good success, including myself for
many years. Time to start looking at the problem from the other side.
This is new hardware, so "apparently" working might not be "correctly" working.
Symptoms: all network traffic is fast and correct, expect for torque
which works correctly but is very slow.
It could be strictly hardware related. But at the wire level, torque
traffic is not any different from any other traffic, which all did
work ok. Not affected by offloading to the nic.
The next alternative is a NIC driver issue, or a kernel issue
(equivalent for the current problem).
Most of the traffic I have is tcp (name resolution, ssh, file
copying,...), however torque uses udp often. An udp bug in the
driver/kernel could perhaps be the reason everything worked other
than torque. I did some tests some time ago running udp streams
across the network to test udp, but they did not reveal anything
unusual. Then again, at that time I did not know as much about the
problem so maybe I just never looked at the timelapse for
communication initiation, which seems to be the problem evidenced in
the straces shown in this thread. Can't remember.
I'm administering this machine from abroad at the moment, but I
happened to be physically near the machine for half a day wednesday.
So, I upgraded the kernel and also switched to a different network
driver, keeping everything else identical (hardware, torque version,
utility software, network setup, etc). The kernel upgrade I did was
only a small (security) upgrade of the basic kernel used in the suse
10.3 family, and I moved to using the sky2 driver. Lo and behold !
Torque became responsive and wellbehaved. Problem solved as far as I
can see. So, I can only conclude that I was triggering some odd thing
in the old kernel and/or the old nic driver.
Note that the sky2 driver on this hardware is not ideal. If I do an
ifdown, followed by an ifup, the hardware comes up but I get no
(software) network connectivity back despite ifconfig showing the nic
as up. Connectivity does get restored if I rmmod and re-insmod the
sky2 module (or if I reboot of course). Having the system possibly
not recover from network changes is tricky, given the remote
administration. Still, having to ask a local friend-administrator to
help me out for an rmmod/insmod for an unlikely situation, is a lot
better than having a virtually unusably slow torque server.
So what caused the problem: kernel or driver? I don't know. I didn't
have enough time on-site to test the earlier nic driver again (a
hand-compiled version of the venerable sk98lin driver that did
support ifdown/ifup cycles and worked well other than for torque). It
remains odd that only torque was affected (I still think it's
udp-related). Also, I didn't have time yet to install the (better)
intel NICs I want this machine to use. That might have been an
alternative way to test the driver dependence. I might or might not
retry the driver thing to further isolate the driver-or-kernel
problem next time I'm on-site (near christmas probably). At any rate,
it is now a hardware/driver/kernel problem, not a torque issue.
Other note: I was surprised before that a qstat triggered pbs_server
to contact each and every execution node with a job running, causing
the delays to become dependent on the number of jobs running. In
hindsight, this was caused by the bad communication between mom and
server. Pbs_server caches job information (with an adjustable
validity time) but lousy communication caused that pbs_server only
had stale information available.
At 16:53 04/12/2009, Luc Vereecken wrote:
>Hi all,
>
>I have upgraded my queuing system to torque-2.4.3-snap.200912031436,
>and as far as I can tell, everything is working correctly. However,
>when there are jobs running, response from torque commands, such as
>pbsnodes, qstat, qdel, etc becomes very slow at times, sometimes
>taking 30 seconds up to 5 minutes to do anything, both on the head
>node and the compute nodes.
>
>It is not related to load on the head node, the network seems to be
>working fine, but it seems as if pbs_server is waiting for a timeout
>or something. Since I have only 40 nodes, I'm surprised to be
>confronted with something like this, so I'm fairly baffled. With a
>fully loaded cluster, pbs_iff also fails on the nodes and headnode
>(pbs_iff: cannot read reply from pbs_server) which I suspect is due
>to a timeout against the slow server. In the serverlogs, I find
>messages such as the below:
>--------
>12/04/2009
>17:38:17;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::wait_request,
>connection 17 to host 2886730506 has timed out after 900 seconds -
>closing stale connection
>12/04/2009
>17:38:17;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::wait_request,
>connection 18 to host 2886730762 has timed out after 900 seconds -
>closing stale connection
>12/04/2009
>17:38:17;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::wait_request,
>connection 21 to host 2886730499 has timed out after 900 seconds -
>closing stale connection
>12/04/2009
>17:38:17;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::wait_request,
>connection 38 to host 2886730760 has timed out after 900 seconds -
>closing stale connection
>12/04/2009 17:43:41;0002;PBS_Server;Svr;PBS_Server;Torque Server
>Version = 2.4.3-snap.200912031436, loglevel = 0
>12/04/2009 17:47:29;0002;PBS_Server;Req;dis_reply_write;DIS reply failure, -1
>12/04/2009 17:49:05;0002;PBS_Server;Svr;PBS_Server;Torque Server
>Version = 2.4.3-snap.200912031436, loglevel = 0
>12/04/2009
>17:49:05;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::wait_request,
>connection 54 to host 2886729985 has timed out after 900 seconds -
>closing stale connection
>12/04/2009
>17:49:05;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::wait_request,
>connection 56 to host 0 has timed out after 900 seconds - closing
>stale connection
>12/04/2009 17:49:20;0040;PBS_Server;Svr;gweyring;Scheduler was sent
>the command time
>-------
>Oddly, I have good response times and no timeouts with only a few
>jobs running.
>
>Any idea what might be causing this, and how to get a snappier
>response from user commands ? I have no idea where to start looking
>for a solution for this, as this problem seems to scale with the
>number of running jobs...
>
>Luc
>
>
>_______________________________________________
>torqueusers mailing list
>torqueusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list