[torqueusers] Considerations for Clusters Running Lots of Small
meo at intrinsity.com
Thu Dec 4 21:53:33 MST 2008
Joshua Bernstein said...
| The TORQUE documentation contains a nice explanation for running TORQUE
|on a large cluster. But are these ideas also pertinent to say a very
|small, say four node cluster, running, say many thousands of short lived
|jobs. Its very common in the BioIT space to have a comparitively small
|cluster, but with the many thousands of jobs lasting only a few seconds.
|Does anybody have an guidance on configuration or even source level
|changes for a high throughput, small cluster, with short lived jobs. Or
|would we expect the same changes for a large cluster to also be
|applicable to this configuration?
We meant to write up what we did, but nobody had the
time, and it wasn't all easy to reconstruct. We have
a mix of jobs ran anywhere from days (not that many
to hours (more, but still not huge numbers) to minutes
(quite a few) to seconds (lots). We have 42 queues.
Things were fins so long as we only had a couple of
hundred single CPU systems, none blazingly fast. But
as we moved into dual CPU and multi-core systems, moved
away from buying the cheapest hardware we could get,
and upgraded to a gigabit network with higher speed
backbone, we ran into massive problems.
We tried a lot of things; we can't pin down a single
change that solved "the problem", but together they
all helped a lot. I'll mention as many as I can
recall or dredge up from emails.
1) We moved the torque and maui server processes onto
a much faster, dual core system. With a small
cluster and just a few queues, this may not be
2) We bundled up small jobs to the extent we could
reasonably do so.
3) Scripts that submit many jobs are now generally
throttled by monitoring the queued jobs and not
queuing more than N at a time (we set "max_queueable"
and some submitters probably just depend on that
to help with throttling). N varies with the
queue and job type, but is always 1,000 or less.
[It can be handy to throttle job submission rate
as well; if the server is too busy handling requests
to queue up jobs, it's not actually running as many.
Again, on a small cluster this may not matter.]
4) Switch from RPP to TCP. We can't say for sure that in
the long run this helped a lot, but at the time we
tried it, it helped. Garrick (IIRC) says it shouldn't
matter, but we were getting kiled with RPP when we
expanded past a certain point. May be irrelevant
for a small cluster.
5) Tweak your torque/maui server kernel!
- Play with swappiness, pagecache, or whatever related
variables are in your kernel. We pushed ours toward
very low swappiness because we have plenty of RAM and
don't ever want to wait on the server pprocesses. We
don't run anything on this system we don't have to.
- We made the following network related changes on the
torque/maui server (2.6.9-55.0.2.ELsmp):
# Speed up NFS read performance
vm.min-readahead = 15
vm.max-readahead = 255
# release sockets faster because we use a lot of them
net.ipv4.tcp_fin_timeout = 20
# Reuse sockets as fast as possible
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
6) Run nscd. As the network and job processing sped
up, we started seeing UID, GID and host lookup errors.
We had to run nscd to cache passwd, group and host
info. We tried keeping the hosts file up to date
on the torque/maui server, and that helped, but it
was too painful in our changing environment. So we
added that to nscd. We run nscd on the server and
all client nodes, and in fact on all the Linux-based
systems that figure into anything even vaguely related
to torque. This has its own problems; the daemon is
not rock solid. We sometimes have to restart it on
every system to make sure it refreshes. [Torque and/or
maui were making insane numbers of passwd/group NIS
lookups, and for some reason insists on checking NIS
for hostnames. We've always used DNS for those.]
6) Play with every server parameter possible. One of
the biggest ones for us was "scheduler_iteration".
I don't know that we'll ever have the One True Value;
we end up readjusting it every so often. We try to
keep it so the scheduler stays busy. We also dropped
things like "node_check_rate" down quite a bit.
We also tried using routing queues to feed run queues.
That seemed to help for a while, but as we changed other
things we ended up doing away with them. You may want
to consider them, though.
We're looking at upgrading the torque/maui server again.
Sometimes maui is running 100% on a core for fairly
long periods of time.
We bumped RESERVATIONDEPTH to 100 to help assure that
high priority jobs always ran ahead of lower priority
More information about the torqueusers