[torqueusers] Considerations for Clusters Running Lots of Small Jobs

Miles O'Neal meo at intrinsity.com
Thu Dec 4 21:53:33 MST 2008

Joshua Bernstein said...

|	The TORQUE documentation contains a nice explanation for running TORQUE 
|on a large cluster. But are these ideas also pertinent to say a very 
|small, say four node cluster, running, say many thousands of short lived 
|jobs. Its very common in the BioIT space to have a comparitively small 
|cluster, but with the many thousands of jobs lasting only a few seconds. 
|Does anybody have an guidance on configuration or even source level 
|changes for a high throughput, small cluster, with short lived jobs. Or 
|would we expect the same changes for a large cluster to also be 
|applicable to this configuration?

We meant to write up what we did, but nobody had the
time, and it wasn't all easy to reconstruct.  We have
a mix of jobs ran anywhere from days (not that many
to hours (more, but still not huge numbers) to minutes
(quite a few) to seconds (lots).  We have 42 queues.
Things were fins so long as we only had a couple of
hundred single CPU systems, none blazingly fast.  But
as we moved into dual CPU and multi-core systems, moved
away from buying the cheapest hardware we could get,
and upgraded to a gigabit network with higher speed
backbone, we ran into massive problems.

We tried a lot of things; we can't pin down a single
change that solved "the problem", but together they
all helped a lot.  I'll mention as many as I can
recall or dredge up from emails.

1) We moved the torque and maui server processes onto
   a much faster, dual core system.  With a small
   cluster and just a few queues, this may not be
   as important.

2) We bundled up small jobs to the extent we could
   reasonably do so.

3) Scripts that submit many jobs are now generally
   throttled by monitoring the queued jobs and not
   queuing more than N at a time (we set "max_queueable"
   and some submitters probably just depend on that
   to help with throttling).  N varies with the
   queue and job type, but is always 1,000 or less.
   [It can be handy to throttle job submission rate
   as well; if the server is too busy handling requests
   to queue up jobs, it's not actually running as many.
   Again, on a small cluster this may not matter.]

4) Switch from RPP to TCP.  We can't say for sure that in
   the long run this helped a lot, but at the time we
   tried it, it helped.  Garrick (IIRC) says it shouldn't
   matter, but we were getting kiled with RPP when we
   expanded past a certain point.  May be irrelevant
   for a small cluster.

5) Tweak your torque/maui server kernel!
   - Play with swappiness, pagecache, or whatever related
     variables are in your kernel.  We pushed ours toward
     very low swappiness because we have plenty of RAM and
     don't ever want to wait on the server pprocesses.  We
     don't run anything on this system we don't have to.
   - We made the following network related changes on the
     torque/maui server (2.6.9-55.0.2.ELsmp):

        # Speed up NFS read performance
        vm.min-readahead = 15
        vm.max-readahead = 255

        # release sockets faster because we use a lot of them
        net.ipv4.tcp_fin_timeout = 20
        # Reuse sockets as fast as possible
        net.ipv4.tcp_tw_reuse = 1
        net.ipv4.tcp_tw_recycle = 1

6) Run nscd.  As the network and job processing sped
   up, we started seeing UID, GID and host lookup errors.
   We had to run nscd to cache passwd, group and host
   info.  We tried keeping the hosts file up to date
   on the torque/maui server, and that helped, but it
   was too painful in our changing environment.  So we
   added that to nscd.  We run nscd on the server and
   all client nodes, and in fact on all the Linux-based
   systems that figure into anything even vaguely related
   to torque.  This has its own problems; the daemon is
   not rock solid.  We sometimes have to restart it on
   every system to make sure it refreshes.  [Torque and/or
   maui were making insane numbers of passwd/group NIS
   lookups, and for some reason insists on checking NIS
   for hostnames.  We've always used DNS for those.]

6) Play with every server parameter possible.  One of
   the biggest ones for us was "scheduler_iteration".
   I don't know that we'll ever have the One True Value;
   we end up readjusting it every so often.  We try to
   keep it so the scheduler stays busy.  We also dropped
   things like "node_check_rate" down quite a bit.

We also tried using routing queues to feed run queues.
That seemed to help for a while, but as we changed other
things we ended up doing away with them.  You may want
to consider them, though.

We're looking at upgrading the torque/maui server again.
Sometimes maui is running 100% on a core for fairly
long periods of time.

We bumped RESERVATIONDEPTH to 100 to help assure that
high priority jobs always ran ahead of lower priority


More information about the torqueusers mailing list