[torqueusers] torque not scaling well

Miles O'Neal meo at intrinsity.com
Wed Aug 1 20:44:06 MDT 2007


Chris Samuel said responded:

|> But we still start seeing slowdows in job run rates somewhere between 1000
|> and 1500 jobs queued, and once we get up around 3K jobs queued, forget it.
|
|It is worth keeping in mind that this could be a Maui scaling problem, not
|Torque.  Maui is the part that is scanning the queues and trying to work out
|the best order to run jobs in based on your policy.

We've been looking at everything- io stats,
memory usage, network usage, torque, maui,
etc.

Running max debug shows that a scheduling
pass goes from well under 1 second below
1500 jobs or so up to 10 or 15 minutes
as the queue length increases, and it's
pretty much all spent waiting on pbs_server
to respond to maui.  We're not yet sure
what that means, but it does really make
us wonder about torque.

Once it's in this state, any torque command
(qsub, qdel, qstat, qmgr, etc) either takes
a stupid long tim eto execute or simply fails.
Maui commands that require maui to talk to
torque have the same problem.

|One suggestion based on what we do here is to look at limiting the number of
|running and idle jobs each user can have to make life fairer across the board
|and to stop people queue stuffing.
|
|On our clusters we set something like:
|
|USERCFG[DEFAULT]        MAXJOB=35,20
|USERCFG[DEFAULT]        MAXIJOB=5

I don't know if we can come up with reasonable
values for everyone (we have a fairly random
job mix in terms of users and queues at any given
time) but I'll discuss it with them and see.

[Today we implemented some queue limits with
torque's max_queuable on key queues (until now
we just had run limits) but that may break a
lot of scripts.  And since this should be
scaling a lot better that it is, we consider
this a very temporary workaround.]

Thanks,
Miles


More information about the torqueusers mailing list