[torqueusers] Problems with maui scalability

Ronny T. Lampert telecaadmin at gmail.com
Thu Aug 16 02:23:00 MDT 2007


Hi,

yesterday maui was behaving in the worst way possible.
I had around 32K jobs queued (for 7 np=2 nodes).
I know that maui is only considering the first 3-4K jobs, which would be 
totally fine for me, as more or less FIFO scheduling is defined.

maui went to 100% CPU, didn't respect the RMPOLLINTERVALL (45s for me) 
and completely choked on the load, filling up the logs with:

08/16 09:36:03 WARNING:  job buffer overflow (cannot add job '498677')
08/16 09:36:03 ERROR:    job buffer is full  (ignoring job 
'498677.SERVER.DOMAIN')

It barely kept 5-10 CPUs (out of 14 CPUs) running.

So my questions are:

1) How can I tell maui that it's OK to only consider the first 4K jobs?
2) How can I keep maui playing along nicely.
3) How can I keep my 14 CPUs busy

I know of this thread

http://www.supercluster.org/pipermail/mauiusers/2004-August/001303.html

but ramping up maui's footprint to 1G is not feasible.

Here's the maui.cfg, nothing too difficult I think:

BACKFILLPOLICY          FIRSTFIT
RESERVATIONPOLICY       NEVER

NODEALLOCATIONPOLICY    PRIORITY
NODECFG[DEFAULT] PRIORITYF='-JOBCOUNT'

QUEUETIMEWEIGHT         1
CREDWEIGHT              1
USERWEIGHT              0
GROUPWEIGHT             0
QOSWEIGHT               3
CLASSWEIGHT             1
USAGEWEIGHT             1
USAGEEXECUTIONTIMEWEIGHT 1

QOSCFG[high]  PRIORITY=1000 QFLAGS=PREEMPTOR
QOSCFG[low] PRIORITY=-1000 QFLAGS=PREEMPTEE

CLASSCFG[default]       QDEF=low
CLASSCFG[short]         QDEF=high MAXNODE=4,14 MAXJOB=4,14


Cheers,
Ronny


More information about the torqueusers mailing list