[torqueusers] Problems with maui scalability
Ronny T. Lampert
telecaadmin at gmail.com
Thu Aug 16 02:23:00 MDT 2007
yesterday maui was behaving in the worst way possible.
I had around 32K jobs queued (for 7 np=2 nodes).
I know that maui is only considering the first 3-4K jobs, which would be
totally fine for me, as more or less FIFO scheduling is defined.
maui went to 100% CPU, didn't respect the RMPOLLINTERVALL (45s for me)
and completely choked on the load, filling up the logs with:
08/16 09:36:03 WARNING: job buffer overflow (cannot add job '498677')
08/16 09:36:03 ERROR: job buffer is full (ignoring job
It barely kept 5-10 CPUs (out of 14 CPUs) running.
So my questions are:
1) How can I tell maui that it's OK to only consider the first 4K jobs?
2) How can I keep maui playing along nicely.
3) How can I keep my 14 CPUs busy
I know of this thread
but ramping up maui's footprint to 1G is not feasible.
Here's the maui.cfg, nothing too difficult I think:
QOSCFG[high] PRIORITY=1000 QFLAGS=PREEMPTOR
QOSCFG[low] PRIORITY=-1000 QFLAGS=PREEMPTEE
CLASSCFG[short] QDEF=high MAXNODE=4,14 MAXJOB=4,14
More information about the torqueusers