[Mauiusers] (my) maui configuration error?
Jones de Andrade
johannesrs at gmail.com
Thu Sep 10 12:43:58 MDT 2009
Hi all.
We've been using Torque+Maui in our cluster for some time now. It's a small
cluster, composed by 8 (now only 6 online) quad-core nodes plus a master
machine.
The way the queue should be working is:
1 - All users has a maximum of 8 processors/cores to use at the same time,
in *any* moment;
2 - Any (sub-)group of users (there are three) should not be able to use
more than 16 processors/cores at the same time, in *any* moment.
I'm quite sure I've tested the following configuration to be certain about
this usage protocols. Unfortunatelly, due to some unexplained reasons so
far, the cluster is just becoming to be heavy used now, more than a month
after it was officially started. And now a really strange behaviour was
noticed, which seems to be related to the way I configured maui:
1 - there are 6 jobs using 24 cores from users of the same group at the same
time running. I guess that if there were more nodes available, the two
queued processes from that same group would also start to run.
2 - There is one user with three jobs alone, using 12 cores. :(
How can I correct this? Due to "internal policies", there is no problem in
having a spare node available with no proceses (we intend to deal with it by
implementing some sort of "wake on lan" procedure), but by no means a group
or a user can go over this stablished limits. :(
Here follows my maui.cfg. Just removed the server name for safety reasons:
*******************
# maui.cfg 3.2.6p20
SERVERHOST server
# primary admin must be first in list
ADMIN1 root
# Resource Manager Definition
RMCFG[server] TYPE=PBS
# Allocation Manager Definition
AMCFG[bank] TYPE=NONE
RMPOLLINTERVAL 00:00:30
SERVERPORT 42559
SERVERMODE NORMAL
# Admin: http://supercluster.org/mauidocs/a.esecurity.html
LOGFILE maui.log
LOGFILEMAXSIZE 10000000
LOGLEVEL 3
# Job Priority: http://supercluster.org/mauidocs/5.1jobprioritization.html
QUEUETIMEWEIGHT 1
# FairShare: http://supercluster.org/mauidocs/6.3fairshare.html
#FSPOLICY PSDEDICATED
#FSDEPTH 7
#FSINTERVAL 86400
#FSDECAY 0.80
# Throttling Policies:
http://supercluster.org/mauidocs/6.2throttlingpolicies.html
CLASSCFG[cluster] MAXPROC[GROUP]=16 MAXPROC[USER]=8
CLASSCFG[qm] MAXPROC[USER]=8
# Backfill: http://supercluster.org/mauidocs/8.2backfill.html
BACKFILLPOLICY FIRSTFIT #NONE
RESERVATIONPOLICY CURRENTHIGHEST
# Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html
NODEALLOCATIONPOLICY MINRESOURCE #CPULOAD ou FIRSTAVAILABLE ???!!!
# QOS: http://supercluster.org/mauidocs/7.3qos.html
# QOSCFG[hi] PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB
# QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE
# Standing Reservations:
http://supercluster.org/mauidocs/7.1.3standingreservations.html
# SRSTARTTIME[test] 8:00:00
# SRENDTIME[test] 17:00:00
# SRDAYS[test] MON TUE WED THU FRI
# SRTASKCOUNT[test] 20
# SRMAXTIME[test] 0:30:00
# Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html
# USERCFG[DEFAULT] FSTARGET=25.0
# USERCFG[john] PRIORITY=100 FSTARGET=10.0-
# GROUPCFG[staff] PRIORITY=1000 QLIST=hi:low QDEF=hi
# CLASSCFG[batch] FLAGS=PREEMPTEE
# CLASSCFG[interactive] FLAGS=PREEMPTOR
********************
Here is the "showq" output:
***********************
ACTIVE JOBS--------------------
JOBNAME USERNAME STATE PROC REMAINING
STARTTIME
311 gullit Running 4 94:05:48:49 Fri Sep 4
21:27:41
313 msegala Running 4 96:08:39:28 Mon Sep 7
00:18:20
314 msegala Running 4 96:20:13:42 Mon Sep 7
11:52:34
318 ricksander Running 4 98:19:45:23 Wed Sep 9
11:24:15
320 msegala Running 4 98:23:29:31 Wed Sep 9
15:08:23
321 william Running 4 99:08:03:39 Wed Sep 9
23:42:31
6 Active Jobs 24 of 24 Processors Active (100.00%)
6 of 6 Nodes Active (100.00%)
IDLE JOBS----------------------
JOBNAME USERNAME STATE PROC WCLIMIT
QUEUETIME
322 gullit Idle 4 99:23:59:59 Thu Sep 10
08:51:42
323 gullit Idle 4 99:23:59:59 Thu Sep 10
11:08:50
2 Idle Jobs
BLOCKED JOBS----------------
JOBNAME USERNAME STATE PROC WCLIMIT
QUEUETIME
Total Jobs: 8 Active Jobs: 6 Idle Jobs: 2 Blocked Jobs: 0
***********************
And here the "qstat -q" output:
***********************
Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- --- --- -- -----
qm -- -- -- -- 0 0 -- E R
cluster -- -- -- -- 6 2 -- E R
----- -----
6 2
***********************
Any clues here? By the way, is there any way to reinforce any corrections I
make immediately, what would mean to automatically place the last started
processes above in a "waiting" state?
Thanks a lot in advance for any help with this matter!
Sincerally yours,
Jones
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20090910/6247bc99/attachment.html
More information about the mauiusers
mailing list