[Mauiusers] (my) maui configuration error?

Jones de Andrade johannesrs at gmail.com
Thu Sep 10 12:43:58 MDT 2009


Hi all.

We've been using Torque+Maui in our cluster for some time now. It's a small
cluster, composed by 8 (now only 6 online) quad-core nodes plus a master
machine.

The way the queue should be working is:
1 - All users has a maximum of 8 processors/cores to use at the same time,
in *any* moment;
2 - Any (sub-)group of users (there are three) should not be able to use
more than 16 processors/cores at the same time, in *any* moment.

I'm quite sure I've tested the following configuration to be certain about
this usage protocols. Unfortunatelly, due to some unexplained reasons so
far, the cluster is just becoming to be heavy used now, more than a month
after it was officially started. And now a really strange behaviour was
noticed, which seems to be related to the way I configured maui:

1 - there are 6 jobs using 24 cores from users of the same group at the same
time running. I guess that if there were more nodes available, the two
queued processes from that same group would also start to run.

2 - There is one user with three jobs alone, using 12 cores. :(

How can I correct this? Due to "internal policies", there is no problem in
having a spare node available with no proceses (we intend to deal with it by
implementing some sort of "wake on lan" procedure), but by no means a group
or a user can go over this stablished limits. :(

Here follows my maui.cfg. Just removed the server name for safety reasons:

*******************
# maui.cfg 3.2.6p20

SERVERHOST            server
# primary admin must be first in list
ADMIN1                root

# Resource Manager Definition

RMCFG[server] TYPE=PBS

# Allocation Manager Definition

AMCFG[bank]  TYPE=NONE

RMPOLLINTERVAL        00:00:30

SERVERPORT            42559
SERVERMODE            NORMAL

# Admin: http://supercluster.org/mauidocs/a.esecurity.html

LOGFILE               maui.log
LOGFILEMAXSIZE        10000000
LOGLEVEL              3

# Job Priority: http://supercluster.org/mauidocs/5.1jobprioritization.html

QUEUETIMEWEIGHT       1

# FairShare: http://supercluster.org/mauidocs/6.3fairshare.html

#FSPOLICY              PSDEDICATED
#FSDEPTH               7
#FSINTERVAL            86400
#FSDECAY               0.80

# Throttling Policies:
http://supercluster.org/mauidocs/6.2throttlingpolicies.html

CLASSCFG[cluster]  MAXPROC[GROUP]=16 MAXPROC[USER]=8
CLASSCFG[qm]       MAXPROC[USER]=8

# Backfill: http://supercluster.org/mauidocs/8.2backfill.html

BACKFILLPOLICY        FIRSTFIT #NONE
RESERVATIONPOLICY     CURRENTHIGHEST

# Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html

NODEALLOCATIONPOLICY  MINRESOURCE #CPULOAD ou FIRSTAVAILABLE ???!!!

# QOS: http://supercluster.org/mauidocs/7.3qos.html

# QOSCFG[hi]  PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB
# QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE

# Standing Reservations:
http://supercluster.org/mauidocs/7.1.3standingreservations.html

# SRSTARTTIME[test] 8:00:00
# SRENDTIME[test]   17:00:00
# SRDAYS[test]      MON TUE WED THU FRI
# SRTASKCOUNT[test] 20
# SRMAXTIME[test]   0:30:00

# Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html

# USERCFG[DEFAULT]      FSTARGET=25.0
# USERCFG[john]         PRIORITY=100  FSTARGET=10.0-
# GROUPCFG[staff]       PRIORITY=1000 QLIST=hi:low QDEF=hi
# CLASSCFG[batch]       FLAGS=PREEMPTEE
# CLASSCFG[interactive] FLAGS=PREEMPTOR
********************

Here is the "showq" output:

***********************
ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING
STARTTIME

311                  gullit    Running     4 94:05:48:49  Fri Sep  4
21:27:41
313                 msegala    Running     4 96:08:39:28  Mon Sep  7
00:18:20
314                 msegala    Running     4 96:20:13:42  Mon Sep  7
11:52:34
318                ricksander    Running     4 98:19:45:23  Wed Sep  9
11:24:15
320                 msegala    Running     4 98:23:29:31  Wed Sep  9
15:08:23
321                 william    Running     4 99:08:03:39  Wed Sep  9
23:42:31

     6 Active Jobs      24 of   24 Processors Active (100.00%)
                         6 of    6 Nodes Active      (100.00%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT
QUEUETIME

322                  gullit       Idle     4 99:23:59:59  Thu Sep 10
08:51:42
323                  gullit       Idle     4 99:23:59:59  Thu Sep 10
11:08:50

2 Idle Jobs

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT
QUEUETIME


Total Jobs: 8   Active Jobs: 6   Idle Jobs: 2   Blocked Jobs: 0
***********************

And here the "qstat -q" output:

***********************
Queue            Memory CPU Time Walltime Node  Run Que Lm  State
---------------- ------ -------- -------- ----  --- --- --  -----
qm                 --      --       --      --    0   0 --   E R
cluster            --      --       --      --    6   2 --   E R
                                               ----- -----
                                                   6     2
***********************

Any clues here? By the way, is there any way to reinforce any corrections I
make immediately, what would mean to automatically place the last started
processes above in a "waiting" state?

Thanks a lot in advance for any help with this matter!

Sincerally yours,

Jones
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20090910/6247bc99/attachment.html 


More information about the mauiusers mailing list