[Mauiusers] Exceedingly high CPU usage for Maui

Wickliffe, Blake W blake.wickliffe at aramco.com
Sat Jan 24 22:59:47 MST 2009


Hello,

I'm seeing strange behavior from my Maui process.  I am running a cluster of 2127 nodes.  On the master node, Maui stays at a constant 100% CPU usage.  There are currently about 750 jobs in the queue, and Maui won't even respond to CLI commands (ie, showq, etc) anymore (times out).

Also, I have logging set to "0", but I am still generating thousands of "WARNING" and "ERROR" messages in the log.  I am not clear if the messages are relevant:

01/25 08:55:58 WARNING:  cannot allocate tasks for job 43251 at   INFINITY
01/25 08:55:58 WARNING:  cannot allocate tasks for job 43251 at   INFINITY
01/25 08:55:58 WARNING:  cannot allocate tasks for job 43251 at   INFINITY
01/25 08:55:58 WARNING:  cannot allocate tasks for job 43251 at   INFINITY
01/25 08:55:58 ERROR:    cannot allocate tasks for job 43251 at any time
01/25 08:55:58 WARNING:  cannot allocate tasks for job 43252 at   INFINITY
01/25 08:55:58 WARNING:  cannot allocate tasks for job 43252 at   INFINITY
01/25 08:55:58 WARNING:  cannot allocate tasks for job 43252 at   INFINITY
01/25 08:55:58 WARNING:  cannot allocate tasks for job 43252 at   INFINITY
01/25 08:55:58 WARNING:  cannot allocate tasks for job 43252 at   INFINITY

Each job generates dozens of these messages.  It seems to me that if there are simply not enough CURRENT resources for a job, that shouldn't count as a WARNING or ERROR level condition, so I suspect something else is wrong.

Here is our (partial) maui.cfg file.  Thanks for any help anyone can provide...

# maui.cfg 3.2.6p16

SERVERHOST            xlch
# primary admin must be first in list
ADMIN1                root
ADMIN2                disco
ADMIN3                ALL

# Resource Manager Definition

RMCFG[xlch] TYPE=PBS

# Allocation Manager Definition

#AMCFG[bank]  TYPE=NONE

# full parameter docs at http://supercluster.org/mauidocs/a.fparameters.html
# use the 'schedctl -l' command to display current configuration

RMPOLLINTERVAL        00:05:00

NODEPOLLFREQUENCY     3
CLIENTTIMEOUT         00:01:30

SERVERPORT            42559
SERVERMODE            NORMAL
#SERVERMODE             TEST

ENABLEMULTIREQJOBS    TRUE
#USEMACHINESPEED              TRUE

# Admin: http://supercluster.org/mauidocs/a.esecurity.html


LOGFILE               maui.log
LOGFILEMAXSIZE        500000000
LOGLEVEL              0

# Job Priority: http://supercluster.org/mauidocs/5.1jobprioritization.html

QUEUETIMEWEIGHT       0

# FairShare: http://supercluster.org/mauidocs/6.3fairshare.html

#FSPOLICY              PSDEDICATED
#FSDEPTH               7
#FSINTERVAL            86400
#FSDECAY               0.80

# Throttling Policies: http://supercluster.org/mauidocs/6.2throttlingpolicies.html

# NONE SPECIFIED

# Backfill: http://supercluster.org/mauidocs/8.2backfill.html

BACKFILLPOLICY        FIRSTFIT
#RESERVATIONPOLICY     CURRENTHIGHEST
RESERVATIONPOLICY       NEVER


# Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html
#NODEALLOCATIONPOLICY  FASTEST
#NODEALLOCATIONPOLICY  PRIORITY
JOBNODEMATCHPOLICY    EXACTNODE
NODEACCESSPOLICY      SHARED

DEFERTIME             00

# QOS: http://supercluster.org/mauidocs/7.3qos.html

# QOSCFG[hi]  PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB
# QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE

# Standing Reservations: http://supercluster.org/mauidocs/7.1.3standingreservations.html

# SRSTARTTIME[test] 8:00:00
# SRENDTIME[test]   17:00:00
# SRDAYS[test]      MON TUE WED THU FRI
# SRTASKCOUNT[test] 20
# SRMAXTIME[test]   0:30:00

# Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html

USERCFG[DEFAULT]      FSTARGET=25.0+
# USERCFG[john]         PRIORITY=100  FSTARGET=10.0-
# GROUPCFG[staff]       PRIORITY=1000 QLIST=hi:low QDEF=hi
# CLASSCFG[batch]       FLAGS=PREEMPTEE
# CLASSCFG[interactive] FLAGS=PREEMPTOR
FSPOLICY              DEDICATEDPS
FSINTERVAL            24:00:00
FSDEPTH               12
FSDECAY               0.5
FSWEIGHT              100
FSUSERWEIGHT          100
FSQOSWEIGHT           0
FSGROUPWEIGHT         0

NODECFG[DEFAULT] PRIORITYF='PRIORITY * 1'


Blake Wickliffe
Saudi Aramco
ENOD/CSYS/USG HPC Team
(873-4417)

The contents of this email, including all related responses, files and attachments transmitted with it (collectively referred to as "this Email"), are intended solely for the use of the individual/entity to whom/which they are addressed, and may contain confidential and/or legally privileged information. This Email may not be disclosed or forwarded to anyone else without authorization from the originator of this Email. If you have received this Email in error, please notify the sender immediately and delete all copies from your system. Please note that the views or opinions presented in this Email are those of the author and may not necessarily represent those of Saudi Aramco. The recipient should check this Email and any attachments for the presence of any viruses. Saudi Aramco accepts no liability for any damage caused by any virus/error transmitted by this Email.


More information about the mauiusers mailing list