[torqueusers] performance issues with maui & torque

Ian Miller ianm at uchicago.edu
Wed Oct 17 09:02:27 MDT 2012


Hi
I have maui verison 3.3.1 and touque version 2.5.7
and I seem to have a few nodes sitting idle that should be running jobs.  They have been able to run jobs in the past but the cluster has never run at 80-90%
The output of showq is as follows (I omitted the jobs lists)
119 Active Jobs     130 of  344 Processors Active (37.79%)
                        15 of   35 Nodes Active      (42.86%)
Total Jobs: 467   Active Jobs: 119   Idle Jobs: 0   Blocked Jobs: 348
When I try to force run a job.. I get ….
root at beast$ qrun 209054
qrun: Execution server rejected request MSG=cannot send job to mom, state=PRERUN 209054.beast-net
30 out of the 34 worker nodes at in one queue (batch) with 2 out of the 30 shared between another queue.  Currently 33 of the total jobs (467) are in  a different queue (short) and are running fine, the reset are in the default(batch).  My question is how can I get the idle nodes to run this jobs?
What might be the problem?


Qmgr: print queue batch
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch max_running = 200
set queue batch resources_default.neednodes = batch
set queue batch resources_default.nodes = 1
set queue batch max_user_run = 150
set queue batch keep_completed = 300
set queue batch enabled = True
set queue batch started = True

# maui.cfg 3.3.1
SERVERHOST            beast
# primary admin must be first in list
ADMIN1                root
# Resource Manager Definition
RMCFG[BEAST] TYPE=PBS
# Allocation Manager Definition
AMCFG[bank]  TYPE=NONE
# full parameter docs at http://supercluster.org/mauidocs/a.fparameters.html
# use the 'schedctl -l' command to display current configuration
RMPOLLINTERVAL        00:00:30
SERVERPORT            42559
SERVERMODE            NORMAL
# Admin: http://supercluster.org/mauidocs/a.esecurity.html
LOGFILE               maui.log
LOGFILEMAXSIZE        10000000
LOGLEVEL              3
# Job Priority: http://supercluster.org/mauidocs/5.1jobprioritization.html
QUEUETIMEWEIGHT       1
# FairShare: http://supercluster.org/mauidocs/6.3fairshare.html
#FSPOLICY              PSDEDICATED
#FSDEPTH               7
#FSINTERVAL            86400
#FSDECAY               0.80

# Throttling Policies: http://supercluster.org/mauidocs/6.2throttlingpolicies.html

# NONE SPECIFIED

# Backfill: http://supercluster.org/mauidocs/8.2backfill.html

BACKFILLPOLICY        FIRSTFIT
RESERVATIONPOLICY     CURRENTHIGHEST

# Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html

NODEALLOCATIONPOLICY PRIORITY
NODECFG[DEFAULT] PRIORITYF='0.01*AMEM - 2*LOAD'
NODEAVAILABILITYPOLICY COMBINED:MEM

SRCFG[Reinitz] HOSTLIST=minion1[2-9]
SRCFG[Reinitz] GROUPLIST=Reinitz

# QOS: http://supercluster.org/mauidocs/7.3qos.html

# QOSCFG[hi]  PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB
# QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE

# Standing Reservations: http://supercluster.org/mauidocs/7.1.3standingreservations.html

# SRSTARTTIME[test] 8:00:00
# SRENDTIME[test]   17:00:00
# SRDAYS[test]      MON TUE WED THU FRI
# SRTASKCOUNT[test] 20
# SRMAXTIME[test]   0:30:00

# Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html

USERCFG[DEFAULT]        MAXIJOB=2000
# USERCFG[DEFAULT]      FSTARGET=25.0
# USERCFG[john]         PRIORITY=100  FSTARGET=10.0-
# GROUPCFG[staff]       PRIORITY=1000 QLIST=hi:low QDEF=hi
# CLASSCFG[batch]       FLAGS=PREEMPTEE
# CLASSCFG[interactive] FLAGS=PREEMPTOR







Ian Miller
Research Computing Administrator
ianm at uchicago.edu
(312) 402-6170

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121017/5f19300f/attachment-0001.html 


More information about the torqueusers mailing list