[Mauiusers] configuration issues...

James Wigdahl james at wigdahl.com
Mon Apr 10 11:52:26 MDT 2006


I have a conundrum I need some help with.....

First of all the particulars:
OSCAR 4.2 - TORQUE 1.2.0p5 - Maui 3.2.5p2

I have a cluster with 30 nodes all of which have 2 CPUs and are  
configured as such in TORQUE. All job scheduling policy other than  
max walltime is configured in Maui.

I have a new queue ("simvision") I want to setup for jobs that, once  
started, are not CPU or memory intensive and I would like to shove  
these jobs on to cluster nodes even when they are running 2 jobs  
already. Here's what I've done so far:

1st attempt:
Created "simvision" queue in TORQUE and a corresponding CLASSCFG  
Maui. In Maui, assigned a QOS ("simvis") giving the jobs high  
priority (matching that of the other jobs with high priority on the  
cluster (queue "ncsim")) and adding "FLAGS=IGNSYSTEM" for the  
"simvis" QOS.

What I found with this was that when all nodes were full of running  
jobs, "simvision" jobs would not run. My deductive reasoning, such as  
it is, and some knowledge gleaned from the Maui docs led me to  
believe that these jobs would not be scheduled because the resource  
was only reporting 2 CPUs on the nodes. With those job slots filled,  
Maui would not permit these jobs to run.

2nd attempt:
Configure 20 nodes (nodes 11-30) as having 3 processors in /var/spool/ 
pbs/server_priv/nodes. Add "MAXJOB=2" and "MAXLOAD=1.8" to all  
NODECFGs in maui.cfg. Still have "FLAGS=IGNSYSTEM" on the "simvis"  
QOS. Restart pbs_server and maui.

What I would expect is to see no more than 2 jobs running per node  
and jobs queueing beyond that, unless a job came into the "simvision"  
queue where it would be allocated to one of the nodes with 3  
configured processors in TORQUE since those jobs should ignore system  
throttling policies. I confirmed that TORQUE was seeing more  
processors (showed 80 procs instead of 60), but when firing off a  
bunch of jobs with a QOS that did *not* have IGNSYSTEM set (queue  
"long"), they would be allocated to the nodes with 3 CPUs anyway and  
start running, as if the MAXJOB and MAXLOAD directives were being  
ignored.

Much thanks to anyone who can help me figure this out. Configs follow  
(total CPUs set back to 60):


######################################################################
# TORQUE config
######################################################################
#
# Create and define queue interact
#
create queue interact
set queue interact queue_type = Execution
set queue interact resources_max.cput = 04:00:00
set queue interact resources_max.ncpus = 60
set queue interact resources_max.nodect = 30
set queue interact resources_max.walltime = 04:00:00
set queue interact resources_min.cput = 00:00:01
set queue interact resources_min.ncpus = 1
set queue interact resources_min.nodect = 1
set queue interact resources_min.walltime = 00:00:01
set queue interact resources_default.cput = 04:00:00
set queue interact resources_default.ncpus = 1
set queue interact resources_default.nodect = 1
set queue interact resources_default.walltime = 04:00:00
set queue interact resources_available.nodect = 30
set queue interact enabled = True
set queue interact started = True
#
# Create and define queue long
#
create queue long
set queue long queue_type = Execution
set queue long resources_max.cput = 10000:00:00
set queue long resources_max.ncpus = 60
set queue long resources_max.nodect = 30
set queue long resources_max.walltime = 10000:00:00
set queue long resources_min.cput = 00:00:01
set queue long resources_min.ncpus = 1
set queue long resources_min.nodect = 1
set queue long resources_min.walltime = 00:00:01
set queue long resources_default.cput = 10000:00:00
set queue long resources_default.ncpus = 1
set queue long resources_default.nodect = 1
set queue long resources_default.walltime = 10000:00:00
set queue long resources_available.nodect = 30
set queue long enabled = True
set queue long started = True
#
# Create and define queue swbuild
#
create queue swbuild
set queue swbuild queue_type = Execution
set queue swbuild resources_max.cput = 02:00:00
set queue swbuild resources_max.ncpus = 60
set queue swbuild resources_max.nodect = 30
set queue swbuild resources_max.walltime = 02:00:00
set queue swbuild resources_min.cput = 00:00:01
set queue swbuild resources_min.ncpus = 1
set queue swbuild resources_min.nodect = 1
set queue swbuild resources_min.walltime = 00:00:01
set queue swbuild resources_default.cput = 02:00:00
set queue swbuild resources_default.ncpus = 1
set queue swbuild resources_default.nodect = 1
set queue swbuild resources_default.walltime = 02:00:00
set queue swbuild resources_available.nodect = 30
set queue swbuild enabled = True
set queue swbuild started = True
#
# Create and define queue matlab
#
create queue matlab
set queue matlab queue_type = Execution
set queue matlab resources_max.cput = 10000:00:00
set queue matlab resources_max.ncpus = 60
set queue matlab resources_max.nodect = 30
set queue matlab resources_max.walltime = 10000:00:00
set queue matlab resources_min.cput = 00:00:01
set queue matlab resources_min.ncpus = 1
set queue matlab resources_min.nodect = 1
set queue matlab resources_min.walltime = 00:00:01
set queue matlab resources_default.cput = 10000:00:00
set queue matlab resources_default.ncpus = 1
set queue matlab resources_default.nodect = 1
set queue matlab resources_default.walltime = 10000:00:00
set queue matlab resources_available.nodect = 30
set queue matlab enabled = True
set queue matlab started = True
#
# Create and define queue long-loprio
#
create queue long-loprio
set queue long-loprio queue_type = Execution
set queue long-loprio resources_max.cput = 10000:00:00
set queue long-loprio resources_max.ncpus = 60
set queue long-loprio resources_max.nodect = 30
set queue long-loprio resources_max.walltime = 10000:00:00
set queue long-loprio resources_min.cput = 00:00:01
set queue long-loprio resources_min.ncpus = 1
set queue long-loprio resources_min.nodect = 1
set queue long-loprio resources_min.walltime = 00:00:01
set queue long-loprio resources_default.cput = 10000:00:00
set queue long-loprio resources_default.ncpus = 1
set queue long-loprio resources_default.nodect = 1
set queue long-loprio resources_default.walltime = 10000:00:00
set queue long-loprio resources_available.nodect = 30
set queue long-loprio enabled = True
set queue long-loprio started = True
#
# Create and define queue ncsim
#
create queue ncsim
set queue ncsim queue_type = Execution
set queue ncsim resources_max.cput = 10000:00:00
set queue ncsim resources_max.ncpus = 60
set queue ncsim resources_max.nodect = 30
set queue ncsim resources_max.walltime = 10000:00:00
set queue ncsim resources_min.cput = 00:00:01
set queue ncsim resources_min.ncpus = 1
set queue ncsim resources_min.nodect = 1
set queue ncsim resources_min.walltime = 00:00:01
set queue ncsim resources_default.cput = 10000:00:00
set queue ncsim resources_default.ncpus = 1
set queue ncsim resources_default.nodect = 1
set queue ncsim resources_default.walltime = 10000:00:00
set queue ncsim resources_available.nodect = 30
set queue ncsim enabled = True
set queue ncsim started = True
#
# Create and define queue short
#
create queue short
set queue short queue_type = Execution
set queue short resources_max.cput = 02:00:00
set queue short resources_max.ncpus = 60
set queue short resources_max.nodect = 30
set queue short resources_max.walltime = 02:00:00
set queue short resources_min.cput = 00:00:01
set queue short resources_min.ncpus = 1
set queue short resources_min.nodect = 1
set queue short resources_min.walltime = 00:00:01
set queue short resources_default.cput = 02:00:00
set queue short resources_default.ncpus = 1
set queue short resources_default.nodect = 1
set queue short resources_default.walltime = 02:00:00
set queue short resources_available.nodect = 30
set queue short enabled = True
set queue short started = True
#
# Create and define queue simvision
#
create queue simvision
set queue simvision queue_type = Execution
set queue simvision resources_max.cput = 10000:00:00
set queue simvision resources_max.ncpus = 60
set queue simvision resources_max.nodect = 30
set queue simvision resources_max.walltime = 10000:00:00
set queue simvision resources_min.cput = 00:00:01
set queue simvision resources_min.ncpus = 1
set queue simvision resources_min.nodect = 1
set queue simvision resources_min.walltime = 00:00:01
set queue simvision resources_default.cput = 10000:00:00
set queue simvision resources_default.ncpus = 1
set queue simvision resources_default.nodect = 1
set queue simvision resources_default.walltime = 10000:00:00
set queue simvision resources_available.nodect = 30
set queue simvision enabled = True
set queue simvision started = True
#
# Set server attributes.
#
set server scheduling = False
set server default_queue = short
set server log_events = 64
set server mail_from = adm
set server query_other_jobs = True
set server resources_available.ncpus = 60
set server resources_available.nodect = 30
set server resources_available.nodes = 30
set server resources_max.ncpus = 60
set server resources_max.nodes = 30
set server scheduler_iteration = 60
set server node_ping_rate = 300
set server node_check_rate = 150
set server tcp_timeout = 6
set server job_stat_rate = 30
######################################################################
# END OF TORQUE config
######################################################################




######################################################################
# maui.cfg
######################################################################
SERVERHOST node001.cluster
SERVERPORT 42559
SERVERMODE NORMAL
ADMIN1 root
LOGFILE maui.log
LOGFILEMAXSIZE 10000000
LOGLEVEL 3

RMCFG[base] TYPE=PBS TIMEOUT=90
RMPOLLINTERVAL 00:00:10

DEFERTIME 1:00
DEFERCOUNT 999
DEFERSTARTCOUNT 10

BACKFILLPOLICY FIRSTFIT
NODEACCESSPOLICY SHARED
PREEMPTPOLICY SUSPEND
RESERVATIONPOLICY NEVER
FSPOLICY UTILIZEDPS
NODEALLOCATIONPOLICY PRIORITY
RESOURCELIMITPOLICY ALWAYS:CANCEL:MEM
RESERVATIONPOLICY NEVER

CREDWEIGHT             5
CLASSWEIGHT            8
QOSWEIGHT              2
QUEUETIMEWEIGHT        1
TARGETQUEUETIMEWEIGHT  1
CONSUMEDWEIGHT         3

QOSCFG[lopri]  PRIORITY=10 QFLAGS=PREEMPTEE FLAGS=PREEMPTEE  
JOBFLAGS=PREEMPTEE
QOSCFG[hipri]  PRIORITY=10000 QFLAGS=PREEMPTOR FLAGS=PREEMPTOR  
JOBFLAGS=PREEMPTOR
QOSCFG[simvis] PRIORITY=10000 FLAGS=IGNSYSTEM

CLASSCFG[interact]      PRIORITY=950 QDEF=hipri  MAXJOBPERUSER=4
CLASSCFG[ncsim]         PRIORITY=900 QDEF=hipri  MAXJOB=6  
MAXJOBPERUSER=2
CLASSCFG[simvision]     PRIORITY=900 QDEF=simvis MAXJOB=11
CLASSCFG[matlab]        PRIORITY=900 QDEF=hipri  MAXJOB=8
CLASSCFG[swbuild]       PRIORITY=700 QDEF=hipri
CLASSCFG[short]         PRIORITY=500 QDEF=lopri
CLASSCFG[long]          PRIORITY=200 QDEF=lopri  MAXMEM=1200
CLASSCFG[long-loprio]   PRIORITY=150 QDEF=lopri  MAXMEM=1200  
MAXJOBPERUSER=30

USERCFG[DEFAULT] QTTARGET=0:00:01 QLIST=lopri,hipri

NODECFG[node001] MAXJOB=2 MAXLOAD=1.8 PRIORITYF=SPEED-10*JOBCOUNT  
SPEED=1
NODECFG[node002] MAXJOB=2 MAXLOAD=1.8 PRIORITYF=SPEED-10*JOBCOUNT  
SPEED=1
NODECFG[node003] MAXJOB=2 MAXLOAD=1.8 PRIORITYF=SPEED-10*JOBCOUNT  
SPEED=1
NODECFG[node004] MAXJOB=2 MAXLOAD=1.8 PRIORITYF=SPEED-10*JOBCOUNT  
SPEED=1
NODECFG[node005] MAXJOB=2 MAXLOAD=1.8 PRIORITYF=SPEED-10*JOBCOUNT  
SPEED=1
NODECFG[node006] MAXJOB=2 MAXLOAD=1.8 PRIORITYF=SPEED-10*JOBCOUNT  
SPEED=1
NODECFG[node007] MAXJOB=2 MAXLOAD=1.8 PRIORITYF=SPEED-10*JOBCOUNT  
SPEED=1
NODECFG[node008] MAXJOB=2 MAXLOAD=1.8 PRIORITYF=SPEED-10*JOBCOUNT  
SPEED=1
NODECFG[node009] MAXJOB=2 MAXLOAD=1.8 PRIORITYF=SPEED-10*JOBCOUNT  
SPEED=1
NODECFG[node010] MAXJOB=2 MAXLOAD=1.8 PRIORITYF=SPEED-10*JOBCOUNT  
SPEED=1
NODECFG[node011] MAXJOB=2 MAXLOAD=1.8 PRIORITYF=SPEED-10*JOBCOUNT  
SPEED=1.1
NODECFG[node012] MAXJOB=2 MAXLOAD=1.8 PRIORITYF=SPEED-10*JOBCOUNT  
SPEED=1.1
NODECFG[node013] MAXJOB=2 MAXLOAD=1.8 PRIORITYF=SPEED-10*JOBCOUNT  
SPEED=1.1
NODECFG[node014] MAXJOB=2 MAXLOAD=1.8 PRIORITYF=SPEED-10*JOBCOUNT  
SPEED=1.1
NODECFG[node015] MAXJOB=2 MAXLOAD=1.8 PRIORITYF=SPEED-10*JOBCOUNT  
SPEED=1.1
NODECFG[node016] MAXJOB=2 MAXLOAD=1.8 PRIORITYF=SPEED-10*JOBCOUNT  
SPEED=1.1
NODECFG[node017] MAXJOB=2 MAXLOAD=1.8 PRIORITYF=SPEED-10*JOBCOUNT  
SPEED=1.1
NODECFG[node018] MAXJOB=2 MAXLOAD=1.8 PRIORITYF=SPEED-10*JOBCOUNT  
SPEED=1.1
NODECFG[node019] MAXJOB=2 MAXLOAD=1.8 PRIORITYF=SPEED-10*JOBCOUNT  
SPEED=1.1
NODECFG[node020] MAXJOB=2 MAXLOAD=1.8 PRIORITYF=SPEED-10*JOBCOUNT  
SPEED=1.1
NODECFG[node021] MAXJOB=2 MAXLOAD=1.8 PRIORITYF=SPEED-10*JOBCOUNT  
SPEED=1.1
NODECFG[node022] MAXJOB=2 MAXLOAD=1.8 PRIORITYF=SPEED-10*JOBCOUNT  
SPEED=1.1
NODECFG[node023] MAXJOB=2 MAXLOAD=1.8 PRIORITYF=SPEED-10*JOBCOUNT  
SPEED=1.1
NODECFG[node024] MAXJOB=2 MAXLOAD=1.8 PRIORITYF=SPEED-10*JOBCOUNT  
SPEED=1.1
NODECFG[node025] MAXJOB=2 MAXLOAD=1.8 PRIORITYF=SPEED-10*JOBCOUNT  
SPEED=1.1
NODECFG[node026] MAXJOB=2 MAXLOAD=1.8 PRIORITYF=SPEED-10*JOBCOUNT  
SPEED=1.1
NODECFG[node027] MAXJOB=2 MAXLOAD=1.8 PRIORITYF=SPEED-10*JOBCOUNT  
SPEED=1.1
NODECFG[node028] MAXJOB=2 MAXLOAD=1.8 PRIORITYF=SPEED-10*JOBCOUNT  
SPEED=1.1
NODECFG[node029] MAXJOB=2 MAXLOAD=1.8 PRIORITYF=SPEED-10*JOBCOUNT  
SPEED=1.1
NODECFG[node030] MAXJOB=2 MAXLOAD=1.8 PRIORITYF=SPEED-10*JOBCOUNT  
SPEED=1.1
######################################################################
# END OF maui.cfg
######################################################################



More information about the mauiusers mailing list