[Mauiusers] Preemption not working : job is deferred. Reason: RMFailure (cannot start job - RM failure, rc: 15044, msg: 'Resource temporarily unavailable MSG=job allocation request exceeds currently available cluster nodes, 1 requested, 0 available'
Tom Rudwick
trudwickiii at apple.com
Tue Apr 20 14:45:04 MDT 2010
Hi Andre,
We have preemption working at our site on that version of maui.
We have found that the settings below seem to be necessary for
it to work at our site. I don't see a SYSCFG in your config,
and I don't see a GROUPCFG for the admins group? I may be off
base on these, since I know some bugs have been fixed since we
got this working, but you may want to try setting those.
On this line you set the "sys" QOS but I don't see it elsewhere...
CLASSCFG[admins] MAXPROC=280 QDEF=sys PRIORITY=2001
I see this "admins" one...
QOSCFG[admins] QFLAGS=PREEMPTOR PRIORITY=1000
Good luck,
Tom
( this is a fragment of our maui config file ...)
QOSWEIGHT 1
SYSCFG QLIST=bigmem,integration,interactive,debug,regress,contingent
QOSCFG[bigmem] PRIORITY=1 QFLAGS=PREEMPTOR,RESTARTPREEMPT
QOSCFG[integration] PRIORITY=1 QFLAGS=USERESERVED
QOSCFG[interactive] PRIORITY=2 QFLAGS=PREEMPTOR,RESTARTPREEMPT
QOSCFG[debug] PRIORITY=1
QOSCFG[regress] PRIORITY=-1
QOSCFG[contingent] PRIORITY=-2 QFLAGS=PREEMPTEE
GROUPCFG[users] QDEF=DEFAULT
QLIST=bigmem,integration,interactive,debug,regress,contingent
CLASSCFG[regress] QDEF=contingent
Andre Gauthier wrote:
> HI, I'm trying to get preemption to work with Maui and Torque. I
> have dozen queues, but one is define as a preemptee (general queue &
> qos) and another as a preemptor (admins queue & qos). I submit a job
> to the queue that is a premptee then a job to the preemptor. The
> preemptor does not run. Maui version 3.2.6p21, Torque Version
> 2.3.6-1.
>
> qstat:
>
> Job id Name User Time Use S Queue
> ------------------------- ---------------- --------------- -------- - -----
> 459.hpc-test sleep.sh user2 00:00:00 R
> general
> 460.hpc-test sleep.sh user1 0 Q admins
>
>
> checkjob 460:
>
> checking job 460
>
> State: Idle EState: Deferred
> Creds: user:user1 group:admins class:admins qos:admins
> WallTime: 00:00:00 of 1:00:00
> SubmitTime: Tue Apr 20 11:41:28
> (Time Queued Total: 00:00:02 Eligible: 00:00:01)
>
> StartDate: 00:00:00 Tue Apr 20 11:41:30
> Total Tasks: 8
>
> Req[0] TaskCount: 8 Partition: ALL
> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
> Opsys: [NONE] Arch: [NONE] Features: [NONE]
> Dedicated Resources Per Task: PROCS: 1 MEM: 32M
>
>
> IWD: [NONE] Executable: [NONE]
> Bypass: 0 StartCount: 1
> PartitionMask: [ALL]
> Flags: RESTARTABLE PREEMPTOR
>
> job is deferred. Reason: RMFailure (cannot start job - RM failure,
> rc: 15044, msg: 'Resource temporarily unavailable MSG=job allocation
> request exceeds currently available cluster nodes, 1 requested, 0
> available')
> Holds: Defer (hold reason: RMFailure)
> PE: 8.00 StartPriority: 3001
> cannot select job 460 for partition DEFAULT (job hold active)
>
>
> checkjob 459:
>
> checking job 459
>
> State: Running
> Creds: user:user2 group:user2 class:general qos:general
> WallTime: 00:03:05 of 1:00:00
> SubmitTime: Tue Apr 20 11:41:11
> (Time Queued Total: 00:00:19 Eligible: 00:00:01)
>
> StartTime: Tue Apr 20 11:41:30
> Total Tasks: 96
>
> Req[0] TaskCount: 96 Partition: DEFAULT
> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
> Opsys: [NONE] Arch: [NONE] Features: [NONE]
> Dedicated Resources Per Task: PROCS: 1 MEM: 2M
> Allocated Nodes:
> [compute-0-15:8][compute-0-13:8][compute-0-12:8][compute-0-11:8]
> [compute-0-10:8][compute-0-9:8][compute-0-8:8][compute-0-7:8]
> [compute-0-6:8][compute-0-5:8][compute-0-4:8][compute-0-3:8]
>
>
>
> IWD: [NONE] Executable: [NONE]
> Bypass: 0 StartCount: 2
> PartitionMask: [ALL]
> Flags: RESTARTABLE PREEMPTEE
> Attr: PREEMPTEE
>
> Reservation '459' (-00:03:06 -> 00:56:54 Duration: 1:00:00)
> PE: 96.00 StartPriority: 200
>
>
>
>
>
> showconfig:
>
>
>
> IWD: [NONE] Executable: [NONE]
> Bypass: 0 StartCount: 2
> PartitionMask: [ALL]
> Flags: RESTARTABLE PREEMPTEE
> Attr: PREEMPTEE
>
> Reservation '459' (-00:03:06 -> 00:56:54 Duration: 1:00:00)
> PE: 96.00 StartPriority: 200
>
> [root at hpc-test maui]# showconfig
> # Maui version 3.2.6p21 (PID: 16046)
> # global policies
>
> REJECTNEGPRIOJOBS[0] FALSE
> ENABLENEGJOBPRIORITY[0] FALSE
> ENABLEMULTINODEJOBS[0] TRUE
> ENABLEMULTIREQJOBS[0] FALSE
> BFPRIORITYPOLICY[0] [NONE]
> JOBPRIOACCRUALPOLICY QUEUEPOLICY
> NODELOADPOLICY ADJUSTSTATE
> USEMACHINESPEED FALSE
> USESYSTEMQUEUETIME TRUE
> USELOCALMACHINEPRIORITY FALSE
> NODEUNTRACKEDLOADFACTOR 1.2
> JOBNODEMATCHPOLICY[0]
>
> JOBMAXSTARTTIME[0] INFINITY
>
> METAMAXTASKS[0] 0
> NODESETPOLICY[0] [NONE]
> NODESETATTRIBUTE[0] [NONE]
> NODESETLIST[0]
> NODESETDELAY[0] 00:00:00
> NODESETPRIORITYTYPE[0] MINLOSS
> NODESETTOLERANCE[0] 0.00
>
> BACKFILLPOLICY[0] FIRSTFIT
> BACKFILLDEPTH[0] 0
> BACKFILLPROCFACTOR[0] 0
> BACKFILLMAXSCHEDULES[0] 10000
> BACKFILLMETRIC[0] PROCS
>
> BFCHUNKDURATION[0] 00:00:00
> BFCHUNKSIZE[0] 0
> PREEMPTPOLICY[0] REQUEUE
> MINADMINSTIME[0] 00:00:00
> RESOURCELIMITPOLICY[0]
> NODEAVAILABILITYPOLICY[0] COMBINED:[DEFAULT]
> NODEALLOCATIONPOLICY[0] MINRESOURCE
> TASKDISTRIBUTIONPOLICY[0] DEFAULT
> RESERVATIONPOLICY[0] NEVER
> RESERVATIONRETRYTIME[0] 00:00:00
> RESERVATIONTHRESHOLDTYPE[0] NONE
> RESERVATIONTHRESHOLDVALUE[0] 0
>
> FSPOLICY [NONE]
> FSPOLICY [NONE]
> FSINTERVAL 12:00:00
> FSDEPTH 8
> FSDECAY 1.00
>
>
>
> # Priority Weights
>
> SERVICEWEIGHT[0] 1
> TARGETWEIGHT[0] 1
> CREDWEIGHT[0] 1
> ATTRWEIGHT[0] 1
> FSWEIGHT[0] 1
> RESWEIGHT[0] 1
> USAGEWEIGHT[0] 1
> QUEUETIMEWEIGHT[0] 1
> XFACTORWEIGHT[0] 0
> SPVIOLATIONWEIGHT[0] 0
> BYPASSWEIGHT[0] 0
> TARGETQUEUETIMEWEIGHT[0] 0
> TARGETXFACTORWEIGHT[0] 0
> USERWEIGHT[0] 1
> GROUPWEIGHT[0] 1
> ACCOUNTWEIGHT[0] 0
> QOSWEIGHT[0] 1
> CLASSWEIGHT[0] 1
> FSUSERWEIGHT[0] 0
> FSGROUPWEIGHT[0] 0
> FSACCOUNTWEIGHT[0] 0
> FSQOSWEIGHT[0] 0
> FSCLASSWEIGHT[0] 0
> ATTRATTRWEIGHT[0] 0
> ATTRSTATEWEIGHT[0] 0
> NODEWEIGHT[0] 0
> PROCWEIGHT[0] 0
> MEMWEIGHT[0] 0
> SWAPWEIGHT[0] 0
> DISKWEIGHT[0] 0
> PSWEIGHT[0] 0
> PEWEIGHT[0] 0
> WALLTIMEWEIGHT[0] 0
> UPROCWEIGHT[0] 0
> UJOBWEIGHT[0] 0
> CONSUMEDWEIGHT[0] 0
> USAGEEXECUTIONTIMEWEIGHT[0] 0
> REMAININGWEIGHT[0] 0
> PERCENTWEIGHT[0] 0
> XFMINWCLIMIT[0] 00:02:00
>
>
> # partition DEFAULT policies
>
> REJECTNEGPRIOJOBS[1] FALSE
> ENABLENEGJOBPRIORITY[1] FALSE
> ENABLEMULTINODEJOBS[1] TRUE
> ENABLEMULTIREQJOBS[1] FALSE
> BFPRIORITYPOLICY[1] [NONE]
> JOBPRIOACCRUALPOLICY QUEUEPOLICY
> NODELOADPOLICY ADJUSTSTATE
> JOBNODEMATCHPOLICY[1]
>
> JOBMAXSTARTTIME[1] INFINITY
>
> METAMAXTASKS[1] 0
> NODESETPOLICY[1] [NONE]
> NODESETATTRIBUTE[1] [NONE]
> NODESETLIST[1]
> NODESETDELAY[1] 00:00:00
> NODESETPRIORITYTYPE[1] MINLOSS
> NODESETTOLERANCE[1] 0.00
>
> # Priority Weights
>
> XFMINWCLIMIT[1] 00:00:00
>
> RMAUTHTYPE[0] CHECKSUM
>
> CLASSCFG[[NONE]] DEFAULT.FEATURES=[NONE]
> CLASSCFG[[ALL]] DEFAULT.FEATURES=[NONE]
> CLASSCFG[DEFAULT] DEFAULT.FEATURES=[NONE]
> CLASSCFG[batch] DEFAULT.FEATURES=[NONE]
> CLASSCFG[interactive] DEFAULT.FEATURES=[NONE]
> CLASSCFG[general] DEFAULT.FEATURES=[NONE]
> CLASSCFG[priya] DEFAULT.FEATURES=[NONE]
> CLASSCFG[admins] DEFAULT.FEATURES=[NONE]
> CLASSCFG[sohrab] DEFAULT.FEATURES=[NONE]
> CLASSCFG[micro] DEFAULT.FEATURES=[NONE]
> CLASSCFG[altonji] DEFAULT.FEATURES=[NONE]
> CLASSCFG[easther] DEFAULT.FEATURES=[NONE]
> CLASSCFG[berry] DEFAULT.FEATURES=[NONE]
> CLASSCFG[hpcprog] DEFAULT.FEATURES=[NONE]
> CLASSCFG[macro] DEFAULT.FEATURES=[NONE]
> QOSPRIORITY[0] 0
> QOSQTWEIGHT[0] 0
> QOSXFWEIGHT[0] 0
> QOSTARGETXF[0] 0.00
> QOSTARGETQT[0] 00:00:00
> QOSFLAGS[0]
> QOSPRIORITY[1] 0
> QOSQTWEIGHT[1] 0
> QOSXFWEIGHT[1] 0
> QOSTARGETXF[1] 0.00
> QOSTARGETQT[1] 00:00:00
> QOSFLAGS[1]
> QOSPRIORITY[2] 100
> QOSQTWEIGHT[2] 0
> QOSXFWEIGHT[2] 0
> QOSTARGETXF[2] 100.00
> QOSTARGETQT[2] 00:00:00
> QOSFLAGS[2]
> QOSPRIORITY[3] -1000
> QOSQTWEIGHT[3] 0
> QOSXFWEIGHT[3] 0
> QOSTARGETXF[3] 0.00
> QOSTARGETQT[3] 00:00:00
> QOSFLAGS[3]
> QOSPRIORITY[4] 1000
> QOSQTWEIGHT[4] 0
> QOSXFWEIGHT[4] 0
> QOSTARGETXF[4] 0.00
> QOSTARGETQT[4] 00:00:00
> QOSFLAGS[4] PREEMPTOR
> QOSPRIORITY[5] 100
> QOSQTWEIGHT[5] 0
> QOSXFWEIGHT[5] 0
> QOSTARGETXF[5] 0.00
> QOSTARGETQT[5] 00:00:00
> QOSFLAGS[5] PREEMPTEE
> # SERVER MODULES: MX
> SERVERMODE NORMAL
> SERVERNAME
> SERVERHOST hpc-test.wss.yale.edu
> SERVERPORT 42559
> LOGFILE maui.log
> LOGFILEMAXSIZE 10000000
> LOGFILEROLLDEPTH 1
> LOGLEVEL 3
> LOGFACILITY fALL
> SERVERHOMEDIR /opt/maui/
> TOOLSDIR /opt/maui/tools/
> LOGDIR /opt/maui/log/
> STATDIR /opt/maui/stats/
> LOCKFILE /opt/maui/maui.pid
> SERVERCONFIGFILE /opt/maui/maui.cfg
> CHECKPOINTFILE /opt/maui/maui.ck
> CHECKPOINTINTERVAL 00:05:00
> CHECKPOINTEXPIRATIONTIME 3:11:20:00
> TRAPJOB
> TRAPNODE
> TRAPFUNCTION
> RESDEPTH 24
>
> RMPOLLINTERVAL 00:00:30
> NODEACCESSPOLICY SHARED
> ALLOCLOCALITYPOLICY [NONE]
> SIMTIMEPOLICY [NONE]
> ADMIN1 maui root
> ADMINHOSTS ALL
> NODEPOLLFREQUENCY 0
> DISPLAYFLAGS
> DEFAULTDOMAIN
> DEFAULTCLASSLIST [DEFAULT:1]
> FEATURENODETYPEHEADER
> FEATUREPROCSPEEDHEADER
> FEATUREPARTITIONHEADER
> DEFERTIME 1:00:00
> DEFERCOUNT 24
> DEFERSTARTCOUNT 1
> JOBPURGETIME 0
> NODEPURGETIME 2140000000
> APIFAILURETHRESHHOLD 6
> NODESYNCTIME 600
> JOBSYNCTIME 600
> JOBMAXOVERRUN 00:10:00
> NODEMAXLOAD 0.0
>
> PLOTMINTIME 120
> PLOTMAXTIME 245760
> PLOTTIMESCALE 11
> PLOTMINPROC 1
> PLOTMAXPROC 512
> PLOTPROCSCALE 9
> SCHEDCFG[] MODE=NORMAL
> SERVER=hpc-test.wss.yale.edu:42559
> # RM MODULES: PBS SSS WIKI NATIVE
> RMCFG[base] AUTHTYPE=CHECKSUM EPORT=15004 TIMEOUT=00:01:30 TYPE=PBS
> SIMWORKLOADTRACEFILE workload
> SIMRESOURCETRACEFILE resource
> SIMAUTOSHUTDOWN OFF
> SIMSTARTTIME 0
> SIMSCALEJOBRUNTIME FALSE
> SIMFLAGS
> SIMJOBSUBMISSIONPOLICY CONSTANTJOBDEPTH
> SIMINITIALQUEUEDEPTH 16
> SIMWCACCURACY 0.00
> SIMWCACCURACYCHANGE 0.00
> SIMNODECOUNT 0
> SIMNODECONFIGURATION NORMAL
> SIMWCSCALINGPERCENT 100
> SIMCOMRATE 0.10
> SIMCOMTYPE ROUNDROBIN
> COMINTRAFRAMECOST 0.30
> COMINTERFRAMECOST 0.30
> SIMSTOPITERATION -1
> SIMEXITITERATION -1
>
>
>
> cat maui.cfg:
>
>
> # maui.cfg.tmpl for Maui v3.2.5
>
> # full parameter docs at http://supercluster.org/mauidocs/a.fparameters.html
> # use the 'schedctl -l' command to display current configuration
>
> RMPOLLINTERVAL 00:00:30
>
> SERVERHOST hpc-test.wss.yale.edu
> SERVERPORT 42559
> SERVERMODE NORMAL
>
> RMCFG[base] TYPE=PBS TIMEOUT=90
>
> # Admin: http://supercluster.org/mauidocs/a.esecurity.html
> # ADMIN1 users have full scheduler control
>
> ADMIN1 maui root
>
> LOGFILE maui.log
> LOGFILEMAXSIZE 10000000
> LOGLEVEL 3
>
> # Job Priority: http://supercluster.org/mauidocs/5.1jobprioritization.html
>
> QUEUETIMEWEIGHT 1
>
> # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html
>
> #FSPOLICY PSDEDICATED
> #FSDEPTH 7
> #FSINTERVAL 86400
> #FSDECAY 0.80
>
> # Throttling Policies:
> http://supercluster.org/mauidocs/6.2throttlingpolicies.html
>
> # NONE SPECIFIED
>
> # Backfill: http://supercluster.org/mauidocs/8.2backfill.html
>
> BACKFILLPOLICY FIRSTFIT
> RESERVATIONPOLICY NEVER # set to never for premption.
>
> # Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html
>
> NODEALLOCATIONPOLICY MINRESOURCE
>
> # QOS: http://supercluster.org/mauidocs/7.3qos.html
>
> QOSCFG[hi] PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB
> QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE
>
> # Standing Reservations:
> http://supercluster.org/mauidocs/7.1.3standingreservations.html
>
> # SRSTARTTIME[test] 8:00:00
> # SRENDTIME[test] 17:00:00
> # SRDAYS[test] MON TUE WED THU FRI
> # SRTASKCOUNT[test] 20
> # SRMAXTIME[test] 0:30:00
>
> #PREEMPTPOLICY set by AG
> PREEMPTIONPOLICY REQUEUE
>
> # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html
>
> USERCFG[DEFAULT] FSTARGET=25.0
> USERCFG[john] PRIORITY=100 FSTARGET=10.0-
> GROUPCFG[staff] PRIORITY=1000 QLIST=hi:low QDEF=hi
> CLASSCFG[batch] FLAGS=PREEMPTEE
> CLASSCFG[interactive] FLAGS=PREEMPTOR
>
> ###set QOS needed for premptions
> QOSWEIGHT 1
> QOSCFG[admins] QFLAGS=PREEMPTOR PRIORITY=1000
> QOSCFG[general] QFLAGS=PREEMPTEE PRIORITY=100
>
> GROUPWEIGHT 1
> CLASSWEIGHT 1
> CREDWEIGHT 1
> USERWEIGHT 1
>
>
> CLASSCFG[general] QDEF=general PRIORITY=100
>
> GROUPWEIGHT 1
> CLASSCFG[DEFAULT] MAXPROC=280 QDEF=general PRIORITY=200
> CLASSCFG[admins] MAXPROC=280 QDEF=sys PRIORITY=2001
> _______________________________________________
> mauiusers mailing list
> mauiusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/mauiusers
>
>
More information about the mauiusers
mailing list