[Mauiusers] Preemption not working : job is deferred. Reason: RMFailure (cannot start job - RM failure, rc: 15044, msg: 'Resource temporarily unavailable MSG=job allocation request exceeds currently available cluster nodes, 1 requested, 0 available'

Tom Rudwick trudwickiii at apple.com
Tue Apr 20 14:45:04 MDT 2010


Hi Andre,

We have preemption working at our site on that version of maui.

We have found that the settings below seem to be necessary for
it to work at our site. I don't see a SYSCFG in your config,
and I don't see a GROUPCFG for the admins group? I may be off
base on these, since I know some bugs have been fixed since we
got this working, but you may want to try setting those.

On this line you set the "sys" QOS but I don't see it elsewhere...

CLASSCFG[admins]	MAXPROC=280 QDEF=sys   PRIORITY=2001

I see this "admins" one...

QOSCFG[admins]		QFLAGS=PREEMPTOR  PRIORITY=1000

Good luck,

Tom

( this is a fragment of our maui config file ...)

QOSWEIGHT 1
SYSCFG QLIST=bigmem,integration,interactive,debug,regress,contingent
QOSCFG[bigmem]       PRIORITY=1  QFLAGS=PREEMPTOR,RESTARTPREEMPT
QOSCFG[integration]  PRIORITY=1  QFLAGS=USERESERVED
QOSCFG[interactive]  PRIORITY=2  QFLAGS=PREEMPTOR,RESTARTPREEMPT
QOSCFG[debug]        PRIORITY=1
QOSCFG[regress]      PRIORITY=-1
QOSCFG[contingent]   PRIORITY=-2 QFLAGS=PREEMPTEE
GROUPCFG[users] QDEF=DEFAULT 
QLIST=bigmem,integration,interactive,debug,regress,contingent
CLASSCFG[regress] QDEF=contingent



Andre Gauthier wrote:
> HI, I'm trying to get preemption to work with Maui and Torque.     I
> have dozen queues, but one is define as a preemptee (general queue &
> qos) and another as a preemptor (admins queue & qos).  I submit a job
> to the queue that is a premptee then a job to the preemptor.  The
> preemptor does not run.  Maui version 3.2.6p21, Torque Version
> 2.3.6-1.
>
> qstat:
>
> Job id                    Name             User            Time Use S Queue
> ------------------------- ---------------- --------------- -------- - -----
> 459.hpc-test              sleep.sh         user2           00:00:00 R
> general
> 460.hpc-test              sleep.sh         user1                  0 Q admins
>
>
> checkjob 460:
>
> checking job 460
>
> State: Idle  EState: Deferred
> Creds:  user:user1  group:admins  class:admins  qos:admins
> WallTime: 00:00:00 of 1:00:00
> SubmitTime: Tue Apr 20 11:41:28
>   (Time Queued  Total: 00:00:02  Eligible: 00:00:01)
>
> StartDate: 00:00:00  Tue Apr 20 11:41:30
> Total Tasks: 8
>
> Req[0]  TaskCount: 8  Partition: ALL
> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
> Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
> Dedicated Resources Per Task: PROCS: 1  MEM: 32M
>
>
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 1
> PartitionMask: [ALL]
> Flags:       RESTARTABLE PREEMPTOR
>
> job is deferred.  Reason:  RMFailure  (cannot start job - RM failure,
> rc: 15044, msg: 'Resource temporarily unavailable MSG=job allocation
> request exceeds currently available cluster nodes, 1 requested, 0
> available')
> Holds:    Defer  (hold reason:  RMFailure)
> PE:  8.00  StartPriority:  3001
> cannot select job 460 for partition DEFAULT (job hold active)
>
>
> checkjob 459:
>
> checking job 459
>
> State: Running
> Creds:  user:user2  group:user2  class:general  qos:general
> WallTime: 00:03:05 of 1:00:00
> SubmitTime: Tue Apr 20 11:41:11
>   (Time Queued  Total: 00:00:19  Eligible: 00:00:01)
>
> StartTime: Tue Apr 20 11:41:30
> Total Tasks: 96
>
> Req[0]  TaskCount: 96  Partition: DEFAULT
> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
> Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
> Dedicated Resources Per Task: PROCS: 1  MEM: 2M
> Allocated Nodes:
> [compute-0-15:8][compute-0-13:8][compute-0-12:8][compute-0-11:8]
> [compute-0-10:8][compute-0-9:8][compute-0-8:8][compute-0-7:8]
> [compute-0-6:8][compute-0-5:8][compute-0-4:8][compute-0-3:8]
>
>
>
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 2
> PartitionMask: [ALL]
> Flags:       RESTARTABLE PREEMPTEE
> Attr:        PREEMPTEE
>
> Reservation '459' (-00:03:06 -> 00:56:54  Duration: 1:00:00)
> PE:  96.00  StartPriority:  200
>
>
>
>
>
> showconfig:
>
>
>
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 2
> PartitionMask: [ALL]
> Flags:       RESTARTABLE PREEMPTEE
> Attr:        PREEMPTEE
>
> Reservation '459' (-00:03:06 -> 00:56:54  Duration: 1:00:00)
> PE:  96.00  StartPriority:  200
>
> [root at hpc-test maui]# showconfig
> # Maui version 3.2.6p21 (PID: 16046)
> # global policies
>
> REJECTNEGPRIOJOBS[0]              FALSE
> ENABLENEGJOBPRIORITY[0]           FALSE
> ENABLEMULTINODEJOBS[0]            TRUE
> ENABLEMULTIREQJOBS[0]             FALSE
> BFPRIORITYPOLICY[0]               [NONE]
> JOBPRIOACCRUALPOLICY            QUEUEPOLICY
> NODELOADPOLICY                  ADJUSTSTATE
> USEMACHINESPEED                 FALSE
> USESYSTEMQUEUETIME              TRUE
> USELOCALMACHINEPRIORITY         FALSE
> NODEUNTRACKEDLOADFACTOR         1.2
> JOBNODEMATCHPOLICY[0]
>
> JOBMAXSTARTTIME[0]                  INFINITY
>
> METAMAXTASKS[0]                   0
> NODESETPOLICY[0]                  [NONE]
> NODESETATTRIBUTE[0]               [NONE]
> NODESETLIST[0]
> NODESETDELAY[0]                   00:00:00
> NODESETPRIORITYTYPE[0]            MINLOSS
> NODESETTOLERANCE[0]                 0.00
>
> BACKFILLPOLICY[0]                 FIRSTFIT
> BACKFILLDEPTH[0]                  0
> BACKFILLPROCFACTOR[0]             0
> BACKFILLMAXSCHEDULES[0]           10000
> BACKFILLMETRIC[0]                 PROCS
>
> BFCHUNKDURATION[0]                00:00:00
> BFCHUNKSIZE[0]                    0
> PREEMPTPOLICY[0]                  REQUEUE
> MINADMINSTIME[0]                  00:00:00
> RESOURCELIMITPOLICY[0]
> NODEAVAILABILITYPOLICY[0]         COMBINED:[DEFAULT]
> NODEALLOCATIONPOLICY[0]           MINRESOURCE
> TASKDISTRIBUTIONPOLICY[0]         DEFAULT
> RESERVATIONPOLICY[0]              NEVER
> RESERVATIONRETRYTIME[0]           00:00:00
> RESERVATIONTHRESHOLDTYPE[0]       NONE
> RESERVATIONTHRESHOLDVALUE[0]      0
>
> FSPOLICY                        [NONE]
> FSPOLICY                        [NONE]
> FSINTERVAL                      12:00:00
> FSDEPTH                         8
> FSDECAY                         1.00
>
>
>
> # Priority Weights
>
> SERVICEWEIGHT[0]                  1
> TARGETWEIGHT[0]                   1
> CREDWEIGHT[0]                     1
> ATTRWEIGHT[0]                     1
> FSWEIGHT[0]                       1
> RESWEIGHT[0]                      1
> USAGEWEIGHT[0]                    1
> QUEUETIMEWEIGHT[0]                1
> XFACTORWEIGHT[0]                  0
> SPVIOLATIONWEIGHT[0]              0
> BYPASSWEIGHT[0]                   0
> TARGETQUEUETIMEWEIGHT[0]          0
> TARGETXFACTORWEIGHT[0]            0
> USERWEIGHT[0]                     1
> GROUPWEIGHT[0]                    1
> ACCOUNTWEIGHT[0]                  0
> QOSWEIGHT[0]                      1
> CLASSWEIGHT[0]                    1
> FSUSERWEIGHT[0]                   0
> FSGROUPWEIGHT[0]                  0
> FSACCOUNTWEIGHT[0]                0
> FSQOSWEIGHT[0]                    0
> FSCLASSWEIGHT[0]                  0
> ATTRATTRWEIGHT[0]                 0
> ATTRSTATEWEIGHT[0]                0
> NODEWEIGHT[0]                     0
> PROCWEIGHT[0]                     0
> MEMWEIGHT[0]                      0
> SWAPWEIGHT[0]                     0
> DISKWEIGHT[0]                     0
> PSWEIGHT[0]                       0
> PEWEIGHT[0]                       0
> WALLTIMEWEIGHT[0]                 0
> UPROCWEIGHT[0]                    0
> UJOBWEIGHT[0]                     0
> CONSUMEDWEIGHT[0]                 0
> USAGEEXECUTIONTIMEWEIGHT[0]       0
> REMAININGWEIGHT[0]                0
> PERCENTWEIGHT[0]                  0
> XFMINWCLIMIT[0]                   00:02:00
>
>
> # partition DEFAULT policies
>
> REJECTNEGPRIOJOBS[1]              FALSE
> ENABLENEGJOBPRIORITY[1]           FALSE
> ENABLEMULTINODEJOBS[1]            TRUE
> ENABLEMULTIREQJOBS[1]             FALSE
> BFPRIORITYPOLICY[1]               [NONE]
> JOBPRIOACCRUALPOLICY            QUEUEPOLICY
> NODELOADPOLICY                  ADJUSTSTATE
> JOBNODEMATCHPOLICY[1]
>
> JOBMAXSTARTTIME[1]                  INFINITY
>
> METAMAXTASKS[1]                   0
> NODESETPOLICY[1]                  [NONE]
> NODESETATTRIBUTE[1]               [NONE]
> NODESETLIST[1]
> NODESETDELAY[1]                   00:00:00
> NODESETPRIORITYTYPE[1]            MINLOSS
> NODESETTOLERANCE[1]                 0.00
>
> # Priority Weights
>
> XFMINWCLIMIT[1]                   00:00:00
>
> RMAUTHTYPE[0]                     CHECKSUM
>
> CLASSCFG[[NONE]]  DEFAULT.FEATURES=[NONE]
> CLASSCFG[[ALL]]  DEFAULT.FEATURES=[NONE]
> CLASSCFG[DEFAULT]  DEFAULT.FEATURES=[NONE]
> CLASSCFG[batch]  DEFAULT.FEATURES=[NONE]
> CLASSCFG[interactive]  DEFAULT.FEATURES=[NONE]
> CLASSCFG[general]  DEFAULT.FEATURES=[NONE]
> CLASSCFG[priya]  DEFAULT.FEATURES=[NONE]
> CLASSCFG[admins]  DEFAULT.FEATURES=[NONE]
> CLASSCFG[sohrab]  DEFAULT.FEATURES=[NONE]
> CLASSCFG[micro]  DEFAULT.FEATURES=[NONE]
> CLASSCFG[altonji]  DEFAULT.FEATURES=[NONE]
> CLASSCFG[easther]  DEFAULT.FEATURES=[NONE]
> CLASSCFG[berry]  DEFAULT.FEATURES=[NONE]
> CLASSCFG[hpcprog]  DEFAULT.FEATURES=[NONE]
> CLASSCFG[macro]  DEFAULT.FEATURES=[NONE]
> QOSPRIORITY[0]                    0
> QOSQTWEIGHT[0]                    0
> QOSXFWEIGHT[0]                    0
> QOSTARGETXF[0]                      0.00
> QOSTARGETQT[0]                    00:00:00
> QOSFLAGS[0]
> QOSPRIORITY[1]                    0
> QOSQTWEIGHT[1]                    0
> QOSXFWEIGHT[1]                    0
> QOSTARGETXF[1]                      0.00
> QOSTARGETQT[1]                    00:00:00
> QOSFLAGS[1]
> QOSPRIORITY[2]                    100
> QOSQTWEIGHT[2]                    0
> QOSXFWEIGHT[2]                    0
> QOSTARGETXF[2]                    100.00
> QOSTARGETQT[2]                    00:00:00
> QOSFLAGS[2]
> QOSPRIORITY[3]                    -1000
> QOSQTWEIGHT[3]                    0
> QOSXFWEIGHT[3]                    0
> QOSTARGETXF[3]                      0.00
> QOSTARGETQT[3]                    00:00:00
> QOSFLAGS[3]
> QOSPRIORITY[4]                    1000
> QOSQTWEIGHT[4]                    0
> QOSXFWEIGHT[4]                    0
> QOSTARGETXF[4]                      0.00
> QOSTARGETQT[4]                    00:00:00
> QOSFLAGS[4]                       PREEMPTOR
> QOSPRIORITY[5]                    100
> QOSQTWEIGHT[5]                    0
> QOSXFWEIGHT[5]                    0
> QOSTARGETXF[5]                      0.00
> QOSTARGETQT[5]                    00:00:00
> QOSFLAGS[5]                       PREEMPTEE
> # SERVER MODULES:  MX
> SERVERMODE                      NORMAL
> SERVERNAME
> SERVERHOST                      hpc-test.wss.yale.edu
> SERVERPORT                      42559
> LOGFILE                         maui.log
> LOGFILEMAXSIZE                  10000000
> LOGFILEROLLDEPTH                1
> LOGLEVEL                        3
> LOGFACILITY                     fALL
> SERVERHOMEDIR                   /opt/maui/
> TOOLSDIR                        /opt/maui/tools/
> LOGDIR                          /opt/maui/log/
> STATDIR                         /opt/maui/stats/
> LOCKFILE                        /opt/maui/maui.pid
> SERVERCONFIGFILE                /opt/maui/maui.cfg
> CHECKPOINTFILE                  /opt/maui/maui.ck
> CHECKPOINTINTERVAL              00:05:00
> CHECKPOINTEXPIRATIONTIME        3:11:20:00
> TRAPJOB
> TRAPNODE
> TRAPFUNCTION
> RESDEPTH                        24
>
> RMPOLLINTERVAL                  00:00:30
> NODEACCESSPOLICY                SHARED
> ALLOCLOCALITYPOLICY             [NONE]
> SIMTIMEPOLICY                   [NONE]
> ADMIN1                          maui root
> ADMINHOSTS                      ALL
> NODEPOLLFREQUENCY               0
> DISPLAYFLAGS
> DEFAULTDOMAIN
> DEFAULTCLASSLIST                [DEFAULT:1]
> FEATURENODETYPEHEADER
> FEATUREPROCSPEEDHEADER
> FEATUREPARTITIONHEADER
> DEFERTIME                       1:00:00
> DEFERCOUNT                      24
> DEFERSTARTCOUNT                 1
> JOBPURGETIME                    0
> NODEPURGETIME                   2140000000
> APIFAILURETHRESHHOLD            6
> NODESYNCTIME                    600
> JOBSYNCTIME                     600
> JOBMAXOVERRUN                   00:10:00
> NODEMAXLOAD                     0.0
>
> PLOTMINTIME                     120
> PLOTMAXTIME                     245760
> PLOTTIMESCALE                   11
> PLOTMINPROC                     1
> PLOTMAXPROC                     512
> PLOTPROCSCALE                   9
> SCHEDCFG[]                        MODE=NORMAL
> SERVER=hpc-test.wss.yale.edu:42559
> # RM MODULES: PBS SSS WIKI NATIVE
> RMCFG[base] AUTHTYPE=CHECKSUM EPORT=15004 TIMEOUT=00:01:30 TYPE=PBS
> SIMWORKLOADTRACEFILE            workload
> SIMRESOURCETRACEFILE            resource
> SIMAUTOSHUTDOWN                 OFF
> SIMSTARTTIME                    0
> SIMSCALEJOBRUNTIME              FALSE
> SIMFLAGS
> SIMJOBSUBMISSIONPOLICY          CONSTANTJOBDEPTH
> SIMINITIALQUEUEDEPTH            16
> SIMWCACCURACY                   0.00
> SIMWCACCURACYCHANGE             0.00
> SIMNODECOUNT                    0
> SIMNODECONFIGURATION            NORMAL
> SIMWCSCALINGPERCENT             100
> SIMCOMRATE                      0.10
> SIMCOMTYPE                      ROUNDROBIN
> COMINTRAFRAMECOST               0.30
> COMINTERFRAMECOST               0.30
> SIMSTOPITERATION                -1
> SIMEXITITERATION                -1
>
>
>
> cat maui.cfg:
>
>
> # maui.cfg.tmpl for Maui v3.2.5
>
> # full parameter docs at http://supercluster.org/mauidocs/a.fparameters.html
> # use the 'schedctl -l' command to display current configuration
>
> RMPOLLINTERVAL		00:00:30
>
> SERVERHOST		hpc-test.wss.yale.edu
> SERVERPORT		42559
> SERVERMODE		NORMAL
>
> RMCFG[base]		TYPE=PBS TIMEOUT=90
>
> # Admin: http://supercluster.org/mauidocs/a.esecurity.html
> # ADMIN1 users have full scheduler control
>
> ADMIN1                maui root
>
> LOGFILE               maui.log
> LOGFILEMAXSIZE        10000000
> LOGLEVEL              3
>
> # Job Priority: http://supercluster.org/mauidocs/5.1jobprioritization.html
>
> QUEUETIMEWEIGHT       1
>
> # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html
>
> #FSPOLICY              PSDEDICATED
> #FSDEPTH               7
> #FSINTERVAL            86400
> #FSDECAY               0.80
>
> # Throttling Policies:
> http://supercluster.org/mauidocs/6.2throttlingpolicies.html
>
> # NONE SPECIFIED
>
> # Backfill: http://supercluster.org/mauidocs/8.2backfill.html
>
> BACKFILLPOLICY        FIRSTFIT
> RESERVATIONPOLICY     NEVER # set to never for premption.
>
> # Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html
>
> NODEALLOCATIONPOLICY  MINRESOURCE
>
> # QOS: http://supercluster.org/mauidocs/7.3qos.html
>
>  QOSCFG[hi]  PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB
>  QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE
>
> # Standing Reservations:
> http://supercluster.org/mauidocs/7.1.3standingreservations.html
>
> # SRSTARTTIME[test] 8:00:00
> # SRENDTIME[test]   17:00:00
> # SRDAYS[test]      MON TUE WED THU FRI
> # SRTASKCOUNT[test] 20
> # SRMAXTIME[test]   0:30:00
>
> #PREEMPTPOLICY set by  AG
> PREEMPTIONPOLICY REQUEUE
>
> # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html
>
>  USERCFG[DEFAULT]      FSTARGET=25.0
>  USERCFG[john]         PRIORITY=100  FSTARGET=10.0-
>  GROUPCFG[staff]       PRIORITY=1000 QLIST=hi:low QDEF=hi
>  CLASSCFG[batch]       FLAGS=PREEMPTEE
>  CLASSCFG[interactive] FLAGS=PREEMPTOR
>
> ###set QOS needed for premptions
> QOSWEIGHT 1
> QOSCFG[admins]		QFLAGS=PREEMPTOR  PRIORITY=1000
> QOSCFG[general]	  	QFLAGS=PREEMPTEE PRIORITY=100
>
> GROUPWEIGHT 1
> CLASSWEIGHT 1
> CREDWEIGHT 1
> USERWEIGHT 1
>
>
> CLASSCFG[general] QDEF=general PRIORITY=100
>
> GROUPWEIGHT 1
> CLASSCFG[DEFAULT]	MAXPROC=280 QDEF=general  PRIORITY=200
> CLASSCFG[admins]	MAXPROC=280 QDEF=sys   PRIORITY=2001
> _______________________________________________
> mauiusers mailing list
> mauiusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/mauiusers
>
>   



More information about the mauiusers mailing list