Resource Usage Limits
Moab Workload Manager®

10.4 Resource Usage Limits

Resource usage limits constrain the amount of resources a given job may consume. These limits are generally proportional to the resources requested and may include walltime, any standard resource, or any specified generic resource. The parameter RESOURCELIMITPOLICY controls which resources are limited, what limit policy is enforced per resource, and what actions the scheduler should take in the event of a policy violation.

10.4.1 Configuring Actions

The RESOURCELIMITPOLICY parameter accepts a number of policies, resources, and actions using the format and values defined in what follows:

Format

RESOURCELIMITPOLICY <RESOURCE>:[<SPOLICY>,]<HPOLICY>:[<SACTION>,]<HACTION>[:[<SVIOLATIONTIME>,]<HVIOLATIONTIME>]...

ResourceDescription
CPUTIMEMaximum total job processor-seconds used by any single job. (Allows scheduler enforcement of cpulimit.)
DISKLocal disk space (in MB) used by any single job task.
JOBMEMMaximum real memory/RAM in (MB) used by any single job.
JOBPROCMaximum processor load associated with any single job.
MEMMaximum real memory/RAM (in MB) used by any single job task.
NETWORKMaximum network load associated with any single job task.
PROCMaximum processor load associated with any single job task.
SWAPMaximum virtual memory/SWAP (in MB) used by any single job task.
WALLTIMERequested job walltime.

PolicyDescription
ALWAYSTakes action whenever a violation is detected.
EXTENDEDVIOLATIONTakes action only if a violation is detected and persists for greater than the specified time limit.
BLOCKEDWORKLOADONLYTakes action only if a violation is detected and the constrained resource is required by another job.

ActionDescription
CANCELTerminates the job.
CHECKPOINTCheckpoints and terminates job.
MIGRATERequeues the job and requires a different set of hosts for execution.
NOTIFYNotifies administrator(s) and job owner(s) regarding violation.
REQUEUETerminates and requeues the job.
SUSPENDSuspends the job and leaves it suspended for an amount of time defined by the X parameter.

Example - Notify and then Cancel Job if Requested Memory is Exceeded

moab.cfg
# if job exceeds memory usage, immediately notify owner
# if job exceeds memory usage for more than 5 minutes, cancel the job

RESOURCELIMITPOLICY MEM:ALWAYS,EXTENDEDVIOLATION:NOTIFY,CANCEL:00:05:00

Example - Checkpoint Job on Walltime Violations

moab.cfg
# if job exceeds requested walltime, checkpoint job
RESOURCELIMITPOLICY WALLTIME:ALWAYS:CHECKPOINT

# when checkpointing, send term signal, followed by kill 1 minute later
RMCFG[base] TYPE=PBS CHECKPOINTPOINTTIMEOUT=00:01:00 CHECKPOINTSIG=SIGTERM

Example - Migrating a job when it blocks other workload

moab.cfg
RESOURCELIMITPOLICY JOBPROC:BLOCKEDWORKLOADONLY:MIGRATE

10.4.2 Specifying Hard and Soft Policy Violations

Moab is able to perform different actions for both hard and soft policy violations. In most resource management systems, a mechanism does not exist to allow the user to specify both hard and soft limits. To address this, Moab provides the RESOURCELIMITMULTIPLIER parameter that allows per partition and per resource multiplier factors to be specified to generate the actual hard and soft limits to be used. If the factor is less than one, the soft limit is lower than the specified value and a Moab action is taken before the specified limit is reached. If the factor is greater than one, the hard limit is set higher than the specified limit allowing a buffer space before the hard limit action is taken.

In the following example, job owners are notified by email when their memory reaches 100% of the target, and the job is canceled if it reaches 125% of the target. For wallclock usage, the job is requeued when it reaches 90% of the specified limit and is checkpointed when it reaches the full limit.

moab.cfg
RESOURCELIMITPOLICY       MEM:ALWAYS:NOTIFY,CANCEL
RESOURCELIMITPOLICY       WALLTIME:ALWAYS:REQUEUE,CHECKPOINT

RESOURCELIMITMULTIPLIER   MEM:1.25,WALLTIME:0.9

10.4.3 Constraining Walltime Usage

While Moab constrains walltime using the parameter RESOURCELIMITPOLICY like other resources, it also allows walltime exception policies that are not available with other resources. In particular, Moab allows jobs to exceed the requested wallclock limit by an amount specified on a global basis using the JOBMAXOVERRUN parameter or on a per credential basis using the OVERRUN attribute of the *CFG credential parameters.

moab.cfg
JOBMAXOVERRUN    00:10:00
CLASSCFG[debug]  overrun=00:00:30  

# send USR1 signal 5 minutes before job is to be terminated
CLASSCFG[viz]    overrun=00:30:00  PRETERMINATIONSIGNAL=13@00:05:00

See Also