Managing Consumable Generic Resources

12.6 Managing Consumable Generic Resources

  • 12.6.1 Configuring Node-Locked Consumable Generic Resources
    • 12.6.1.1 Requesting Consumable Generic Resources
  • 12.6.2 Dynamic Generic Resource Examples
  • 12.6.3 Managing Generic Resource Race Conditions

Each time a job is allocated to a compute node, it consumes one or more types of resources. Standard resources such as CPU, memory, disk, network adapter bandwidth, and swap are automatically tracked and consumed by Moab. However, in many cases, additional resources may be provided by nodes and consumed by jobs that must be tracked. The purpose of this tracking may include accounting, billing, or the prevention of resource over-subscription. Generic consumable resources may be used to manage software licenses, I/O usage, bandwidth, application connections, or any other aspect of the larger compute environment; they may be associated with compute nodes, networks, storage systems, or other real or virtual resources.

These additional resources can be managed within Moab by defining one or more generic resources. The first step in defining a generic resource involves naming the resource. Generic resource availability can then be associated with various compute nodes and generic resource usage requirements can be associated with jobs.

Differences between Node Features and Consumable Resources

A node feature (or node property) is an opaque string label that is associated with a compute node. Each compute node may have any number of node features assigned to it and jobs may request allocation of nodes that have specific features assigned. Node features are labels and their association with a compute node is not conditional, meaning they cannot be consumed or exhausted.

12.6.1 Configuring Node-locked Consumable Generic Resources

Consumable generic resources are supported within Moab using either direct configuration or resource manager auto-detect. For direct configuration, node-locked consumable generic resources (or generic resources) are specified using the NODECFG parameter's GRES attribute. This attribute is specified using the format <ATTR>:<COUNT> as in the following example:

NODECFG[titan001] GRES=tape:4
NODECFG[login32]  GRES=matlab:2,prime:4
NODECFG[login33]  GRES=matlab:2
...

Note By default, Moab supports up to 128 independent generic resource types.

12.6.1.1 Requesting Consumable Generic Resources

Generic resources can be requested on a per task or per job basis using the GRES resource manager extension. If the generic resource is located on a compute node, requests are by default interpreted as a per task request. If the generic resource is located on a shared, cluster-level resource (such as a network or storage system), then the request defaults to a per job interpretation.

If using TORQUE, the GRES or software resource can be requested as in the following examples:

Example 1: Per Task Requests

NODECFG[compute001] GRES=dvd:2 SPEED=2200
NODECFG[compute002] GRES=dvd:2 SPEED=2200
NODECFG[compute003] GRES=dvd:2 SPEED=2200
NODECFG[compute004] GRES=dvd:2 SPEED=2200
NODECFG[compute005] SPEED=2200
NODECFG[compute006] SPEED=2200
NODECFG[compute007] SPEED=2200
NODECFG[compute008] SPEED=2200

# submit job which will allocate only from nodes 1 through 4 requesting one dvd per task
> qsub -l nodes=2,walltime=100,gres=dvd job.cmd

In this example, Moab determines that compute nodes exist that possess the requested generic resource. A compute node is a node object that possesses processors on which compute jobs actually execute. License server, network, and storage resources are typically represented by non-compute nodes. Because compute nodes exist with the requested generic resource, Moab interprets this job as requesting two compute nodes each of which must also possess a DVD generic resource.

Example 2: Per Job Requests

NODECFG[network] PARTITION=shared GRES=bandwidth:2000000

# submit job which will allocate 2 nodes and 10000 units of network bandwidth
> qsub -l nodes=2,walltime=100,gres=bandwidth:10000 job.cmd

In this example, Moab determines that there exist no compute nodes that also possess the generic resource bandwidth so this job is translated into a multiple-requirement—multi-req—job. Moab creates a job that has a requirement for two compute nodes and a second requirement for 10000 bandwidth generic resources. Because this is a multi-req job, Moab knows that it can locate these needed resources separately.

Using Generic Resource Requests in Conjunction with other Constraints

Jobs can explicitly specify generic resource constraints. However, if a job also specifies a hostlist, the hostlist constraint overrides the generic resource constraint if the request is for per task allocation. In Example 1: Per Task Requests, if the job also specified a hostlist, the DVD request is ignored.

Requesting Resources with No Generic Resources

In some cases, it is valuable to allocate nodes that currently have no generic resources available. This can be done using the special value none as in the following example:

> qsub -l nodes=2,walltime=100,gres=none job.cmd

In this case, the job only allocates compute nodes that have no generic resources associated with them.

Requesting Generic Resources Automatically within a Queue/Class

Generic resource constraints can be assigned to a queue or class and inherited by any jobs that do not have a gres request. This allows targeting of specific resources, automation of co-allocation requests, and other uses. To enable this, use the DEFAULT.GRES attribute of the CLASSCFG parameter as in the following example:

CLASSCFG[viz] DEFAULT.GRES=graphics:2

For each node requested by a viz job, also request two graphics cards.

12.6.2 Dynamic Generic Resource Examples

Dynamic Large Data Set Cluster - Background

Suppose an organization wants to efficiently manage large multi-terabyte data sets within their cluster. To do this, they have partitioned their 1,028-node cluster into 4 quadrants, each with its own high-speed interconnect, 256 compute nodes, and its own associated file server that is also able to execute compute jobs. Compute jobs running on this cluster typically require their input data to be pre-staged to a given set of compute nodes and once staged, the data set may be re-used multiple times before being discarded. Each compute node may only contain a single simultaneous data set and a given data set may not span a partition.

Dynamic Large Data Set Configuration

The first step in setting up this environment is defining the partitions. This can be done by setting a node property indicating the partition associated with each job and by setting an additional node property to indicate whether the node is a file server node. If using TORQUE, this could be done by setting the following in the nodes file:

fs1      p1 fs
fs2      p2 fs
fs3      p3 fs
fs4      p4 fs
node001  p1
node002  p1
...
node256  p2
node257  p2
...
node512  p3
node513  p3
...
node768  p4
node769  p4
...

The next step is to associate these features with Moab partition boundaries. This is done using the FEATUREPARTITIONHEADER parameter as in the following example:

FEATUREPARTITIONHEADER p

With partitions now set up, jobs run entirely within a single partition unless explicitly granted special QoS based permission to run otherwise.

Note If not using partitions, or if using partitions for other purposes, the needed behavior can also be accomplished using Node Sets.

Dynamic Large Data Set Job Submission

The next step is for the users to submit one or more data staging jobs. As each data staging job successfully completes, it should indicate to Moab that the allocated nodes are now enabled with the needed data set. This is accomplished using the mnodectl -m command to modify the node's generic resources. For example, to indicate that the bio13 data set is now in place and available within a data staging job, add the following to the end of the data staging job script:

for i in `cat $PBS_NODEFILE`
  do
    mnodectl -m resource.bio13=1,1 $i
  done

Note In most cases, the resource count should be set to match the number of processors on the compute node. For example, in a dual processor cluster, the mnodectl command used in the datastage.cmd script should be mnodectl -m resource.bio13=2,2 $i.
Note By default, end-users do not have authority to use the mnodectl command. This can be enabled via the ADMINCFG parameter as in the following:

ADMINCFG[4] USERS=ALL SERVICES=mnodectl

This command will modify the generic resources available on the compute nodes. To submit a data staging job that initializes the per node data sets as needed, the qsub command can be used with a generic resource constraint:

> qsub -l nodes=1:fs+64,gres=none datastage.cmd

Note By requesting a generic resource of none, the job requests resources that are exclusive of all other currently active data sets.

A compute job may now request to run on nodes with the generic resource in place, again using the qsub gres constraint as in the following:

> qsub -l nodes=1:fs+64,gres=bio13

As many of these jobs as desired may be submitted and queued. They are blocked until the data stage job successfully completes and are then steered toward the target nodes that have the needed data set in place. When the data set is no longer needed, it can be removed automatically by a clean-up job or manually using mnodectl command as in the following:

mnodectl -m resource.bio13=0,0 ALL

Note The clean-up job can be submitted with a job dependency or start time constraint.

12.6.3 Managing Generic Resource Race Conditions

A software license race condition "window of opportunity" opens when Moab checks a license server for sufficient available licenses and closes when the user's software actually checks out the software licenses. The time between these two events can be seconds to many minutes depending on overhead factors such as node OS provisioning, job startup, licensed software startup, and so forth.

During this window, another Moab-scheduled job or a user or job external to the cluster or cloud can obtain enough software licenses that by the time the job attempts to obtain its software licenses, there are an insufficent quantity of available licenses. In such cases a job will sit and wait for the license, and while it waits it occupies but does not use resources that another job could have used. Use the STARTDELAY parameter to prevent such a situation.

GRESCFG[<license>] STARTDELAY=<window_of_opportunity>

With the STARTDELAY parameter enabled (on a per generic resource basis) Moab blocks any idle jobs requesting the same generic resource from starting until the <window_of_opportunity> passes. The window is defined by the customer on a per generic resource basis.

See Also


Home Up Previous Next
Searches Moab documentation only