12.6 Managing Consumable Generic Resources
Each time a job is allocated to a compute node, it consumes one or more types of resources. Standard resources such as CPU, memory, disk, network adapter bandwidth, and swap are automatically tracked and consumed by Moab. However, in many cases, additional resources may be provided by nodes and consumed by jobs that must be tracked. The purpose of this tracking may include accounting, billing, or the prevention of resource over-subscription. Generic consumable resources may be used to manage software licenses, I/O usage, bandwidth, application connections, or any other aspect of the larger compute environment; they may be associated with compute nodes, networks, storage systems, or other real or virtual resources.
These additional resources can be managed within Moab by defining one or more generic resources. The first step in defining a generic resource involves naming the resource. Generic resource availability can then be associated with various compute nodes and generic resource usage requirements can be associated with jobs.
Differences between Node Features and Consumable Resources
A node feature (or node property) is an opaque string label that is associated with a compute node. Each compute node may have any number of node features assigned to it and jobs may request allocation of nodes that have specific features assigned. Node features are labels and their association with a compute node is not conditional, meaning they cannot be consumed or exhausted.
Consumable generic resources are supported within Moab using either direct configuration or resource manager auto-detect. For direct configuration, node-locked consumable generic resources (or generic resources) are specified using the NODECFG parameter's GRES attribute. This attribute is specified using the format <ATTR>:<COUNT> as in the following example:
NODECFG[titan001] GRES=tape:4 NODECFG[login32] GRES=matlab:2,prime:4 NODECFG[login33] GRES=matlab:2 ...
Generic resources can be requested on a per task or per job basis using the GRES resource manager extension. If the generic resource is located on a compute node, requests are by default interpreted as a per task request. If the generic resource is located on a shared, cluster-level resource (such as a network or storage system), then the request defaults to a per job interpretation.
Example 1: Per Task Requests
NODECFG[compute001] GRES=dvd:2 SPEED=2200 NODECFG[compute002] GRES=dvd:2 SPEED=2200 NODECFG[compute003] GRES=dvd:2 SPEED=2200 NODECFG[compute004] GRES=dvd:2 SPEED=2200 NODECFG[compute005] SPEED=2200 NODECFG[compute006] SPEED=2200 NODECFG[compute007] SPEED=2200 NODECFG[compute008] SPEED=2200
# submit job which will allocate only from nodes 1 through 4 requesting one dvd per task > qsub -l nodes=2,walltime=100,gres=dvd job.cmd
In this example, Moab determines that compute nodes exist that possess the requested generic resource. A compute node is a node object that possesses processors on which compute jobs actually execute. License server, network, and storage resources are typically represented by non-compute nodes. Because compute nodes exist with the requested generic resource, Moab interprets this job as requesting two compute nodes each of which must also possess a DVD generic resource.
Example 2: Per Job Requests
NODECFG[network] PARTITION=shared GRES=bandwidth:2000000
# submit job which will allocate 2 nodes and 10000 units of network bandwidth > qsub -l nodes=2,walltime=100,gres=bandwidth:10000 job.cmd
In this example, Moab determines that there exist no compute nodes that also possess the generic resource bandwidth so this job is translated into a multiple-requirement—multi-req—job. Moab creates a job that has a requirement for two compute nodes and a second requirement for 10000 bandwidth generic resources. Because this is a multi-req job, Moab knows that it can locate these needed resources separately.
Using Generic Resource Requests in Conjunction with other Constraints
Jobs can explicitly specify generic resource constraints. However, if a job also specifies a hostlist, the hostlist constraint overrides the generic resource constraint if the request is for per task allocation. In Example 1: Per Task Requests, if the job also specified a hostlist, the DVD request is ignored.
Requesting Resources with No Generic Resources
In some cases, it is valuable to allocate nodes that currently have no generic resources available. This can be done using the special value none as in the following example:
> qsub -l nodes=2,walltime=100,gres=none job.cmd
In this case, the job only allocates compute nodes that have no generic resources associated with them.
Requesting Generic Resources Automatically within a Queue/Class
Generic resource constraints can be assigned to a queue or class and inherited by any jobs that do not have a gres request. This allows targeting of specific resources, automation of co-allocation requests, and other uses. To enable this, use the DEFAULT.GRES attribute of the CLASSCFG parameter as in the following example:
For each node requested by a viz job, also request two graphics cards.
Suppose an organization wants to efficiently manage large multi-terabyte data sets within their cluster. To do this, they have partitioned their 1,028-node cluster into 4 quadrants, each with its own high-speed interconnect, 256 compute nodes, and its own associated file server that is also able to execute compute jobs. Compute jobs running on this cluster typically require their input data to be pre-staged to a given set of compute nodes and once staged, the data set may be re-used multiple times before being discarded. Each compute node may only contain a single simultaneous data set and a given data set may not span a partition.
Dynamic Large Data Set Configuration
The first step in setting up this environment is defining the partitions. This can be done by setting a node property indicating the partition associated with each job and by setting an additional node property to indicate whether the node is a file server node. If using TORQUE, this could be done by setting the following in the nodes file:
fs1 p1 fs fs2 p2 fs fs3 p3 fs fs4 p4 fs node001 p1 node002 p1 ... node256 p2 node257 p2 ... node512 p3 node513 p3 ... node768 p4 node769 p4 ...
The next step is to associate these features with Moab partition boundaries. This is done using the FEATUREPARTITIONHEADER parameter as in the following example:
With partitions now set up, jobs run entirely within a single partition unless explicitly granted special QoS based permission to run otherwise.
Dynamic Large Data Set Job Submission
The next step is for the users to submit one or more data staging jobs. As each data staging job successfully completes, it should indicate to Moab that the allocated nodes are now enabled with the needed data set. This is accomplished using the mnodectl -m command to modify the node's generic resources. For example, to indicate that the bio13 data set is now in place and available within a data staging job, add the following to the end of the data staging job script:
for i in `cat $PBS_NODEFILE` do mnodectl -m resource.bio13=1,1 $i done
This command will modify the generic resources available on the compute nodes. To submit a data staging job that initializes the per node data sets as needed, the qsub command can be used with a generic resource constraint:
> qsub -l nodes=1:fs+64,gres=none datastage.cmd
A compute job may now request to run on nodes with the generic resource in place, again using the qsub gres constraint as in the following:
> qsub -l nodes=1:fs+64,gres=bio13
As many of these jobs as desired may be submitted and queued. They are blocked until the data stage job successfully completes and are then steered toward the target nodes that have the needed data set in place. When the data set is no longer needed, it can be removed automatically by a clean-up job or manually using mnodectl command as in the following:
mnodectl -m resource.bio13=0,0 ALL
A software license race condition "window of opportunity" opens when Moab checks a license server for sufficient available licenses and closes when the user's software actually checks out the software licenses. The time between these two events can be seconds to many minutes depending on overhead factors such as node OS provisioning, job startup, licensed software startup, and so forth.
During this window, another Moab-scheduled job or a user or job external to the cluster or cloud can obtain enough software licenses that by the time the job attempts to obtain its software licenses, there are an insufficent quantity of available licenses. In such cases a job will sit and wait for the license, and while it waits it occupies but does not use resources that another job could have used. Use the STARTDELAY parameter to prevent such a situation.
With the STARTDELAY parameter enabled (on a per generic resource basis) Moab blocks any idle jobs requesting the same generic resource from starting until the <window_of_opportunity> passes. The window is defined by the customer on a per generic resource basis.
Searches Moab documentation only
|© 2001-2010 Adaptive Computing Enterprises, Inc.|