Managing Networks
Moab Workload Manager®

13.9 Managing Networks

13.9.1 Network Management Overview

Network resources can be tightly integrated with the rest of a compute cluster using the Moab multi-resource manager management interface. This interface has the following capabilities:

  • Dynamic per job and per partition VLAN creation and management
  • Monitoring and reporting of network health and failure events
  • Monitoring and reporting of network load
  • Creation of subnets with guaranteed performance criteria
  • Automated workload-aware configuration and router maintenance
  • Intelligent network-aware scheduling algorithms

13.9.2 Dynamic VLAN Creation

Most sites using dynamic VLAN's operate under the following assumptions:

  • Each compute node has access to two or more networks, one of which is the compute network, and another which is the administrator network.
  • Each compute node may only access other compute nodes via the compute network.
  • Each compute node may only communicate with the head node via the administrator network.
  • Logins on the head node may not be requested from a compute node.

In this environment, organizations may choose to have VLANs automatically configured that encapsulate individual jobs or VPC requests. These VLAN's essentially disconnect the job from either incoming or outgoing communication with other compute nodes.

13.9.2.1 Configuring VLANs

Automated VLAN management can be enabled by setting up a network resource manager that supports dynamic VLAN configuration and a QoS to request this feature. The example configuration highlights this setup:

moab.cfg
...
RMCFG[cisco] TYPE=NATIVE RESOURCETYPE=NETWORK FLAGS=VLAN
RMCFG[cisco] CLUSTERQUERYURL=exec://$TOOLSDIR/node.query.cisco.pl
RMCFG[cisco] SYSTEMMODIFYURL=exec://$TOOLSDIR/system.modify.cisco.pl

QOSCFG[secure] SECURITY=VLAN
13.9.2.2 Requesting a VLAN

VLANs can be requested on a per job basis directly using the associated resource manager extension or indirectly by requesting a QoS with a VLAN security requirement.

job submission requesting VLAN
> qsub -l nodes=256,walltime=24:00:00,qos=netsecure biojob.cmd

143325.umc.com submitted

13.9.3 Network Health Monitoring

Network-level health monitoring is enabled by supporting the cluster query action in the network resource manager and specifying the appropriate CLUSTERQUERYURL attribute in the associated resource manager interface. Node (virtual node) query commands (mnodectl,checknode) can be used to view this health information that will also be correlated with associated workload and written to persistent accounting records. Network health based event information can also be fed into generic events and used to drive appropriate event based triggers.

At present, health attributes such as fan speed, temperature, port failures, and various core switch failures can be monitored and reported. Additional failure events are monitored and reported as support is added within the network management system.

13.9.4 Network Load Monitoring

Network-level load monitoring is enabled by supporting the cluster query action in the network resource manager and specifying the appropriate CLUSTERQUERYURL attribute in the associated resource manager interface. Node (virtual node) query commands (mnodectl,checknode) can be used to view this load information that will also be correlated with associated workload and written to persistent accounting records. Load information can also be fed into generic metrics and used to drive appropriate load based triggers.

13.9.5 Providing Per-QoS and Per-Job Bandwidth and Latency Guarantees

Intra-job bandwidth and latency guarantees can be requested on a per job and/or per QoS basis using the BANDWIDTH and LATENCY resource manager extensions (for jobs) and the MINBANDWIDTH and MAXLATENCY QoS attributes (for QoS limits). If specified, Moab does not allow a job to start unless these criteria can be satisfied via proper resource allocation or dynamic network partitions. As needed, Moab makes future resource reservations to be able to guarantee required allocations.

Example

requesting minimum bandwidth w/TORQUE
> qsub -l nodes=24,walltime=8:00:00,bandwidth=1000 hex3chem.cmd

job 44362.qjc submitted

NOTE: If dynamic network partitions are enabled, a NODEMODIFYURL attribute must be properly configured to drive the network resource manager. (See Native Resource Manager Overview for details.)

13.9.6 Enabling Workload-Aware Network Maintenance

Network-aware maintenance is enabled by supporting the modify action in the network resource manager and specifying the appropriate NODEMODIFYURL attribute in the associated resource manager interface. Administrator resource management commands, (mnodectl and mrmctl), will then be routed directly through the resource manager to the network management system. In addition, reservation and real-time generic event and generic metric triggers can be configured to intelligently drive these facilities for maintenance and auto-recovery purposes.

Maintenance actions can include powering on and off the switch as well as rebooting/recycling all or part of the network. Additional operations are enabled as supported by the underlying networks.

13.9.7 Enabling Network-Aware Scheduling Decisions

Moab has the ability to support network-aware resource allocation algorithms either via its resource allocation plug-in interface or by way of direct interaction with an intelligent network management system.

13.9.7.1 Plug-in Based Network Aware Allocation Algorithms

If a plug-in interface is used, the algorithm will be responsible for allocating resources in such a way as to do the following:

  • satisfy all per job bandwidth and network latency requirements (if specified)
  • deliver maximum bandwidth and minimum latency to the current job
  • allocate resources to maintain the maximum allocation flexibility for subsequent allocation requests
  • maximize allocation affinity hints

As input, each call to the allocation algorithm will include the following:

  • job - high-level job structure including job credentials, requested quality of service (QoS), duration, and other attributes
  • job req (job taskgroup) - specific resource requirement that must be satisfied including required task count, node count, and task packing information, user and system specified bandwidth and latency requirements, and other task constraints
  • feasible node list - list of nodes (with associated available taskcounts) that are available for allocation and can satisfy all other aspects of job and task requirements
  • task affinity map - list of all nodes with associated task affinities taking into account ownership, reservations, resource preferences, and other factors (affinities include positive, negative, required, and neutral)
  • start time - time at which allocation request must be satisfied

The algorithm returns SUCCESS if a satisfactory allocation is made; otherwise, FAILURE is returned.

Upon successful completion, the algorithm returns a list of nodes and associated taskcounts that can be allocated to the specified job req (taskgroup). Upon failure, the algorithm returns a failure status code and a human readable message indicating the reason for the failure.

NOTE: This algorithm is called once per job as jobs are started as well as once per job as future job reservations are made. Depending on workload and policies, this may result in this algorithm being called hundreds or thousands of times per scheduling iteration. Depending on cluster size, appropriate scaling considerations should be taken into account to allow appropriate responsiveness.

NOTE: The job level QoS credential indicates if the job is authorized to create dynamic network partitions with bandwidth and/or latency guarantees. If authorized by the QoS and supported by the network management system, this algorithm can contact the network resource manager directly and make appropriate calls. These calls should only be made for immediate allocations and not for future reservations as specified via the start time parameter.

13.9.7.2 Use of Intelligent Network Scheduling APIs

If the network management system (NMS) supports an allocation query API, Moab can be configured to use this to enhance its existing allocation policies. Depending on the underlying capabilities of the NMS, the following queries can be used:

  • return best list of nodes to allocate and optionally report expected bandwidth/latency
  • for an exact allocation list, return the expected bandwidth and latency
  • for an exact list, return Boolean indicating if allocation is possible

In each case, Moab will pass to the NMS service a list of nodes that can be considered for allocation together with the number of tasks required.

NOTE: Both networks and certain exotic architectures can impose various allocation constraints. In such cases, the feasible allocation query should return an allocation consistent with both the network and the underlying hardware architecture.

13.9.8 Creating a Resource Management Interface for a New Network

Many popular networks are supported using interfaces provided in the Moab tools directory. If a required network interface is not available, a new one can be created using the following guidelines:

General Requirements

In all cases, a network resource manager should respond to a cluster query request by reporting a single node with a node name that will not conflict with any existing compute nodes. This node should report as a minimum the state attribute.

Monitoring Load

Network load is reported to Moab using the generic resource bandwidth. For greatest value, both configured and used bandwidth (in megabytes per second) should be reported as in the following example:

network.dat
force10 state=idle ares=bandwidth:5466 cres=bandwidth:10000

Monitoring Failures

Network warning and failure events can be reported to Moab using the gevent metric. If automated responses are enabled, embedded epochtime information should be included.

network.dat
force10 state=idle gevent[checksum]='ECC failure detected on port 13'

Controlling Router State

Router power state can be controlled as a system modify interface is created that supports the commands on, off, and reset.

Creating VLANs

VLAN creation, management, and reporting is more advanced requiring persistent VLAN ID tracking, global pool creation, and other features. Use of existing routing interface tools as templates is highly advised. VLAN management requires use of both the cluster query interface and the system modify interface.

13.9.9 Per-Job Network Monitoring

It is possible to gather network usage on a per job basis using the Native Interface. When the native interface has been configured to report netin and netout Moab automatically gathers this data through the life of a job and reports total usage statistics upon job completion.

Example Native Content
...
node99  netin=78658 netout=1256  
...

This information is visible to users and administrators via command-line utilities, the web portal, and the desktop graphical interfaces.

See Also