8.1 Monitoring Resources
8.1 Monitoring Resources
8.1.1 Resource Overview
A primary task of any resource manager is to monitor the state, health, configuration, and utilization of managed resources. TORQUE is specifically designed to monitor compute hosts for use in a batch environment. TORQUE is not designed to monitor non-compute host resources such as software licenses, networks, file systems, and so forth, although these resources can be integrated into the cluster using some scheduling systems.
With regard to monitoring compute nodes, TORQUE reports about a number of attributes broken into three major categories:
Configuration includes both detected hardware configuration and specified batch attributes.
||operating system of the node
||The value reported is a derivative of the operating system installed.
|Node Features (properties)
||arbitrary string attributes associated with the node
||No node features are specified by default. If required, they are set using the nodes file located in the $TORQUEHOME/server_priv directory. They may specify any string and are most commonly used to allow users to request certain subsets of nodes when submitting jobs.
|Local Disk (size)
||configured local disk
||By default, local disk space is not monitored. If the mom configuration size parameter is set, TORQUE will report, in kilobytes, configured disk space within the specified directory.
||Local memory/RAM is monitored and reported in kilobytes.
||The number of processors detected by TORQUE is reported via the ncpus attribute. However, for scheduling purposes, other factors are taken into account. In its default configuration, TORQUE operates in dedicated mode with each node possessing a single virtual processor. In dedicated mode, each job task will consume one virtual processor and TORQUE will accept workload on each node until all virtual processors on that node are in use. While the number of virtual processors per node defaults to 1, this may be configured using the nodes file located in the $TORQUEHOME/server_priv directory. An alternative to dedicated mode is timeshared mode. If TORQUE's timeshared mode is enabled, TORQUE will accept additional workload on each node until the node's maxload limit is reached.
||Virtual memory/Swap is monitored and reported in kilobytes.
Utilization includes information regarding the amount of node resources currently in use as well as information about who or what is consuming it.
||local disk availability
||By default, local disk space is not monitored. If the mom configuration size parameter is set, TORQUE will report configured and currently available disk space within the specified directory in kilobytes.
||Available real memory/RAM is monitored and reported in kilobytes.
||local network adapter usage
||Reports total number of bytes transferred in or out by the network adapter.
|Processor Utilization (loadave)
||node's cpu load average
||Reports the node's 1 minute bsd load average.
State information includes administrative status, general node health information, and general usage status.
|Idle Time (idletime)
||time since local keyboard/mouse activity has been detected
||Time in seconds since local keyboard/mouse activity has been detected.
||monitored/admin node state
A node can be in one or more of the following states:
- busy - node is full and will not accept additional work
- down - node is failing to report, is detecting local failures with node configuration or resources, or is marked down by an administrator
- free - node is ready to accept additional work
- job-exclusive - all available virtual processors are assigned to jobs
- job-sharing - node has been allocated to run multiple shared jobs and will remain in this state until jobs are complete
- offline - node has been instructed by an admin to no longer accept work
- reserve - node has been reserved by the server
- time-shared - node always allows multiple jobs to run concurrently
- unknown - node has not been detected