|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
12.8 Enabling Generic Events
Generic events are used to identify failures and other occurrences that Moab or other systems must be made aware. This information may result in automated resource recovery, notifications, adjustments to statistics, or changes in policy. Generic events also have the ability to carry an arbitrary human readable message that may be attached to associated objects or passed to administrators or external systems. Generic events typically signify the occurrence of a specific event as opposed to generic metrics which indicate a change in a measured value. Using generic events, Moab can be configured to automatically address many failures and environmental changes improving the overall performance. Some sample events that sites may be interested in monitoring, recording, and taking action on include:
12.8.1 Configuring Generic EventsGeneric events are defined in the moab.cfg file and have several different configuration options. The only required option is action. The full list of configurable options for generic events are listed in the following table:
12.8.1.1 Action TypesThe impact of the event is controlled using the ACTION attribute of the GEVENTCFG parameter. The ACTION attribute is comma-delimited and may include any combination of the actions in the following table:
12.8.1.2 Named EventsIn general, generic events are named, with the exception of those based on generic metrics. Names are used primarily to differentiate between different events and do not have any intrinsic meaning to Moab. It is suggested that the administrator choose names that denote specific meanings within the organization. Example # Note: cpu failures require admin attention, create maintenance reservation GEVENTCFG[cpufail] action=notify,record,disable,reserve rearm=01:00:00 # Note: power failures are transient, minimize future use GEVENTCFG[powerfail] action=notify,record, rearm=00:05:00 # Note: fs full can be automatically fixed GEVENTCFG[fsfull] action=notify,execute:/home/jason/MyPython/cleartmp.py?$OID?nodefix # Note: memory errors can cause invalid job results, clear node immediately GEVENTCFG[badmem] action=notify,record,preempt,disable,reserve 12.8.1.3 Generic Metric (GMetric) EventsGMetric events are generic events based on generic metrics. They are used for executing an action when a generic metric passes a defined threshold. Unlike named events, GMetric events are not named and use the following format: GEVENTCFG[GMETRIC<COMPARISON>VALUE] ACTION=... Example GEVENTCFG[cputemp>150] action=off This form of generic events uses the GMetric name, as returned by a GMETRIC attribute in a native Resource Manager interface.
Valid comparative operators are shows in the following table:
12.8.2 Reporting Generic EventsUnlike generic metrics, generic events can be optionally configured at the global level to adjust rearm policies, and other behaviors. In all cases, this is accomplished using the GEVENTCFG parameter. To report an event associated with a job or node, use the native Resource Manager interface or the mjobctl or mnodectl commands. If using the native Resource Manager interface, use the GEVENT attribute as in the following example:
node001 GEVENT[hitemp]='temperature exceeds 150 degrees' node017 GEVENT[fullfs]='/var/tmp is full'
The messages specified after GEVENT are routed to Moab Cluster Manager for graphical display and can be used to dynamically adjust scheduling behavior. 12.8.3 Generic Events AttributesEach node will record the following about reported generic events:
Each event can be individually cleared, annotated, or deleted by cluster administrators using a mnodectl command.
12.8.4 Recording Job EventsJob events occur when a job undergoes a definitive change in state. Job events include submission, starting, cancellation, migration, and completion. This feature is useful, as some site administrators do not want to use an external accounting system and use these logged events to determine their clusters' accounting statistics. Moab may be configured to record these events in the appropriate event file found in the Moab stats/ directory. To enable job event recording for both local and remotely staged jobs, use the RECORDEVENTLIST parameter. For example:
RECORDEVENTLIST JOBCANCEL,JOBCOMPLETE,JOBSTART,JOBSUBMIT ... This configuration records an event each time both remote and/or local jobs are canceled, run to completion, started, or submitted. The Event Logs section details the format of these records. 12.8.5 Manually Creating Generic EventsGeneric events may be manually created on a physical node or VM. To add GEVENT "event" with message "hello" to node02, do the following: > mnodectl -m gevent=event:"hello" node02 To add GEVENT "event" with message "hello" to myvm, do the following: > mvmctl -m gevent=event:"hello" myvm See Also
Searches Moab documentation only
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| © 2001-2010 Adaptive Computing Enterprises, Inc. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||