14.1 Internal Diagnostics/Diagnosing System Behavior and Problems
Maui provides a number of commands for diagnosing system behavior. These diagnostic commands present detailed state information about various aspects of the scheduling problem, summarize performance, and evaluate current operation reporting on any unexpected or potentially erroneous conditions found. Where possible, Maui's diagnostic commands even correct detected problems if desired.
At a high level, the diagnostic commands are organized along functionality and object based delineations. Diagnostic command exist to help prioritize workload, evaluate fairness, and determine effectiveness of scheduling optimizations. Commands are also available to evaluate reservations reporting state information, potential reservation conflicts, and possible corruption issues. Scheduling is a complicated task. Failures and unexpected conditions can occur as a result of resource failures, jobs failures, or conflicting policies.
Maui's diagnostics can intelligently organize information to help isolate these failures and allow them to be resolved quickly. Another powerful use of the diagnostic commands is to address the situation in which there are no hard failures. In these cases, the jobs, compute nodes, and scheduler are all functioning properly, but the cluster is not behaving exactly as desired. Maui diagnostics can help a site determine how the current configuration is performing and how it can be changed to obtain the desired behavior.
14.1.1 Diagnose Command
The cornerstone of Maui's diagnostics is a command named, aptly enough, diagnose. This command provides detailed information about scheduler state and also performs a large number of internal sanity checks presenting problems it finds as warning messages.
Currently, the diagnose command provides in depth analysis of the following objects and subsystems
14.1.2 Other Diagnostic Commands
Beyond diagnose, the checkjob and checknode commands also provide detailed information and sanity checking on individual jobs and nodes respectively. These commands can indicate why a job cannot start, which nodes can be available, and information regarding the recent events impacting current job or nodes state.
14.1.3 Using Maui Logs for Troubleshooting
Maui logging is extremely useful in determining the cause of a problem. Where other systems may be cursed for not providing adequate logging to diagnose a problem, Maui may be cursed for the opposite reason. If the logging level is configured too high, huge volumes of log output may be recorded, potentially obscuring the problems in a flood of data. Intelligent searching, combined with the use of the LOGLEVEL and LOGFACILITY parameters can mine out the needed information. Key information associated with various problems is generally marked with the keywords WARNING, ALERT, or ERROR. See the Logging Overview for further information.
14.1.4 Using a Debugger
If other methods do not resolve the problem, the use of a debugger can provide missing information. While output recorded in the Maui logs can specify which routine is failing, the debugger can actually locate the very source of the problem. Log information can help you pinpoint exactly which section of code needs to be examined and which data is suspicious. Historically, combining log information with debugger flexibility have made locating and correcting Maui bugs a relatively quick and straightforward process.
To use a debugger, you can either attach to a running Maui process or start Maui under the debugger. Starting Maui under a debugger requires that the MAUIDEBUG environment variable be set to the value 'yes' to prevent Maui from daemonizing and backgrounding itself. The following example shows a typical debugging start up using gdb.
---- > export MAUIDEBUG=yes > cd <MAUIHOMEDIR>/src/moab > gdb ../../bin/maui > b MQOSInitialize > r >----
The gdb debugger has the ability to specify conditional breakpoints which make debugging much easier. For debuggers which do not have such capabilities, the 'TRAP*' parameters are of value allowing breakpoints to be set which only trigger when specific routines are processing particular nodes, jobs or reservations. See the TRAPNODE, TRAPJOB, TRAPRES, and TRAPFUNCTION parameters for more information.
14.1.5 Controlling behavior after a 'crash'
The MAUICRASHMODE environment variable can be set to control scheduler action in the case of a catastrophic internal failure. Valid valus include trap, ignore, and die.
|© 2001-2010 Adaptive Computing Enterprises, Inc.|