|
|||||||||
21.2 Enabling High Availability Features
21.2.1 High Availability OverviewHigh availability allows Moab to run on two different machines: a primary and secondary server. There are two different configuration methods to achieve this behavior. The first takes advantage of a networked file system to configure two Moab servers with only one operating at a time. The second high availability configuration does not rely on a networked file system. A master Moab server operates until its machine crashes; after the master has been down for a certain amount of time, a secondary, or fallback server, takes control until the master returns to activity. It is recommended that site administrators use the networked file system configuration as there is no delay between failures and there is less chance of a synchronization failure. 21.2.2.1 Networked File System High Availability Overview (recommended configuration)When configured to run on a networked file system—any networked filesystem that supports file locking is supported—the first Moab server that starts locks a particular file. The second Moab server waits on that lock and only begins scheduling when it gains control of the lock on the file. This method achieves near instantaneous turnover between failures and eliminates the need for two Moab servers to synchronize information periodically as the two Moab servers access the same database/checkpoint file. 21.2.2.2 Configuring High Availability on a Networked File SystemBecause the two Moab servers access the same files, configuration is only required in the moab.cfg file. The two hosts that run Moab must be configured with the SERVER and FBSERVER parameters. File lock is turned on using the FLAGS=filelockha parameter. Finally, the lock file is specifiled with the HALOCKFILE parameter. The following example illustrates a possible configuration: 21.2.2.3 Confirming High Availability on a Networked File SystemAdminstrators can run the mdiag -S -v command to view which Moab server is currently scheduling and responding to client requests. 21.2.3.1 Master Slave High Availability OverviewHigh availability allows Moab to run on two different machines: a primary and secondary server. While both are running, the secondary server, or fallback server, continually updates internal statistics, reservations, and other information to stay synchronized with the primary server. Should the primary server stop running, the secondary server picks up all responsibilities of the primary server and begins scheduling jobs and tracking internal data. When the primary server comes back online, the secondary server hands over its data and resumes functionality as the secondary server. NOTE: By default, the fallback server pings the primary server every 30 seconds. If two successive ping attempts fail, the fallback server takes over scheduling duties. The HAPOLLINTERVAL parameter can be tuned to adjust the responsiveness of the fallback server to failures. 21.2.3.2 Configuring Master Slave High AvailabilityNote: When Moab is compiled separately on the primary and fallback servers, ensure that the MBUILD_SKEY defined in include/moab-local.h is the same for both builds. For high availability to function correctly, both servers must have a properly configured moab.cfg file (that can actually be the same file—NFS mounted—for both servers) with the following lines: Both SERVER and FBSERVER are of the format <HOST>[:<PORT>]. It is also necessary to ensure a few configuration settings for correct operation:
By default, the secondary server waits for two iterations before deciding to take over as the primary server. During this time (~30 seconds by default) client commands are unresponsive as neither the primary nor secondary servers are servicing requests. Proper high availability configuration and health status of the primary and fallback servers can be determined using the mdiag -S command. Example21.2.3.3 Confirming Master Slave High Availability ConfigurationThe following explains how to verify that the high availability configuration is active and working as expected: To confirm the fallback Moab server is able to communicate with the primary Moab server correctly, issue mdiag -R -v on the fallback system. Output should indicate that the State field for the resource manager should have an Active connection. To confirm the fallback Moab server is correctly communicating with the primary resource manager, use the mdiag -n command, which results in output similar to the following: If you do not get similar output, check the following:
21.2.4 Other High Availability ConfigurationMoab has many features to improve the availability of a cluster beyond the ability to automatically relocate to another execution server. The following table describes some of these features.
|
|||||||||
| © 2001-2008 Cluster Resources, Incorporated | |||||||||