Enabling High Availability Features

21.2 Enabling High Availability Features

21.2.1 High Availability Overview

High availability allows Moab to run on two different machines: a primary and secondary server. The configuration method to achieve this behavior takes advantage of a networked file system to configure two Moab servers with only one operating at a time.

When configured to run on a networked file system—any networked file system that supports file locking is supported—the first Moab server that starts locks a particular file. The second Moab server waits on that lock and only begins scheduling when it gains control of the lock on the file. This method achieves near instantaneous turnover between failures and eliminates the need for two Moab servers to synchronize information periodically as the two Moab servers access the same database/checkpoint file.

21.2.2.1 Configuring High Availability on a Networked File System

Because the two Moab servers access the same files, configuration is only required in the moab.cfg file. The two hosts that run Moab must be configured with the SERVER and FBSERVER parameters. File lock is turned on using the FLAGS=filelockha parameter. Finally, the lock file is specifiled with the HALOCKFILE parameter. The following example illustrates a possible configuration:

SCHEDCFG[Moab]	SERVER=host1:42559
SCHEDCFG[Moab]	FBSERVER=host2
SCHEDCFG[Moab]	FLAGS=filelockha

SCHEDCFG[Moab]	HALOCKFILE=/opt/moab/.moab_lock

Note FBSERVER does not take a port number. If you specify a port for FBSERVER, it is ignored.

21.2.2.2 Confirming High Availability on a Networked File System

Adminstrators can run the mdiag -S -v command to view which Moab server is currently scheduling and responding to client requests.

21.2.3 Other High Availability Configuration

Moab has many features to improve the availability of a cluster beyond the ability to automatically relocate to another execution server. The following table describes some of these features.

Feature Description
If a node allocated to an active job fails, it is possible for the job to continue running indefinitely even though the output it produces is of no value. Setting this parameter allows the scheduler to automatically preempt these jobs when a node failure is detected, possibly allowing the job to run elsewhere and also allowing other allocated nodes to be used by other jobs.
If a catastrophic failure event occurs (SIGSEGV or SIGILL signal is triggered), Moab can be configured to automatically restart, trap the failure, ignore the failure, or behave in the default manner for the specified signal. These actions are specified using the values RESTART, TRAP, IGNORE, or DIE, as in the following example:

SCHEDCFG[bas] MODE=NORMAL RECOVERYACTION=RESTART


Home Up Previous Next
Searches Moab documentation only