Case Study 24: Cluster Environment Event Handling

A.24  Case Study: Cluster Environment Event Handling

Overview

An organization requires intelligent management of cluster behavior in the event of various periodic failures including the following:

  • Machine Room Chiller Fails
    • notify admins, preempt all jobs, power off nodes
  • External Power Failure, UPS Triggers
    • notify admins, preempt low priority jobs, power off unused nodes
  • Compute Node Temperature Exceeds Desired Threshold
    • notify admins, modify scheduling policies to minimize node usage
  • Storage Manager Reports Warnings
    • notify admins, block jobs requiring storage manager resources until warnings cease
  • Compute Node Local Disk Fills Up
    • launch script to purge unneeded files on compute node
  • Effective Node Throughput Drops Below Desired Threshold
    • notify admins, launch script to investigate, correct, and recycle node
  • Major Network Failure
    • notify admins, dynamic establish peer relationship with alternate company resources or connect to remote on-demand center

Solution

Moab's event management features (generic events, generic metrics, and triggers) allow an organization to address each of these events in an intelligent manner which maximize cluster availability and protect the most important workload. For the events above

The configuration below will enable Moab to schedule a weekly accounting package and enable an analysis service during business hours.

SCHEDCFG[master] SERVER=main.ifl.com MODE=NORMAL

# interface to monitor/manage services
RMCFG[direct]   TYPE=Loadleveler

# load information regarding UPS, Chiller and Storage Manager
RMCFG[local] TYPE=native  

# enable connection to utility computing resources
# only enable if local failures occur

RMCFG[uci] TYPE=moab://utilitycomputinginc.com:22000 STATE=disabled

# cooling has failed, power down cluster immediately
GEVENTCFG[coolfail] action=notify,record,preempt,execute:/tools/powerdown

# external power failure detected.  powerdown nodes associated with 
# low priority jobs
GEVENTCFG[powerfail] action=notify,record,execute:/tools/powerdown-lpo

# UPS is almost empty, shutdown cluster
GEVENTCFG[powerfail2] action=notify,record,execute:/tools/powerdown

# minimize use of 'hot' nodes
GEVENTCFG[hitemp] action=notify,record,avoid

# temporarily block jobs which require failing storage resources 
# while warnings are reported
GEVENTCFG[storagefail] action=notify,record,reserve

# purge full filesystems
GEVENTCFG[fsfull] action=record,execute:/tools/purgefs.pl

# investigate/recover nodes with low throughput
GEVENTCFG[slownode] action=record,execute:/tools/recovernode.pl

# local cluster is unavailable, activate remote resources
GEVENTCFG[netfailure] action=notify,record,enable:rm:uci

To submit a batch application request which requires operating system provisioning, use standard batch submission commands.

> msub -l nodes=1,walltime=300,arch=x86,os=suse91 applaunch:data3.txt

moab.1043 submitted

Moab will schedule applications across the cluster, grid, or utility computing resource and will package the application with the requested operating system. As needed, Moab will reprovision resources to provide the bundled OS/application on the best available compute node.

With this model, both batch and rigidly scheduled applications can be inter-mixed.

  • resource allocation optimized to minimize provisioning overhead
  • forced provisioning option to allow added security
  • per architecture image tracking
  • configurable policies to prevent oversubscription of provisioning resources
  • automated failure recovery

Home Up Previous Next