Case Study 6
A.14 Case Study: Overflow Grid
Overview
An organization must provide service to a workload of widely fluctuating volume which can vary in urgency from basic 'business as usual' to 'a matter of national security' levels at a moment's notice. The organization needs the ability to seamlessly flow over-capacity workload to other internal clusters, clusters associated with partner organizations, and alternate commercial sources.
The infrastructure must also handle the complete loss of internal computing capacity and must automatically redirect workload to remaining available capacity in such an event. Further, if workload demands continue to exceed compute capacity, the system must automatically take steps to make certain urgent workload is able to run, even if that requires preempting or destroying other less critical jobs.
The organization will be billed for the use of both partner and commercial resources.
Resources
- Two 128 processor compute clusters
- High-speed interconnect between clusters
- Shared storage system between clusters
- High-speed network connect to external internet
Workload
- short duration jobs requiring between 10 minutes and 3 hours to complete
- small jobs requiring between 1 and 32 processors to complete
- relatively small data set requirements with input data < 20 MB
- significant variation in workload priority from basic to urgent
- variations in resource affinity based on optimal execution platforms and political considerations
Solution
Establish a peer-to-peer grid across all internal clusters allowing automatic load-balancing across active clusters. Enable per lab submission points which are able to migrate workload to local, partner, or commercial resources. By default, allow only priority or urgent workload to flow to external resources. Enable automated workload roll-over and resubmission in the event of internal network or cluster failures. Provide admin notification prior to rollover to allow manual override of rollover. Allow manual reconfiguration of external resource access rights to allow production use of external resources in the event of extended internal failures or excessive workload.
Enable service level agreements within local, partner, and commercial resources to enable next-to-run, and automated preemption based on workload priority. Allow workload to be re-directed automatically as local workload levels drop or local systems are brought back online. This solution will allow users to see and utilize all potential compute resources as if they were local, even using local portals and graphical interfaces, even in the event of major local and remote failures.
|