A.6 Case Study: Peer-To-Peer Overflow
OverviewNCSA owns a 512 processor cluster, part of which they want to make available to clients for use when needed. They want to create a protected area of their cluster, containing 64 processors, to which clients have priority access. When those resources are not in use by clients, they would like to use them for internal workload. However, any client workload that shows up would be able to preempt NCSA internal workload on those processors.
NCSA wants to establish a service level agreement with each client, including a billing rate. Each client should have control over when NCSA resources are used. Tracking the actual resource usage by each client should be as automated as possible.
NCSA wants to maintain a high level of security by maintaining tight control over access to their shared resources and encrypting communication to and from each client. They also want to minimize the overhead of providing access to clients so as to avoid affecting scheduling performance for their clients and themselves.
The goals of establishing a relationship with a peer site in order to share extra capacity include:
The best way to avoid any user-visible changes is to allow Moab to migrate jobs transparently by creating a peer relationship with another instance of Moab. This allows users to continue to submit jobs normally, while experiencing the increased turn-around time gained by using the additional resources.
Moab has flexible and convenient mechanisms for handling authorization and mapping user credentials on one system to a different set of credentials on a peer system. It also supports encryption and authentication using secret keys or certificates.
Moab is designed to handle large numbers of jobs during scheduling. The additional work of evaluating jobs for migration to a peer instance of Moab adds only a minor amount of computation. Moab can immediately begin migrating jobs when a preconfigured backlog threshold is reached.
Client (Job Source) Configuration
# this name is used for identification SCHEDCFG[client1] TYPE=NORMAL SCHEDCFG[client1] SERVER=localserver:42559 # allow jobs to run locally as well RMCFG[base] TYPE=PBS # create the peer connection RMCFG[overflow] TYPE=MOAB RMCFG[overflow] SERVER=moab://moab.ncsa.uiuc.edu:42559 # map local credentials to remote equivalent RMCFG[overflow] CREDMAP=file:///opt/moab/credmap.txt # only use NCSA resources when backlog threshold is exceeded SRCFG[shared] QLIST=shared QOSCFG[shared] BACKLOGTHRESHOLD=24
# key should be the same for both peers CLIENTCFG[RM:overflow] KEY=thisisthekey AUTH=admin1
# format: [LOCALCRED],[REMOTECRED] user:clientuser,peeruser user:steve,sjohnson group:test,company1 class:batch,serial
NSCA (Job Destination) Configuration
# define the scheduler SCHEDCFG[ncsa] TYPE=NORMAL SCHEDCFG[ncsa] SERVER=moab://scheduler:42559 # allow jobs to run locally as well RMCFG[base] TYPE=PBS # define each client using local credentials RMCFG[client1] FLAGS=CLIENT CLIENT=client1 AUTHGLIST=company1 RMCFG[client2] FLAGS=CLIENT CLIENT=client2 AUTHGLIST=company2 # limit incoming job rate for each client RMCFG[client1] FLOWMETRIC=jobs FLOWLIMIT=20 FLOWINTERVAL=5:00:00 RMCFG[client2] FLOWMETRIC=jobs FLOWLIMIT=12 FLOWINTERVAL=0:30:00 # use QOS to allow local jobs to be preempted by client jobs PREEMPTPOLICY REQUEUE QOSWEIGHT 1 QOSCFG[localjobs] QFLAGS=PREEMTEE # put all local jobs into the localjobs QOS by default USERCFG[DEFAULT] QDEF=localjobs # create the grid sandbox SRCFG[sandbox1] PERIOD=INFINITY SRCFG[sandbox1] HOSTLIST=node0[1-9],node[1-5][0-9],node6[0-4] # enable grid jobs and allow client1 to preempt jobs SRCFG[sandbox1] FLAGS=ALLOWGRID,OWNERPREEMPT SRCFG[sandbox1] OWNER=CLUSTER:client1,CLUSTER:client2 # allow client1 to have access to sandbox SRCFG[sandbox1] CLUSTERLIST=client1,client2 # allow local jobs to use unused resources within sandbox SRCFG[sandbox1] QOSLIST=localjobs
# define key for each client CLIENTCFG[RM:client1] KEY=thisisthekey AUTH=admin1 CLIENTCFG[RM:client2] KEY=mykeyiscool1 AUTH=admin1
The mdiag -R command will allow both NCSA and their clients to monitor how many jobs flow over to NCSA's grid sandbox. Detailed information about what types of jobs are migrating can be obtained using job profiles.
Creating peer relationships between clusters allows great flexibility in how sits can work with each other to increase their overall efficiency and meet their own goals. Creating a unidirectional peer relationship from each of the clients allows NCSA to provide an overflow service that is beneficial to both parties. Clients can choose to use NCSA's resources to reduce their workflow backlog, and NCSA can provide overflow capacity to their clients and bill for usage.
|© 2001-2010 Adaptive Computing Enterprises, Inc.|