[an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive]

A.3  Case Study: Development O2K

Overview:

    A 64 proc O2K system needs to be scheduled with a significant 'background' load.

Resources:

    Compute Nodes:         64  processor, 32 GB O2K system
    Resource Manager:    OpenPBS 2.3
    Network:                      InternalSGI network

Workload:

    Job Size:                range in size from 1 to 32 processors.

    Job Length:           jobs range in length from 15 minutes to 48 hours.

    Job Owners:         various

    NOTES:              This is a login/development machine meaning at any given time, there may be a significant load originating with jobs/processes outside of the resource manager's view or control.  The major scheduling relevant impact of this is in the area of cpu load and real memory consumption.

Constraints: (Must do)

    The scheduler must run the machine at maximum capacity without overcommitting either memory or processors.  A significant and variable background load exists from jobs submitted outside of the resource manager's view or control.  The scheduler must track and account for this load and allow space for some variability and growth of this load over time.  The scheduler should also 'kill' any job which violates its requested resource allocation and notify the associated user of this violation.

Goals: (Should do)

    The scheduler should maximize the throughput associated with the queued jobs while avoiding starvation as a secondary concern.

Analysis:

    The background load causes many problems in any mixed batch/interactive environment.  One problem which will occur results from the fact that a situation may arise in which the highest priority batch job cannot run.  Maui can make a reservation for this highest priority job but because their are no constraints on the background load, Maui cannot determine when this background load will drop enough to allow this job to run.  By default, it optimistically attempts a reservation for the next scheduling iteration, perhaps 1 minute out.  The problem is that this reservation now exists one minute out and when Maui attempts to backfill, it can only consider jobs which request less than one minute or which can fit 'beside' this high priority job.  The next iteration, Maui still cannot run the job because the background load has not dropped and again creates a new reservation for one minute out.

    The background load has basically turned batch scheduling into an exercise in 'resource scavenging'.  If the priority job reservation were not there, other smaller queued jobs might be able to run.  Thus to maximize the 'scavenging' effect, the scheduler should be configured to allow this high priority job 'first dibs' on all available resources but prevent it from reserving these resources if it cannot run immediately.

Configuration:

  The configuration needs to accomplish several main objectives including:

    -    track the background load to prevent oversubscription
    -    favor small, short jobs to maximize job turnaround
    -    prevent blocked high priority jobs from creating reservations
    -    interface to an allocation manager to charge for all resource usage based on utilized CPU time
    -    cancel jobs which exceed specified resource limits
    -    notify users of job cancellation due to resource utilization limit violations

    The following Maui config file should work.

maui.cfg
-----
# allow jobs to share node
NODEACCESSPOLICY  SHARED

# track background load
NODELOADPOLICY            ADJUSTPROCS
NODEUNTRACKEDLOADFACTOR   1.2

# favor short jobs, disfavor large jobs
QUEUETIMEWEIGHT   0
RESOURCEWEIGHT    -10
PROCWEIGHT        128
MEMWEIGHT         1
XFACTOR           1000

# disable priority reservations for the default QOS
QOSFLAGS[0]       NORESERVATION

# debit by CPU
BANKTYPE          QBANK
BANKSERVER        develop1
BANKPORT          2334
BANKCHARGEMODE    DEBITSUCCESSFULLCPU

# kill resource hogs
RESOURCEUTILIZATIONPOLICY ALWAYS
RESOURCEUTILIZATIONACTION CANCEL

# notify user of job events

NOTIFYSCRIPT  tools/notify.pl
-----

Monitoring:

    The most difficult aspects of this environment are properly 'reserving' space for the untracked 'background' load.  Since this load is outside the viewing/control of the scheduler/resource manager, there are no constraints on what it can do.  It could instant grow and overwhelm the machine, or just as easily disappear.  The parameter 'NODEUNTRACKEDLOADFACTOR' provides 'slack' for this background load to grow and shrink.  However, since there is now control over the load, the effectiveness of this parameter will depend on the statistical behavior of this load.  The greater the value, the more slack provided, the less likely the system is to be overcommitted; however, a larger value also means more resources are in this 'reserve' which are unavailable for scheduling.  The right solution is to migrate the users over to the batch system or provide them with a constrained resource 'box' to play in, either through a processor partition, another system, or via a logical software system.  The value in the 'box' is that it prevents this unpredictable background load from wreaking havoc with an otherwise sane dedicated resource reservation system.  Maui can reserve resource for jobs according to all info currently available.  However the unpredictable nature of the background load may mean those resources are not available when they should be resulting in cancelled reservations and the inability to enforce site policies and priorities.

    The second aspect of this environment which must be monitored is the trade-off between high job throughput and job starvation.  The 'locally greedy' approach of favoring the smallest, shortest jobs will have a negative effect on larger and longer jobs.  The large, long jobs which have been queued for some time can be pushed to the front of the queue by increasing the QUEUETIMEWEIGHT factor until a satisfactory balance is achieved.

Conclusions:

    Mixed batch/non-batch systems are very, very nasty.  :)
  [an error occurred while processing this directive] [an error occurred while processing this directive]