[an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive]

A.1   Case Study: Mixed Parallel/Serial Homogeneous Cluster

Overview:

    A multi-user site wishes to control the distribution of compute cycles while minimizing job turnaround time and maximizing overall system utilization.

Resources:

    Compute Nodes:         64  2 way SMP Linux based nodes, each with 512 MB of RAM and 16 GB local scratch space
    Resource Manager:    OpenPBS 2.3
    Network:                      100 MB switched ethernet
 

Workload:

    Job Size:                range in size from 1 to 32 processors with approximately the following quartile job frequency distribution
                                         1 - 2, 3 - 8, 9 - 24, and 25 - 32 nodes.

    Job Length:           jobs range in length from 1 to 24 hours

    Job Owners:         job are submitted from 6 major groups consisting of a total of about 50 users

    NOTES:              During prime time hours, the majority of jobs submitted are smaller, short running development jobs where users are testing out new code and new data sets.  The owners of these jobs are often unable to proceed with their work until a job they have submitted completes.  Many of these jobs are interactive in nature.  Throughout the day, large, longer running production workload is also submitted but these jobs do not have comparable turnaround time pressure.
 

Constraints: (Must do)

    The groups 'Meteorology' and 'Statistics' should receive approximately 45 and 35% of the total delivered cycles respectively.  Nodes cannot be shared amongst tasks from different jobs.
 

Goals: (Should do)

    The system should attempt to minimize turnaround time during primetime hours (Mon - Fri, 8:00 AM to 5:00 PM) and maximize system utilization during all other times.   System maintenance should be efficiently scheduled around
 

Analysis:

    The network topology is flat and and nodes are homogeneous.  This makes life significantly simpler.  The focus for this site is controlling distribution of compute cycles without negatively impacting overall system turnaround and utilization.  Currently, the best mechanism for doing this is Fairshare.  This feature can be used to adjust the priority of jobs to favor/disfavor jobs based on fairshare targets and historical usage.  In essence, this feature improves the turnaround time of the jobs not meeting their fairshare target at the expense of those that are.  Depending on the criticality of the delivered cycle distribution constraints, this site might also wish to consider an allocations bank such as PNNL's QBank which enables more stringent control over the amount of resources which can be delivered to various users.

    To manage the primetime job turnaround constraints, a standing reservation would probably be the best approach.  A standing reservation can be used to set aside a subset of the nodes for quick turnaround jobs.  This reservation can be configured with a time based access point to allow only jobs which will complete within some time X to utilize these resources.  The reservation has advantages over a typical queue based solution in this case in that these quick turnaround jobs can be run anywhere resources are available, either inside, or outside the reservation, or even crossing reservation boundaries.  The site does not have any hard constraints about what is acceptable turnaround time so the best approach would probably be to analyze the site's workload under a number of configurations using the simulator and observe the corresponding scheduling behavior.

    For general optimization, there are a number of scheduling aspects to consider, scheduling algorithm, reservation policies, node allocation policies, and job prioritization.  It is almost always a good idea to utilize the scheduler's backfill capability since this has a tendency to increase average system utilization and decrease average turnaround time in a surprisingly fair manner.  It does tend to favor somewhat small and short jobs over others which is exactly what this site desires.  Reservation policies are often best left alone unless rare starvation issues arise or quality of service policies are desired.  Node allocation policies are effectively meaningless since the system is homogeneous.  The final scheduling aspect, job prioritization, can play a significant role in meeting site goals.  To maximize overall system utilization, maintaining a significant Resource priority factor will favor large resource (processor) jobs, pushing them to the front of the queue.  Large jobs, though often only a small portion of a site's job count, regularly account for the majority of a site's delivered compute cycles.  To minimize job turnaround, the XFactor priority factor will favor short running jobs.  Finally, in order for fairshare to be effective, a significant Fairshare priority factor must be included.
 

Configuration:

    For this scenario, a resource manager configuration consisting of a single, global queue/class with no constraints would allow Maui the maximum flexibility and opportunities for optimization.

    The following Maui configuration would be a good initial stab.

maui.cfg
-----
# reserve 16 processors during primetime for jobs requiring less than 2 hours to complete

SRNAME[0]        fast
SRTASKCOUNT[0]   16
SRDAYS[0]        MON TUE WED THU FRI
SRSTARTTIME[0]   8:00:00
SRENDTIME[0]     17:00:00
SRMAXTIME[0]     2:00:00

# prioritize jobs for Fairshare, XFactor, and Resources

RESOURCEWEIGHT   20
XFACTORWEIGHT    100
FAIRSHAREWEIGHT  100

# disable SMP node sharing

NODEACCESSPOLICY  DEDICATED
-----

fs.cfg
-----
Group:Meterology  FSTARGET=45
Group:Statistics  FSTARGET=35
-----
 

Monitoring:

    The command 'diagnose -f' will allow you to monitor the effectiveness of the fairshare component of your job prioritization.  Adjusting the Fairshare priority factor up/or down will make fairshare more/less effective.  Note that a tradeoff must occur between fairshare and other goals managed via job prioritization.  'diagnose -p' will help you analyze the priority distributions of the currently idle jobs.  The 'showgrid AVGXFACTOR' command will provide a good indication of average job turnaround while the 'profiler' command will give an excellent analysis of longer term historical performance statistics.
 

Conclusions:

    Any priority configuration will need to be tuned over time because the effect of priority weights is highly dependent upon the site specific workload.  Additionally, the priority weights themselves are part of a feedback loop which adjust the site workload.  However, most sites quickly stabilize and significant priority tuning is unnecessary after a few days.
[an error occurred while processing this directive] [an error occurred while processing this directive]