The first step of most simulations is to determine the primary purpose of the simulation. Purposes may include identifying impact of certain resource or workload changes on current cluster performance. Simulations may also focus on system utilization or workload distribution across resources or credentials. Further, simulations may also be used for training purposes, allowing risk-free evaluation of behavior, facilities, and commands. With the purpose known, metrics of success can be specified and a proper simulation created. While performance metrics may not be critical to training based simulations, they are key to successful evaluation in most other cases.
16.3.1.2 Selecting Resources
As in the real world, a simulation requires a set of resources (compute hosts) on which to run. In Moab, this is specified using a resource trace file. This resource trace file may be obtained from specific hardware or generated for the specific purpose.
16.3.1.3 Selecting Workload
In addition to resources, a simulation also requires a workload (batch jobs) to schedule onto the available resources. This workload is specified within a workload trace file. Like the resource traces, this workload information may be based on recorded data or generated to meet the need of the particular simulation.
16.3.1.4 Selecting Policies
The final aspect of a simulation is the set of policies and configuration to be used to determine how a workload is to be scheduled onto the available resources. This configuration is placed in the moab.cfg file just as would be done in production (or normal) mode operation.
16.3.1.5 Initial Configuration Using the Sample Traces
While mastering simulations may take some time, initial configuration is straightforward. To start, edit the moab.cfg file and do the following:
Change the SCHEDCFG attribute MODE from NORMAL or MONITOR to SIMULATION.
Add the following lines:
The preceding steps specify that the scheduler should run in simulation mode and use the referenced resource and workload trace files. In addition, leaving the SIMSTOPITERATION parameter at zero indicates that Moab should stop before the first scheduling iteration and wait for further instructions. If you want the simulation to run as soon as you start Moab, remove (or comment out) this line. To continue scheduling, run the mschedctl -r command.
You also may need to add these lines to the moab.cfg file:
The second set of parameters are helpful if you want to generate charts or reports from Moab Cluster Manager. Since events in the workload trace may reference credentials that are not listed in your moab.cfg file, set CREDDISCOVERY to true, which allows Moab to create simulated credentials for credentials that do not yet exist. Setting SIMAUTOSHUTDOWN to false prevents Moab from terminating after it has finished running all the jobs in the workload trace, and it allows you to generate charts after all the simulated jobs have finished. Ensure that SIMSTARTTIME is set to the epoch time (in seconds) of the first event in your workload trace file. This causes the internal clock in Moab to be set to the workload trace's first event, which prevents issues caused by the difference between the time the workload trace was created and the time reported by the CPU clock. Otherwise, Moab thinks the current time is the time that the CPU clock reports, yet simulated jobs that are reported by showq as currently running will really be running at the time the workload trace was created. To avoid confusion, set the SIMSTARTTIME. The lines that specify ENABLEPROFILING=true are necessary for Moab to keep track of the statistics generated by the simulated jobs. Not setting these lines will cause charts and reports to contain all zero values.
16.3.1.6 Starting a Simulation
As in all cases, Moab should be started by issuing the command moab. It should be noted that in simulation mode, Moab does not daemonize itself and so will not background itself. Verification of proper operation is possible using any common user command such as showq. If the showq command is run, it will display the number of jobs currently in the scheduler's queue. The jobs displayed by the showq command are taken from the workload trace file specified earlier and those that are marked as running are running on resources described in the resource trace file. At any point, a detailed summary of available resources may be obtained by running the mdiag -n command.
A closer look at the output of showq shows that the jobs are organized into three categories, active jobs that are currently running, idle jobs that will start as soon as the required resources become available, and blocked jobs that are currently ineligible to be run because they violate some configured policy.
16.3.1.7 Interactive Tutorial
The rest of this section provides an interactive tutorial to demonstrate the basics of the simulator's capacities in Moab. The commands to issue are formatted as follows: > showq along with the expected output.
Next, verify that Moab is running by executing showq:
Out of the thousands of jobs in the workload trace, only 16 jobs are either active or eligible because of the default settings of the SIMINITIALQUEUEDEPTH parameter. Sixteen jobs are put in the idle queue, seven of which immediately run. Issuing the command showq -r allows a more detailed look at the active (or running) jobs. The output is sorted by job completion time and indicates that the first job will complete in one day (1:00:00:00).
While showq details information about the queues, scheduler statistics may be viewed using the showstats command. The field Current Active/Total Procs shows current system utilization, for example.
You might be wondering why there are only 140 of 196 Processors Active (as shown with showq) when the first job (fr1n04.362.0) in the queue only requires 20 processors. We will use the checkjob command, which reports detailed job state information and diagnostic output for a particular job to determine why it is not running:
Checkjob not only tells us the job's wallclock limit and the number of requested nodes (they're in the ellipsis) but explains why the job was rejected from running. The Job Eligibility Analysis tells us that 48 of the processors rejected this job due to memory limitations and that another 140 processors rejected it because of their state (that is, they're running other jobs). Notice the >= 256 M(B) memory requirement.
If you run checkjob with the ID of a running job, it would also tell us exactly which nodes have been allocated to this job. There is additional information that the checkjob command page describes in more detail.
Advancing the simulator an iteration, the following happens:
The scheduler control command, mschedctl, controls various aspects of scheduling behavior. It can be used to manage scheduling activity, kill the scheduler, and create resource trace files. The -S argument indicates that the scheduler run for a single iteration and stop. Specifying a number, n, after -S causes the simulator to advance n steps. You can determine what iteration you are currently on using showstats -v.
The line that starts with statistics for iteration <X> specifies the iteration you are currently on. Each iteration advances the simulator RMPOLLINTERVAL seconds. To see what RMPOLLINTERVAL is set to, use the showconfig command:
By default, RMPOLLINTERVAL is set to 30 seconds. With showconfig, you can see the current value of all configurable parameters.
The showq -r command can be used to display the running (active) jobs to see what happened in the last iteration:
Notice that two new jobs started (without waiting in the eligible queue). Also notice that job fr8n01.187.0, along with the rest that are summarized in the ellipsis, did NOT advance its REMAINING or STARTTIME. The simulator needs one iteration to do a sanity check. Setting the parameter SIMSTOPITERATION to 1 causes Moab to stop after the first scheduling iteration and wait for further instructions.
The showq -i command displays the idle (eligible) jobs.
Notice how none of the eligible jobs are requesting 19 or fewer jobs (the number of idle processors). Also notice the * after the job id fr1n04.362.0. This means that this job now has a reservation. The showres command shows all reservations currently on the system.
Here, the S column is the job's state(R = running, I = idle). All the active jobs have a reservation along with idle job fr1n04.362.0. This reservation was actually created by the backfill scheduler for the highest priority idle job as a way to prevent starvation while lower priority jobs were being backfilled. (The backfill documentation describes the mechanics of the backfill scheduling more fully.)
To display information about the nodes that job fr1n04.362.0 has reserved, use showres -n <JOBID>.
Now advance the simulator an iteration to allow some jobs to actually run.
Next, check the queues to see what happened.
Two new jobs, fr8n01.963.0 and fr8n01.1016.0, are in the eligible queue. Also, note that the first job will now complete in 4 minutes 30 seconds rather than 5 minutes because we have just advanced now by 30 seconds, 1 RMPOLLINTERVAL. It is important to note that when the simulated jobs were created, both the job's wallclock limit and its actual run time were recorded. The wallclock time is specified by the user indicating their best estimate of an upper bound on how long the job will run. The run time is how long the job actually ran before completing and releasing its allocated resources. For example, a job with a wallclock limit of 1 hour will be given the needed resources for up to an hour but may complete in only 20 minutes.
Stop the simulation at iteration 6.
The -s 6I argument indicates that the scheduler will stop at iteration 6 and will (I)gnore user input until it gets there. This prevents the possibility of obtaining showq output from iteration 5 rather than iteration 6.
Job fr8n01.804.0 is still 2 minutes 30 seconds away from completing as expected but notice that both jobs fr8n01.189.0 and fr8n01.191.0 have completed early. Although they had almost 24 hours remaining of wallclock limit, they terminated. In reality, they probably failed on the real world system where the trace file was being created. Their completion freed up 40 processors which the scheduler was able to immediately use by starting several more jobs.
Note the system statistics:
A few more fields are filled in now that some jobs have completed providing information on which to generate statistics.
Decrease the default LOGLEVEL with mschedctl -m to avoid unnecessary logging, and speed up the simulation.
You can use mschedctl -m to immediately change the value of any parameter. The change is only made to the currently running Moab server and is not propagated to the configuration file. Changes can also be made by modifying the configuration file and restarting the scheduler.
Stop at iteration 580 and pull up the scheduler's statistics.
You may note that showq hangs a while as the scheduler simulates up to iteration 580. The output shows that currently only 156 of the 196 nodes are busy, yet at first glance 3 jobs, fr8n01.963.0, fr8n01.1075.0, and fr8n01.1076.0 appear to be ready to run.
The checkjob command reveals that job fr8n01.963.0 only found 20 of 32 processors. The remaining 20 idle processors could not be used because the configured memory on the node did not meet the jobs requirements. The other jobs cannot find enough nodes because of ReserveTime. This indicates that the processors are idle, but that they have a reservation in place that will start before the job being checked could complete.
Verify that the idle nodes do not have enough memory configured and they are already reserved with the mdiag -n command, which provides detailed information about the state of nodes Moab is currently tracking. The mdiag command can be used with various flags to obtain detailed information about accounts, fair share, groups, jobs, nodes, QoS, queues, reservations, the resource manager, and users. The command also performs a number of sanity checks on the data provided and will present warning messages if discrepancies are detected.
The grep gets the command header and the idle nodes listed. All the idle nodes with 256 MB of memory installed already have a reservation. (See the Rsv column.) The rest of the idle nodes only have 128 MB of memory.
Using checknode revealed that Job fr8n01.963.0 has the reservation.
Moving ahead:
We now know that the scheduler is scheduling efficiently. So far, system utilization as reported by showstats -v looks very good. An important and subjective question is whether the scheduler is scheduling fairly. Look at the user and group statistics to see if there are any glaring problems.
Suppose you need to now take down the entire system for maintenance on Thursday from 2:00 to 8:00 a.m. To do this, create a reservation with setres.