TORQUE Administrator's Manual - 2.6 Job Checkpoint and Restart
2.6 Job Checkpoint and Restart
While TORQUE has had a job checkpoint and restart capability
for many years, this was tied to machine specific features.
Now there is an architecture independent package available which
provides for process checkpoint and restart. The package is BLCR
and TORQUE now provides support for BLCR.
Introduction to BLCR
BLCR is a kernel level package. It must be downloaded and
installed from BLCR.
After building and making the package, it must be installed
into the kernel with commands as follows. These can be installed
into the file /etc/modules but all of the testing was done
with explict invocations of modprobe.
The BLCR system provides four command line utilities
cr_checkpoint, cr_info, cr_restart and cr_run.
Note that the support for BLCR is at the Beta stage and this
support must be enabled explicitly when configuring and building
TORQUE. BLCR support is available in the 2.3 and 2.4(trunk)
versions of Torque. The 2.4 version is documented below as
there were some command line syntax changes between 2.3 and 2.4.
Also the number of parameters to the checkpoint script changed
as well as the available configureable options in the mom config file.
The following shows the configure options used in
developing and testing the implementation of support for BLCR.
Configuration files and scripts
The pbs_mom config file located in /var/spool/torque/mom_priv must
be modified to identify the script names associated with invoking the
BLCR commands. The following variables should be used in the config
file when using BLCR checkpointing.
$checkpoint_interval - How often periodic job checkpoints will be taken (minutes).
$checkpoint_script - The name of the script file to execute to perform a job checkpoint.
$restart_script - The name of the script file to execute to perform a job restart.
$checkpoint_run_exe - The name of an executable program to be run when starting a checkpointable job (for BLCR, cr_run).
The following example shows the contents of the config
file used for testing the BLCR feature in TORQUE.
Note that the script files below must
be executable by the user. Be sure to use chmod to set the permissions
to 754.
Starting a checkpointable job
Not every job is checkpointable.
A job for which checkpointing is desireable must be started
with the -c command line option. This option takes
a comma separated list of arguments that are used to control
checkpointing behavior. The list of valid options is show below.
Note that there is an older and a newer version of the syntax.
The older version uses single characters while the newer spells
out the option fully. In the older verison, some of the single characters
would imply multiple options.
The new version is to be preferred as it allows more precise control of the behavior.
The new syntax is present in the 2.4 version of Torque.
Only the new syntax is documented below.
none - No checkpointing (not highly useful but included for completeness).
enabled - Specify that checkpointing is allowed but must be explicitly invoked
by either the qhold or qchkpt commands.
shutdown - Specify that checkpointing is to be done on a job at pbs_mom shutdown.
periodic - Specify that periodic checkpointing is enabled.
The default interval is 10 minutes and can be changed by the $checkpoint_interval option in the mom config file or
by specifying an interval when the job is submitted.
interval=minutes - Specify the checkpoint interval in minutes.
depth=number - Specify a number (depth) of checkpoint images to be kept in the checkpoint directory.
dir=path - Specify a checkpoint directory (default is /var/spool/torque/checkpoint).
If you have no scheduler running, you might need to start the job with
qrun .
As this program runs, it writes it's output to a file in /var/spool/torque/spool.
This file can be observered with the command tail -f.
Checkpointing a job
Jobs are checkpointed by issuing a qhold command.
This causes an image file representing the state of the process to
be written to disk. The directory by default is "/var/spool/torque/checkpoint".
This default can be altered at the queue level with the qmgr
command. For example, the command qmgr -c 'set queue batch checkpoint_dir=/tmp'
would change the checkpoint directory to /tmp.
The default directory can also be specified at job submission time with the
-c dir=/tmp command line option.
The name of the checkpoint directory and the name of the checkpoint image file
become attributes of the job and can be observed with the command qstat -f.
Notice in the output the names checkpoint_dir and checkpoint_name.
The variable checkpoint_name is set when the image file is created and will
not exist if no checkpoint has been taken.
A job can also be checkpointed without stopping or holding the job with the
command qchkpt.
Periodic job checkpoints
A job can have checkpoints taken at a regular interval without causing the
job to be terminated. This option is enabled by starting the job with the
qsub -c periodic,interval=n job-script syntax.
Then n argument specifies a number of minutes between checkpoints.
See the manual section on qsub for more details.
Restarting a job in the Held state
The qrls command is used to restart the hibernated job. If you were
using the tail -f command to watch the output file, you will see the test
program start counting again.
It is possible to use the qalter command to change the name of the
checkpoint file associated with a job. This could be useful if there were
several job checkpoints take and it was desired to restart the job from an
older image.
Restarting a job in the Completed state
In this case, the job must be moved to the Queued state with the
qrerun command. Then the job must
go to the Run state either by action of the scheduler or if there is no
scheduler, through using the qrun command.
Acceptance tests
A number of tests were made to verify the functioning of the BLCR implementation.
See tests-2.3 or
tests-2.4 for a description of the testing.