TORQUE Resource Manager

TORQUE Administrator's Manual - 2.6 Job Checkpoint and Restart

2.6 Job Checkpoint and Restart

While TORQUE has had a job checkpoint and restart capability for many years, this was tied to machine specific features. Now there is an architecture independent package available which provides for process checkpoint and restart. The package is BLCR and TORQUE now provides support for BLCR.

Introduction to BLCR

BLCR is a kernel level package. It must be downloaded and installed from BLCR.

After building and making the package, it must be installed into the kernel with commands as follows. These can be installed into the file /etc/modules but all of the testing was done with explict invocations of modprobe.

Installing BLCR into the kernel
# /sbin/insmod /usr/local/lib/blcr/2.6.12-1.234/blcr_imports.ko
# /sbin/insmod /usr/local/lib/blcr/2.6.12-1.234/blcr_vmadump.ko
# /sbin/insmod /usr/local/lib/blcr/2.6.12-1.234/blcr.ko

The BLCR system provides four command line utilities cr_checkpoint, cr_info, cr_restart and cr_run.

For more info about BLCR, see the BLCR Admin Guide.

Note that the support for BLCR is at the Beta stage and this support must be enabled explicitly when configuring and building TORQUE. BLCR support is available in the 2.3 and 2.4(trunk) versions of Torque. The 2.4 version is documented below as there were some command line syntax changes between 2.3 and 2.4. Also the number of parameters to the checkpoint script changed as well as the available configureable options in the mom config file.

The following shows the configure options used in developing and testing the implementation of support for BLCR.

Configuring and Building TORQUE for BLCR
>  ./configure --enable-unixsockets=no --enable-blcr
>  make
>  sudo make install

Configuration files and scripts

The pbs_mom config file located in /var/spool/torque/mom_priv must be modified to identify the script names associated with invoking the BLCR commands. The following variables should be used in the config file when using BLCR checkpointing.

  • $checkpoint_interval - How often periodic job checkpoints will be taken (minutes).
  • $checkpoint_script - The name of the script file to execute to perform a job checkpoint.
  • $restart_script - The name of the script file to execute to perform a job restart.
  • $checkpoint_run_exe - The name of an executable program to be run when starting a checkpointable job (for BLCR, cr_run).

The following example shows the contents of the config file used for testing the BLCR feature in TORQUE.

Note that the script files below must be executable by the user. Be sure to use chmod to set the permissions to 754.

Script file permissions
# chmod 754 blcr*
# ls -l
total 20
-rwxr-xr-- 1 root root 2112 2008-03-11 13:14 blcr_checkpoint_script
-rwxr-xr-- 1 root root 1987 2008-03-11 13:14 blcr_restart_script
-rw-r--r-- 1 root root  215 2008-03-11 13:13 config
drwxr-x--x 2 root root 4096 2008-03-11 13:21 jobs
-rw-r--r-- 1 root root    7 2008-03-11 13:15 mom.lock

mom_priv/config
$checkpoint_script  /var/spool/torque/mom_priv/blcr_checkpoint_script
$restart_script  /var/spool/torque/mom_priv/blcr_restart_script
$checkpoint_run_exe /usr/local/bin/cr_run
$pbsserver makua.cridomain
$loglevel 7

mom_priv/blcr_checkpoint_script
#! /usr/bin/perl
################################################################################
#
# Usage: checkpoint_script      
#
# This script is invoked by pbs_mom to checkpoint a job.
#
################################################################################
use strict;
use Sys::Syslog;

# Log levels:
# 0 = none -- no logging
# 1 = fail -- log only failures
# 2 = info -- log invocations
# 3 = debug -- log all subcommands
my $logLevel = 3;

logPrint(2, "Invoked: $0 " . join(' ', @ARGV) . "\n");

my ($sessionId, $jobId, $userId, $signalNum, $checkpointDir, $checkpointName);
my $usage =
  "Usage: $0        \n";

# Note that depth is not used in this script but could control a limit to the number of checkpoint
# image files that are preserved on the disk.
#
# Note also that a request was made to identify whether this script was invoked by the job's
# owner or by a system administrator.  While this information is known to pbs_server, it
# is not propagated to pbs_mom and thus it is not possible to pass this to the script.
# Therefore, a workaround is to invoke qmgr and attempt to set a trivial variable.
# This will fail if the invoker is not a manager.

if (@ARGV == 7)
{
    ($sessionId, $jobId, $userId, $checkpointDir, $checkpointName, $signalNum $depth) =
      @ARGV;
}
else { logDie(1, $usage); }

# Change to the checkpoint directory where we want the checkpoint to be created
chdir $checkpointDir
  or logDie(1, "Unable to cd to checkpoint dir ($checkpointDir): $!\n")
  if $logLevel;

my $cmd = "cr_checkpoint";
$cmd .= " --signal $signalNum" if $signalNum;
$cmd .= " --tree $sessionId";
$cmd .= " --file $checkpointName";
my $output = `$cmd 2>&1`;
my $rc     = $? >> 8;
logDie(1, "Subcommand ($cmd) failed with rc=$rc:\n$output")
  if $rc && $logLevel >= 1;
logPrint(3, "Subcommand ($cmd) yielded rc=$rc:\n$output")
   if $logLevel >= 3;
exit 0;

################################################################################
# logPrint($message)
# Write a message (to syslog) and die
################################################################################
sub logPrint
{
    my ($level, $message) = @_;
    my @severity = ('none', 'warning', 'info', 'debug');

    return if $level > $logLevel;

    openlog('checkpoint_script', '', 'user');
    syslog($severity[$level], $message);
    closelog();
}

################################################################################
# logDie($message)
# Write a message (to syslog) and die
################################################################################
sub logDie
{
    my ($level, $message) = @_;

    logPrint($level, $message);
    die($message);
}

mom_priv/blcr_restart_script
#! /usr/bin/perl
################################################################################
#
# Usage: restart_script      
#
# This script is invoked by pbs_mom to restart a job.
#
################################################################################
use strict;
use Sys::Syslog;

# Log levels:
# 0 = none -- no logging
# 1 = fail -- log only failures
# 2 = info -- log invocations
# 3 = debug -- log all subcommands
my $logLevel = 3;

logPrint(2, "Invoked: $0 " . join(' ', @ARGV) . "\n");

my ($sessionId, $jobId, $userId, $checkpointDir, $restartName);
my $usage =
  "Usage: $0      \n";
if (@ARGV == 5)
{
    ($sessionId, $jobId, $userId, $checkpointDir, $restartName) =
      @ARGV;
}
else { logDie(1, $usage); }

# Change to the checkpoint directory where we want the checkpoint to be created
chdir $checkpointDir
  or logDie(1, "Unable to cd to checkpoint dir ($checkpointDir): $!\n")
  if $logLevel;


my $cmd = "cr_restart";
$cmd .= " $restartName";
my $output = `$cmd 2>&1`;
my $rc     = $? >> 8;
logDie(1, "Subcommand ($cmd) failed with rc=$rc:\n$output")
  if $rc && $logLevel >= 1;
logPrint(3, "Subcommand ($cmd) yielded rc=$rc:\n$output")
   if $logLevel >= 3;
exit 0;

################################################################################
# logPrint($message)
# Write a message (to syslog) and die
################################################################################
sub logPrint
{
    my ($level, $message) = @_;
    my @severity = ('none', 'warning', 'info', 'debug');

    return if $level > $logLevel;

    openlog('restart_script', '', 'user');
    syslog($severity[$level], $message);
    closelog();
}

################################################################################
# logDie($message)
# Write a message (to syslog) and die
################################################################################
sub logDie
{
    my ($level, $message) = @_;

    logPrint($level, $message);
    die($message);
}


Starting a checkpointable job

Not every job is checkpointable. A job for which checkpointing is desireable must be started with the -c command line option. This option takes a comma separated list of arguments that are used to control checkpointing behavior. The list of valid options is show below.

Note that there is an older and a newer version of the syntax. The older version uses single characters while the newer spells out the option fully. In the older verison, some of the single characters would imply multiple options. The new version is to be preferred as it allows more precise control of the behavior. The new syntax is present in the 2.4 version of Torque. Only the new syntax is documented below.

  • none - No checkpointing (not highly useful but included for completeness).
  • enabled - Specify that checkpointing is allowed but must be explicitly invoked by either the qhold or qchkpt commands.
  • shutdown - Specify that checkpointing is to be done on a job at pbs_mom shutdown.
  • periodic - Specify that periodic checkpointing is enabled. The default interval is 10 minutes and can be changed by the $checkpoint_interval option in the mom config file or by specifying an interval when the job is submitted.
  • interval=minutes - Specify the checkpoint interval in minutes.
  • depth=number - Specify a number (depth) of checkpoint images to be kept in the checkpoint directory.
  • dir=path - Specify a checkpoint directory (default is /var/spool/torque/checkpoint).

Sample test program
#include "stdio.h"
int main( int argc, char *argv[] )
{
int i;
        for (i=0; i<100; i++)
        {
                printf("i = %d\n", i);
                fflush(stdout);
                sleep(1);
        }
}

Instructions for building test program
>  gcc -o test test.c

Sample test script
#!/bin/bash

./test

Starting the test job
>  qstat
>  qsub -c enabled,periodic,shutdown,interval=1 test.sh
77.jakaa.cridomain
>  qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
77.jakaa                  test.sh          jsmith                 0 Q batch          
>  

If you have no scheduler running, you might need to start the job with qrun .

As this program runs, it writes it's output to a file in /var/spool/torque/spool. This file can be observered with the command tail -f.

Checkpointing a job

Jobs are checkpointed by issuing a qhold command. This causes an image file representing the state of the process to be written to disk. The directory by default is "/var/spool/torque/checkpoint".

This default can be altered at the queue level with the qmgr command. For example, the command qmgr -c 'set queue batch checkpoint_dir=/tmp' would change the checkpoint directory to /tmp.

The default directory can also be specified at job submission time with the -c dir=/tmp command line option.

The name of the checkpoint directory and the name of the checkpoint image file become attributes of the job and can be observed with the command qstat -f. Notice in the output the names checkpoint_dir and checkpoint_name. The variable checkpoint_name is set when the image file is created and will not exist if no checkpoint has been taken.

A job can also be checkpointed without stopping or holding the job with the command qchkpt.

Periodic job checkpoints

A job can have checkpoints taken at a regular interval without causing the job to be terminated. This option is enabled by starting the job with the qsub -c periodic,interval=n job-script syntax. Then n argument specifies a number of minutes between checkpoints. See the manual section on qsub for more details.

Restarting a job in the Held state

The qrls command is used to restart the hibernated job. If you were using the tail -f command to watch the output file, you will see the test program start counting again.

It is possible to use the qalter command to change the name of the checkpoint file associated with a job. This could be useful if there were several job checkpoints take and it was desired to restart the job from an older image.

Restarting a job in the Completed state

In this case, the job must be moved to the Queued state with the qrerun command. Then the job must go to the Run state either by action of the scheduler or if there is no scheduler, through using the qrun command.

Acceptance tests

A number of tests were made to verify the functioning of the BLCR implementation. See tests-2.3 or tests-2.4 for a description of the testing.