Acceptance tests for BLCR support in Torque

Date of last revision: Wed Mar 12 11:58:23 MDT 2008

This document describes a set of tests to be performed to determine if BLCR is correctly supported in its Torque implementation.

Note that there is a configuration option of specifying an explicit location for the placement of the checkpoint image files. This is done by either specifying an option on job submission i.e. -W checkpoint_dir=/home/test or by setting an attribute on the execution quere. This is done with the command qmgr -c 'set queue batch checkpoint_dir=/home/test'.

Rather than multiply tests with the only difference being the specification of this option, the tests are described once and then run under both of these scenarios. The test summary below has columns showing the results for each case.

Test Summary

Test Number Description Result w/Default Dir Result w/Specified Dir
Test 1 Basic Operation Pass Pass
Test 2 Persistance of checkpoint images Pass Pass
Test 3 Restart after checkpoint Pass Pass
Test 4 Multiple checkpoint/restart Pass Pass
Test 5 Periodic checkpoint Pass Pass
Test 6 Restart from previous image Pass Pass
Test 7 Restart from a completed job Pass Pass

Note about Test 7, Restart from a completed job. There has been no requirements document developed for this project and so it is a little hard to determine if this particular test scenario is correctly formulated.

Test environment

All these tests assume the following test program and shell script, test.sh.

#include 
int main( int argc, char *argv[] )
{
int i;

    for (i=0; i<100; i++)
    {
        printf("i = %d\n", i);
        fflush(stdout);
        sleep(1);
    }
}
#!/bin/bash

/home/test/test

Test 1 - Basic operation

Introduction

This test determines if the proper environment has been established.

Test Steps

Submit a test job and the issue a hold on the job.

> qsub test.sh
999.xxx.yyy
> qhold 999

Possible Failures

Normally the result of qhold is nothing. If an error message is produced saying that qhold is not a supported feature then one of the following configuration errors might be present.

Successful Results

If no configuration was done to specify a specific directory location for the checkpoint file, the default location is off of the Torque directory which in my case is /var/spool/torque/checkpoint.

Otherwise, go to the specified directory for the checkpoint image files. This was done by either specifying an option on job submission i.e. -W checkpoint_dir=/home/test or by setting an attribute on the execution quere. This is done with the command qmgr -c 'set queue batch checkpoint_dir=/home/test'.

Doing a directory listing shows the following.


# find /var/spool/torque/checkpoint
/var/spool/torque/checkpoint
/var/spool/torque/checkpoint/999.xxx.yyy.CK
/var/spool/torque/checkpoint/999.xxx.yyy.CK/ckpt.999.xxx.yyy.1205266630
# find /var/spool/torque/checkpoint |xargs ls -l
-r-------- 1 root root 543779 2008-03-11 14:17 /var/spool/torque/checkpoint/999.xxx.yyy.CK/ckpt.999.xxx.yyy.1205266630

/var/spool/torque/checkpoint:
total 4
drwxr-xr-x 2 root root 4096 2008-03-11 14:17 999.xxx.yyy.CK

/var/spool/torque/checkpoint/999.xxx.yyy.CK:
total 536
-r-------- 1 root root 543779 2008-03-11 14:17 ckpt.999.xxx.yyy.1205266630

Doing a qstat -f command should show the job in a held state, job_state = H. Note that the attribute checkpoint_name is set to the name of the file seen above.

If a checkpoint directory has been specified, there will also be an attribute checkpoint_dir in the output of qstat -f.


$ qstat -f
Job Id: 999.xxx.yyy
    Job_Name = test.sh
    Job_Owner = test@xxx.yyy
    resources_used.cput = 00:00:00
    resources_used.mem = 0kb
    resources_used.vmem = 0kb
    resources_used.walltime = 00:00:06
    job_state = H
    queue = batch
    server = xxx.yyy
    Checkpoint = u
    ctime = Tue Mar 11 14:17:04 2008
    Error_Path = xxx.yyy:/home/test/test.sh.e999
    exec_host = test/0
    Hold_Types = u
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Tue Mar 11 14:17:10 2008
    Output_Path = xxx.yyy:/home/test/test.sh.o999
    Priority = 0
    qtime = Tue Mar 11 14:17:04 2008
    Rerunable = True
    Resource_List.neednodes = 1
    Resource_List.nodect = 1
    Resource_List.nodes = 1
    Resource_List.walltime = 01:00:00
    session_id = 9402
    substate = 20
    Variable_List = PBS_O_HOME=/home/test,PBS_O_LANG=en_US.UTF-8,
        PBS_O_LOGNAME=test,
        PBS_O_PATH=/usr/local/perltests/bin:/home/test/bin:/usr/local/s
        bin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games,
        PBS_O_SHELL=/bin/bash,PBS_SERVER=xxx.yyy,
        PBS_O_HOST=xxx.yyy,PBS_O_WORKDIR=/home/test,
        PBS_O_QUEUE=batch
    euser = test
    egroup = test
    hashname = 999.xxx.yyy
    queue_rank = 3
    queue_type = E
    comment = Job started on Tue Mar 11 at 14:17
    exit_status = 271
    submit_args = test.sh 
    start_time = Tue Mar 11 14:17:04 2008
    start_count = 1
    checkpoint_name = ckpt.999.xxx.yyy.1205266630

Test 2 - Persistance of checkpoint images

Introduction

This test determines if the checkpoint files remain in the default directory after the job is removed from the Torque queue.

Note that this behavior was requested by a customer but in fact may not be the right thing to do as it leaves the checkpoint files on the execution node. These will gradually build up over time on the node being limited only by disk space. The right thing would seem to be that the checkpoint files are copied to the users home directory after the job is purged from the execution node.

Test Steps

Assuming the steps of Test 1, delete the job and then wait until the job leaves the queue after the completed job hold time. Then look at the contents of the default checkpoint directory to see if the files are still there.

> qsub test.sh
999.xxx.yyy
> qhold 999
> qdel 999
> sleep 100
> qstat
>
> find /var/spool/torque/checkpoint
... files ...

Possible Failures

The files are not there, did Test 1 actually pass?

Successful Results

The files are there.


Test 3 - Restart after checkpoint

Introduction

This test determines if the job can be restarted after a checkpoint hold.

Test Steps

Assuming the steps of Test 1, issue a qrls command. Have another window open into the /var/spool/torque/spool directory and tail the job.

Successful Results

After the qrls, the job's output should resume.

Test 4 - Multiple checkpoint/restart

Introduction

This test determines if the checkpoint/restart cycle can be repeated multiple times.

Test Steps

Start a job and then while tail'ing the job output, do multiple qhold/qrls operations.

> qsub test.sh
999.xxx.yyy
> qhold 999
> qrls 999
> qhold 999
> qrls 999
> qhold 999
> qrls 999

Successful Results

After each qrls, the job's output should resume.

Test 5 - Periodic checkpoint

Introduction

This test determines if automatic periodic checkpoint will work.

Test Steps

Start the job with the option -c c=1 and look in the checkpoint directory for checkpoint images to be generated about every minute.

> qsub -c c=1 test.sh
999.xxx.yyy

Successful Results

The checkpoint directory should contain multiple checkpoint images and the time on the files should be roughly a minute apart.


Test 6 - Restart from previous image

Introduction

This test determines if the job can be restarted from a previous checkpoint image.

Test Steps

Start the job with the option -c c=1 and look in the checkpoint directory for checkpoint images to be generated about every minute. Do a qhold on the job to stop it. Change the attribute checkpoint_name with the qalter command. Then do a qrls to restart the job.

> qsub -c c=1 test.sh
999.xxx.yyy
> qhold 999
> qalter -W checkpoint_name=ckpt.999.xxx.yyy.1234567
> qrls 999

Successful Results

The job output file should be truncated back and the count should resume at an earlier number.


Test 7 - Restart from a completed job

Introduction

This test determines if the job can be restarted if the job has already completed. Note that this only works if the job is still in the queue and implies that the keep completed jobs time attribute is set reasonably high.

Test Steps

The job must be re-run and then held. Then the job must be altered to specifiy a previous checkpoint image and then the job must be released.

> qrerun 999
> qhold 999
> qalter -W checkpoint_name=ckpt.999.xxx.yyy.1234567
> qrls 999

Successful Results

The job output file will be a new file and the count should resume at an earlier number.