[torquedev] TORQUE checkpointing

Andrew Keen keenandr at msu.edu
Mon May 10 10:08:33 MDT 2010


I've been working with TORQUE's BLCR support, but have noticed some 
issues in my testing and would like feedback to see if that I'm 
misunderstanding the behavior I'm seeing.

1. qhold copies checkpoints back from the node to the PBS server. The 
scp command is run as root on the node but the target of the copy is the 
user running the job; the equivalent of:

root# scp checkpointfile user at head-node:/var/spool/torque/checkpoint

Unless root has a key for user or /var/spool/torque is set as usecp, 
then this copy command will fail.

2. The server's checkpoint directory is hardcoded to PBS_HOME/checkpoint 
, the permissions of which are set by default so that only root can 
write to it (which breaks the behavior as described in #1)

3. Checkpoint files are not deleted on the child after qhold migration 

4. BLCR supports image migration between nodes, but TORQUE's BLCR allows 
migration between nodes and the qhold command migrates the checkpoint 
file back to the server, but TORQUE specifically checks and only allows 
held jobs with checkpoint files to run on the node the checkpoint was 
taken on. Is there a specific reason TORQUE disallows node migration of 

5. cr_restart appears to run as root, so BLCR cannot reopen files that 
root cannot access (e.g. files on an NFS share with root_squash on) and 
is unable to restart the checkpoint file.

We're experimenting with handling checkpointing within the job, but 
because TERM is to all processes rather than the parent shell we can't 
just trap the signal in the shell and run a checkpoint-and-terminate 
command on receipt of a KILL. Is it possible for Moab to send a signal 
(USR1?) to a job before terminating/preempting it?

Does anyone have any experience using BLCR in production?

As an aside, what tools do people recommend for working with the TORQUE 
source code? My generic system administrator vim configuration is not 
holding up. Are there any particular particular IDE or tools useful?


More information about the torquedev mailing list