[torquedev] TORQUE checkpointing
keenandr at msu.edu
Mon May 10 10:08:33 MDT 2010
I've been working with TORQUE's BLCR support, but have noticed some
issues in my testing and would like feedback to see if that I'm
misunderstanding the behavior I'm seeing.
1. qhold copies checkpoints back from the node to the PBS server. The
scp command is run as root on the node but the target of the copy is the
user running the job; the equivalent of:
root# scp checkpointfile user at head-node:/var/spool/torque/checkpoint
Unless root has a key for user or /var/spool/torque is set as usecp,
then this copy command will fail.
2. The server's checkpoint directory is hardcoded to PBS_HOME/checkpoint
, the permissions of which are set by default so that only root can
write to it (which breaks the behavior as described in #1)
3. Checkpoint files are not deleted on the child after qhold migration
4. BLCR supports image migration between nodes, but TORQUE's BLCR allows
migration between nodes and the qhold command migrates the checkpoint
file back to the server, but TORQUE specifically checks and only allows
held jobs with checkpoint files to run on the node the checkpoint was
taken on. Is there a specific reason TORQUE disallows node migration of
5. cr_restart appears to run as root, so BLCR cannot reopen files that
root cannot access (e.g. files on an NFS share with root_squash on) and
is unable to restart the checkpoint file.
We're experimenting with handling checkpointing within the job, but
because TERM is to all processes rather than the parent shell we can't
just trap the signal in the shell and run a checkpoint-and-terminate
command on receipt of a KILL. Is it possible for Moab to send a signal
(USR1?) to a job before terminating/preempting it?
Does anyone have any experience using BLCR in production?
As an aside, what tools do people recommend for working with the TORQUE
source code? My generic system administrator vim configuration is not
holding up. Are there any particular particular IDE or tools useful?
More information about the torquedev