[torqueusers] Setting up checkpointing
lloyd_brown at byu.edu
Wed Jan 18 15:39:11 MST 2012
Can anyone enlighten me on the current state of BLCR-style checkpointing
in Torque? I've been trying to get it to work, and so far, I see that
it's invoking my checkpoint script, that script calls cr_checkpoint, and
the checkpoint files/directories are created, but something is calling
the mom_checkpoint_delete_files function, which in turn calls
delete_blcr_files, and the checkpoints get deleted.
Also, when I do a "qhold" on my job to try to initiate the checkpoint,
is it really supposed to terminate my job? Perhaps that's related, eg.
the job is ending so the files get cleaned up.
Basically, does anyone have it working, and can give me advice?
Fulton Supercomputing Lab
Brigham Young University
More information about the torqueusers