[torqueusers] Setting up checkpointing

Lloyd Brown lloyd_brown at byu.edu
Wed Jan 18 15:39:11 MST 2012


Can anyone enlighten me on the current state of BLCR-style checkpointing
in Torque?  I've been trying to get it to work, and so far, I see that
it's invoking my checkpoint script, that script calls cr_checkpoint, and
the checkpoint files/directories are created, but something is calling
the mom_checkpoint_delete_files function, which in turn calls
delete_blcr_files, and the checkpoints get deleted.

Also, when I do a "qhold" on my job to try to initiate the checkpoint,
is it really supposed to terminate my job?  Perhaps that's related, eg.
the job is ending so the files get cleaned up.

Basically, does anyone have it working, and can give me advice?

Thanks,

-- 
Lloyd Brown
Systems Administrator
Fulton Supercomputing Lab
Brigham Young University
http://marylou.byu.edu


More information about the torqueusers mailing list