[torqueusers] Setting up checkpointing

Al Taufer ataufer at adaptivecomputing.com
Thu Jan 26 09:55:37 MST 2012


----- Original Message -----
> Can anyone enlighten me on the current state of BLCR-style
> checkpointing
> in Torque?  I've been trying to get it to work, and so far, I see
> that
> it's invoking my checkpoint script, that script calls cr_checkpoint,
> and
> the checkpoint files/directories are created, but something is
> calling
> the mom_checkpoint_delete_files function, which in turn calls
> delete_blcr_files, and the checkpoints get deleted.

I hope you are seeing normal behavior.  If I remember correctly, when a job gets checkpointed, the checkpoint files remain on the mom until the mom completes the job or until the job is put on hold and is no longer on the mom.  At that time the checkpoint files are transferred to the server where they remain until the job is removed from the server.  When the job gets restarted, which may or may not be on the original mom node, the checkpoint files are transferred to the mom which can then restart the job from the checkpoint file.

> 
> Also, when I do a "qhold" on my job to try to initiate the
> checkpoint,
> is it really supposed to terminate my job?  Perhaps that's related,
> eg.
> the job is ending so the files get cleaned up.

qhold is behaving as designed and as documented in its man page.  If you want to just checkpoint the job and allow it to continue running, use qchkpt.

> 
> Basically, does anyone have it working, and can give me advice?
> 
> Thanks,
> 
> --
> Lloyd Brown
> Systems Administrator
> Fulton Supercomputing Lab
> Brigham Young University
> http://marylou.byu.edu
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 


More information about the torqueusers mailing list