[torqueusers] Setting up checkpointing
ataufer at adaptivecomputing.com
Thu Jan 26 11:18:43 MST 2012
Are you just using the "-c interval=x"? If so that just specifies what the checkpoint interval is but it does not enable the checkpointing. Try changing it to "-c periodic,interval=x".
----- Original Message -----
> Thanks for the update. I guess the use case our users are really
> is to have either a one-time or a periodic checkpoint, with the wait
> time before the checkpoint specified by the user. The "-c interval="
> parameter to qsub makes it look like this should work. But when I
> that, I couldn't get the job to actually checkpoint without manually
> calling qhold/qchkpt. Maybe I'm just misinterpreting something, or
> don't have it set up right, but the idea here is to not require the
> users to manually checkpoint their job.
> Lloyd Brown
> Systems Administrator
> Fulton Supercomputing Lab
> Brigham Young University
> On 01/26/2012 09:55 AM, Al Taufer wrote:
> > ----- Original Message -----
> >> Can anyone enlighten me on the current state of BLCR-style
> >> checkpointing
> >> in Torque? I've been trying to get it to work, and so far, I see
> >> that
> >> it's invoking my checkpoint script, that script calls
> >> cr_checkpoint,
> >> and
> >> the checkpoint files/directories are created, but something is
> >> calling
> >> the mom_checkpoint_delete_files function, which in turn calls
> >> delete_blcr_files, and the checkpoints get deleted.
> > I hope you are seeing normal behavior. If I remember correctly,
> > when a job gets checkpointed, the checkpoint files remain on the
> > mom until the mom completes the job or until the job is put on
> > hold and is no longer on the mom. At that time the checkpoint
> > files are transferred to the server where they remain until the
> > job is removed from the server. When the job gets restarted,
> > which may or may not be on the original mom node, the checkpoint
> > files are transferred to the mom which can then restart the job
> > from the checkpoint file.
> >> Also, when I do a "qhold" on my job to try to initiate the
> >> checkpoint,
> >> is it really supposed to terminate my job? Perhaps that's
> >> related,
> >> eg.
> >> the job is ending so the files get cleaned up.
> > qhold is behaving as designed and as documented in its man page.
> > If you want to just checkpoint the job and allow it to continue
> > running, use qchkpt.
> >> Basically, does anyone have it working, and can give me advice?
> >> Thanks,
> >> --
> >> Lloyd Brown
> >> Systems Administrator
> >> Fulton Supercomputing Lab
> >> Brigham Young University
> >> http://marylou.byu.edu
> >> _______________________________________________
> >> torqueusers mailing list
> >> torqueusers at supercluster.org
> >> http://www.supercluster.org/mailman/listinfo/torqueusers
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> torqueusers mailing list
> torqueusers at supercluster.org
More information about the torqueusers