[torqueusers] Setting up checkpointing

Al Taufer ataufer at adaptivecomputing.com
Thu Jan 26 11:18:43 MST 2012


Are you just using the "-c interval=x"?  If so that just specifies what the checkpoint interval is but it does not enable the checkpointing.  Try changing it to "-c periodic,interval=x".

----- Original Message -----
> Al,
> 
> Thanks for the update.  I guess the use case our users are really
> after
> is to have either a one-time or a periodic checkpoint, with the wait
> time before the checkpoint specified by the user.  The "-c interval="
> parameter to qsub makes it look like this should work.  But when I
> did
> that, I couldn't get the job to actually checkpoint without manually
> calling qhold/qchkpt.  Maybe I'm just misinterpreting something, or
> don't have it set up right, but the idea here is to not require the
> users to manually checkpoint their job.
> 
> Lloyd Brown
> Systems Administrator
> Fulton Supercomputing Lab
> Brigham Young University
> http://marylou.byu.edu
> 
> On 01/26/2012 09:55 AM, Al Taufer wrote:
> > 
> > ----- Original Message -----
> >> Can anyone enlighten me on the current state of BLCR-style
> >> checkpointing
> >> in Torque?  I've been trying to get it to work, and so far, I see
> >> that
> >> it's invoking my checkpoint script, that script calls
> >> cr_checkpoint,
> >> and
> >> the checkpoint files/directories are created, but something is
> >> calling
> >> the mom_checkpoint_delete_files function, which in turn calls
> >> delete_blcr_files, and the checkpoints get deleted.
> > 
> > I hope you are seeing normal behavior.  If I remember correctly,
> > when a job gets checkpointed, the checkpoint files remain on the
> > mom until the mom completes the job or until the job is put on
> > hold and is no longer on the mom.  At that time the checkpoint
> > files are transferred to the server where they remain until the
> > job is removed from the server.  When the job gets restarted,
> > which may or may not be on the original mom node, the checkpoint
> > files are transferred to the mom which can then restart the job
> > from the checkpoint file.
> > 
> >>
> >> Also, when I do a "qhold" on my job to try to initiate the
> >> checkpoint,
> >> is it really supposed to terminate my job?  Perhaps that's
> >> related,
> >> eg.
> >> the job is ending so the files get cleaned up.
> > 
> > qhold is behaving as designed and as documented in its man page.
> >  If you want to just checkpoint the job and allow it to continue
> > running, use qchkpt.
> > 
> >>
> >> Basically, does anyone have it working, and can give me advice?
> >>
> >> Thanks,
> >>
> >> --
> >> Lloyd Brown
> >> Systems Administrator
> >> Fulton Supercomputing Lab
> >> Brigham Young University
> >> http://marylou.byu.edu
> >> _______________________________________________
> >> torqueusers mailing list
> >> torqueusers at supercluster.org
> >> http://www.supercluster.org/mailman/listinfo/torqueusers
> >>
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 


More information about the torqueusers mailing list