[torqueusers] Setting up checkpointing

Al Taufer ataufer at adaptivecomputing.com
Thu Jan 26 15:31:37 MST 2012


This is a good method for accomplishing what is wanted.  The only thing I can add is that when you configure the server you could use the --with-servchkptdir option to specify where the server will keep its checkpoint files, which can be a remotely mounted path.

----- Original Message -----
> Hi,
> 
> This is what I did to avoid checkpoint images going onto server node.
> 
> Modify the pbs_mom's config file to specify what checkpointing
> directories are remotely mounted. This can be done by adding
> something like:
> 
> $remote_checkpoint_dirs /opt/torque/checkpoint
> 
> Here /opt/torque/checkpoint is remotely mounted onto
> /opt/torque/checkpoint on each compute node. It doesn't have to be
> /opt/torque/checkpoint on server node. It can be any other directory
> on server node. I linked /opt/torque/checkpoint on server node to
> some other directory with lots of space.
> 
> Best,
> Sreedhar.
> 
> 
> On Jan 26, 2012, at 4:30 PM, Lloyd Brown wrote:
> 
> > Al,
> > 
> > I tried a number of combinations of params, but after your last
> > email, I
> > tried it with "-c periodic,interval=x", and I do see the checkpoint
> > being created in the TORQUEMOMHOME/checkpoints directory.  I
> > haven't
> > been able to test beyond that, since some other things go up.
> > 
> >> From what you've said, though, I have to ask if there's any way to
> > specify where the checkpoint goes, especially when it would
> > otherwise be
> > copied back to the host where pbs_server is running.  You see, our
> > use
> > case involves checkpointing some really big-memory (eg. 256 GB)
> > processes, and we simply don't have the space to store that on the
> > pbs_server host.
> > 
> > 
> > 
> > Lloyd Brown
> > Systems Administrator
> > Fulton Supercomputing Lab
> > Brigham Young University
> > http://marylou.byu.edu
> > 
> > On 01/26/2012 11:18 AM, Al Taufer wrote:
> >> Are you just using the "-c interval=x"?  If so that just specifies
> >> what the checkpoint interval is but it does not enable the
> >> checkpointing.  Try changing it to "-c periodic,interval=x".
> >> 
> >> ----- Original Message -----
> >>> Al,
> >>> 
> >>> Thanks for the update.  I guess the use case our users are really
> >>> after
> >>> is to have either a one-time or a periodic checkpoint, with the
> >>> wait
> >>> time before the checkpoint specified by the user.  The "-c
> >>> interval="
> >>> parameter to qsub makes it look like this should work.  But when
> >>> I
> >>> did
> >>> that, I couldn't get the job to actually checkpoint without
> >>> manually
> >>> calling qhold/qchkpt.  Maybe I'm just misinterpreting something,
> >>> or
> >>> don't have it set up right, but the idea here is to not require
> >>> the
> >>> users to manually checkpoint their job.
> >>> 
> >>> Lloyd Brown
> >>> Systems Administrator
> >>> Fulton Supercomputing Lab
> >>> Brigham Young University
> >>> http://marylou.byu.edu
> >>> 
> >>> On 01/26/2012 09:55 AM, Al Taufer wrote:
> >>>> 
> >>>> ----- Original Message -----
> >>>>> Can anyone enlighten me on the current state of BLCR-style
> >>>>> checkpointing
> >>>>> in Torque?  I've been trying to get it to work, and so far, I
> >>>>> see
> >>>>> that
> >>>>> it's invoking my checkpoint script, that script calls
> >>>>> cr_checkpoint,
> >>>>> and
> >>>>> the checkpoint files/directories are created, but something is
> >>>>> calling
> >>>>> the mom_checkpoint_delete_files function, which in turn calls
> >>>>> delete_blcr_files, and the checkpoints get deleted.
> >>>> 
> >>>> I hope you are seeing normal behavior.  If I remember correctly,
> >>>> when a job gets checkpointed, the checkpoint files remain on the
> >>>> mom until the mom completes the job or until the job is put on
> >>>> hold and is no longer on the mom.  At that time the checkpoint
> >>>> files are transferred to the server where they remain until the
> >>>> job is removed from the server.  When the job gets restarted,
> >>>> which may or may not be on the original mom node, the checkpoint
> >>>> files are transferred to the mom which can then restart the job
> >>>> from the checkpoint file.
> >>>> 
> >>>>> 
> >>>>> Also, when I do a "qhold" on my job to try to initiate the
> >>>>> checkpoint,
> >>>>> is it really supposed to terminate my job?  Perhaps that's
> >>>>> related,
> >>>>> eg.
> >>>>> the job is ending so the files get cleaned up.
> >>>> 
> >>>> qhold is behaving as designed and as documented in its man page.
> >>>> If you want to just checkpoint the job and allow it to continue
> >>>> running, use qchkpt.
> >>>> 
> >>>>> 
> >>>>> Basically, does anyone have it working, and can give me advice?
> >>>>> 
> >>>>> Thanks,
> >>>>> 
> >>>>> --
> >>>>> Lloyd Brown
> >>>>> Systems Administrator
> >>>>> Fulton Supercomputing Lab
> >>>>> Brigham Young University
> >>>>> http://marylou.byu.edu
> >>>>> _______________________________________________
> >>>>> torqueusers mailing list
> >>>>> torqueusers at supercluster.org
> >>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
> >>>>> 
> >>>> _______________________________________________
> >>>> torqueusers mailing list
> >>>> torqueusers at supercluster.org
> >>>> http://www.supercluster.org/mailman/listinfo/torqueusers
> >>> _______________________________________________
> >>> torqueusers mailing list
> >>> torqueusers at supercluster.org
> >>> http://www.supercluster.org/mailman/listinfo/torqueusers
> >>> 
> >> _______________________________________________
> >> torqueusers mailing list
> >> torqueusers at supercluster.org
> >> http://www.supercluster.org/mailman/listinfo/torqueusers
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 


More information about the torqueusers mailing list