[torqueusers] Setting up checkpointing

Al Taufer ataufer at adaptivecomputing.com
Tue Jan 31 09:40:17 MST 2012


I am not sure but I don't think its currently possible. The server always wants to transfer the checkpoint file back to its checkpoint directory. If this is a remotely accessible path that the compute nodes are set up to use then the actual transfer will not happen.

----- Original Message -----
> Hi Al,
> 
> Is there a way to make checkpoint files not stay on either compute
> nodes or head node? I mean I want them to go into users' working
> directories. We have /scratch space mounted on compute nodes but not
> on head node. Over all, we have less space on head node as well as
> on compute nodes. If the jobs are huge I'm afraid checkpoint images
> might occupy all the space eventually leading to job failures.
> 
> I know that qsub -c dir=<path to checkpoint> puts the file in the
> specified path. If we do this, does server still keep the checkpoint
> image on it ( this directory is remotely mounted on to compute
> nodes) or it stays just in the path specified next to dir.
> 
> I appreciate your help.
> 
> Thanks,
> Sreedhar.
> 
> On Jan 26, 2012, at 5:31 PM, Al Taufer wrote:
> 
> > This is a good method for accomplishing what is wanted.  The only
> > thing I can add is that when you configure the server you could
> > use the --with-servchkptdir option to specify where the server
> > will keep its checkpoint files, which can be a remotely mounted
> > path.
> > 
> > ----- Original Message -----
> >> Hi,
> >> 
> >> This is what I did to avoid checkpoint images going onto server
> >> node.
> >> 
> >> Modify the pbs_mom's config file to specify what checkpointing
> >> directories are remotely mounted. This can be done by adding
> >> something like:
> >> 
> >> $remote_checkpoint_dirs /opt/torque/checkpoint
> >> 
> >> Here /opt/torque/checkpoint is remotely mounted onto
> >> /opt/torque/checkpoint on each compute node. It doesn't have to be
> >> /opt/torque/checkpoint on server node. It can be any other
> >> directory
> >> on server node. I linked /opt/torque/checkpoint on server node to
> >> some other directory with lots of space.
> >> 
> >> Best,
> >> Sreedhar.
> >> 
> >> 
> >> On Jan 26, 2012, at 4:30 PM, Lloyd Brown wrote:
> >> 
> >>> Al,
> >>> 
> >>> I tried a number of combinations of params, but after your last
> >>> email, I
> >>> tried it with "-c periodic,interval=x", and I do see the
> >>> checkpoint
> >>> being created in the TORQUEMOMHOME/checkpoints directory.  I
> >>> haven't
> >>> been able to test beyond that, since some other things go up.
> >>> 
> >>>> From what you've said, though, I have to ask if there's any way
> >>>> to
> >>> specify where the checkpoint goes, especially when it would
> >>> otherwise be
> >>> copied back to the host where pbs_server is running.  You see,
> >>> our
> >>> use
> >>> case involves checkpointing some really big-memory (eg. 256 GB)
> >>> processes, and we simply don't have the space to store that on
> >>> the
> >>> pbs_server host.
> >>> 
> >>> 
> >>> 
> >>> Lloyd Brown
> >>> Systems Administrator
> >>> Fulton Supercomputing Lab
> >>> Brigham Young University
> >>> http://marylou.byu.edu
> >>> 
> >>> On 01/26/2012 11:18 AM, Al Taufer wrote:
> >>>> Are you just using the "-c interval=x"?  If so that just
> >>>> specifies
> >>>> what the checkpoint interval is but it does not enable the
> >>>> checkpointing.  Try changing it to "-c periodic,interval=x".
> >>>> 
> >>>> ----- Original Message -----
> >>>>> Al,
> >>>>> 
> >>>>> Thanks for the update.  I guess the use case our users are
> >>>>> really
> >>>>> after
> >>>>> is to have either a one-time or a periodic checkpoint, with the
> >>>>> wait
> >>>>> time before the checkpoint specified by the user.  The "-c
> >>>>> interval="
> >>>>> parameter to qsub makes it look like this should work.  But
> >>>>> when
> >>>>> I
> >>>>> did
> >>>>> that, I couldn't get the job to actually checkpoint without
> >>>>> manually
> >>>>> calling qhold/qchkpt.  Maybe I'm just misinterpreting
> >>>>> something,
> >>>>> or
> >>>>> don't have it set up right, but the idea here is to not require
> >>>>> the
> >>>>> users to manually checkpoint their job.
> >>>>> 
> >>>>> Lloyd Brown
> >>>>> Systems Administrator
> >>>>> Fulton Supercomputing Lab
> >>>>> Brigham Young University
> >>>>> http://marylou.byu.edu
> >>>>> 
> >>>>> On 01/26/2012 09:55 AM, Al Taufer wrote:
> >>>>>> 
> >>>>>> ----- Original Message -----
> >>>>>>> Can anyone enlighten me on the current state of BLCR-style
> >>>>>>> checkpointing
> >>>>>>> in Torque?  I've been trying to get it to work, and so far, I
> >>>>>>> see
> >>>>>>> that
> >>>>>>> it's invoking my checkpoint script, that script calls
> >>>>>>> cr_checkpoint,
> >>>>>>> and
> >>>>>>> the checkpoint files/directories are created, but something
> >>>>>>> is
> >>>>>>> calling
> >>>>>>> the mom_checkpoint_delete_files function, which in turn calls
> >>>>>>> delete_blcr_files, and the checkpoints get deleted.
> >>>>>> 
> >>>>>> I hope you are seeing normal behavior.  If I remember
> >>>>>> correctly,
> >>>>>> when a job gets checkpointed, the checkpoint files remain on
> >>>>>> the
> >>>>>> mom until the mom completes the job or until the job is put on
> >>>>>> hold and is no longer on the mom.  At that time the checkpoint
> >>>>>> files are transferred to the server where they remain until
> >>>>>> the
> >>>>>> job is removed from the server.  When the job gets restarted,
> >>>>>> which may or may not be on the original mom node, the
> >>>>>> checkpoint
> >>>>>> files are transferred to the mom which can then restart the
> >>>>>> job
> >>>>>> from the checkpoint file.
> >>>>>> 
> >>>>>>> 
> >>>>>>> Also, when I do a "qhold" on my job to try to initiate the
> >>>>>>> checkpoint,
> >>>>>>> is it really supposed to terminate my job?  Perhaps that's
> >>>>>>> related,
> >>>>>>> eg.
> >>>>>>> the job is ending so the files get cleaned up.
> >>>>>> 
> >>>>>> qhold is behaving as designed and as documented in its man
> >>>>>> page.
> >>>>>> If you want to just checkpoint the job and allow it to
> >>>>>> continue
> >>>>>> running, use qchkpt.
> >>>>>> 
> >>>>>>> 
> >>>>>>> Basically, does anyone have it working, and can give me
> >>>>>>> advice?
> >>>>>>> 
> >>>>>>> Thanks,
> >>>>>>> 
> >>>>>>> --
> >>>>>>> Lloyd Brown
> >>>>>>> Systems Administrator
> >>>>>>> Fulton Supercomputing Lab
> >>>>>>> Brigham Young University
> >>>>>>> http://marylou.byu.edu
> >>>>>>> _______________________________________________
> >>>>>>> torqueusers mailing list
> >>>>>>> torqueusers at supercluster.org
> >>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
> >>>>>>> 
> >>>>>> _______________________________________________
> >>>>>> torqueusers mailing list
> >>>>>> torqueusers at supercluster.org
> >>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
> >>>>> _______________________________________________
> >>>>> torqueusers mailing list
> >>>>> torqueusers at supercluster.org
> >>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
> >>>>> 
> >>>> _______________________________________________
> >>>> torqueusers mailing list
> >>>> torqueusers at supercluster.org
> >>>> http://www.supercluster.org/mailman/listinfo/torqueusers
> >>> _______________________________________________
> >>> torqueusers mailing list
> >>> torqueusers at supercluster.org
> >>> http://www.supercluster.org/mailman/listinfo/torqueusers
> >> 
> >> _______________________________________________
> >> torqueusers mailing list
> >> torqueusers at supercluster.org
> >> http://www.supercluster.org/mailman/listinfo/torqueusers
> >> 
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 


More information about the torqueusers mailing list