[torqueusers] Setting up checkpointing

Lloyd Brown lloyd_brown at byu.edu
Thu Jan 26 11:03:24 MST 2012


Al,

Thanks for the update.  I guess the use case our users are really after
is to have either a one-time or a periodic checkpoint, with the wait
time before the checkpoint specified by the user.  The "-c interval="
parameter to qsub makes it look like this should work.  But when I did
that, I couldn't get the job to actually checkpoint without manually
calling qhold/qchkpt.  Maybe I'm just misinterpreting something, or
don't have it set up right, but the idea here is to not require the
users to manually checkpoint their job.

Lloyd Brown
Systems Administrator
Fulton Supercomputing Lab
Brigham Young University
http://marylou.byu.edu

On 01/26/2012 09:55 AM, Al Taufer wrote:
> 
> ----- Original Message -----
>> Can anyone enlighten me on the current state of BLCR-style
>> checkpointing
>> in Torque?  I've been trying to get it to work, and so far, I see
>> that
>> it's invoking my checkpoint script, that script calls cr_checkpoint,
>> and
>> the checkpoint files/directories are created, but something is
>> calling
>> the mom_checkpoint_delete_files function, which in turn calls
>> delete_blcr_files, and the checkpoints get deleted.
> 
> I hope you are seeing normal behavior.  If I remember correctly, when a job gets checkpointed, the checkpoint files remain on the mom until the mom completes the job or until the job is put on hold and is no longer on the mom.  At that time the checkpoint files are transferred to the server where they remain until the job is removed from the server.  When the job gets restarted, which may or may not be on the original mom node, the checkpoint files are transferred to the mom which can then restart the job from the checkpoint file.
> 
>>
>> Also, when I do a "qhold" on my job to try to initiate the
>> checkpoint,
>> is it really supposed to terminate my job?  Perhaps that's related,
>> eg.
>> the job is ending so the files get cleaned up.
> 
> qhold is behaving as designed and as documented in its man page.  If you want to just checkpoint the job and allow it to continue running, use qchkpt.
> 
>>
>> Basically, does anyone have it working, and can give me advice?
>>
>> Thanks,
>>
>> --
>> Lloyd Brown
>> Systems Administrator
>> Fulton Supercomputing Lab
>> Brigham Young University
>> http://marylou.byu.edu
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list