[torqueusers] Setting up checkpointing

Sreedhar Manchu sm4082 at nyu.edu
Thu Jan 26 11:24:09 MST 2012


Hi,

Try this:

qmgr -c 'set queue serial checkpoint_defaults="enabled,shutdown,periodic,interval=1,depth=2"'

serial is the queue name.

depth doesn't work. You need to change the perl script that comes with blcr package to accommodate this variable.

Has anyone modified the checkpoint scripts? Does it work?

Thanks,
Sreedhar.

On Jan 26, 2012, at 1:18 PM, Al Taufer wrote:

> Are you just using the "-c interval=x"?  If so that just specifies what the checkpoint interval is but it does not enable the checkpointing.  Try changing it to "-c periodic,interval=x".
> 
> ----- Original Message -----
>> Al,
>> 
>> Thanks for the update.  I guess the use case our users are really
>> after
>> is to have either a one-time or a periodic checkpoint, with the wait
>> time before the checkpoint specified by the user.  The "-c interval="
>> parameter to qsub makes it look like this should work.  But when I
>> did
>> that, I couldn't get the job to actually checkpoint without manually
>> calling qhold/qchkpt.  Maybe I'm just misinterpreting something, or
>> don't have it set up right, but the idea here is to not require the
>> users to manually checkpoint their job.
>> 
>> Lloyd Brown
>> Systems Administrator
>> Fulton Supercomputing Lab
>> Brigham Young University
>> http://marylou.byu.edu
>> 
>> On 01/26/2012 09:55 AM, Al Taufer wrote:
>>> 
>>> ----- Original Message -----
>>>> Can anyone enlighten me on the current state of BLCR-style
>>>> checkpointing
>>>> in Torque?  I've been trying to get it to work, and so far, I see
>>>> that
>>>> it's invoking my checkpoint script, that script calls
>>>> cr_checkpoint,
>>>> and
>>>> the checkpoint files/directories are created, but something is
>>>> calling
>>>> the mom_checkpoint_delete_files function, which in turn calls
>>>> delete_blcr_files, and the checkpoints get deleted.
>>> 
>>> I hope you are seeing normal behavior.  If I remember correctly,
>>> when a job gets checkpointed, the checkpoint files remain on the
>>> mom until the mom completes the job or until the job is put on
>>> hold and is no longer on the mom.  At that time the checkpoint
>>> files are transferred to the server where they remain until the
>>> job is removed from the server.  When the job gets restarted,
>>> which may or may not be on the original mom node, the checkpoint
>>> files are transferred to the mom which can then restart the job
>>> from the checkpoint file.
>>> 
>>>> 
>>>> Also, when I do a "qhold" on my job to try to initiate the
>>>> checkpoint,
>>>> is it really supposed to terminate my job?  Perhaps that's
>>>> related,
>>>> eg.
>>>> the job is ending so the files get cleaned up.
>>> 
>>> qhold is behaving as designed and as documented in its man page.
>>> If you want to just checkpoint the job and allow it to continue
>>> running, use qchkpt.
>>> 
>>>> 
>>>> Basically, does anyone have it working, and can give me advice?
>>>> 
>>>> Thanks,
>>>> 
>>>> --
>>>> Lloyd Brown
>>>> Systems Administrator
>>>> Fulton Supercomputing Lab
>>>> Brigham Young University
>>>> http://marylou.byu.edu
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>> 
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list