[torqueusers] Setting up checkpointing

Lloyd Brown lloyd_brown at byu.edu
Thu Jan 26 14:30:41 MST 2012


Al,

I tried a number of combinations of params, but after your last email, I
tried it with "-c periodic,interval=x", and I do see the checkpoint
being created in the TORQUEMOMHOME/checkpoints directory.  I haven't
been able to test beyond that, since some other things go up.

>From what you've said, though, I have to ask if there's any way to
specify where the checkpoint goes, especially when it would otherwise be
copied back to the host where pbs_server is running.  You see, our use
case involves checkpointing some really big-memory (eg. 256 GB)
processes, and we simply don't have the space to store that on the
pbs_server host.



Lloyd Brown
Systems Administrator
Fulton Supercomputing Lab
Brigham Young University
http://marylou.byu.edu

On 01/26/2012 11:18 AM, Al Taufer wrote:
> Are you just using the "-c interval=x"?  If so that just specifies what the checkpoint interval is but it does not enable the checkpointing.  Try changing it to "-c periodic,interval=x".
> 
> ----- Original Message -----
>> Al,
>>
>> Thanks for the update.  I guess the use case our users are really
>> after
>> is to have either a one-time or a periodic checkpoint, with the wait
>> time before the checkpoint specified by the user.  The "-c interval="
>> parameter to qsub makes it look like this should work.  But when I
>> did
>> that, I couldn't get the job to actually checkpoint without manually
>> calling qhold/qchkpt.  Maybe I'm just misinterpreting something, or
>> don't have it set up right, but the idea here is to not require the
>> users to manually checkpoint their job.
>>
>> Lloyd Brown
>> Systems Administrator
>> Fulton Supercomputing Lab
>> Brigham Young University
>> http://marylou.byu.edu
>>
>> On 01/26/2012 09:55 AM, Al Taufer wrote:
>>>
>>> ----- Original Message -----
>>>> Can anyone enlighten me on the current state of BLCR-style
>>>> checkpointing
>>>> in Torque?  I've been trying to get it to work, and so far, I see
>>>> that
>>>> it's invoking my checkpoint script, that script calls
>>>> cr_checkpoint,
>>>> and
>>>> the checkpoint files/directories are created, but something is
>>>> calling
>>>> the mom_checkpoint_delete_files function, which in turn calls
>>>> delete_blcr_files, and the checkpoints get deleted.
>>>
>>> I hope you are seeing normal behavior.  If I remember correctly,
>>> when a job gets checkpointed, the checkpoint files remain on the
>>> mom until the mom completes the job or until the job is put on
>>> hold and is no longer on the mom.  At that time the checkpoint
>>> files are transferred to the server where they remain until the
>>> job is removed from the server.  When the job gets restarted,
>>> which may or may not be on the original mom node, the checkpoint
>>> files are transferred to the mom which can then restart the job
>>> from the checkpoint file.
>>>
>>>>
>>>> Also, when I do a "qhold" on my job to try to initiate the
>>>> checkpoint,
>>>> is it really supposed to terminate my job?  Perhaps that's
>>>> related,
>>>> eg.
>>>> the job is ending so the files get cleaned up.
>>>
>>> qhold is behaving as designed and as documented in its man page.
>>>  If you want to just checkpoint the job and allow it to continue
>>> running, use qchkpt.
>>>
>>>>
>>>> Basically, does anyone have it working, and can give me advice?
>>>>
>>>> Thanks,
>>>>
>>>> --
>>>> Lloyd Brown
>>>> Systems Administrator
>>>> Fulton Supercomputing Lab
>>>> Brigham Young University
>>>> http://marylou.byu.edu
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list