[torqueusers] Setting up checkpointing
lloyd_brown at byu.edu
Thu Jan 26 14:30:41 MST 2012
I tried a number of combinations of params, but after your last email, I
tried it with "-c periodic,interval=x", and I do see the checkpoint
being created in the TORQUEMOMHOME/checkpoints directory. I haven't
been able to test beyond that, since some other things go up.
>From what you've said, though, I have to ask if there's any way to
specify where the checkpoint goes, especially when it would otherwise be
copied back to the host where pbs_server is running. You see, our use
case involves checkpointing some really big-memory (eg. 256 GB)
processes, and we simply don't have the space to store that on the
Fulton Supercomputing Lab
Brigham Young University
On 01/26/2012 11:18 AM, Al Taufer wrote:
> Are you just using the "-c interval=x"? If so that just specifies what the checkpoint interval is but it does not enable the checkpointing. Try changing it to "-c periodic,interval=x".
> ----- Original Message -----
>> Thanks for the update. I guess the use case our users are really
>> is to have either a one-time or a periodic checkpoint, with the wait
>> time before the checkpoint specified by the user. The "-c interval="
>> parameter to qsub makes it look like this should work. But when I
>> that, I couldn't get the job to actually checkpoint without manually
>> calling qhold/qchkpt. Maybe I'm just misinterpreting something, or
>> don't have it set up right, but the idea here is to not require the
>> users to manually checkpoint their job.
>> Lloyd Brown
>> Systems Administrator
>> Fulton Supercomputing Lab
>> Brigham Young University
>> On 01/26/2012 09:55 AM, Al Taufer wrote:
>>> ----- Original Message -----
>>>> Can anyone enlighten me on the current state of BLCR-style
>>>> in Torque? I've been trying to get it to work, and so far, I see
>>>> it's invoking my checkpoint script, that script calls
>>>> the checkpoint files/directories are created, but something is
>>>> the mom_checkpoint_delete_files function, which in turn calls
>>>> delete_blcr_files, and the checkpoints get deleted.
>>> I hope you are seeing normal behavior. If I remember correctly,
>>> when a job gets checkpointed, the checkpoint files remain on the
>>> mom until the mom completes the job or until the job is put on
>>> hold and is no longer on the mom. At that time the checkpoint
>>> files are transferred to the server where they remain until the
>>> job is removed from the server. When the job gets restarted,
>>> which may or may not be on the original mom node, the checkpoint
>>> files are transferred to the mom which can then restart the job
>>> from the checkpoint file.
>>>> Also, when I do a "qhold" on my job to try to initiate the
>>>> is it really supposed to terminate my job? Perhaps that's
>>>> the job is ending so the files get cleaned up.
>>> qhold is behaving as designed and as documented in its man page.
>>> If you want to just checkpoint the job and allow it to continue
>>> running, use qchkpt.
>>>> Basically, does anyone have it working, and can give me advice?
>>>> Lloyd Brown
>>>> Systems Administrator
>>>> Fulton Supercomputing Lab
>>>> Brigham Young University
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>> torqueusers mailing list
>> torqueusers at supercluster.org
> torqueusers mailing list
> torqueusers at supercluster.org
More information about the torqueusers