[torqueusers] Setting up checkpointing

Sreedhar Manchu sm4082 at nyu.edu
Thu Jan 26 14:43:44 MST 2012


Hi,

This is what I did to avoid checkpoint images going onto server node.

Modify the pbs_mom's config file to specify what checkpointing directories are remotely mounted. This can be done by adding something like:

$remote_checkpoint_dirs /opt/torque/checkpoint

Here /opt/torque/checkpoint is remotely mounted onto /opt/torque/checkpoint on each compute node. It doesn't have to be /opt/torque/checkpoint on server node. It can be any other directory on server node. I linked /opt/torque/checkpoint on server node to some other directory with lots of space.

Best,
Sreedhar.


On Jan 26, 2012, at 4:30 PM, Lloyd Brown wrote:

> Al,
> 
> I tried a number of combinations of params, but after your last email, I
> tried it with "-c periodic,interval=x", and I do see the checkpoint
> being created in the TORQUEMOMHOME/checkpoints directory.  I haven't
> been able to test beyond that, since some other things go up.
> 
>> From what you've said, though, I have to ask if there's any way to
> specify where the checkpoint goes, especially when it would otherwise be
> copied back to the host where pbs_server is running.  You see, our use
> case involves checkpointing some really big-memory (eg. 256 GB)
> processes, and we simply don't have the space to store that on the
> pbs_server host.
> 
> 
> 
> Lloyd Brown
> Systems Administrator
> Fulton Supercomputing Lab
> Brigham Young University
> http://marylou.byu.edu
> 
> On 01/26/2012 11:18 AM, Al Taufer wrote:
>> Are you just using the "-c interval=x"?  If so that just specifies what the checkpoint interval is but it does not enable the checkpointing.  Try changing it to "-c periodic,interval=x".
>> 
>> ----- Original Message -----
>>> Al,
>>> 
>>> Thanks for the update.  I guess the use case our users are really
>>> after
>>> is to have either a one-time or a periodic checkpoint, with the wait
>>> time before the checkpoint specified by the user.  The "-c interval="
>>> parameter to qsub makes it look like this should work.  But when I
>>> did
>>> that, I couldn't get the job to actually checkpoint without manually
>>> calling qhold/qchkpt.  Maybe I'm just misinterpreting something, or
>>> don't have it set up right, but the idea here is to not require the
>>> users to manually checkpoint their job.
>>> 
>>> Lloyd Brown
>>> Systems Administrator
>>> Fulton Supercomputing Lab
>>> Brigham Young University
>>> http://marylou.byu.edu
>>> 
>>> On 01/26/2012 09:55 AM, Al Taufer wrote:
>>>> 
>>>> ----- Original Message -----
>>>>> Can anyone enlighten me on the current state of BLCR-style
>>>>> checkpointing
>>>>> in Torque?  I've been trying to get it to work, and so far, I see
>>>>> that
>>>>> it's invoking my checkpoint script, that script calls
>>>>> cr_checkpoint,
>>>>> and
>>>>> the checkpoint files/directories are created, but something is
>>>>> calling
>>>>> the mom_checkpoint_delete_files function, which in turn calls
>>>>> delete_blcr_files, and the checkpoints get deleted.
>>>> 
>>>> I hope you are seeing normal behavior.  If I remember correctly,
>>>> when a job gets checkpointed, the checkpoint files remain on the
>>>> mom until the mom completes the job or until the job is put on
>>>> hold and is no longer on the mom.  At that time the checkpoint
>>>> files are transferred to the server where they remain until the
>>>> job is removed from the server.  When the job gets restarted,
>>>> which may or may not be on the original mom node, the checkpoint
>>>> files are transferred to the mom which can then restart the job
>>>> from the checkpoint file.
>>>> 
>>>>> 
>>>>> Also, when I do a "qhold" on my job to try to initiate the
>>>>> checkpoint,
>>>>> is it really supposed to terminate my job?  Perhaps that's
>>>>> related,
>>>>> eg.
>>>>> the job is ending so the files get cleaned up.
>>>> 
>>>> qhold is behaving as designed and as documented in its man page.
>>>> If you want to just checkpoint the job and allow it to continue
>>>> running, use qchkpt.
>>>> 
>>>>> 
>>>>> Basically, does anyone have it working, and can give me advice?
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> --
>>>>> Lloyd Brown
>>>>> Systems Administrator
>>>>> Fulton Supercomputing Lab
>>>>> Brigham Young University
>>>>> http://marylou.byu.edu
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> torqueusers at supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>> 
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>> 
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list