[torqueusers] Setting up checkpointing
Sreedhar Manchu
sm4082 at nyu.edu
Thu Jan 26 14:43:44 MST 2012
Hi,
This is what I did to avoid checkpoint images going onto server node.
Modify the pbs_mom's config file to specify what checkpointing directories are remotely mounted. This can be done by adding something like:
$remote_checkpoint_dirs /opt/torque/checkpoint
Here /opt/torque/checkpoint is remotely mounted onto /opt/torque/checkpoint on each compute node. It doesn't have to be /opt/torque/checkpoint on server node. It can be any other directory on server node. I linked /opt/torque/checkpoint on server node to some other directory with lots of space.
Best,
Sreedhar.
On Jan 26, 2012, at 4:30 PM, Lloyd Brown wrote:
> Al,
>
> I tried a number of combinations of params, but after your last email, I
> tried it with "-c periodic,interval=x", and I do see the checkpoint
> being created in the TORQUEMOMHOME/checkpoints directory. I haven't
> been able to test beyond that, since some other things go up.
>
>> From what you've said, though, I have to ask if there's any way to
> specify where the checkpoint goes, especially when it would otherwise be
> copied back to the host where pbs_server is running. You see, our use
> case involves checkpointing some really big-memory (eg. 256 GB)
> processes, and we simply don't have the space to store that on the
> pbs_server host.
>
>
>
> Lloyd Brown
> Systems Administrator
> Fulton Supercomputing Lab
> Brigham Young University
> http://marylou.byu.edu
>
> On 01/26/2012 11:18 AM, Al Taufer wrote:
>> Are you just using the "-c interval=x"? If so that just specifies what the checkpoint interval is but it does not enable the checkpointing. Try changing it to "-c periodic,interval=x".
>>
>> ----- Original Message -----
>>> Al,
>>>
>>> Thanks for the update. I guess the use case our users are really
>>> after
>>> is to have either a one-time or a periodic checkpoint, with the wait
>>> time before the checkpoint specified by the user. The "-c interval="
>>> parameter to qsub makes it look like this should work. But when I
>>> did
>>> that, I couldn't get the job to actually checkpoint without manually
>>> calling qhold/qchkpt. Maybe I'm just misinterpreting something, or
>>> don't have it set up right, but the idea here is to not require the
>>> users to manually checkpoint their job.
>>>
>>> Lloyd Brown
>>> Systems Administrator
>>> Fulton Supercomputing Lab
>>> Brigham Young University
>>> http://marylou.byu.edu
>>>
>>> On 01/26/2012 09:55 AM, Al Taufer wrote:
>>>>
>>>> ----- Original Message -----
>>>>> Can anyone enlighten me on the current state of BLCR-style
>>>>> checkpointing
>>>>> in Torque? I've been trying to get it to work, and so far, I see
>>>>> that
>>>>> it's invoking my checkpoint script, that script calls
>>>>> cr_checkpoint,
>>>>> and
>>>>> the checkpoint files/directories are created, but something is
>>>>> calling
>>>>> the mom_checkpoint_delete_files function, which in turn calls
>>>>> delete_blcr_files, and the checkpoints get deleted.
>>>>
>>>> I hope you are seeing normal behavior. If I remember correctly,
>>>> when a job gets checkpointed, the checkpoint files remain on the
>>>> mom until the mom completes the job or until the job is put on
>>>> hold and is no longer on the mom. At that time the checkpoint
>>>> files are transferred to the server where they remain until the
>>>> job is removed from the server. When the job gets restarted,
>>>> which may or may not be on the original mom node, the checkpoint
>>>> files are transferred to the mom which can then restart the job
>>>> from the checkpoint file.
>>>>
>>>>>
>>>>> Also, when I do a "qhold" on my job to try to initiate the
>>>>> checkpoint,
>>>>> is it really supposed to terminate my job? Perhaps that's
>>>>> related,
>>>>> eg.
>>>>> the job is ending so the files get cleaned up.
>>>>
>>>> qhold is behaving as designed and as documented in its man page.
>>>> If you want to just checkpoint the job and allow it to continue
>>>> running, use qchkpt.
>>>>
>>>>>
>>>>> Basically, does anyone have it working, and can give me advice?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> --
>>>>> Lloyd Brown
>>>>> Systems Administrator
>>>>> Fulton Supercomputing Lab
>>>>> Brigham Young University
>>>>> http://marylou.byu.edu
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> torqueusers at supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list