[torqueusers] Setting up checkpointing

Lloyd Brown lloyd_brown at byu.edu
Fri Jan 27 11:04:08 MST 2012


I have to apologize for being so dense, but it seems I still need a
little help.

Thanks to Al's and Sreedhar's help, I've been able to get the checkpoint
files to be generated (either in TORQUEMOMDIR/checkpoints, or whatever I
specify via "-c dir=").  When the job ends (qdel, runs out of walltime,
etc.), though, it sounds like it should be copied back to the pbs_server
host somewhere, either where specified via configure or qmgr, or in
PBSSERVERDIR/checkpoint by default.  The thing is that while the
checkpoints get deleted on the mom, they never show up on the server.
This occurs both with and without "qmgr -c 's q queuename
checkpoint_dir=..'", as described in the docs.  I haven't tried
recompiling the server with the config param Al mentioned yet.

I'm still deciding whether I like the behavior of torque with respect to
checkpointing, and whether it will fit with my users' use case, but
right now, I can't replicate the behavior yet.

Lloyd Brown
Systems Administrator
Fulton Supercomputing Lab
Brigham Young University
http://marylou.byu.edu

On 01/26/2012 02:43 PM, Sreedhar Manchu wrote:
> Hi,
> 
> This is what I did to avoid checkpoint images going onto server node.
> 
> Modify the pbs_mom's config file to specify what checkpointing directories are remotely mounted. This can be done by adding something like:
> 
> $remote_checkpoint_dirs /opt/torque/checkpoint
> 
> Here /opt/torque/checkpoint is remotely mounted onto /opt/torque/checkpoint on each compute node. It doesn't have to be /opt/torque/checkpoint on server node. It can be any other directory on server node. I linked /opt/torque/checkpoint on server node to some other directory with lots of space.
> 
> Best,
> Sreedhar.
> 
> 
> On Jan 26, 2012, at 4:30 PM, Lloyd Brown wrote:
> 
>> Al,
>>
>> I tried a number of combinations of params, but after your last email, I
>> tried it with "-c periodic,interval=x", and I do see the checkpoint
>> being created in the TORQUEMOMHOME/checkpoints directory.  I haven't
>> been able to test beyond that, since some other things go up.
>>
>>> From what you've said, though, I have to ask if there's any way to
>> specify where the checkpoint goes, especially when it would otherwise be
>> copied back to the host where pbs_server is running.  You see, our use
>> case involves checkpointing some really big-memory (eg. 256 GB)
>> processes, and we simply don't have the space to store that on the
>> pbs_server host.
>>
>>
>>
>> Lloyd Brown
>> Systems Administrator
>> Fulton Supercomputing Lab
>> Brigham Young University
>> http://marylou.byu.edu
>>
>> On 01/26/2012 11:18 AM, Al Taufer wrote:
>>> Are you just using the "-c interval=x"?  If so that just specifies what the checkpoint interval is but it does not enable the checkpointing.  Try changing it to "-c periodic,interval=x".
>>>
>>> ----- Original Message -----
>>>> Al,
>>>>
>>>> Thanks for the update.  I guess the use case our users are really
>>>> after
>>>> is to have either a one-time or a periodic checkpoint, with the wait
>>>> time before the checkpoint specified by the user.  The "-c interval="
>>>> parameter to qsub makes it look like this should work.  But when I
>>>> did
>>>> that, I couldn't get the job to actually checkpoint without manually
>>>> calling qhold/qchkpt.  Maybe I'm just misinterpreting something, or
>>>> don't have it set up right, but the idea here is to not require the
>>>> users to manually checkpoint their job.
>>>>
>>>> Lloyd Brown
>>>> Systems Administrator
>>>> Fulton Supercomputing Lab
>>>> Brigham Young University
>>>> http://marylou.byu.edu
>>>>
>>>> On 01/26/2012 09:55 AM, Al Taufer wrote:
>>>>>
>>>>> ----- Original Message -----
>>>>>> Can anyone enlighten me on the current state of BLCR-style
>>>>>> checkpointing
>>>>>> in Torque?  I've been trying to get it to work, and so far, I see
>>>>>> that
>>>>>> it's invoking my checkpoint script, that script calls
>>>>>> cr_checkpoint,
>>>>>> and
>>>>>> the checkpoint files/directories are created, but something is
>>>>>> calling
>>>>>> the mom_checkpoint_delete_files function, which in turn calls
>>>>>> delete_blcr_files, and the checkpoints get deleted.
>>>>>
>>>>> I hope you are seeing normal behavior.  If I remember correctly,
>>>>> when a job gets checkpointed, the checkpoint files remain on the
>>>>> mom until the mom completes the job or until the job is put on
>>>>> hold and is no longer on the mom.  At that time the checkpoint
>>>>> files are transferred to the server where they remain until the
>>>>> job is removed from the server.  When the job gets restarted,
>>>>> which may or may not be on the original mom node, the checkpoint
>>>>> files are transferred to the mom which can then restart the job
>>>>> from the checkpoint file.
>>>>>
>>>>>>
>>>>>> Also, when I do a "qhold" on my job to try to initiate the
>>>>>> checkpoint,
>>>>>> is it really supposed to terminate my job?  Perhaps that's
>>>>>> related,
>>>>>> eg.
>>>>>> the job is ending so the files get cleaned up.
>>>>>
>>>>> qhold is behaving as designed and as documented in its man page.
>>>>> If you want to just checkpoint the job and allow it to continue
>>>>> running, use qchkpt.
>>>>>
>>>>>>
>>>>>> Basically, does anyone have it working, and can give me advice?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> --
>>>>>> Lloyd Brown
>>>>>> Systems Administrator
>>>>>> Fulton Supercomputing Lab
>>>>>> Brigham Young University
>>>>>> http://marylou.byu.edu
>>>>>> _______________________________________________
>>>>>> torqueusers mailing list
>>>>>> torqueusers at supercluster.org
>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> torqueusers at supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list