[torqueusers] Setting up checkpointing

Sreedhar Manchu sm4082 at nyu.edu
Mon Jan 30 15:19:36 MST 2012


Hi Al,

Is there a way to make checkpoint files not stay on either compute nodes or head node? I mean I want them to go into users' working directories. We have /scratch space mounted on compute nodes but not on head node. Over all, we have less space on head node as well as on compute nodes. If the jobs are huge I'm afraid checkpoint images might occupy all the space eventually leading to job failures.

I know that qsub -c dir=<path to checkpoint> puts the file in the specified path. If we do this, does server still keep the checkpoint image on it ( this directory is remotely mounted on to compute nodes) or it stays just in the path specified next to dir.

I appreciate your help.

Thanks,
Sreedhar.

On Jan 26, 2012, at 5:31 PM, Al Taufer wrote:

> This is a good method for accomplishing what is wanted.  The only thing I can add is that when you configure the server you could use the --with-servchkptdir option to specify where the server will keep its checkpoint files, which can be a remotely mounted path.
> 
> ----- Original Message -----
>> Hi,
>> 
>> This is what I did to avoid checkpoint images going onto server node.
>> 
>> Modify the pbs_mom's config file to specify what checkpointing
>> directories are remotely mounted. This can be done by adding
>> something like:
>> 
>> $remote_checkpoint_dirs /opt/torque/checkpoint
>> 
>> Here /opt/torque/checkpoint is remotely mounted onto
>> /opt/torque/checkpoint on each compute node. It doesn't have to be
>> /opt/torque/checkpoint on server node. It can be any other directory
>> on server node. I linked /opt/torque/checkpoint on server node to
>> some other directory with lots of space.
>> 
>> Best,
>> Sreedhar.
>> 
>> 
>> On Jan 26, 2012, at 4:30 PM, Lloyd Brown wrote:
>> 
>>> Al,
>>> 
>>> I tried a number of combinations of params, but after your last
>>> email, I
>>> tried it with "-c periodic,interval=x", and I do see the checkpoint
>>> being created in the TORQUEMOMHOME/checkpoints directory.  I
>>> haven't
>>> been able to test beyond that, since some other things go up.
>>> 
>>>> From what you've said, though, I have to ask if there's any way to
>>> specify where the checkpoint goes, especially when it would
>>> otherwise be
>>> copied back to the host where pbs_server is running.  You see, our
>>> use
>>> case involves checkpointing some really big-memory (eg. 256 GB)
>>> processes, and we simply don't have the space to store that on the
>>> pbs_server host.
>>> 
>>> 
>>> 
>>> Lloyd Brown
>>> Systems Administrator
>>> Fulton Supercomputing Lab
>>> Brigham Young University
>>> http://marylou.byu.edu
>>> 
>>> On 01/26/2012 11:18 AM, Al Taufer wrote:
>>>> Are you just using the "-c interval=x"?  If so that just specifies
>>>> what the checkpoint interval is but it does not enable the
>>>> checkpointing.  Try changing it to "-c periodic,interval=x".
>>>> 
>>>> ----- Original Message -----
>>>>> Al,
>>>>> 
>>>>> Thanks for the update.  I guess the use case our users are really
>>>>> after
>>>>> is to have either a one-time or a periodic checkpoint, with the
>>>>> wait
>>>>> time before the checkpoint specified by the user.  The "-c
>>>>> interval="
>>>>> parameter to qsub makes it look like this should work.  But when
>>>>> I
>>>>> did
>>>>> that, I couldn't get the job to actually checkpoint without
>>>>> manually
>>>>> calling qhold/qchkpt.  Maybe I'm just misinterpreting something,
>>>>> or
>>>>> don't have it set up right, but the idea here is to not require
>>>>> the
>>>>> users to manually checkpoint their job.
>>>>> 
>>>>> Lloyd Brown
>>>>> Systems Administrator
>>>>> Fulton Supercomputing Lab
>>>>> Brigham Young University
>>>>> http://marylou.byu.edu
>>>>> 
>>>>> On 01/26/2012 09:55 AM, Al Taufer wrote:
>>>>>> 
>>>>>> ----- Original Message -----
>>>>>>> Can anyone enlighten me on the current state of BLCR-style
>>>>>>> checkpointing
>>>>>>> in Torque?  I've been trying to get it to work, and so far, I
>>>>>>> see
>>>>>>> that
>>>>>>> it's invoking my checkpoint script, that script calls
>>>>>>> cr_checkpoint,
>>>>>>> and
>>>>>>> the checkpoint files/directories are created, but something is
>>>>>>> calling
>>>>>>> the mom_checkpoint_delete_files function, which in turn calls
>>>>>>> delete_blcr_files, and the checkpoints get deleted.
>>>>>> 
>>>>>> I hope you are seeing normal behavior.  If I remember correctly,
>>>>>> when a job gets checkpointed, the checkpoint files remain on the
>>>>>> mom until the mom completes the job or until the job is put on
>>>>>> hold and is no longer on the mom.  At that time the checkpoint
>>>>>> files are transferred to the server where they remain until the
>>>>>> job is removed from the server.  When the job gets restarted,
>>>>>> which may or may not be on the original mom node, the checkpoint
>>>>>> files are transferred to the mom which can then restart the job
>>>>>> from the checkpoint file.
>>>>>> 
>>>>>>> 
>>>>>>> Also, when I do a "qhold" on my job to try to initiate the
>>>>>>> checkpoint,
>>>>>>> is it really supposed to terminate my job?  Perhaps that's
>>>>>>> related,
>>>>>>> eg.
>>>>>>> the job is ending so the files get cleaned up.
>>>>>> 
>>>>>> qhold is behaving as designed and as documented in its man page.
>>>>>> If you want to just checkpoint the job and allow it to continue
>>>>>> running, use qchkpt.
>>>>>> 
>>>>>>> 
>>>>>>> Basically, does anyone have it working, and can give me advice?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> --
>>>>>>> Lloyd Brown
>>>>>>> Systems Administrator
>>>>>>> Fulton Supercomputing Lab
>>>>>>> Brigham Young University
>>>>>>> http://marylou.byu.edu
>>>>>>> _______________________________________________
>>>>>>> torqueusers mailing list
>>>>>>> torqueusers at supercluster.org
>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>> 
>>>>>> _______________________________________________
>>>>>> torqueusers mailing list
>>>>>> torqueusers at supercluster.org
>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> torqueusers at supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>> 
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>> 
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list