[torqueusers] Setting up checkpointing

Sreedhar Manchu sm4082 at nyu.edu
Fri Jan 27 11:33:55 MST 2012


Hi Lloyd,

I had the same problem. Then I realized the directory with checkpoints for each job in the server checkpoint directory or user mentioned directory through command line with qsub would be deleted as soon as either the job is completed or the job is deleted with qdel.

I don't remember exactly, but I think if you keep the job information on the server using

set server keep_completed = 300  

(300 seconds)

then you can restart the checkpoint that is created with qhold. I think if you set up the server to keep the info long enough after the job is done and (then include qhold in the pbs script ) to use qhold just before walltime ends, then you should be able to restart the job with the checkpoint that would be created right before job ends. Otherwise, once job ends torque takes off job information and blcr wouldn't have needed files/information to restart the job.

Since I use
$spool_as_final_name true
parameter in my /opt/torque/mom_priv/config on each compute node, I couldn't make torque restart the jobs from checkpoint images. I think for whatever reasons blcr sees that files are modified from the time they are created. Because of this I never used checkpointing from torque side as for us it is beneficial to have files in user directories at the beginning of the job it self rather than copying from compute nodes after the job is done.

Regarding qdel, it deletes all the checkpoint files as it deletes all the job information from server as well (if I am right). Since there is no information from torque, it wouldn't be much helpful to have checkpoint images as blcr ( I mean torque compiled with blcr) needs all the information to restart. I guess issuing qdel is like telling I don't care about this job and so I don't need anything anymore.

This is same once the job is also done. Torque sees it as it doesn't have to care about checkpoint images as the job has successfully finished and so there is no need for checkpoint images. This is the reason you need to issue qhold just before walltime ends and torque would keep the job information for the time you mention in the set server keep_completed parameter with qmgr -c.

Long time back, I successfully checkpointed and restarted the jobs with torque (simple C executable). The other thing I noticed was that it deletes the first checkpoint file as soon as you create the second checkpoint ( I guess it thinks there is no need for the first checkpoint once we get to next point in time by creating second checkpoint). This is helpful when we consider space usage.

You can try out these things. I might be wrong with all these statements. I tried my best for days to make it work the way I wanted. But some how I realized it wasn't working the way I wanted (especially with spool_as_final_name parameter and so gave up. Now I am trying to do it just blcr with some scripts.

In a way it would be great if it works with torque. If you succeed in this please let us know how you did it. It would be really helpful if someone helps. I know for sure that there are people using torque with blcr support.

Good luck,
Sreedhar.


On Jan 27, 2012, at 1:04 PM, Lloyd Brown wrote:

> I have to apologize for being so dense, but it seems I still need a
> little help.
> 
> Thanks to Al's and Sreedhar's help, I've been able to get the checkpoint
> files to be generated (either in TORQUEMOMDIR/checkpoints, or whatever I
> specify via "-c dir=").  When the job ends (qdel, runs out of walltime,
> etc.), though, it sounds like it should be copied back to the pbs_server
> host somewhere, either where specified via configure or qmgr, or in
> PBSSERVERDIR/checkpoint by default.  The thing is that while the
> checkpoints get deleted on the mom, they never show up on the server.
> This occurs both with and without "qmgr -c 's q queuename
> checkpoint_dir=..'", as described in the docs.  I haven't tried
> recompiling the server with the config param Al mentioned yet.
> 
> I'm still deciding whether I like the behavior of torque with respect to
> checkpointing, and whether it will fit with my users' use case, but
> right now, I can't replicate the behavior yet.
> 
> Lloyd Brown
> Systems Administrator
> Fulton Supercomputing Lab
> Brigham Young University
> http://marylou.byu.edu
> 
> On 01/26/2012 02:43 PM, Sreedhar Manchu wrote:
>> Hi,
>> 
>> This is what I did to avoid checkpoint images going onto server node.
>> 
>> Modify the pbs_mom's config file to specify what checkpointing directories are remotely mounted. This can be done by adding something like:
>> 
>> $remote_checkpoint_dirs /opt/torque/checkpoint
>> 
>> Here /opt/torque/checkpoint is remotely mounted onto /opt/torque/checkpoint on each compute node. It doesn't have to be /opt/torque/checkpoint on server node. It can be any other directory on server node. I linked /opt/torque/checkpoint on server node to some other directory with lots of space.
>> 
>> Best,
>> Sreedhar.
>> 
>> 
>> On Jan 26, 2012, at 4:30 PM, Lloyd Brown wrote:
>> 
>>> Al,
>>> 
>>> I tried a number of combinations of params, but after your last email, I
>>> tried it with "-c periodic,interval=x", and I do see the checkpoint
>>> being created in the TORQUEMOMHOME/checkpoints directory.  I haven't
>>> been able to test beyond that, since some other things go up.
>>> 
>>>> From what you've said, though, I have to ask if there's any way to
>>> specify where the checkpoint goes, especially when it would otherwise be
>>> copied back to the host where pbs_server is running.  You see, our use
>>> case involves checkpointing some really big-memory (eg. 256 GB)
>>> processes, and we simply don't have the space to store that on the
>>> pbs_server host.
>>> 
>>> 
>>> 
>>> Lloyd Brown
>>> Systems Administrator
>>> Fulton Supercomputing Lab
>>> Brigham Young University
>>> http://marylou.byu.edu
>>> 
>>> On 01/26/2012 11:18 AM, Al Taufer wrote:
>>>> Are you just using the "-c interval=x"?  If so that just specifies what the checkpoint interval is but it does not enable the checkpointing.  Try changing it to "-c periodic,interval=x".
>>>> 
>>>> ----- Original Message -----
>>>>> Al,
>>>>> 
>>>>> Thanks for the update.  I guess the use case our users are really
>>>>> after
>>>>> is to have either a one-time or a periodic checkpoint, with the wait
>>>>> time before the checkpoint specified by the user.  The "-c interval="
>>>>> parameter to qsub makes it look like this should work.  But when I
>>>>> did
>>>>> that, I couldn't get the job to actually checkpoint without manually
>>>>> calling qhold/qchkpt.  Maybe I'm just misinterpreting something, or
>>>>> don't have it set up right, but the idea here is to not require the
>>>>> users to manually checkpoint their job.
>>>>> 
>>>>> Lloyd Brown
>>>>> Systems Administrator
>>>>> Fulton Supercomputing Lab
>>>>> Brigham Young University
>>>>> http://marylou.byu.edu
>>>>> 
>>>>> On 01/26/2012 09:55 AM, Al Taufer wrote:
>>>>>> 
>>>>>> ----- Original Message -----
>>>>>>> Can anyone enlighten me on the current state of BLCR-style
>>>>>>> checkpointing
>>>>>>> in Torque?  I've been trying to get it to work, and so far, I see
>>>>>>> that
>>>>>>> it's invoking my checkpoint script, that script calls
>>>>>>> cr_checkpoint,
>>>>>>> and
>>>>>>> the checkpoint files/directories are created, but something is
>>>>>>> calling
>>>>>>> the mom_checkpoint_delete_files function, which in turn calls
>>>>>>> delete_blcr_files, and the checkpoints get deleted.
>>>>>> 
>>>>>> I hope you are seeing normal behavior.  If I remember correctly,
>>>>>> when a job gets checkpointed, the checkpoint files remain on the
>>>>>> mom until the mom completes the job or until the job is put on
>>>>>> hold and is no longer on the mom.  At that time the checkpoint
>>>>>> files are transferred to the server where they remain until the
>>>>>> job is removed from the server.  When the job gets restarted,
>>>>>> which may or may not be on the original mom node, the checkpoint
>>>>>> files are transferred to the mom which can then restart the job
>>>>>> from the checkpoint file.
>>>>>> 
>>>>>>> 
>>>>>>> Also, when I do a "qhold" on my job to try to initiate the
>>>>>>> checkpoint,
>>>>>>> is it really supposed to terminate my job?  Perhaps that's
>>>>>>> related,
>>>>>>> eg.
>>>>>>> the job is ending so the files get cleaned up.
>>>>>> 
>>>>>> qhold is behaving as designed and as documented in its man page.
>>>>>> If you want to just checkpoint the job and allow it to continue
>>>>>> running, use qchkpt.
>>>>>> 
>>>>>>> 
>>>>>>> Basically, does anyone have it working, and can give me advice?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> --
>>>>>>> Lloyd Brown
>>>>>>> Systems Administrator
>>>>>>> Fulton Supercomputing Lab
>>>>>>> Brigham Young University
>>>>>>> http://marylou.byu.edu
>>>>>>> _______________________________________________
>>>>>>> torqueusers mailing list
>>>>>>> torqueusers at supercluster.org
>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>> 
>>>>>> _______________________________________________
>>>>>> torqueusers mailing list
>>>>>> torqueusers at supercluster.org
>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> torqueusers at supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>> 
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>> 
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

---
Sreedhar Manchu
HPC Support Specialist
New York University
251 Mercer Street
New York, NY 10012-1110




More information about the torqueusers mailing list