[torqueusers] Setting up checkpointing

Lloyd Brown lloyd_brown at byu.edu
Thu Jan 26 08:05:45 MST 2012


Thanks for your insight.  I apologize that I wasn't clear.  I have BLCR
working, at least manually.  My question has more to do with how the
integration with Torque works.

In the meantime, I'm currently pursuing the approach of having the
checkpointing occur within the job, eg. scripting to have the job call
cr_run, cr_checkpoint, cr_restart, etc., as needed.  Some clever work
with signals makes it reasonably easy.  The real problem is redirecting
or relocating the output spool files, eg.
<TORQUE_MOM_HOME>/spool/<JOBNAME>.{OU,ER}.  But if your script is
checkpointing something it called, rather than checkpointing itself, and
if your script uses shell redirection to files that persist on a central
filesystem, that's not too hard.

Lloyd Brown
Systems Administrator
Fulton Supercomputing Lab
Brigham Young University
http://marylou.byu.edu

On 01/26/2012 02:14 AM, Mahmood Naderan wrote:
> If you are using debian based operating system, then you hardly can make
> BLCR working.
> BLCR is primarily designed for redhat based operating systems.
>  
> *// Naderan *Mahmood;*
> 
> ------------------------------------------------------------------------
> *From:* Lloyd Brown <lloyd_brown at byu.edu>
> *To:* Torque Users Mailing List <torqueusers at supercluster.org>
> *Sent:* Thursday, January 19, 2012 2:09 AM
> *Subject:* [torqueusers] Setting up checkpointing
> 
> Can anyone enlighten me on the current state of BLCR-style checkpointing
> in Torque?  I've been trying to get it to work, and so far, I see that
> it's invoking my checkpoint script, that script calls cr_checkpoint, and
> the checkpoint files/directories are created, but something is calling
> the mom_checkpoint_delete_files function, which in turn calls
> delete_blcr_files, and the checkpoints get deleted.
> 
> Also, when I do a "qhold" on my job to try to initiate the checkpoint,
> is it really supposed to terminate my job?  Perhaps that's related, eg.
> the job is ending so the files get cleaned up.
> 
> Basically, does anyone have it working, and can give me advice?
> 
> Thanks,
> 
> -- 
> Lloyd Brown
> Systems Administrator
> Fulton Supercomputing Lab
> Brigham Young University
> http://marylou.byu.edu
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> 


More information about the torqueusers mailing list