[torquedev] BLCR checkpoint and restart checked into trunk

Steve Snelgrove ssnelgrove at clusterresources.com
Thu Feb 14 09:12:49 MST 2008


Garrick Staples wrote:
> On Wed, Feb 13, 2008 at 06:59:07PM -0700, Steve Snelgrove alleged:
>   
>> I checked in the BCLR checkpoint and restart changes into the trunk (2.3)
>> and made a snapshot release.  There might still be a few small problems
>> with this code but it does at least seem to work.
>>
>> I have done a documentation page for anyone who is interested in this
>> feature.
>>
>>  http://www.clusterresources.com/torquedocs21/2.6jobcheckpoint.shtml
>>     
>
> Does this work with multi-node jobs?
>
> Your documentation says to use --enable-unixsockets=no when configuring
> torque.  Is there a regression with the unix domain socket code?  Do the
> default CFLAGS not work with the BLCR code?  We don't have any C++ files.
>   
The documentation is very preliminary and I basically just was
documenting my test setup.  Probably some of these things
are superstition.

There does appear to be some kind of problem
with the unix domain sockets in 2.3 where if these are not
disabled, things get very slow and sometimes communication
fails altogether.  I did not look into this much as there has
been some pressure to get the BLCR work done.

It probably does not support multi-node jobs.  I have only
tried the test job as described on the web page.  In fact,
there is still a small problem with getting the restored
job into the right state so that the job completion detection
code works.  If you could, Dave thought that you might know
the right way to tweak things to make this right.




More information about the torquedev mailing list