[torqueusers] Check point support
jacksond at clusterresources.com
Tue Feb 28 14:03:05 MST 2006
While it is true that TORQUE does not support general 'system level
checkpointing' paradigm, it does support the more commonly used
'application level checkpointing' paradigm. In this model, the batch
system is responsible for determining which jobs should be checkpointed
and when and it then send signals to the application informing it that
it is time to create a checkpoint.
The application is responsible for trapping these signals and creating
its own custom checkpoint file which saves its state. When the job is
later restarted by the batch system, the application is responsible for
detecting this checkpoint file, recovering its state, and continuing its
operation from where it left off.
TORQUE is able to deliver a combination of signals to the application
indicating that it should start checkpointing soon, must checkpoint
immediately, and/or must exit immediately allowing sites to set up a
'maximum checkpoint duration' policy within their scheduler. While
application level checkpoint is not as 'easy' as system level
checkpointing, it also does not have the same limitations (system level
checkpointing often only applies to serial jobs, or jobs with no in-
flight messages, or no open file descriptors, etc). Further,
application level checkpointing generally provides far more efficient
checkpointing as only relevant data is saved as opposed to saving off
the entire execution environment.
With regards to Altix checkpointing, are you able to drive this
manually? If so, let us know the process and we can probably integrate
either TORQUE or Maui/Moab to enable intelligent management of this
On Tue, 2006-02-28 at 13:48 +1000, Ashley Wright wrote:
> I am currently investigating whether it is possible to have check
> pointing of jobs in torque. I do not know anything about check pointing
> in Linux. (Our SGI machine supports it). Can someone please give me some
> ideas about what is supported in torque and what software to use.
> I am currently running Torque 1.2.0p6 with Maui 3.2.6p13 on Red Hat
> Enterprise 3. My machines are x86_64 Opterons.
> Thank you for any help you can provide,
More information about the torqueusers