PTL Logo

Fault Tolerance Research @ Open Systems Laboratory

Application Level Checkpoint/Restart Interfaces

  •  

Overview

The Application Checkpoint/Restart Library project abstracts the application away from the, often system specific, checkpoint storage and notification mechanisms by using an interposition library. This library essentially wraps the existing checkpoint and restart functionality providing just enough separation between the system and the application C/R functionality for the library to efficiently coordinate their respective activities.

The principle goals of this project are to:

  • Transparently interact with specialized checkpoint/restart file systems (e.g., stdchk, PLFS).
  • Express requirements to the underlying runtime and messaging environments (e.g., MPI)
  • Interact with various fault notification services (e.g., CIFTS FTB)
  • Communicate optimizations directly to alternative checkpoint/restart services when applicable. (e.g., BLCR)

Code Access

The Application Checkpoint/Restart Library source code is currently hosted at the site below:

For instructions on how to build and install from source see the Installation page.

Publications

  • Joshua Hursey, Scott S. Hampton, Pratul Agarwal and Andrew Lumsdaine. An Adaptive Checkpoint/Restart Library for Large Scale HPC Applications 14th SIAM Conference on Parallel Processing and Scientific Computing, 2010 (Poster).

Notes

Currently, only the CIFTS Fault Tolerance Backplane (FTB) events described in the API Reference are supported. More events may be added in the future.

Currently, the Application C/R Library is shipped as part of the Open MPI project. The only reason for this dependency is because we were too lazy to make our own build system. So we piggyback on theirs at the moment. In a later revision we will separate the two projects more cleanly.