PTL Logo

Fault Tolerance Research @ Open Systems Laboratory

Application Level Checkpoint/Restart Interfaces

  •  

CIFTS FTB Events v1.0

Below is a list of CIFTS FTB events that are supported by this library, and the actions that result. (others will be added as the library is expanded upon). These are thrown in the FTB.checkpoint_sw.appcr event space.

FTB Event Severity Action Brief Description
CHKPT_REQ Info Caught Request a checkpoint from the application.
CHKPT_BEGIN Info Thrown A checkpoint has started at this rank.
CHKPT_END Info Thrown A checkpoint has finished at this rank.
CHKPT_ERROR Error Thrown A checkpoint has failed at this rank.
RSTRT_BEGIN Info Thrown A restart of the rank has started.
RSTRT_END Info Thrown A restart of the rank has finished.
RSTRT_ERROR Error Thrown A restart of the rank has failed.

Checkpoint/Restart API Reference v1.0

Below is a list of the API interfaces to the Checkpoint/Restart Library.

CR_Init

Initialize CR library. This must be called after MPI_Init.

C Specification

#include <appcr.h>
int CR_Init(MPI_Comm comm);

Parameters

Name Arguments Description
comm MPI_COMM_WORLD (or other communicator) In order to allow the process to potentially checkpoint in thread context we need to make sure to duplicate the communicator so that we avoid any potential deadlock related to overlapping collectives.

Notes

Upon restart this function may call the previously registered CR_Restart_cb_fn_t function on supported platforms, otherwise the restart callback will be called upon the first call to CR_Register_restart_cb.

As justification for the communicator restrictions, From MPI 2.2 Standard, Chapter 5.12: Collective Communication: Correctness:

Finally, in multi-threaded implementations, one can have more than one, concurrently executing, collective communication call at a process. In these situations, it is the users responsibility to ensure that the same communicator is not used concurrently by two different collective communication calls at the same process.

Return Codes

Return Code Description
CR_SUCCESS (0) Success
CR_ERROR Error

Back to top

CR_Finalize

Finalize the CR library. This must be called before MPI_Finalize.

C Specification

#include <appcr.h>
int CR_Finalize(void);

Parameters

Name Arguments Description

Return Codes

Return Code Description
CR_SUCCESS (0) Success
CR_ERROR Error

Back to top

CR_Attr_get

Get the value of an attribute.

C Specification

#include <appcr.h>
int CR_Attr_get(char * key, char **value);

Parameters

Name Arguments Description
key string Key from the Attributes table.
value string A buffer to store the string associated with the key.

Return Codes

Return Code Description
CR_SUCCESS (0) Success
CR_ERROR Error

Back to top

CR_Attr_set

Set the value of an attribute.

C Specification

#include <appcr.h>
int CR_Attr_set(char * key, char *value);

Parameters

Name Arguments Description
key string Key from the Attributes table.
value string Value to associate with the key.

Return Codes

Return Code Description
CR_SUCCESS (0) Success
CR_ERROR Error

Back to top

CR_Protect_enter

Enter a protected section in which no checkpoint should be taken. This keeps the checkpointing thread from being activated while global data structures are being updated.

C Specification

#include <appcr.h>
int CR_Protect_enter(void);

Parameters

Name Arguments Description

Return Codes

Return Code Description
CR_SUCCESS (0) Success
CR_ERROR Error

Back to top

CR_Protect_leave

Leave a protected section in which no checkpoint should be taken.

C Specification

#include <appcr.h>
int CR_Protect_leave(void);

Parameters

Name Arguments Description

Return Codes

Return Code Description
CR_SUCCESS (0) Success
CR_ERROR Error

Back to top

CR_Register_checkpoint_cb

Register a checkpoint callback function.

C Specification

#include <appcr.h>
int CR_Register_checkpoint_cb(CR_Checkpoint_cb_fn_t cb_func);

Parameters

Name Arguments Description
CR_Checkpoint_cb_fn_t NULL or function pointer Pointer to a function with a signature matching CR_Checkpoint_cb_fn_t.

Notes

Multiple calls overwrite the last callback registered. Passing a NULL value deregisters the checkpoint callback.

Return Codes

Return Code Description
CR_SUCCESS (0) Success
CR_ERROR Error

Back to top

CR_Checkpoint_cb_fn_t

The checkpoint callback function signature.

C Specification

#include <appcr.h>
int (CR_Checkpoint_cb_fn_t*)(char * dir, int *seq, MPI_Comm ckpt_comm);

Parameters

Name Arguments Description
dir string Full path to the directory that this process can write its checkpoint to.
seq integer Sequence number determined by the process for reference to this checkpoint iteration.
ckpt_comm MPI_Comm A duplicate of the communicator passed to CR_Init to be used exclusively for communication in this callback.

Notes

If threaded checkpointing is enabled, then this callback could be activated in a separate thread than normal execution. So caution should be taken when writing this function to make sure it is thread safe.

Return Codes

Return Code Description
CR_SUCCESS (0) Successful checkpoint
CR_ERROR Failed checkpoint, directory will not be saved

Back to top

CR_Register_restart_cb

Register a restart callback function.

C Specification

#include <appcr.h>
int CR_Register_restart_cb(CR_Restart_cb_fn_t cb_func);

Parameters

Name Arguments Description
CR_Restart_cb_fn_t NULL or function pointer Pointer to a function with a signature matching CR_Restart_cb.

Notes

Multiple calls overwrite the last callback registered. Passing a NULL value deregisters the checkpoint callback.

Return Codes

Return Code Description
CR_SUCCESS (0) Success
CR_ERROR Error

Back to top

CR_Restart_cb_fn_t

Restart callback function signature.

C Specification

#include <appcr.h>
int (CR_Restart_cb_fn_t*)(char * dir, int seq, MPI_Comm ckpt_comm);

Parameters

Name Arguments Description
dir string Full path to the directory that this process can read its checkpoint from.
seq int Sequence number for this process to reference this checkpoint iteration.
ckpt_comm MPI_Comm A duplicate of the communicator passed to CR_Init to be used exclusively for communication in this callback.

Notes

Upon restart this function may be called from either CR_Init on supported platforms, otherwise the restart callback will be called upon the first call to CR_Restart_cb_fn_t. For this reason it is strongly encouraged that you register callbacks directly after calling CR_Init.

Return Codes

Return Code Description
CR_SUCCESS (0) Success
CR_ERROR Error

Back to top

CR_Checkpoint

A blocking checkpoint operation.

C Specification

#include <appcr.h>
int CR_Checkpoint(void);

Parameters

Name Arguments Description

Return Codes

Return Code Description
CR_SUCCESS (0) Success
CR_ERROR Error

Back to top

CR_Icheckpoint

A non-blocking checkpoint operation.

C Specification

#include <appcr.h>
int CR_Icheckpoint(CR_Request_t *req);

Parameters

Name Arguments Description
req CR_Request_t Return a request that can be waited or tested with.

Return Codes

Return Code Description
CR_SUCCESS (0) Success
CR_ERROR Error

Back to top

CR_Test

Test if the checkpoint operation is complete.

C Specification

#include <appcr.h>
int CR_Test(CR_Request_t *req, int *done);

Parameters

Name Arguments Description
req CR_Request_t Test a checkpoint request for completion.
done 0 = finished,
1 = not finished
If the checkpoint has finished or not.

Return Codes

Return Code Description
CR_SUCCESS (0) Success
CR_ERROR Error

Back to top

CR_Wait

Wait for the checkpoint operation to complete.

C Specification

#include <appcr.h>
int CR_Wait(CR_Request_t *req, int *status);

Parameters

Name Arguments Description
req CR_Request_t A request to wait upon.
status Integer The return code of the checkpoint operation. Matches the return codes described in the CR_Checkpoint function definition.

Notes

If the checkpoint thread is available then it is used to make concurrent progress, otherwise most/all of the work is done during the CR_Wait operation.

Return Codes

Return Code Description
CR_SUCCESS (0) Success
CR_ERROR Error

Back to top

CR Attributes

List of attributes supported. To be used with the CR_Attr_get and CR_Attr_set routines.

Name Arguments Description
checkpoint_freq X min (0 = off) Automatically checkpoint approximately every X min.
am_i_restarting 1 (true), 0 (false) Set to true when restarting. Should be cleared by the user once restarted.
enable_cr_thread 1 (true), 0 (false) Enable the checkpointing thread. This is useful when setting the checkpoint_freq attribute so that asynchronous progress can be made as a result of a timer event.

Back to top

CR_Request_t

Checkpoint request type.

Example

#include <appcr.h>
{
  CR_Request_t req;
  int status;
  CR_Icheckpoint(&req);
  CR_Wait(&req, &status);
}

Back to top

FTB Event: CHKPT_REQ

FTB Event to request a checkpoint of the target application.
This event is caught by the library.
The library will schedule a checkpoint operation to occur at the next opportunity. If the checkpoint thread is enabled, this will occur immediately, otherwise it will be required to be postponed until the process calls into the checkpoint library.

Back to top

FTB Event: CHKPT_BEGIN

FTB Event to indicate that the checkpoint operation has begun at the designated rank.
This event is thrown by the library.

Back to top

FTB Event: CHKPT_END

FTB Event to indicate that the checkpoint operation has finished at the designated rank.
This event is thrown by the library.

Back to top

FTB Event: CHKPT_ERROR

FTB Event to indicate that the checkpoint operation has failed at the designated rank.
This event is thrown by the library.

Back to top

FTB Event: RSTRT_BEGIN

FTB Event to indicate that the restart operation has begun at the designated rank.
This event is thrown by the library.

Back to top

FTB Event: RSTRT_END

FTB Event to indicate that the restart operation has finished at the designated rank.
This event is thrown by the library.

Back to top

FTB Event: RSTRT_ERROR

FTB Event to indicate that the restart operation has failed at the designated rank.
This event is thrown by the library.

Back to top