Below is a list of CIFTS FTB events that are supported by this library, and
the actions that result. (others will be added as the library is expanded
upon). These are thrown in the FTB.checkpoint_sw.appcr event space.
| FTB Event | Severity | Action | Brief Description |
|---|---|---|---|
| CHKPT_REQ | Info | Caught | Request a checkpoint from the application. |
| CHKPT_BEGIN | Info | Thrown | A checkpoint has started at this rank. |
| CHKPT_END | Info | Thrown | A checkpoint has finished at this rank. |
| CHKPT_ERROR | Error | Thrown | A checkpoint has failed at this rank. |
| RSTRT_BEGIN | Info | Thrown | A restart of the rank has started. |
| RSTRT_END | Info | Thrown | A restart of the rank has finished. |
| RSTRT_ERROR | Error | Thrown | A restart of the rank has failed. |
Below is a list of the API interfaces to the Checkpoint/Restart Library.
Initialize CR library. This must be called after MPI_Init.
#include <appcr.h> int CR_Init(MPI_Comm comm);
| Name | Arguments | Description |
|---|---|---|
comm
|
MPI_COMM_WORLD (or other communicator)
|
In order to allow the process to potentially checkpoint in thread context we need to make sure to duplicate the communicator so that we avoid any potential deadlock related to overlapping collectives. |
Upon restart this function may call the previously registered CR_Restart_cb_fn_t function on supported platforms, otherwise the restart callback will be called upon the first call to CR_Register_restart_cb.
As justification for the communicator restrictions, From MPI 2.2 Standard, Chapter 5.12: Collective Communication: Correctness:
Finally, in multi-threaded implementations, one can have more than one, concurrently executing, collective communication call at a process. In these situations, it is the users responsibility to ensure that the same communicator is not used concurrently by two different collective communication calls at the same process.
| Return Code | Description |
|---|---|
CR_SUCCESS (0)
|
Success |
CR_ERROR
|
Error |
Finalize the CR library. This must be called before MPI_Finalize.
#include <appcr.h> int CR_Finalize(void);
| Name | Arguments | Description |
|---|
| Return Code | Description |
|---|---|
CR_SUCCESS (0)
|
Success |
CR_ERROR
|
Error |
Get the value of an attribute.
#include <appcr.h> int CR_Attr_get(char * key, char **value);
| Name | Arguments | Description |
|---|---|---|
key
|
string
|
Key from the Attributes table. |
value
|
string
|
A buffer to store the string associated with the key. |
| Return Code | Description |
|---|---|
CR_SUCCESS (0)
|
Success |
CR_ERROR
|
Error |
Set the value of an attribute.
#include <appcr.h> int CR_Attr_set(char * key, char *value);
| Name | Arguments | Description |
|---|---|---|
key
|
string
|
Key from the Attributes table. |
value
|
string
|
Value to associate with the key. |
| Return Code | Description |
|---|---|
CR_SUCCESS (0)
|
Success |
CR_ERROR
|
Error |
Enter a protected section in which no checkpoint should be taken. This keeps the checkpointing thread from being activated while global data structures are being updated.
#include <appcr.h> int CR_Protect_enter(void);
| Name | Arguments | Description |
|---|
| Return Code | Description |
|---|---|
CR_SUCCESS (0)
|
Success |
CR_ERROR
|
Error |
Leave a protected section in which no checkpoint should be taken.
#include <appcr.h> int CR_Protect_leave(void);
| Name | Arguments | Description |
|---|
| Return Code | Description |
|---|---|
CR_SUCCESS (0)
|
Success |
CR_ERROR
|
Error |
Register a checkpoint callback function.
#include <appcr.h> int CR_Register_checkpoint_cb(CR_Checkpoint_cb_fn_t cb_func);
| Name | Arguments | Description |
|---|---|---|
CR_Checkpoint_cb_fn_t
|
NULL or function pointer
|
Pointer to a function with a signature matching CR_Checkpoint_cb_fn_t. |
Multiple calls overwrite the last callback registered. Passing a NULL value deregisters the checkpoint callback.
| Return Code | Description |
|---|---|
CR_SUCCESS (0)
|
Success |
CR_ERROR
|
Error |
The checkpoint callback function signature.
#include <appcr.h> int (CR_Checkpoint_cb_fn_t*)(char * dir, int *seq, MPI_Comm ckpt_comm);
| Name | Arguments | Description |
|---|---|---|
dir
|
string
|
Full path to the directory that this process can write its checkpoint to. |
seq
|
integer
|
Sequence number determined by the process for reference to this checkpoint iteration. |
ckpt_comm
|
MPI_Comm
|
A duplicate of the communicator passed to CR_Init to be used exclusively for communication in this callback. |
If threaded checkpointing is enabled, then this callback could be activated in a separate thread than normal execution. So caution should be taken when writing this function to make sure it is thread safe.
| Return Code | Description |
|---|---|
CR_SUCCESS (0)
|
Successful checkpoint |
CR_ERROR
|
Failed checkpoint, directory will not be saved |
Register a restart callback function.
#include <appcr.h> int CR_Register_restart_cb(CR_Restart_cb_fn_t cb_func);
| Name | Arguments | Description |
|---|---|---|
CR_Restart_cb_fn_t
|
NULL or function pointer
|
Pointer to a function with a signature matching CR_Restart_cb. |
Multiple calls overwrite the last callback registered. Passing a NULL value deregisters the checkpoint callback.
| Return Code | Description |
|---|---|
CR_SUCCESS (0)
|
Success |
CR_ERROR
|
Error |
Restart callback function signature.
#include <appcr.h> int (CR_Restart_cb_fn_t*)(char * dir, int seq, MPI_Comm ckpt_comm);
| Name | Arguments | Description |
|---|---|---|
dir
|
string
|
Full path to the directory that this process can read its checkpoint from. |
seq
|
int
|
Sequence number for this process to reference this checkpoint iteration. |
ckpt_comm
|
MPI_Comm
|
A duplicate of the communicator passed to CR_Init to be used exclusively for communication in this callback. |
Upon restart this function may be called from either CR_Init on supported platforms, otherwise the restart callback will be called upon the first call to CR_Restart_cb_fn_t. For this reason it is strongly encouraged that you register callbacks directly after calling CR_Init.
| Return Code | Description |
|---|---|
CR_SUCCESS (0)
|
Success |
CR_ERROR
|
Error |
A blocking checkpoint operation.
#include <appcr.h> int CR_Checkpoint(void);
| Name | Arguments | Description |
|---|
| Return Code | Description |
|---|---|
CR_SUCCESS (0)
|
Success |
CR_ERROR
|
Error |
A non-blocking checkpoint operation.
#include <appcr.h> int CR_Icheckpoint(CR_Request_t *req);
| Name | Arguments | Description |
|---|---|---|
req
|
CR_Request_t
|
Return a request that can be waited or tested with. |
| Return Code | Description |
|---|---|
CR_SUCCESS (0)
|
Success |
CR_ERROR
|
Error |
Test if the checkpoint operation is complete.
#include <appcr.h> int CR_Test(CR_Request_t *req, int *done);
| Name | Arguments | Description |
|---|---|---|
req
|
CR_Request_t
|
Test a checkpoint request for completion. |
done
|
0 = finished, |
If the checkpoint has finished or not. |
| Return Code | Description |
|---|---|
CR_SUCCESS (0)
|
Success |
CR_ERROR
|
Error |
Wait for the checkpoint operation to complete.
#include <appcr.h> int CR_Wait(CR_Request_t *req, int *status);
| Name | Arguments | Description |
|---|---|---|
req
|
CR_Request_t
|
A request to wait upon. |
status
|
Integer
|
The return code of the checkpoint operation. Matches the return codes described in the CR_Checkpoint function definition. |
If the checkpoint thread is available then it is used to make concurrent progress, otherwise most/all of the work is done during the CR_Wait operation.
| Return Code | Description |
|---|---|
CR_SUCCESS (0)
|
Success |
CR_ERROR
|
Error |
List of attributes supported. To be used with the CR_Attr_get and CR_Attr_set routines.
| Name | Arguments | Description |
|---|---|---|
checkpoint_freq
|
X min (0 = off)
|
Automatically checkpoint approximately every X min. |
am_i_restarting
|
1 (true), 0 (false)
|
Set to true when restarting. Should be cleared by the user once restarted. |
enable_cr_thread
|
1 (true), 0 (false)
|
Enable the checkpointing thread. This is useful when setting the
checkpoint_freq attribute so that asynchronous progress can be
made as a result of a timer event.
|
restart_args
|
string
|
Set of additional arguments to pass to this process upon restart |
checkpoint_directory
|
string
|
Checkpoint directory to store files. |
sync_seq_numbers
|
1 (true), 0 (false)
|
Should the sequence numbers be synchronized between all processes, or independent. Implies at least MPI_Barrier before checkpoint |
max_num_restarts
|
X (0 = inf.)
|
Max number of times this process should be restarted before aborting. |
quiesce_comm
|
1 (true), 0 (false)
|
Drain all messages from network before checkpoint, resume after checkpoint. |
Checkpoint request type.
#include <appcr.h>
{
CR_Request_t req;
int status;
CR_Icheckpoint(&req);
CR_Wait(&req, &status);
}
FTB Event to request a checkpoint of the target application.
This event is caught by the library.
The library will schedule a checkpoint operation to occur at the next
opportunity. If the checkpoint thread is enabled, this will occur immediately,
otherwise it will be required to be postponed until the process calls into the
checkpoint library.
FTB Event to indicate that the checkpoint operation has begun at the designated rank.
This event is thrown by the library.
FTB Event to indicate that the checkpoint operation has finished at the designated rank.
This event is thrown by the library.
FTB Event to indicate that the checkpoint operation has failed at the designated rank.
This event is thrown by the library.
FTB Event to indicate that the restart operation has begun at the designated rank.
This event is thrown by the library.
FTB Event to indicate that the restart operation has finished at the designated rank.
This event is thrown by the library.
FTB Event to indicate that the restart operation has failed at the designated rank.
This event is thrown by the library.