PTL Logo

Fault Tolerance Research @ Open Systems Laboratory

Application Level Checkpoint/Restart Interfaces

  •  

Command Line Tools

Command line tools.

cr-setup

Setup environment for checkpointing. This function may establish the global checkpoint directory, initialize a checkpointing file system, or other actions necessary to setup a checkpointing environment for the application.

Interface

cr-setup \
    [-h | --help]
    [-v | --verbose]
    [-h | --handle]
    [-d | --storage]
    [-t | --tmp]
    [--script]
    [-r | --restart]
    [-s | --seq]
    [-l | --list]

Example

shell$ cat run-myapp.sh
#!/bin/env bash

# A local /tmp on each node
export APPCR_TMPDIR=/tmp/

# General Reference for this application
export APPCR_HANDLE=example

# Suggested path to store the checkpoint
export APPCR_STORAGE_DIR=$HOME/tmp/ckpt-appcr

# Setup the checkpoint system
cr-setup

# launch the application
mpirun my-app

# Synchronize the working directory
cr-sync

# To restart use the following command
# cr-setup --restart

Arguments

Argument Description
-h | --help Display help.
-v | --verbose Display verbose output.
-h | --handle Override the APPCR_HANDLE environment variable.
-d | --storage Override the APPCR_STORAGE_DIR environment variable.
-t | --tmp Override the APPCR_TMP_DIR environment variable.
--script Override the APPCR_SCRIPT environment variable.
-r | --restart Setup the environment to restart a job.
-s | --seq Request to be restarted from a specific sequence number. By default restart will used the highest sequence number.
-l | --list Display a list of checkpoint files available on this machine for the given handle.

Notes

...

Back to top

cr-sync

Synchronize the checkpoint file system to the globally mounted stable storage location specified by $APPCR_STORAGE_DIR. Usually this is called just before terminating a job script in order to flush the distributed file system storage cache to a file system that does not depend on the job allocation.

Interface

cr-sync  \
    [-h | --help]
    [-v | --verbose]
    [-c | --cleanup]
    [-h | --handle]
    [-d | --storage]
    [-t | --tmp]

Example

See example above.

Arguments

Argument Description
-h | --help Display help.
-v | --verbose Display verbose output.
-h | --handle Override the APPCR_HANDLE environment variable.
-d | --storage Override the APPCR_STORAGE_DIR environment variable.
-t | --tmp Override the APPCR_TMP_DIR environment variable.
-c | --cleanup Cleanup the stable storage directory. Removes backup and working checkpoints (synchronized checkpoints are preserved). Also inspects the synchronized checkpoints and displays a warning when not all of the ranks are represented.

Notes

The checkpoints may be synchronized while the application is running in order to prepare for unexpected job loss. However the job submission script should always call this function just before terminating the job script, and should allow sufficient time to flush large files from the distributed cache to the directory pointed to by $APPCR_STORAGE_DIR.

Back to top

Environment Variables

Below are a list of environment variables reserved by the checkpoint/restart library. Some of these can be set by the user to influence the behavior of the library.

Argument Description
APPCR_HANDLE The user defined handle that can be used to reference the checkpoint set for this application.
APPCR_STORAGE_DIR A globally mounted file system where checkpoint files should be synchronized to.
APPCR_TMP_DIR A locally mounted file system where local checkpoint files are temporarily writen to by each rank.
APPCR_SCRIPT Script file used to setup environment variables.

Back to top