Command line tools.
Setup environment for checkpointing. This function may establish the global checkpoint directory, initialize a checkpointing file system, or other actions necessary to setup a checkpointing environment for the application.
cr-setup \
[-h | --help]
[-v | --verbose]
[-h | --handle]
[-d | --storage]
[-t | --tmp]
[--script]
[-r | --restart]
[-s | --seq]
[-l | --list]
shell$ cat run-myapp.sh #!/bin/env bash # A local /tmp on each node export APPCR_TMPDIR=/tmp/ # General Reference for this application export APPCR_HANDLE=example # Suggested path to store the checkpoint export APPCR_STORAGE_DIR=$HOME/tmp/ckpt-appcr # Setup the checkpoint system cr-setup # launch the application mpirun my-app # Synchronize the working directory cr-sync # To restart use the following command # cr-setup --restart
| Argument | Description |
|---|---|
-h | --help
|
Display help. |
-v | --verbose
|
Display verbose output. |
-h | --handle
|
Override the APPCR_HANDLE environment variable.
|
-d | --storage
|
Override the APPCR_STORAGE_DIR environment variable.
|
-t | --tmp
|
Override the APPCR_TMP_DIR environment variable.
|
--script
|
Override the APPCR_SCRIPT environment variable.
|
-r | --restart
|
Setup the environment to restart a job. |
-s | --seq
|
Request to be restarted from a specific sequence number. By default restart will used the highest sequence number. |
-l | --list
|
Display a list of checkpoint files available on this machine for the given handle. |
...
Synchronize the checkpoint file system to the globally mounted stable storage
location specified by $APPCR_STORAGE_DIR.
Usually this is called just before terminating a job script in order to flush
the distributed file system storage cache to a file system that does not depend
on the job allocation.
cr-sync \
[-h | --help]
[-v | --verbose]
[-c | --cleanup]
[-h | --handle]
[-d | --storage]
[-t | --tmp]
See example above.
| Argument | Description |
|---|---|
-h | --help
|
Display help. |
-v | --verbose
|
Display verbose output. |
-h | --handle
|
Override the APPCR_HANDLE environment variable.
|
-d | --storage
|
Override the APPCR_STORAGE_DIR environment variable.
|
-t | --tmp
|
Override the APPCR_TMP_DIR environment variable.
|
-c | --cleanup
|
Cleanup the stable storage directory. Removes backup and working checkpoints (synchronized checkpoints are preserved). Also inspects the synchronized checkpoints and displays a warning when not all of the ranks are represented. |
The checkpoints may be synchronized while the application is running in order
to prepare for unexpected job loss. However the job submission script should
always call this function just before terminating the job script, and should
allow sufficient time to flush large files from the distributed cache to the
directory pointed to by $APPCR_STORAGE_DIR.
Below are a list of environment variables reserved by the checkpoint/restart library. Some of these can be set by the user to influence the behavior of the library.
| Argument | Description |
|---|---|
APPCR_HANDLE
|
The user defined handle that can be used to reference the checkpoint set for this application. |
APPCR_STORAGE_DIR
|
A globally mounted file system where checkpoint files should be synchronized to. |
APPCR_TMP_DIR
|
A locally mounted file system where local checkpoint files are temporarily writen to by each rank. |
APPCR_SCRIPT
|
Script file used to setup environment variables. |