This configure option specifies the type of fault tolerance to enable in the
Open MPI build. By default no fault tolerance is enabled, which is the same as
if the option --without-ft was specified. Currently only
the cr option is supported.
./configure --with-ft=cr
This option enables a concurrent thread to assist the application in making progress on a checkpoint operation when not inside the MPI library. To enable this feature you must enable MPI threads in addition to the checkpointing thread. By default this is disabled.
./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads
This option specifies the path to the installation of the BLCR library. It is strongly suggested that users specify this option to ensure that the proper BLCR installation is selected.
./configure --with-ft=cr --with-blcr=/opt/blcr/
This option specifies the path to the library path specific to the installation of the BLCR library.
./configure --with-ft=cr --with-blcr=/opt/blcr/ --with-blcr-libdir=/opt/blcr/lib64
To enable checkpoint/restart fault tolerance for an MPI application you must
use the Aggregate MCA parameter ft-enable-cr. This will enable the
best available checkpoint/restart fault tolerance components currently
available.
shell$ mpirun -am ft-enable-cr my-app
Verbose output for the OMPI layer Checkpoint/Restart functionality.
Default: 0 (off)
shell$ mpirun --mca ompi_cr_verbose 10 -am ft-enable-cr my-app
Verbose output for the ORTE layer Checkpoint/Restart functionality.
Default: 0 (off)
shell$ mpirun --mca orte_cr_verbose 10 -am ft-enable-cr my-app
Verbose output for the OPAL layer Checkpoint/Restart functionality.
Default: 0 (off)
shell$ mpirun --mca opal_cr_verbose 10 -am ft-enable-cr my-app
Enable fault tolerance for this program.
Default: 0 (disabled)
Automatically enabled by ft-enable-cr.
The user should never need to set this parameter.
Enable checkpoint timer
Default: 0 (disabled)
shell$ mpirun --mca opal_cr_enable_timer 1 -am ft-enable-cr my-app
Enable checkpoint timer barrier between stages to control for process skew.
Default: 0 (disabled)
shell$ mpirun --mca opal_cr_enable_timer_barrier 1 --mca opal_cr_enable_timer 1 \
-am ft-enable-cr my-app
MPI rank that should display the checkpoint timer.
Default: 0
shell$ mpirun --mca opal_cr_timer_target_rank 2 \
--mca opal_cr_enable_timer 1 \
-am ft-enable-cr my-app
Use an asynchronous thread to checkpoint this program.
Default: 0 (off)
Automatically enabled by ft-enable-cr when built
with --enable-ft-thread.
The user should never need to set this parameter.
Time for the checkpoint thread to sleep between checking for a checkpoint.
Default: 0 microseconds
shell$ mpirun --mca opal_cr_thread_sleep_check 10 -am ft-enable-cr my-app
Time for the checkpoint thread to sleep when waiting for a process to exit the
MPI library.
Default: 0 microseconds
shell$ mpirun --mca opal_cr_thread_sleep_wait 10 -am ft-enable-cr my-app
Is this a tool program, meaning does it require a fully operational OPAL or just enough to exec.
Default: 0 (false)
Automatically enabled when needed.
The user should never need to set this parameter.
Checkpoint/Restart signal used to initialize an OPAL Only checkpoint of a
program.
Default: SIGUSR1
shell$ mpirun --mca opal_cr_signal 14 -am ft-enable-cr my-app
Activate a signal handler for debugging SIGPIPE Errors that can happen on restart.
Default: 0 (disabled)
shell$ mpirun --mca opal_cr_debug_sigpipe 1 -am ft-enable-cr my-app
Temporary directory to place rendezvous files for a checkpoint. Note that this
is not the checkpoint storage directory, but should be a local file
system to the machine.
Default: "/tmp"
shell$ mpirun --mca opal_cr_tmp_dir /tmp/ramdisk/ -am ft-enable-cr my-app
Which CRS component to use
Default: NULL (auto-select)
shell$ mpirun --mca crs blcr -am ft-enable-cr my-app
Set the verbose level for the CRS framework.
Default: 0 (off)
shell$ mpirun --mca crs_base_verbose 10 -am ft-enable-cr my-app
Directory to use when storing local snapshots. Note that this is only used if
you disable
snapc_base_store_in_place.
Default: "/tmp"
shell$ mpirun --mca crs_base_snapshot_dir /tmp/ramdisk \
--mca snapc_base_store_in_place 0 \
-am ft-enable-cr my-app
Set the Priority of the CRS BLCR component.
The component with the highest priority wins.
Default: 50
shell$ mpirun --mca crs_blcr_priority 100 -am ft-enable-cr my-app
Set the verbose level of the CRS BLCR component.
Default: 0 (set to match
crs_base_verbose)
shell$ mpirun --mca crs_blcr_verbose 10 -am ft-enable-cr my-app
Save the local checkpoint to /dev/null.
Note: This is not for general use. It is a benchmarking and debugging option
that should be used with care.
Default: 0 (disabled)
shell$ mpirun --mca crs_blcr_dev_null 1 -am ft-enable-cr my-app
Set the Priority of the CRS SELF component. Only selected if
lt_dlsym can find functions in the user program with the correct
signatures.
The component with the highest priority wins.
Default: 20
shell$ mpirun --mca crs_self_priority 100 -am ft-enable-cr my-app
Set the verbose level of the CRS SELF component.
Default: 0 (set to match
crs_base_verbose)
shell$ mpirun --mca crs_self_verbose 10 -am ft-enable-cr my-app
Prefix for the user defined callback functions.
Default: "opal_crs_self_user"
shell$ mpirun --mca crs_self_prefix my_foo -am ft-enable-cr my-app
Start execution by calling the restart callback during
MPI_INIT.
Default: 0 (disabled)
Automatically enabled when needed.
The user should never need to set this parameter.
Which FileM component to use
Default: NULL (auto-select)
shell$ mpirun --mca filem rsh -am ft-enable-cr my-app
Set the verbose level for the FileM framework.
Default: 0 (off)
shell$ mpirun --mca filem_base_verbose 10 -am ft-enable-cr my-app
Set the Priority of the FileM RSH component.
The component with the highest priority wins.
Default: 50
shell$ mpirun --mca filem_rsh_priority 100 -am ft-enable-cr my-app
Set the verbose level of the FileM RSH component.
Default: 0 (set to match
filem_base_verbose)
shell$ mpirun --mca filem_rsh_verbose 10 -am ft-enable-cr my-app
The rsh
Default: "scp"
shell$ mpirun --mca filem_rsh_rcp rcp -am ft-enable-cr my-app
The rsh
Default: "ssh"
shell$ mpirun --mca filem_rsh_rsh rsh -am ft-enable-cr my-app
The UNIX cp command to use for local copy operations. Useful when
moving files from a local file system to a globally mounted file system (see
snapc_base_global_shared for more
information).
Default: "cp"
shell$ mpirun --mca filem_rsh_cp my_cp -am ft-enable-cr my-app
Maximum number of incomming connections (0 = any)
Default: 10
shell$ mpirun --mca filem_rsh_max_incomming 50 -am ft-enable-cr my-app
Which SnapC component to use
Default: NULL (auto-select)
shell$ mpirun --mca snapc full -am ft-enable-cr my-app
Set the verbose level for the SnapC framework.
Default: 0 (off)
shell$ mpirun --mca snapc_base_verbose 10 -am ft-enable-cr my-app
The base directory to use when storing global snapshots. This is the directory
where all checkpoint files will be gathered during a checkpoint
operation. Usually this is a globally mounted file system, but it does not need
to be if using the FileM framework.
Default: $HOME
shell$ mpirun --mca snapc_base_global_snapshot_dir /home/me/ckpts \
-am ft-enable-cr my-app
If the
snapc_base_global_snapshot_dir
is on a shared file system that all nodes can access, then the checkpoint files
can be copied more efficiently when FileM is used.
Default: 0 (disabled)
shell$ mpirun --mca snapc_base_global_shared 1 -am ft-enable-cr my-app
If the
snapc_base_global_snapshot_dir
is on a shared file system that all nodes can access, then the checkpoint files
can be stored in place instead of incurring a remote copy.
Default: 1 (enabled)
shell$ mpirun --mca snapc_base_store_in_place 0 -am ft-enable-cr my-app
Only store one sequence number (reusing the checkpoint directory)
Default: 0 (disabled)
shell$ mpirun --mca snapc_base_only_one_seq 1 -am ft-enable-cr my-app
Establish the global snapshot directory on job startup, instead of on the first
checkpoint operation.
Note that this is currently only lightly tested, and may not work properly.
Default: 0 (disabled)
shell$ mpirun --mca snapc_base_establish_global_snapshot_dir 1 -am ft-enable-cr my-app
Specify the global snapshot reference that should be used for this job.
Default: "ompi_global_snapshot_PID.ckpt" (where PID is the PID of the mpirun process)
shell$ mpirun --mca snapc_base_global_snapshot_ref my_ref -am ft-enable-cr my-app
Set the Priority of the SnapC FULL component.
The component with the highest priority wins.
Default: 20
shell$ mpirun --mca snapc_full_priority 100 -am ft-enable-cr my-app
Set the verbose level of the Snapc FULL component.
Default: 0 (set to match
snapc_base_verbose)
shell$ mpirun --mca snapc_full_verbose 10 -am ft-enable-cr my-app
Only pretend to move files using FileM.
Note: This is not for general use. It is a benchmarking and debugging option
that should be used with care.
Default: 0 (disabled)
shell$ mpirun --mca snapc_full_skip_filem 1 -am ft-enable-cr my-app
Shortcut the application level coordination (do not start the INC or checkpoint
operations in the local processes, just pretend to do so).
Note: This is not for general use. It is a benchmarking and debugging option
that should be used with care.
Default: 0 (disabled)
shell$ mpirun --mca snapc_full_skip_app 1 -am ft-enable-cr my-app
Enable checkpoint timing information
Default: 0 (disabled)
shell$ mpirun --mca snapc_full_enable_timing 1 -am ft-enable-cr my-app
Maximum time to wait before daemon gives up on the checkpoint
operation. (values less than or equal to 0 mean wait infinitely long).
Default: 20 seconds
shell$ mpirun --mca snapc_full_max_wait_time 60 -am ft-enable-cr my-app
Which CRCP component to use
Default: NULL (auto-select)
shell$ mpirun --mca crcp bkmrk -am ft-enable-cr my-app
Set the verbose level for the CRCP framework.
Default: 0 (off)
shell$ mpirun --mca crcp_base_verbose 10 -am ft-enable-cr my-app
Set the Priority of the CRCP BKMRK component.
The component with the highest priority wins.
Default: 20
shell$ mpirun --mca crcp_bkmrk_priority 100 -am ft-enable-cr my-app
Set the verbose level of the CRCP BKMRK component.
Default: 0 (set to match
crcp_base_verbose)
shell$ mpirun --mca crcp_bkmrk_verbose 10 -am ft-enable-cr my-app
Enable performance timing for the Bookmark Exchange.
Default: 0 (disabled)
shell$ mpirun --mca crcp_bkmrk_timing 1 -am ft-enable-cr my-app