PTL Logo

Fault Tolerance Research @ Open Systems Laboratory

Transparent Checkpoint/Restart in Open MPI

  •  

Configure Options

MCA Parameters

--with-ft

This configure option specifies the type of fault tolerance to enable in the Open MPI build. By default no fault tolerance is enabled, which is the same as if the option --without-ft was specified. Currently only the cr option is supported.

./configure --with-ft=cr

Back to top

--enable-ft-thread

This option enables a concurrent thread to assist the application in making progress on a checkpoint operation when not inside the MPI library. To enable this feature you must enable MPI threads in addition to the checkpointing thread. By default this is disabled.

./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads

Back to top

--with-blcr

This option specifies the path to the installation of the BLCR library. It is strongly suggested that users specify this option to ensure that the proper BLCR installation is selected.

./configure --with-ft=cr --with-blcr=/opt/blcr/

Back to top

--with-blcr-libdir

This option specifies the path to the library path specific to the installation of the BLCR library.

./configure --with-ft=cr --with-blcr=/opt/blcr/ --with-blcr-libdir=/opt/blcr/lib64

Back to top

-am ft-enable-cr

To enable checkpoint/restart fault tolerance for an MPI application you must use the Aggregate MCA parameter ft-enable-cr. This will enable the best available checkpoint/restart fault tolerance components currently available.

shell$ mpirun -am ft-enable-cr my-app

Back to top

--mca ompi_cr_verbose

Verbose output for the OMPI layer Checkpoint/Restart functionality.
Default: 0 (off)

shell$ mpirun --mca ompi_cr_verbose 10 -am ft-enable-cr my-app

Back to top

--mca orte_cr_verbose

Verbose output for the ORTE layer Checkpoint/Restart functionality.
Default: 0 (off)

shell$ mpirun --mca orte_cr_verbose 10 -am ft-enable-cr my-app

Back to top

--mca opal_cr_verbose

Verbose output for the OPAL layer Checkpoint/Restart functionality.
Default: 0 (off)

shell$ mpirun --mca opal_cr_verbose 10 -am ft-enable-cr my-app

Back to top

--mca ft_cr_enabled

Enable fault tolerance for this program.
Default: 0 (disabled)
Automatically enabled by ft-enable-cr. The user should never need to set this parameter.

Back to top

--mca opal_cr_enable_timer

Enable checkpoint timer
Default: 0 (disabled)

shell$ mpirun --mca opal_cr_enable_timer 1 -am ft-enable-cr my-app

Back to top

--mca opal_cr_enable_timer_barrier

Enable checkpoint timer barrier between stages to control for process skew.
Default: 0 (disabled)

shell$ mpirun --mca opal_cr_enable_timer_barrier 1 --mca opal_cr_enable_timer 1 \
              -am ft-enable-cr my-app

Back to top

--mca opal_cr_timer_target_rank

MPI rank that should display the checkpoint timer.
Default: 0

shell$ mpirun --mca opal_cr_timer_target_rank 2 \
              --mca opal_cr_enable_timer 1 \
              -am ft-enable-cr my-app

Back to top

--mca opal_cr_use_thread

Use an asynchronous thread to checkpoint this program.
Default: 0 (off)
Automatically enabled by ft-enable-cr when built with --enable-ft-thread. The user should never need to set this parameter.

Back to top

--mca opal_cr_thread_sleep_check

Time for the checkpoint thread to sleep between checking for a checkpoint.
Default: 0 microseconds

shell$ mpirun --mca opal_cr_thread_sleep_check 10 -am ft-enable-cr my-app

Back to top

--mca opal_cr_thread_sleep_wait

Time for the checkpoint thread to sleep when waiting for a process to exit the MPI library.
Default: 0 microseconds

shell$ mpirun --mca opal_cr_thread_sleep_wait 10 -am ft-enable-cr my-app

Back to top

--mca opal_cr_is_tool

Is this a tool program, meaning does it require a fully operational OPAL or just enough to exec.
Default: 0 (false)
Automatically enabled when needed. The user should never need to set this parameter.

Back to top

--mca opal_cr_signal

Checkpoint/Restart signal used to initialize an OPAL Only checkpoint of a program.
Default: SIGUSR1

shell$ mpirun --mca opal_cr_signal 14 -am ft-enable-cr my-app

Back to top

--mca opal_cr_debug_sigpipe

Activate a signal handler for debugging SIGPIPE Errors that can happen on restart.
Default: 0 (disabled)

shell$ mpirun --mca opal_cr_debug_sigpipe 1 -am ft-enable-cr my-app

Back to top

--mca opal_cr_tmp_dir

Temporary directory to place rendezvous files for a checkpoint. Note that this is not the checkpoint storage directory, but should be a local file system to the machine.
Default: "/tmp"

shell$ mpirun --mca opal_cr_tmp_dir /tmp/ramdisk/ -am ft-enable-cr my-app

Back to top

--mca crs

Which CRS component to use
Default: NULL (auto-select)

shell$ mpirun --mca crs blcr -am ft-enable-cr my-app

Back to top

--mca crs_base_verbose

Set the verbose level for the CRS framework.
Default: 0 (off)

shell$ mpirun --mca crs_base_verbose 10 -am ft-enable-cr my-app

Back to top

--mca crs_base_snapshot_dir

Directory to use when storing local snapshots. Note that this is only used if you disable snapc_base_store_in_place.
Default: "/tmp"

shell$ mpirun --mca crs_base_snapshot_dir /tmp/ramdisk \
              --mca snapc_base_store_in_place 0 \
              -am ft-enable-cr my-app

Back to top

--mca crs_blcr_priority

Set the Priority of the CRS BLCR component. The component with the highest priority wins.
Default: 50

shell$ mpirun --mca crs_blcr_priority 100 -am ft-enable-cr my-app

Back to top

--mca crs_blcr_verbose

Set the verbose level of the CRS BLCR component.
Default: 0 (set to match crs_base_verbose)

shell$ mpirun --mca crs_blcr_verbose 10 -am ft-enable-cr my-app

Back to top

--mca crs_blcr_dev_null

Save the local checkpoint to /dev/null. Note: This is not for general use. It is a benchmarking and debugging option that should be used with care.
Default: 0 (disabled)

shell$ mpirun --mca crs_blcr_dev_null 1 -am ft-enable-cr my-app

Back to top

--mca crs_self_priority

Set the Priority of the CRS SELF component. Only selected if lt_dlsym can find functions in the user program with the correct signatures. The component with the highest priority wins.
Default: 20

shell$ mpirun --mca crs_self_priority 100 -am ft-enable-cr my-app

Back to top

--mca crs_self_verbose

Set the verbose level of the CRS SELF component.
Default: 0 (set to match crs_base_verbose)

shell$ mpirun --mca crs_self_verbose 10 -am ft-enable-cr my-app

Back to top

--mca crs_self_prefix

Prefix for the user defined callback functions.
Default: "opal_crs_self_user"

shell$ mpirun --mca crs_self_prefix my_foo -am ft-enable-cr my-app

Back to top

--mca crs_self_do_restart

Start execution by calling the restart callback during MPI_INIT.
Default: 0 (disabled)
Automatically enabled when needed. The user should never need to set this parameter.

Back to top

--mca filem

Which FileM component to use
Default: NULL (auto-select)

shell$ mpirun --mca filem rsh -am ft-enable-cr my-app

Back to top

--mca filem_base_verbose

Set the verbose level for the FileM framework.
Default: 0 (off)

shell$ mpirun --mca filem_base_verbose 10 -am ft-enable-cr my-app

Back to top

--mca filem_rsh_priority

Set the Priority of the FileM RSH component. The component with the highest priority wins.
Default: 50

shell$ mpirun --mca filem_rsh_priority 100 -am ft-enable-cr my-app

Back to top

--mca filem_rsh_verbose

Set the verbose level of the FileM RSH component.
Default: 0 (set to match filem_base_verbose)

shell$ mpirun --mca filem_rsh_verbose 10 -am ft-enable-cr my-app

Back to top

--mca filem_rsh_rcp

The rsh Default: "scp"

shell$ mpirun --mca filem_rsh_rcp rcp -am ft-enable-cr my-app

Back to top

--mca filem_rsh_rsh

The rsh Default: "ssh"

shell$ mpirun --mca filem_rsh_rsh rsh -am ft-enable-cr my-app

Back to top

--mca filem_rsh_cp

The UNIX cp command to use for local copy operations. Useful when moving files from a local file system to a globally mounted file system (see snapc_base_global_shared for more information).
Default: "cp"

shell$ mpirun --mca filem_rsh_cp my_cp -am ft-enable-cr my-app

Back to top

--mca filem_rsh_max_incomming

Maximum number of incomming connections (0 = any)
Default: 10

shell$ mpirun --mca filem_rsh_max_incomming 50 -am ft-enable-cr my-app

Back to top

--mca snapc

Which SnapC component to use
Default: NULL (auto-select)

shell$ mpirun --mca snapc full -am ft-enable-cr my-app

Back to top

--mca snapc_base_verbose

Set the verbose level for the SnapC framework.
Default: 0 (off)

shell$ mpirun --mca snapc_base_verbose 10 -am ft-enable-cr my-app

Back to top

--mca snapc_base_global_snapshot_dir

The base directory to use when storing global snapshots. This is the directory where all checkpoint files will be gathered during a checkpoint operation. Usually this is a globally mounted file system, but it does not need to be if using the FileM framework.
Default: $HOME

shell$ mpirun --mca snapc_base_global_snapshot_dir /home/me/ckpts \
              -am ft-enable-cr my-app

Back to top

--mca snapc_base_global_shared

If the snapc_base_global_snapshot_dir is on a shared file system that all nodes can access, then the checkpoint files can be copied more efficiently when FileM is used.
Default: 0 (disabled)

shell$ mpirun --mca snapc_base_global_shared 1 -am ft-enable-cr my-app

Back to top

--mca snapc_base_store_in_place

If the snapc_base_global_snapshot_dir is on a shared file system that all nodes can access, then the checkpoint files can be stored in place instead of incurring a remote copy.
Default: 1 (enabled)

shell$ mpirun --mca snapc_base_store_in_place 0 -am ft-enable-cr my-app

Back to top

--mca snapc_base_only_one_seq

Only store one sequence number (reusing the checkpoint directory)
Default: 0 (disabled)

shell$ mpirun --mca snapc_base_only_one_seq 1 -am ft-enable-cr my-app

Back to top

--mca snapc_base_establish_global_snapshot_dir

Establish the global snapshot directory on job startup, instead of on the first checkpoint operation. Note that this is currently only lightly tested, and may not work properly.
Default: 0 (disabled)

shell$ mpirun --mca snapc_base_establish_global_snapshot_dir 1 -am ft-enable-cr my-app

Back to top

--mca snapc_base_global_snapshot_ref

Specify the global snapshot reference that should be used for this job.
Default: "ompi_global_snapshot_PID.ckpt" (where PID is the PID of the mpirun process)

shell$ mpirun --mca snapc_base_global_snapshot_ref my_ref -am ft-enable-cr my-app

Back to top

--mca snapc_full_priority

Set the Priority of the SnapC FULL component. The component with the highest priority wins.
Default: 20

shell$ mpirun --mca snapc_full_priority 100 -am ft-enable-cr my-app

Back to top

--mca snapc_full_verbose

Set the verbose level of the Snapc FULL component.
Default: 0 (set to match snapc_base_verbose)

shell$ mpirun --mca snapc_full_verbose 10 -am ft-enable-cr my-app

Back to top

--mca snapc_full_skip_filem

Only pretend to move files using FileM. Note: This is not for general use. It is a benchmarking and debugging option that should be used with care.
Default: 0 (disabled)

shell$ mpirun --mca snapc_full_skip_filem 1 -am ft-enable-cr my-app

Back to top

--mca snapc_full_skip_app

Shortcut the application level coordination (do not start the INC or checkpoint operations in the local processes, just pretend to do so). Note: This is not for general use. It is a benchmarking and debugging option that should be used with care.
Default: 0 (disabled)

shell$ mpirun --mca snapc_full_skip_app 1 -am ft-enable-cr my-app

Back to top

--mca snapc_full_enable_timing

Enable checkpoint timing information
Default: 0 (disabled)

shell$ mpirun --mca snapc_full_enable_timing 1 -am ft-enable-cr my-app

Back to top

--mca snapc_full_max_wait_time

Maximum time to wait before daemon gives up on the checkpoint operation. (values less than or equal to 0 mean wait infinitely long).
Default: 20 seconds

shell$ mpirun --mca snapc_full_max_wait_time 60 -am ft-enable-cr my-app

Back to top

--mca crcp

Which CRCP component to use
Default: NULL (auto-select)

shell$ mpirun --mca crcp bkmrk -am ft-enable-cr my-app

Back to top

--mca crcp_base_verbose

Set the verbose level for the CRCP framework.
Default: 0 (off)

shell$ mpirun --mca crcp_base_verbose 10 -am ft-enable-cr my-app

Back to top

--mca crcp_bkmrk_priority

Set the Priority of the CRCP BKMRK component. The component with the highest priority wins.
Default: 20

shell$ mpirun --mca crcp_bkmrk_priority 100 -am ft-enable-cr my-app

Back to top

--mca crcp_bkmrk_verbose

Set the verbose level of the CRCP BKMRK component.
Default: 0 (set to match crcp_base_verbose)

shell$ mpirun --mca crcp_bkmrk_verbose 10 -am ft-enable-cr my-app

Back to top

--mca crcp_bkmrk_timing

Enable performance timing for the Bookmark Exchange.
Default: 0 (disabled)

shell$ mpirun --mca crcp_bkmrk_timing 1 -am ft-enable-cr my-app

Back to top