hnp component]:
hnp component]:
This configure option specifies the type of fault tolerance to enable in the
Open MPI build. By default no fault tolerance is enabled, which is the same as
if the option --without-ft was specified. Currently only
the cr option is supported.
./configure --with-ft=cr
This option enables a concurrent thread to assist the application in making progress on a checkpoint operation when not inside the MPI library. To enable this feature you must enable MPI threads in addition to the checkpointing thread. By default this is disabled.
./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads
After r22841
the --enable-mpi-threads was replaced by
--enable-opal-multi-threads. So you should use the following instead:
./configure --enable-ft-thread --with-ft=cr --enable-opal-multi-threads
This option specifies the path to the installation of the BLCR library. It is strongly suggested that users specify this option to ensure that the proper BLCR installation is selected.
./configure --with-ft=cr --with-blcr=/opt/blcr/
This option specifies the path to the library path specific to the installation of the BLCR library.
./configure --with-ft=cr --with-blcr=/opt/blcr/ --with-blcr-libdir=/opt/blcr/lib64
Introduced in r23587. Included in v1.5.1 and later releases.
This option activates the Checkpoint/Restart-enabled debugging support. See C/R Enabled Debugging
./configure --with-ft=cr --enable-crdebug
To enable checkpoint/restart fault tolerance for an MPI application you must
use the Aggregate MCA parameter ft-enable-cr. This will enable the
best available checkpoint/restart fault tolerance components currently
available.
shell$ mpirun -am ft-enable-cr my-app
Introduced in r23629. Included in v1.5.1 and later releases.
To enable checkpoint/restart fault tolerance with automatic recovery and/or
process migration for an MPI application you must use the Aggregate MCA
parameter ft-enable-cr-recovery. This will enable the
best available checkpoint/restart fault tolerance components currently
available.
shell$ mpirun -am ft-enable-cr-recovery my-app
Verbose output for the OMPI layer Checkpoint/Restart functionality.
Default: 0 (off)
shell$ mpirun --mca ompi_cr_verbose 10 -am ft-enable-cr my-app
Verbose output for the ORTE layer Checkpoint/Restart functionality.
Default: 0 (off)
shell$ mpirun --mca orte_cr_verbose 10 -am ft-enable-cr my-app
Verbose output for the OPAL layer Checkpoint/Restart functionality.
Default: 0 (off)
shell$ mpirun --mca opal_cr_verbose 10 -am ft-enable-cr my-app
Enable fault tolerance for this program.
Default: 0 (disabled)
Automatically enabled by ft-enable-cr.
The user should never need to set this parameter.
Enable checkpoint timer
Default: 0 (disabled)
shell$ mpirun --mca opal_cr_enable_timer 1 -am ft-enable-cr my-app
Enable checkpoint timer barrier between stages to control for process skew.
Default: 0 (disabled)
shell$ mpirun --mca opal_cr_enable_timer_barrier 1 --mca opal_cr_enable_timer 1 \
-am ft-enable-cr my-app
MPI rank that should display the checkpoint timer.
Default: 0
shell$ mpirun --mca opal_cr_timer_target_rank 2 \
--mca opal_cr_enable_timer 1 \
-am ft-enable-cr my-app
Use an asynchronous thread to checkpoint this program.
Default: 0 (off)
Automatically enabled by ft-enable-cr when built
with --enable-ft-thread.
The user should never need to set this parameter.
Time for the checkpoint thread to sleep between checking for a checkpoint.
Default: 0 microseconds
shell$ mpirun --mca opal_cr_thread_sleep_check 10 -am ft-enable-cr my-app
Time for the checkpoint thread to sleep when waiting for a process to exit the
MPI library.
Default: 1000 microseconds
(changed from 0 for v1.5 and later)
shell$ mpirun --mca opal_cr_thread_sleep_wait 10 -am ft-enable-cr my-app
Is this a tool program, meaning does it require a fully operational OPAL or just enough to exec.
Default: 0 (false)
Automatically enabled when needed.
The user should never need to set this parameter.
Checkpoint/Restart signal used to initialize an OPAL Only checkpoint of a
program.
Default: SIGUSR1
shell$ mpirun --mca opal_cr_signal 14 -am ft-enable-cr my-app
Activate a signal handler for debugging SIGPIPE Errors that can happen on restart.
Default: 0 (disabled)
shell$ mpirun --mca opal_cr_debug_sigpipe 1 -am ft-enable-cr my-app
Temporary directory to place rendezvous files for a checkpoint. Note that this
is not the checkpoint storage directory, but should be a local file
system to the machine.
Default: "/tmp"
shell$ mpirun --mca opal_cr_tmp_dir /tmp/ramdisk/ -am ft-enable-cr my-app
Which CRS component to use
Default: NULL (auto-select)
shell$ mpirun --mca crs blcr -am ft-enable-cr my-app
Set the verbose level for the CRS framework.
Default: 0 (off)
shell$ mpirun --mca crs_base_verbose 10 -am ft-enable-cr my-app
Set the Priority of the CRS BLCR component.
The component with the highest priority wins.
Default: 50
shell$ mpirun --mca crs_blcr_priority 100 -am ft-enable-cr my-app
Set the verbose level of the CRS BLCR component.
Default: 0 (set to match
crs_base_verbose)
shell$ mpirun --mca crs_blcr_verbose 10 -am ft-enable-cr my-app
Save the local checkpoint to /dev/null.
Note: This is not for general use. It is a benchmarking and debugging option
that should be used with care.
Default: 0 (disabled)
shell$ mpirun --mca crs_blcr_dev_null 1 -am ft-enable-cr my-app
Set the Priority of the CRS SELF component. Only selected if
lt_dlsym can find functions in the user program with the correct
signatures.
The component with the highest priority wins.
Default: 20
shell$ mpirun --mca crs_self_priority 100 -am ft-enable-cr my-app
Set the verbose level of the CRS SELF component.
Default: 0 (set to match
crs_base_verbose)
shell$ mpirun --mca crs_self_verbose 10 -am ft-enable-cr my-app
Prefix for the user defined callback functions.
Default: "opal_crs_self_user"
shell$ mpirun --mca crs_self_prefix my_foo -am ft-enable-cr my-app
Start execution by calling the restart callback during
MPI_INIT.
Default: 0 (disabled)
Automatically enabled when needed.
The user should never need to set this parameter.
Which Compress component to use
Default: NULL (auto-select)
shell$ mpirun --mca compress gzip \
--mca sstore stage \
--mca sstore_stage_compress 1 \
-am ft-enable-cr my-app
Set the verbose level for the Compress framework.
Default: 0 (off)
shell$ mpirun --mca compress_base_verbose 10 -am ft-enable-cr my-app
Set the Priority of the Compress gzip component.
The component with the highest priority wins.
Default: 15
shell$ mpirun --mca compress gzip \
--mca compress_gzip_priority 100 \
--mca sstore stage \
--mca sstore_stage_compress 1 \
-am ft-enable-cr my-app
Set the verbose level for the Compress gzip component.
Default: 0 (off)
shell$ mpirun --mca compress gzip \
--mca compress_gzip_verbose 10 \
--mca sstore stage \
--mca sstore_stage_compress 1 \
-am ft-enable-cr my-app
Set the Priority of the Compress bzip component.
The component with the highest priority wins.
Default: 10
shell$ mpirun --mca compress bzip \
--mca compress_bzip_priority 100 \
--mca sstore stage \
--mca sstore_stage_compress 1 \
-am ft-enable-cr my-app
Set the verbose level for the Compress bzip component.
Default: 0 (off)
shell$ mpirun --mca compress bzip \
--mca compress_bzip_verbose 10 \
--mca sstore stage \
--mca sstore_stage_compress 1 \
-am ft-enable-cr my-app
Which FileM component to use
Default: NULL (auto-select)
shell$ mpirun --mca filem rsh -am ft-enable-cr my-app
Set the verbose level for the FileM framework.
Default: 0 (off)
shell$ mpirun --mca filem_base_verbose 10 -am ft-enable-cr my-app
Set the Priority of the FileM RSH component.
The component with the highest priority wins.
Default: 50
shell$ mpirun --mca filem_rsh_priority 100 -am ft-enable-cr my-app
Set the verbose level of the FileM RSH component.
Default: 0 (set to match
filem_base_verbose)
shell$ mpirun --mca filem_rsh_verbose 10 -am ft-enable-cr my-app
The rsh
Default: "scp"
shell$ mpirun --mca filem_rsh_rcp rcp -am ft-enable-cr my-app
The rsh
Default: "ssh"
shell$ mpirun --mca filem_rsh_rsh rsh -am ft-enable-cr my-app
The UNIX cp command to use for local copy operations. Useful when
moving files from a local file system to a globally mounted file system
(see sstore_stage_global_is_shared for more
information).
Default: "cp"
shell$ mpirun --mca filem_rsh_cp my_cp -am ft-enable-cr my-app
Maximum number of incomming connections (0 = any)
Default: 10
shell$ mpirun --mca filem_rsh_max_incomming 50 -am ft-enable-cr my-app
Introduced in r23587. Included in v1.5.1 and later releases.
Display Progress every X percentage done.
Default: 0 (off)
shell$ mpirun --mca filem_rsh_progress_meter 10 \
-am ft-enable-cr-recovery my-app
Which SnapC component to use
Default: NULL (auto-select)
shell$ mpirun --mca snapc full -am ft-enable-cr my-app
Set the verbose level for the SnapC framework.
Default: 0 (off)
shell$ mpirun --mca snapc_base_verbose 10 -am ft-enable-cr my-app
Only store one sequence number (reusing the checkpoint directory)
Default: 0 (disabled)
shell$ mpirun --mca snapc_base_only_one_seq 1 -am ft-enable-cr my-app
Set the Priority of the SnapC FULL component.
The component with the highest priority wins.
Default: 20
shell$ mpirun --mca snapc_full_priority 100 -am ft-enable-cr my-app
Set the verbose level of the Snapc FULL component.
Default: 0 (set to match
snapc_base_verbose)
shell$ mpirun --mca snapc_full_verbose 10 -am ft-enable-cr my-app
Shortcut the application level coordination (do not start the INC or checkpoint
operations in the local processes, just pretend to do so).
Note: This is not for general use. It is a benchmarking and debugging option
that should be used with care.
Default: 0 (disabled)
shell$ mpirun --mca snapc_full_skip_app 1 -am ft-enable-cr my-app
Enable checkpoint timing information
Default: 0 (disabled)
shell$ mpirun --mca snapc_full_enable_timing 1 -am ft-enable-cr my-app
Maximum time to wait before daemon gives up on the checkpoint
operation. (values less than or equal to 0 mean wait infinitely long).
Default: 20 seconds
shell$ mpirun --mca snapc_full_max_wait_time 60 -am ft-enable-cr my-app
Introduced in r23587. Included in v1.5.1 and later releases.
Display Progress every X percentage done.
Default: 0 (off)
shell$ mpirun --mca snapc_full_progress_meter 10 \
-am ft-enable-cr-recovery my-app
Introduced in r23587. Included in v1.5.1 and later releases.
Which SStore component to use
Default: NULL (auto-select)
shell$ mpirun --mca sstore stage -am ft-enable-cr my-app
Introduced in r23587. Included in v1.5.1 and later releases.
Set the verbose level for the SStore framework.
Default: 0 (off)
shell$ mpirun --mca sstore_base_verbose 10 -am ft-enable-cr my-app
Introduced in r23587. Included in v1.5.1 and later releases.
The base directory to use when storing global snapshots. This is the directory
where all checkpoint files will be gathered during a checkpoint
operation. Usually this is a globally mounted file system, but it does not need
to be if using the stage SStore framework.
Default: $HOME
shell$ mpirun --mca sstore_base_global_snapshot_dir /home/me/ckpts \
-am ft-enable-cr my-app
Introduced in r23587. Included in v1.5.1 and later releases.
Specify the global snapshot reference that should be used for this job.
Default: "ompi_global_snapshot_PID.ckpt" (where PID is the PID of the mpirun process)
shell$ mpirun --mca snapc_base_global_snapshot_ref my_ref \
-am ft-enable-cr my-app
Introduced in r23587. Included in v1.5.1 and later releases.
Set the Priority of the SStore Central component.
The component with the highest priority wins.
Default: 20
shell$ mpirun --mca sstore_central_priority 100 -am ft-enable-cr my-app
Introduced in r23587. Included in v1.5.1 and later releases.
Set the verbose level of the SStore Central component.
Default: 0 (set to match
sstore_base_verbose)
shell$ mpirun --mca sstore_central_verbose 10 -am ft-enable-cr my-app
Introduced in r23587. Included in v1.5.1 and later releases.
Set the Priority of the SStore Stage component.
The component with the highest priority wins.
Default: 10
shell$ mpirun --mca sstore_stage_priority 100 -am ft-enable-cr my-app
Introduced in r23587. Included in v1.5.1 and later releases.
Set the verbose level of the SStore Stage component.
Default: 0 (set to match
sstore_base_verbose)
shell$ mpirun --mca sstore_stage_verbose 10 -am ft-enable-cr my-app
Introduced in r23587. Included in v1.5.1 and later releases.
If the
sstore_base_global_snapshot_dir
is on a shared file system that all nodes can access, then the checkpoint files
can be copied more efficiently when FileM is used in conjunction with the stage SStore.
Default: 0 (disabled)
shell$ mpirun --mca sstore_stage_global_is_shared 1 -am ft-enable-cr my-app
Introduced in r23587. Included in v1.5.1 and later releases.
Only pretend to move files using FileM.
Note: This is not for general use. It is a benchmarking and debugging option
that should be used with care.
Default: 0 (disabled)
shell$ mpirun --mca sstore_stage_skip_filem 1 -am ft-enable-cr my-app
Introduced in r23587. Included in v1.5.1 and later releases.
Maintain a node local cache of last checkpoint.
Default: 0 (disabled)
shell$ mpirun --mca sstore_stage_caching 1 \
-am ft-enable-cr-recovery my-app
Introduced in r23587. Included in v1.5.1 and later releases.
Compress local snapshots.
Default: 0 (disabled)
shell$ mpirun --mca sstore_stage_compress 1 \
-am ft-enable-cr-recovery my-app
Introduced in r23587. Included in v1.5.1 and later releases.
Seconds to delay the start of compression on sync()
Default: 0
shell$ mpirun --mca sstore_stage_compress_delay 5 \
-am ft-enable-cr-recovery my-app
Introduced in r23587. Included in v1.5.1 and later releases.
Display Progress every X percentage done.
Default: 0 (off)
shell$ mpirun --mca sstore_stage_progress_meter 10 \
-am ft-enable-cr-recovery my-app
Introduced in r23629. Included in v1.5.1 and later releases.
Enable Automatic Recovery feature.
Default: 0 (disabled)
Automaticly enabled when using the -am ft-enable-cr-recovery parameter.
So the below two command lines are equivalent.
shell$ mpirun --mca errmgr_hnp_autor_enable 1 \
-am ft-enable-cr-recovery my-app
shell$ mpirun -am ft-enable-cr-recovery my-app
Introduced in r23629. Included in v1.5.1 and later releases.
Enable Automatic Recovery timing information.
Default: 0 (disabled)
shell$ mpirun --mca errmgr_hnp_autor_timing 1 \
-am ft-enable-cr-recovery my-app
Introduced in r23629. Included in v1.5.1 and later releases.
Number of seconds to wait before starting to recover the job after a failure.
Default: 1
shell$ mpirun --mca errmgr_hnp_autor_recovery_delay 10 \
-am ft-enable-cr-recovery my-app
Introduced in r23629. Included in v1.5.1 and later releases.
Skip the old node from failed proc, even if it is still available.
Default: 1 (Enabled)
shell$ mpirun --mca errmgr_hnp_autor_skip_oldnode 0 \
-am ft-enable-cr-recovery my-app
Introduced in r23629. Included in v1.5.1 and later releases.
Enable C/R migration feature.
Default: 0 (disabled)
Automaticly enabled when using the -am ft-enable-cr-recovery parameter.
So the below two command lines are equivalent.
shell$ mpirun --mca errmgr_hnp_crmig_enable 1 \
-am ft-enable-cr-recovery my-app
shell$ mpirun -am ft-enable-cr-recovery my-app
Introduced in r23629. Included in v1.5.1 and later releases.
Enable C/R migration timing information.
Default: 0 (disabled)
shell$ mpirun --mca errmgr_hnp_crmig_timing 1 \
-am ft-enable-cr-recovery my-app
Which CRCP component to use
Default: NULL (auto-select)
shell$ mpirun --mca crcp bkmrk -am ft-enable-cr my-app
Set the verbose level for the CRCP framework.
Default: 0 (off)
shell$ mpirun --mca crcp_base_verbose 10 -am ft-enable-cr my-app
Set the Priority of the CRCP BKMRK component.
The component with the highest priority wins.
Default: 20
shell$ mpirun --mca crcp_bkmrk_priority 100 -am ft-enable-cr my-app
Set the verbose level of the CRCP BKMRK component.
Default: 0 (set to match
crcp_base_verbose)
shell$ mpirun --mca crcp_bkmrk_verbose 10 -am ft-enable-cr my-app
Enable performance timing for the Bookmark Exchange.
Default: 0 (disabled)
shell$ mpirun --mca crcp_bkmrk_timing 1 -am ft-enable-cr my-app
This option has been deprecated as of r23587.
v1.5.0 is the last release containing this option.
All later releases should use the following:
sstore_stage_local_snapshot_dir.
Directory to use when storing local snapshots. Note that this is only used if
you disable
snapc_base_store_in_place.
Default: "/tmp"
shell$ mpirun --mca crs_base_snapshot_dir /tmp/ramdisk \
--mca snapc_base_store_in_place 0 \
-am ft-enable-cr my-app
This option has been deprecated as of r23587.
v1.5.0 is the last release containing this option.
All later releases should use the following:
sstore_base_global_snapshot_dir.
The base directory to use when storing global snapshots. This is the directory
where all checkpoint files will be gathered during a checkpoint
operation. Usually this is a globally mounted file system, but it does not need
to be if using the FileM framework.
Default: $HOME
shell$ mpirun --mca snapc_base_global_snapshot_dir /home/me/ckpts \
-am ft-enable-cr my-app
This option has been deprecated as of r23587.
v1.5.0 is the last release containing this option.
All later releases should use the following:
sstore_stage_global_is_shared.
If the
snapc_base_global_snapshot_dir
is on a shared file system that all nodes can access, then the checkpoint files
can be copied more efficiently when FileM is used.
Default: 0 (disabled)
shell$ mpirun --mca snapc_base_global_shared 1 -am ft-enable-cr my-app
This option has been deprecated as of r23587.
v1.5.0 is the last release containing this option.
All later releases should use the following:
The 'stage' component of SStore.
If the
snapc_base_global_snapshot_dir
is on a shared file system that all nodes can access, then the checkpoint files
can be stored in place instead of incurring a remote copy.
Default: 1 (enabled)
shell$ mpirun --mca snapc_base_store_in_place 0 -am ft-enable-cr my-app
This option has been deprecated as of r23587.
v1.5.0 is the last release containing this option.
All later releases should use the following:
sstore_base_global_snapshot_ref.
Specify the global snapshot reference that should be used for this job.
Default: "ompi_global_snapshot_PID.ckpt" (where PID is the PID of the mpirun process)
shell$ mpirun --mca snapc_base_global_snapshot_ref my_ref -am ft-enable-cr my-app
This option has been deprecated as of r23587. v1.5.0 is the last release containing this option. This option was removed since it was never well supported.
Establish the global snapshot directory on job startup, instead of on the first
checkpoint operation.
Note that this is currently only lightly tested, and may not work properly.
Default: 0 (disabled)
shell$ mpirun --mca snapc_base_establish_global_snapshot_dir 1 -am ft-enable-cr my-app
This option has been deprecated as of r23587.
v1.5.0 is the last release containing this option.
All later releases should use the following:
sstore_stage_skip_filem.
Only pretend to move files using FileM.
Note: This is not for general use. It is a benchmarking and debugging option
that should be used with care.
Default: 0 (disabled)
shell$ mpirun --mca snapc_full_skip_filem 1 -am ft-enable-cr my-app
Introduced in r23587. Included in v1.5.1 and later releases.
This configure option enables the optional C/R MPI Extension APIs.
./configure --with-ft=cr --enable-mpi-ext=cr
Introduced in r23587. Included in v1.5.1 and later releases.
All processes must call this function.
OMPI_CR_CHECKPOINT(handle, seq, info)
OUT handle Global snapshot reference (string)
OUT seq Sequence number (int)
INOUT info A set of key-value pairs providing additional
information to the MPI implementation regarding how
to continue after quiescence (handle, significant on
all ranks)
int OMPI_CR_CHECKPOINT(char **handle, int *seq, MPI_Info info);
#include#ifdef OPEN_MPI #include #endif { MPI_Init(argc, argv); for(i=0; i < max_iter; ++i) { #ifdef OMPI_HAVE_MPI_EXT_CR // Request a checkpoint before every step OMPI_CR_Checkpoint(&handle, &seq, MPI_INFO_NULL); #endif // Resume normal operation. } }
Introduced in r23587. Included in v1.5.1 and later releases.
Not all processes must call this function.
OMPI_CR_RESTART(handle, seq, info)
IN handle Global snapshot reference (string)
IN seq Sequence number (int)
INOUT info A set of key-value pairs providing additional
information to the MPI implementation regarding how
to continue after quiescence (handle, significant on
all ranks)
int OMPI_CR_RESTART(char *handle, int seq, MPI_Info info);
#include#ifdef OPEN_MPI #include #endif { MPI_Init(argc, argv); for(i=0; i < max_iter; ++i) { #ifdef OMPI_HAVE_MPI_EXT_CR // Request a checkpoint before every step OMPI_CR_Checkpoint(&handle, &seq); #endif // Resume normal operation. if( MPI_SUCCESS != MPI_Send(...) ) { #ifdef OMPI_HAVE_MPI_EXT_CR // Restart from the last checkpoint, and keep processing OMPI_CR_Restart(handle, seq, MPI_INFO_NULL); #else MPI_Abort(MPI_COMM_WORLD, -1); #endif } } }
Introduced in r23587. Included in v1.5.1 and later releases.
This is a collective operation.
OMPI_CR_MIGRATE(comm, hostname, rank, info)
IN comm Communicator of processes to migrate
IN hostname Name of the machine to move this rank onto.
May be NULL. (string)
IN rank Process rank to move this rank close to.
May be negative, indicating NULL. (int)
INOUT info A set of key-value pairs providing hints to the MPI
implementation regarding how this function should
behave (handle, significant on all ranks)
int OMPI_CR_MIGRATE(MPI_Comm comm, char *hostname, int rank, MPI_Info info)
#include#ifdef OPEN_MPI #include #endif { MPI_Info qinfo; MPI_Init(argc, argv); for(i=0; i < max_iter; ++i) { // Receive notification that this node is going to fail #ifdef OMPI_HAVE_MPI_EXT_CR // Asked to be migrated anywhere else in the system, // except this node. MPI_Info_set(qinfo, "CR_OFF_NODE", "true"); OMPI_CR_MIGRATE(MPI_COMM_SELF, NULL, -1, MPI_INFO_NULL); #endif // Resume normal operation. } }
#include#ifdef OPEN_MPI #include #endif { MPI_Init(argc, argv); ... // Stage 1: Communication Pattern A for(i=0; i < max_iter; ++i) { ... } #ifdef OMPI_HAVE_MPI_EXT_CR // Since the communication pattern is changing, // re-position my processes by using process migration. neighbor_rank = get_best_neighbor(my_rank); OMPI_CR_MIGRATE(MPI_COMM_WORLD, NULL, neighbor_rank, MPI_INFO_NULL); #endif // Stage 2: Communication Pattern B for(i=0; i < max_iter; ++i) { ... } }
Introduced in r23587. Included in v1.5.1 and later releases.
This is a collective operation.
// INC Registration Function
int OMPI_CR_INC_register_callback(OMPI_CR_INC_callback_event_t event,
OMPI_CR_INC_callback_function function,
OMPI_CR_INC_callback_function *prev_function);
// INC Callback Function Signature
typedef int (*OMPI_CR_INC_callback_function)(OMPI_CR_INC_callback_event_t event,
OMPI_CR_INC_callback_state_t state);
OMPI_CR_INC_callback_event_t OMPI_CR_INC_PRE_CRS_PRE_MPI Pre-checkpoint, before OMPI INC. OMPI_CR_INC_PRE_CRS_POST_MPI Pre-checkpoint, after OMPI INC. OMPI_CR_INC_POST_CRS_PRE_MPI Continue/Restart, before OMPI INC. OMPI_CR_INC_POST_CRS_POST_MPI Continue/Restart, after OMPI INC. OMPI_CR_INC_callback_state_t OMPI_CR_INC_STATE_PREPARE Pre-checkpoint OMPI_CR_INC_STATE_CONTINUE Continue OMPI_CR_INC_STATE_RESTART Restart OMPI_CR_INC_STATE_ERROR Error
Introduced in r23587. Included in v1.5.1 and later releases.
This is a collective operation.
OMPI_CR_QUIESCE_START(comm, info)
IN comm communicator (handle)
INOUT info A set of key-value pairs providing hints to the MPI
implementation regarding how this function should
behave (handle, significant on all ranks)
int OMPI_CR_QUIESCE_START(MPI_Comm comm, MPI_Info info);
#include#ifdef OPEN_MPI #include #endif { MPI_Init(argc, argv); #ifdef OMPI_HAVE_MPI_EXT_CR OMPI_CR_Quiesce_start(MPI_COMM_WORLD, MPI_INFO_NULL); // Prepare application for application-level checkpoint. // Wait on any important outstanding receives // Save application state OMPI_CR_Quiesce_end(MPI_COMM_WORLD, MPI_INFO_NULL); #endif // Resume normal operation. }
Introduced in r23587. Included in v1.5.1 and later releases.
This is a collective operation.
OMPI_CR_QUIESCE_CHECKPOINT(comm, handle, seq, info)
IN comm communicator (handle)
OUT handle Global snapshot reference (string)
OUT seq Sequence number (int)
INOUT info A set of key-value pairs providing hints to the MPI
implementation regarding how this function should
behave (handle, significant on all ranks)
int OMPI_CR_QUIESCE_CHECKPOINT(MPI_Comm comm, char **handle, int *seq,
MPI_Info info);
#include#ifdef OPEN_MPI #include #endif { MPI_Init(argc, argv); #ifdef OMPI_HAVE_MPI_EXT_CR OMPI_CR_Quiesce_start(MPI_COMM_WORLD, MPI_INFO_NULL); // Prepare application for checkpoint. // Wait on any important outstanding receives // Mark some memory regions for exclusion OMPI_CR_Quiesce_checkpoint(MPI_COMM_WORLD, &handle, &seq, MPI_INFO_NULL); OMPI_CR_Quiesce_end(MPI_COMM_WORLD, MPI_INFO_NULL); #endif // Resume normal operation. }
Introduced in r23587. Included in v1.5.1 and later releases.
This is a collective operation.
OMPI_CR_QUIESCE_END(comm, info)
IN comm communicator (handle)
INOUT info A set of key-value pairs providing additional
information to the MPI implementation regarding how
to continue after quiescence (handle, significant on
all ranks)
int OMPI_CR_QUIESCE_END(MPI_Comm comm, MPI_Info info);
#include#ifdef OPEN_MPI #include #endif { MPI_Init(argc, argv); #ifdef OMPI_HAVE_MPI_EXT_CR OMPI_CR_Quiesce_start(MPI_COMM_WORLD, MPI_INFO_NULL); // Prepare application for application-level checkpoint. // Wait on any important outstanding receives // Save application state OMPI_CR_Quiesce_end(MPI_COMM_WORLD, MPI_INFO_NULL); #endif // Resume normal operation. }
Introduced in r23587. Included in v1.5.1 and later releases.
The self CRS must be used for these functions to work.
// Default Checkpoint Callback int opal_crs_self_user_checkpoint(char **restart_cmd); // SELF CRS Checkpoint Registration Function int OMPI_CR_self_register_checkpoint_callback(OMPI_CR_self_checkpoint_fn function); // SELF CRS Callback Function Signature typedef int (*OMPI_CR_self_checkpoint_fn)(char **restart_cmd);
Introduced in r23587. Included in v1.5.1 and later releases.
This is a collective operation.
// Default Restart Callback int opal_crs_self_user_restart(void); // SELF CRS Restart Registration Function int OMPI_CR_self_register_restart_callback(OMPI_CR_self_restart_fn function); // SELF CRS Callback Function Signature typedef int (*OMPI_CR_self_restart_fn)(void);
Introduced in r23587. Included in v1.5.1 and later releases.
This is a collective operation.
// Default Continue Callback int opal_crs_self_user_continue(void); // SELF CRS Continue Registration Function int OMPI_CR_self_register_continue_callback(OMPI_CR_self_continue_fn function); // SELF CRS Callback Function Signature typedef int (*OMPI_CR_self_continue_fn)(void);