PTL Logo

Fault Tolerance Research @ Open Systems Laboratory

CIFTS in Open MPI

  •  

Overview

The Coordinated Infrastructure for Fault Tolerant Systems (CIFTS) project aims to provide a coordinated infrastructure that will enable Fault Tolerant Systems to adapt to faults occurring in the operating environment in a holistic manner. Our work on this project focuses on integrating the CIFTS Fault Tolerant Backplane (FTB) infrastructure into Open MPI, and making Open MPI more robust in the face of failure.

Initial support for the Fault Tolerant Backplane was introduced in Open MPI in February 2009. Open MPI supports the relaying of fault related information to the FTB through the Notifier framework interface. The FTB coordinates the interaction between components internal to Open MPI, as well as external system components like job schedulers, resource managers, C/R libraries and other FTB-enabled software.

Publications

Currently Supported

  • A list of supported FTB Events thrown by Open MPI is available in the API Reference.
  • A list of supported FTB Events caught by Open MPI is available in the API Reference.

Demonstration

Notes

Currently, only the events described in the API Reference are supported. More events related to checkpoint/restart, node status, job status, and message corruption are planned for future releases. Based on the scenarios described in the Open MPI FTB workflow, following are some possible events likely to be supported by Open MPI.

Planned Events

FTB Event Type Action Description
NODE_DEAD response Caught Check if node is dead
MPI_NODE_DEAD normal Thrown Node X is unreachable
NODE_RESTORED response Caught Add node X as an unallocated resource
MPI_NODE_RESTORED normal Thrown Return node X to the available resource pool
NODE_MIGRATE response Caught Migrate all ranks from Node X to Node Q
MPI_NODE_MIGRATE_DONE normal Thrown Ranks migrated from Node X to Node Q
JOB_ABORT response Caught Suspend or terminate job Z
MPI_JOB_ABORTED normal Thrown MPI Job Z has been aborted
JOB_RESUME response Caught Bring back job Z to a running state
MPI_JOB_RESUMED normal Thrown MPI Job Z has been resumed
IFACE_DEAD response Caught Physical interface P has failed
MPI_IFACE_DEAD normal Thrown Mark physical interface P as dead
IFACE_RESTORED normal Caught Physical interface P is back to service
MPI_IFACE_RESTORED normal Thrown Add P to available physical interfaces
MPI_MSG_CORRUPT normal Thrown Message corruption on interface P

Open MPI Supported Events

FTB Event Severity Description
MPI_INIT info Initialize the MPI execution environment
MPI_FINALIZE info Finalize the MPI execution environment
MPI_NODE_DEAD error Node X is unreachable
MPI_NODE_RESTORED info Node X is back to service
MPI_RANK_DEAD error Rank Y (on Node X) is presumably dead
MPI_RANK_RESTORED info Rank Y (on Node X) is back to service
MPI_NODE_MIGRATE_DONE info Ranks migrated from Node X to Node Q
MPI_JOB_ABORT_CMD error Command to abort MPI Job Z
MPI_JOB_RESUME_CMD info Command to resume MPI Job Z
MPI_JOB_ABORTED error MPI Job Z has been aborted
MPI_JOB_RESUMED info MPI Job Z has been resumed
MPI_MSG_CORRUPT error Message corruption on interface P
MPI_IFACE_DEAD error Mark physical interface P as dead
MPI_IFACE_RESTORED info Add P to available physical interfaces