The Coordinated Infrastructure for Fault Tolerant Systems (CIFTS) project aims to provide a coordinated infrastructure that will enable Fault Tolerant Systems to adapt to faults occurring in the operating environment in a holistic manner. Our work on this project focuses on integrating the CIFTS Fault Tolerant Backplane (FTB) infrastructure into Open MPI, and making Open MPI more robust in the face of failure.
Initial support for the Fault Tolerant Backplane was introduced in Open MPI in February 2009. Open MPI supports the relaying of fault related information to the FTB through the Notifier framework interface. The FTB coordinates the interaction between components internal to Open MPI, as well as external system components like job schedulers, resource managers, C/R libraries and other FTB-enabled software.
Currently, only the events described in the API Reference are supported. More events related to checkpoint/restart, node status, job status, and message corruption are planned for future releases. Based on the scenarios described in the Open MPI FTB workflow, following are some possible events likely to be supported by Open MPI.
| FTB Event | Type | Action | Description |
|---|---|---|---|
| NODE_DEAD | response | Caught | Check if node is dead |
| MPI_NODE_DEAD | normal | Thrown | Node X is unreachable |
| NODE_RESTORED | response | Caught | Add node X as an unallocated resource |
| MPI_NODE_RESTORED | normal | Thrown | Return node X to the available resource pool |
| NODE_MIGRATE | response | Caught | Migrate all ranks from Node X to Node Q |
| MPI_NODE_MIGRATE_DONE | normal | Thrown | Ranks migrated from Node X to Node Q |
| JOB_ABORT | response | Caught | Suspend or terminate job Z |
| MPI_JOB_ABORTED | normal | Thrown | MPI Job Z has been aborted |
| JOB_RESUME | response | Caught | Bring back job Z to a running state |
| MPI_JOB_RESUMED | normal | Thrown | MPI Job Z has been resumed |
| IFACE_DEAD | response | Caught | Physical interface P has failed |
| MPI_IFACE_DEAD | normal | Thrown | Mark physical interface P as dead |
| IFACE_RESTORED | normal | Caught | Physical interface P is back to service |
| MPI_IFACE_RESTORED | normal | Thrown | Add P to available physical interfaces |
| MPI_MSG_CORRUPT | normal | Thrown | Message corruption on interface P |
| FTB Event | Severity | Description |
|---|---|---|
| MPI_INIT | info | Initialize the MPI execution environment |
| MPI_FINALIZE | info | Finalize the MPI execution environment |
| MPI_NODE_DEAD | error | Node X is unreachable |
| MPI_NODE_RESTORED | info | Node X is back to service |
| MPI_RANK_DEAD | error | Rank Y (on Node X) is presumably dead |
| MPI_RANK_RESTORED | info | Rank Y (on Node X) is back to service |
| MPI_NODE_MIGRATE_DONE | info | Ranks migrated from Node X to Node Q |
| MPI_JOB_ABORT_CMD | error | Command to abort MPI Job Z |
| MPI_JOB_RESUME_CMD | info | Command to resume MPI Job Z |
| MPI_JOB_ABORTED | error | MPI Job Z has been aborted |
| MPI_JOB_RESUMED | info | MPI Job Z has been resumed |
| MPI_MSG_CORRUPT | error | Message corruption on interface P |
| MPI_IFACE_DEAD | error | Mark physical interface P as dead |
| MPI_IFACE_RESTORED | info | Add P to available physical interfaces |