The Coordinated Infrastructure for Fault Tolerant Systems (CIFTS) project aims to provide a coordinated infrastructure that will enable Fault Tolerant Systems to adapt to faults occurring in the operating environment in a holistic manner. Our work on this project focuses on integrating the CIFTS Fault Tolerant Backplane (FTB) infrastructure into Open MPI, and making Open MPI more robust in the face of failure.
Initial support for the Fault Tolerant Backplane was introduced in Open MPI in February 2009. Open MPI supports the relaying of fault related information to the FTB through the Notifier framework interface. The FTB coordinates the interaction between components internal to Open MPI, as well as external system components like job schedulers, resource managers, C/R libraries and other FTB-enabled software.
Currently, only the events described in the API Reference are supported.