Transparent checkpoint/restart process fault tolerance allows an application to be preserved to a stable storage device and recovered at a later time. This technique does not require any changes to the application source code making it a convenient solution for complex, legacy applications and scheduler based dynamic resource management.
We provide a transparent checkpoint/restart process fault tolerance solution for MPI-1.3 compliant applications using Open MPI. Our solution was incorporated into the development trunk of Open MPI in March 2007, and later released as part of the v1.3 release series.
Open MPI supports a transparent, coordinated checkpoint/restart implementation supported primarily by the Berkeley Lab's Checkpoint/Restart (BLCR) Library.
hnpErrMgr component: C/R-enabled Process Migration
hnpErrMgr component: C/R-enabled Automatic Recovery
No special code is required in MPI application to take advantage of Open MPI's checkpoint/restart functionality, although some limitations may be imposed (depending on the back-end checkpointing system that is used).
Open MPI's checkpoint/restart functionality only involves MPI process: the Open MPI runtime environment is not checkpointed.
Open MPI does not yet support checkpointing/restarting MPI-2 applications. In particular, Open MPI's behavior is undefined when checkpointing MPI process that invoke any MPI-2 functionality (including dynamic functions and IO).
Checkpoints can only be performed after all processes have
MPI_INIT and before any process has
Threaded checkpoint coordination support was added in Feb. 2008. This allows an application to make progress on a checkpoint operation whether or not the process is inside the MPI library. To enable this feature you must enable MPI threads and the checkpoint thread
./configure --enable-ft-thread --with-ft=cr --enable-mpi-threadsAfter r22841 the
--enable-mpi-threadswas replaced by
--enable-opal-multi-threads. So you should use the following instead:
./configure --enable-ft-thread --with-ft=cr --enable-opal-multi-threads
Do not use the BLCR command line tools! You must use the Open MPI provided
tools. It is currently undefined how Open MPI will behave if you use the
Currently, the only fully supported threading model
MPI_THREAD_SINGLE. Other MPI threading models may work, but
have not received any testing.
The SELF checkpoint interface has changed slightly. Be sure to read the attached documentation for the new function call specifications. Let us know if you require backwards compatibility on the users list, and we can discuss options there.