Command line tools to support checkpoint/restart in Open MPI.
The ompi-checkpoint command is provided to checkpoint an MPI application. The
one required argument to this command is the PID of the mpirun process. This
command must be launched on the same machine as the running mpirun
process. Once a checkpoint request has completed ompi-checkpoint will return a
global snapshot reference and a sequence number. This information will allow
you to properly restart the MPI job at a later time.
ompi-checkpoint PID_OF_MPIRUN \
[-h | --help]
[-v | --verbose]
[-V #]
[--term]
[--stop]
[-w | --nowait]
[-s | --status]
[-l | --list]
[-attach | --attach]
[-detach | --detach]
[-crdebug | --crdebug]
shell$ mpirun my-app <args> & shell$ export PID_OF_MPIRUN=1234 shell$ ompi-checkpoint $PID_OF_MPIRUN Snapshot Ref.: 0 ompi-global-snapshot-1234 shell$ ompi-checkpoint $PID_OF_MPIRUN Snapshot Ref.: 1 ompi-global-snapshot-1234
| Argument | Description |
|---|---|
PID_OF_MPIRUN
|
PID of the mpirun process |
-h | --help
|
Display help |
-v | |
Display verbose output |
-V #
|
Display verbose output up to a specified level |
--term
|
Terminate the application after checkpoint. |
--stop
|
Send SIGSTOP to application just after checkpoint (checkpoint will not finish until SIGCONT is sent) (Cannot be used with --term) |
-w | --nowait
|
Not Implemented: Do not wait for the application to finish checkpointing before returning. |
-s | --status
|
Display status messages describing the progression of the checkpoint. |
-l | --list
|
Display a list of checkpoint files available on this machine |
-attach | |
Introduced in r23587. Included in v1.5.1 and later releases. Wait for the debugger to attach directly after taking the checkpoint. |
-detach | |
Introduced in r23587. Included in v1.5.1 and later releases. Do not wait for the debugger to reattach after taking the checkpoint. |
-crdebug | |
Introduced in r23587. Included in v1.5.1 and later releases. Enable C/R Enhanced Debugging. |
Users familiar with LAM/MPI checkpoint/restart commands should notice that ompi-checkpoint does not require the user to tell it which checkpoint/restart service (e.g., BLCR or SELF) to use when checkpointing the application. This information is automatically detected and stored with the checkpoint snapshot.
The ompi-restart command is provided to restart a previously-checkpointed MPI
application. The one required argument to this command is the global snapshot
reference returned by ompi-checkpoint. The global snapshot reference contains
all of the necessary information to properly restart an MPI
application. Invoking ompi-restart results in a new mpirun being launched.
ompi-restart GLOBAL_SNAPSHOT_REF \
[-h | --help]
[-v | --verbose]
[--fork]
[-s | --seq]
[--hostfile]
[--machinefile]
[-i | --info]
[-a | --apponly]
[-crdebug | --crdebug]
[-mpirun_opts | --mpirun_opts]
[--showme]
shell$ ompi-restart ompi-global-snapshot-1234
| Argument | Description |
|---|---|
GLOBAL_SNAPSHOT_REF
|
Global snapshot reference |
-h | --help
|
Display help |
-v | --verbose
|
Display verbose output |
--fork
|
Fork off a new process which is the restarted process instead of replacing orte_restart. |
-s | --seq #
|
The sequence number of the checkpoint to start from. (Default: -1, or most recent) |
--hostfile | |
Provide a hostfile to use for launch. |
-i | --info
|
Display information about the checkpoint |
-a | --apponly
|
Introduced in r23587. Included in v1.5.1 and later releases. Only create the app context file, do not restart from it. |
-crdebug | |
Introduced in r23587. Included in v1.5.1 and later releases. Enable C/R Enhanced Debugging |
-mpirun_opts | |
Introduced in r23587. Included in v1.5.1 and later releases. Command line options to pass directly to mpirun (be sure to quote long strings, and escape internal quotes) |
--showme
|
Introduced in r23587. Included in v1.5.1 and later releases. Display the full command line that would have been exec'ed. |
-p | --preload
|
Deprecated in r23587. Deprecated in v1.5.1 and later releases. Preload the checkpoint files before restarting (Default = Disabled) |
Users familiar with LAM/MPI checkpoint/restart commands should notice that ompi-restart does not require the user to tell it which checkpoint/restart service (e.g., BLCR or SELF) was used when checkpointing the application. This information is stored with the checkpoint snapshot and automatically used by the ompi-restart command.
Introduced in r23587. Included in v1.5.1 and later releases.
The ompi-migrate command is provided to migrate an MPI application.
The one required argument to this command is the PID of the mpirun process.
This command must be launched on the same machine as the running mpirun process.
ompi-migrate PID_OF_MPIRUN \
[-h | --help]
[-v | --verbose]
[-r | --ranks]
[-t | --onto]
[-x | --off]
shell$ ompi-migrate -x node123,node124 1234 shell$ ompi-migrate -x node123,node124 -t node125,node126 1234 shell$ ompi-migrate -r 1,3,5,7 1234
| Argument | Description |
|---|---|
PID_OF_MPIRUN
|
PID of the mpirun process |
-h | --help
|
Display help |
-v | --verbose
|
Display verbose output |
-r | --ranks
|
List of MPI_COMM_WORLD ranks to migrate (comma separated) |
-t | --onto
|
List of nodes to migrate onto (comma separated) |
-x | --off
|
List of nodes to migrate off of (comma separated) |