MemVerge Transparent Checkpoint Operator User Guide¶

This guide explains how to use the MemVerge Transparent Checkpoint Operator to enable automatic checkpointing and restoration for your Kubernetes workloads. This tool provides automated snapshot and restore functionality for Kubernetes Pods. Designed for workloads that need high availability, fast recovery, and fault tolerance, the Transparent Checkpoint Operator leverages Kubernetes-native events to detect when Pods stop, fail, or are terminated by automatically creating a snapshot of their state once an event is detected. These snapshots are then used to restore the application when Pods are restarted manually or through the scheduler.

When a checkpoint is restored, it must be restored on the same type of GPU that was used to create the checkpoint. MemVerge does not support migrating between different GPU models via checkpoint utilization.

By applying specific labels to your Pod specifications, you can instruct the operator to manage the lifecycle of your application's running state.

This guide covers how to accomplish the following tasks: