Skip to content

MemVerge Transparent Checkpoint Operator for Kubernetes

The MemVerge Transparent Checkpoint Operator for Kubernetes provides automated snapshot and restore functionality for Kubernetes Pods. Designed for workloads that need high availability, fast recovery, and fault tolerance, the Transparent Checkpoint Operator leverages Kubernetes-native events to detect when Pods stop, fail, or are terminated. It automatically creates a snapshot of their state. These snapshots are then used to restore the application when Pods are restarted manually or through the scheduler.

Imagine:

  • Instant Recovery: Restore critical stateful applications in seconds after failures, minimizing downtime and data loss.
  • Enhanced Resilience: Protect your stateful workloads against node failures, planned maintenance, and accidental disruptions.
  • Simplified Operations: Automate checkpointing and restoration processes through Kubernetes-native controls.
  • GPU-Aware Checkpointing (Optional): Extend the benefits of transparent checkpointing to your GPU-accelerated applications.

Key Features

  • Automatic Checkpointing: Triggered upon pod deletion (e.g., manual deletion, node drain).
  • Automatic Restoration: Initiated when a checkpointed pod is recreated.
  • Granular Control: Specify which containers and files to include in checkpoints.
  • Flexible Storage: Supports dynamic provisioning of Persistent Volumes for checkpoint data.
  • Integration with Workload Controllers: Works seamlessly with Deployments, StatefulSets, Jobs, and more.
  • Optional GPU Operator Integration: Enables checkpointing of GPU-enabled workloads using Kubernetes CDI.
  • Comprehensive Label-Based Configuration: Easily manage checkpointing behavior using Kubernetes labels.

Get Started

Ready to bring the power of instant application recovery to your Kubernetes cluster? Follow our comprehensive installation guide to deploy the MemVerge Transparent Checkpoint Operator.

How It Works

The MemVerge Transparent Checkpoint Operator leverages the underlying capabilities of MMCloud Engine to create consistent, application-aware snapshots of your Pod's running state, including in-memory data. These snapshots are stored as Persistent Volumes within your Kubernetes cluster. When a Pod needs to be restarted, the operator orchestrates the restoration process, bringing the application back to its pre-failure state almost instantaneously.

Use Cases

  • Mission-Critical Applications: Ensure high availability and minimize downtime for your most important services.
  • Long-Running Batch Jobs: Preserve progress and quickly resume interrupted computations.
  • AI/ML Workloads: Protect the significant time and resources invested in training models.
  • Stateful Applications: Provide robust recovery for databases, message queues, and other stateful services.
  • Dev/Test Environments: Easily snapshot and restore application states for efficient testing and debugging.

Learn More