Skip to content

MemVerge Transparent Checkpoint Operator Installation Guide

This guide provides a step-by-step procedure for Kubernetes administrators to install the MemVerge Transparent Checkpoint Operator. The MemVerge Transparent Checkpoint Operator for Kubernetes provides automated snapshot and restore functionality for Kubernetes Pods. Designed for workloads that need high availability, fast recovery, and fault tolerance, the Transparent Checkpoint Operator leverages Kubernetes-native events to detect when Pods stop, fail, or are terminated. It automatically creates a snapshot of their state. These snapshots are then used to restore the application when Pods are restarted manually or through the scheduler.

Prerequisites

Before proceeding with the installation, ensure your Kubernetes environment meets the following requirements:

  • Kubernetes Cluster: Access to a Kubernetes v1.28+ cluster with cluster-admin role. Supported distributions include:
  • Vanilla Kubernetes
  • Rancher Kubernetes Engine 2 (RKE2)
  • K3s
  • Container Runtime Interface (CRI): Your Kubernetes nodes must be configured with one of the following supported CRI runtimes:
  • Containerd: version v1.7+
  • CRI-O: version v1.28+
  • Note: Other CRI runtimes are not currently supported. Refer to the official Kubernetes documentation for more information on CRI runtimes: https://kubernetes.io/docs/setup/production-environment/container-runtimes/
  • Storage Class: Your cluster must be configured with a StorageClass that supports dynamic provisioning of Persistent Volumes and the ability to move a Persistent Volume between different nodes. This is essential for the checkpoint functionality.
  • kubectl: Ensure you have kubectl version v1.28+ installed and configured to interact with your Kubernetes cluster.
  • Helm: Helm package manager version v3.14+ is required for installing the MemVerge Transparent Checkpoint Operator.

For a detailed list of system and software requirements, see requirements for more information.

Step 1: Acquire GitHub Token

To download the MemVerge Helm chart and container images, you need a personal access token from the mv-customer-support GitHub account. Please contact MemVerge Customer Support at support@memverge.com to obtain this token.

Step 2: Log in to GitHub Registry

Use the acquired personal access token to log in to the GitHub Container Registry (ghcr.io/memverge). Execute the following Helm command:

helm registry login ghcr.io/memverge
# Username: mv-customer-support
# Password: <your-personal-access-token>

Replace <your-personal-access-token> with the token you received from MemVerge Customer Support.

Step 3: Create Image Pull Secret

Create a Kubernetes Secret in the mvtco-system namespace to allow your cluster to pull images from the GitHub Container Registry. If the mvtco-system namespace does not exist, it will be created by the kubectl create namespace command.

kubectl create namespace mvtco-system

kubectl create secret generic memverge-dockerconfig --namespace mvtco-system \
  --from-file=.dockerconfigjson=$HOME/.config/helm/registry/config.json \
  --type=kubernetes.io/dockerconfigjson

This command assumes that your Helm registry configuration is stored in the default location ($HOME/.config/helm/registry/config.json).

Step 4: Install Cert Manager (Optional)

Cert Manager is required if the MemVerge Transparent Checkpoint Operator needs to manage TLS certificates within your cluster. If cert-manager is already installed and configured, you can skip this step.

To install Cert Manager using Helm:

helm repo add jetstack [https://charts.jetstack.io](https://charts.jetstack.io) --force-update

helm install cert-manager jetstack/cert-manager --namespace cert-manager \
  --create-namespace --set crds.enabled=true

For alternative installation methods and more detailed configuration options, please refer to the official Cert Manager documentation.

Step 5: Install Nvidia's GPU Operator (Optional, for GPU Checkpointing)

If you intend to use the transparent checkpoint functionality for GPU-enabled workloads, you need to install the Nvidia GPU Operator with specific configurations to enable Kubernetes's native CDI (Container Device Interface) mode.

First, add the Nvidia Helm repository:

helm repo add nvidia [https://helm.ngc.nvidia.com/nvidia](https://helm.ngc.nvidia.com/nvidia) --force-update

Then, install the Nvidia GPU Operator with the necessary CDI configurations:

helm install --wait --generate-name -n gpu-operator --create-namespace \
    nvidia/gpu-operator --version v25.3.0 \
    --set toolkit.env[0].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED \
    --set-string toolkit.env[0].value=false \
    --set toolkit.env[1].name=CDI_ENABLED \
    --set-string toolkit.env[1].value=true \
    --set toolkit.env[2].name=NVIDIA_CONTAINER_RUNTIME_MODE \
    --set toolkit.env[2].value=cdi \
    --set toolkit.env[3].name=NVIDIA_CONTAINER_RUNTIME_MODES_CDI_ANNOTATION_PREFIXES \
    --set toolkit.env[3].value=cdi.k8s.io/ \
    --set toolkit.env[4].name=CRIO_CONFIG_MODE \
    --set toolkit.env[4].value=config \
    --set devicePlugin.env[0].name=DEVICE_LIST_STRATEGY \
    --set devicePlugin.env[0].value=cdi-annotations \
    --set devicePlugin.env[1].name=CDI_ANNOTATION_PREFIX \
    --set devicePlugin.env[1].value=cdi.k8s.io/ \
    --set devicePlugin.env[2].name=NVIDIA_CTK_PATH \
    --set devicePlugin.env[2].value=/usr/local/nvidia/toolkit/nvidia-ctk

If you are using Rancher Kubernetes Engine 2 (RKE2) or K3s, append the following additional configuration to the install command:

    --set toolkit.env[5].name=CONTAINERD_SOCKET \
    --set toolkit.env[5].value=/run/k3s/containerd/containerd.sock

Step 6: Install MemVerge Transparent Checkpoint Operator

With the prerequisites met and the necessary components installed, you can now install the MemVerge Transparent Checkpoint Operator using Helm:

helm install --namespace mvtco-system mvtco oci://ghcr.io/memverge/charts/mvtco --version <version>

Replace <version> with the specific version of the MemVerge Transparent Checkpoint Operator you wish to install. Do not include the v prefix in the version number. If you omit the --version flag, the latest version will be installed.

Step 7: Uninstall MemVerge Transparent Checkpoint Operator

To uninstall the MemVerge Transparent Checkpoint Operator deployment, execute the following Helm command:

helm uninstall --namespace mvtco-system mvtco

This command will remove the operator's deployment but will leave the Custom Resource Definitions (CRDs) in your Kubernetes cluster.

To completely remove all MemVerge Transparent Checkpoint Operator resources, including the CRDs, run the following command:

kubectl delete crd engines.snapshot.memverge.ai

Next Steps

Once the MemVerge Transparent Checkpoint Operator is successfully installed, you can begin leveraging its capabilities for your applications. Refer to the User Guide for detailed instructions on how to enable transparent checkpointing and restore for your Kubernetes workloads by applying specific labels to your pod specifications.

In the simplest scenario, enabling checkpointing for a pod involves adding the following label to its specification:

memverge.ai/checkpoint-mode: 'true'

The User Guide also provides information on other available labels for customizing the checkpointing behavior, such as specifying containers to checkpoint, defining storage volumes, and configuring other advanced options.