Skip to content

MemVerge Transparent Checkpoint Operator Installation Guide

This guide provides a step-by-step procedure for Kubernetes administrators to install the MemVerge Transparent Checkpoint Operator. The MemVerge Transparent Checkpoint Operator for Kubernetes provides automated snapshot and restore functionality for Kubernetes Pods. Designed for workloads that need high availability, fast recovery, and fault tolerance, the Transparent Checkpoint Operator leverages Kubernetes-native events to detect when Pods stop, fail, or are terminated by automatically creating a snapshot of their state once an event is detected. These snapshots are then used to restore the application when Pods are restarted manually or through the scheduler.

Prerequisites

Before proceeding with the installation, ensure your Kubernetes environment meets the following requirements:

  • Kubernetes Cluster: Access to a Kubernetes v1.28+ cluster with cluster-admin role. Supported distributions include:
    • Vanilla Kubernetes
    • Rancher Kubernetes Engine 2 (RKE2)
    • K3s
  • Container Runtime Interface (CRI): Your Kubernetes nodes must be configured with one of the following supported CRI runtimes:
  • Storage Class: Your cluster must be configured with a StorageClass that supports dynamic provisioning of Persistent Volumes and the ability to move a Persistent Volume between different nodes. This is essential for the checkpoint functionality.
  • kubectl: Ensure you have kubectl version v1.28+ installed and configured to interact with your Kubernetes cluster.
  • Helm: Helm package manager version v3.14+ is required for installing the MemVerge Transparent Checkpoint Operator.

For a detailed list of system and software requirements, see requirements for more information.

Please Note: RedHat OpenShift installation should be done via the latest package available on their Marketplace. The package on Marketplace will walk you through installation steps and the guide beneath this note does not apply.

Step 1: Acquire GitHub Token

To download the MemVerge Helm chart and container images, you need a personal access token from the mv-customer-support GitHub account. Please contact MemVerge Customer Support at support@memverge.com to obtain this token.

Step 2: Log in to GitHub Registry

Use the acquired personal access token to log in to the GitHub Container Registry (ghcr.io/memverge). Execute the following Helm command:

helm registry login ghcr.io/memverge
# Username: mv-customer-support
# Password: <your-personal-access-token>

Replace <your-personal-access-token> with the token you received from MemVerge Customer Support.

Step 3: Create Image Pull Secret

Create a Kubernetes Secret in the mvtco-system namespace to allow your cluster to pull images from the GitHub Container Registry. If the mvtco-system namespace does not exist, it will be created by the kubectl create namespace command.

kubectl create namespace mvtco-system

kubectl create secret generic memverge-dockerconfig --namespace mvtco-system \
  --from-file=.dockerconfigjson=$HOME/.config/helm/registry/config.json \
  --type=kubernetes.io/dockerconfigjson

This command assumes that your Helm registry configuration is stored in the default location ($HOME/.config/helm/registry/config.json).

Step 4: Install Cert Manager (Optional)

Cert Manager is required if the MemVerge Transparent Checkpoint Operator needs to manage TLS certificates within your cluster. If cert-manager is already installed and configured, you can skip this step.

To install Cert Manager using Helm:

helm repo add jetstack https://charts.jetstack.io --force-update

helm install cert-manager jetstack/cert-manager --namespace cert-manager \
  --create-namespace --set crds.enabled=true

For alternative installation methods and more detailed configuration options, please refer to the official Cert Manager documentation.

Step 5: Install Nvidia's GPU Operator (Optional, for GPU Checkpointing)

If you intend to use the transparent checkpoint functionality for GPU-enabled workloads, you need to install the Nvidia GPU Operator with specific configurations to enable Kubernetes's native CDI (Container Device Interface) mode.

First, add the Nvidia Helm repository:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia --force-update

Then, install the Nvidia GPU Operator with the necessary CDI configurations:

helm install --wait --generate-name -n gpu-operator --create-namespace \
    nvidia/gpu-operator --version v25.3.0 \
    --set toolkit.env[0].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED \
    --set-string toolkit.env[0].value=false \
    --set toolkit.env[1].name=CDI_ENABLED \
    --set-string toolkit.env[1].value=true \
    --set toolkit.env[2].name=NVIDIA_CONTAINER_RUNTIME_MODE \
    --set toolkit.env[2].value=cdi \
    --set toolkit.env[3].name=NVIDIA_CONTAINER_RUNTIME_MODES_CDI_ANNOTATION_PREFIXES \
    --set toolkit.env[3].value=cdi.k8s.io/ \
    --set toolkit.env[4].name=CRIO_CONFIG_MODE \
    --set toolkit.env[4].value=config \
    --set devicePlugin.env[0].name=DEVICE_LIST_STRATEGY \
    --set devicePlugin.env[0].value=cdi-annotations \
    --set devicePlugin.env[1].name=CDI_ANNOTATION_PREFIX \
    --set devicePlugin.env[1].value=cdi.k8s.io/ \
    --set devicePlugin.env[2].name=NVIDIA_CTK_PATH \
    --set devicePlugin.env[2].value=/usr/local/nvidia/toolkit/nvidia-ctk

If you are using Rancher Kubernetes Engine 2 (RKE2) or K3s, append the following additional configuration to the install command:

    --set toolkit.env[5].name=CONTAINERD_SOCKET \
    --set toolkit.env[5].value=/run/k3s/containerd/containerd.sock

Step 6: Install MemVerge Transparent Checkpoint Operator

With the prerequisites met and the necessary components installed, you can now install the MemVerge Transparent Checkpoint Operator using Helm:

helm install --namespace mvtco-system mvtco oci://ghcr.io/memverge/charts/mvtco --version <version>

Replace <version> with the specific version of the MemVerge Transparent Checkpoint Operator you wish to install. Do not include the v prefix in the version number. If you omit the --version flag, the latest version will be installed.

Helm Chart Options:

Option Default Value Description
imagePullSecrets ["name: memverge-dockerconfig"] list of names of Secret containing private registry credentials.
image.repository ghcr.io/memverge/mvtco Location for the image of the Transparent Checkpoint Operator.
image.tag .Chart.appVersion Tag for the image of the MemVerge Transparent Checkpoint Operator.
engine.image.repository ghcr.io/memverge/mmcloud-engine Location for the image of the MMCloud Engine.
engine.image.tag v3.5.0-mvtco-0.9.0 Tag for the image of the MMCloud Engine.
engine.kubeletConfigFilePath /var/lib/kubelet/config.yaml Path of the Kubelet configuration file. For K3s or RKE2, it is their own config file because they don't have separate kubelet config file.
engine.runtimeRequestTimeout 5m0s Kubelet's runtime-request-timeout configuration. It limits the time for the runtime to respond to checkpoint and restore operations. It should be increased if there are large workloads.
loki-stack.enabled true Whether to install loki-stack.

Step 7: Uninstall MemVerge Transparent Checkpoint Operator

To uninstall the MemVerge Transparent Checkpoint Operator deployment, execute the following Helm command:

helm uninstall --namespace mvtco-system mvtco

This command will remove the operator's deployment but will leave the Custom Resource Definitions (CRDs) in your Kubernetes cluster.

To completely remove all MemVerge Transparent Checkpoint Operator resources, including the CRDs, run the following command:

kubectl delete crd engines.snapshot.memverge.ai

Next Steps

Once the MemVerge Transparent Checkpoint Operator is successfully installed, you can begin leveraging its capabilities for your applications. Refer to the User Guide for detailed instructions on how to enable transparent checkpointing and restore for your Kubernetes workloads by applying specific labels to your pod specifications.

In the simplest scenario, enabling checkpointing for a pod involves adding the following label to its specification:

memverge.ai/checkpoint-mode: 'true'

The User Guide also provides information on other available labels for customizing the checkpointing behavior, such as specifying containers to checkpoint, defining storage volumes, and configuring other advanced options.