Skip to content

MemVerge Transparent Checkpoint Operator User Guide

This guide explains how to use the MemVerge Transparent Checkpoint Operator to enable automatic checkpointing and restoration for your Kubernetes workloads. By applying specific labels to your Pod specifications, you can instruct the operator to manage the lifecycle of your application's running state.

Enabling Transparent Checkpointing

The MemVerge Transparent Checkpoint Operator is activated by adding specific labels to your Kubernetes Pod specifications. These labels can be applied directly in your YAML manifests or dynamically using kubectl.

Applying Labels in Pod Specifications

To enable checkpointing for a specific pod, add the memverge.ai/checkpoint-mode: 'true' label to the metadata.labels section of your Pod specification or within the template.metadata.labels of your workload controller (e.g., Deployment, StatefulSet, Job).

Example: Enabling checkpointing for a Job

apiVersion: batch/v1
kind: Job
metadata:
  name: my-checkpointed-job
spec:
  template:
    metadata:
      labels:
        memverge.ai/checkpoint-mode: 'true'  # Enable checkpointing for pods created by this Job
        memverge.ai/checkpoint-volume-size: 2Gi # Optional: Specify checkpoint volume size
    spec:
      containers:
      - name: my-container
        image: my-image:latest
      restartPolicy: Never

When the Job creates a Pod, the memverge.ai/checkpoint-mode: 'true' label will instruct the operator to automatically checkpoint the pod's state when it is deleted (e.g., upon successful completion or failure). If the pod is recreated, the operator will automatically restore its state from the latest checkpoint.

Important Note for Workload Controllers: For controllers like Deployments, StatefulSets, and Jobs, apply the MemVerge labels to the template.metadata.labels section. This ensures that all Pods created by the controller will inherit these labels. Modifying the labels of the controller itself will not affect existing Pods.

Applying Labels to Existing Pods using kubectl label

You can also add MemVerge labels to running Pods using the kubectl label command. This is useful for enabling checkpointing for existing deployments without modifying their original specifications.

Example: Enabling checkpointing for an existing Pod named my-running-pod

kubectl label pod my-running-pod memverge.ai/checkpoint-mode=true

To specify a checkpoint volume size for the same pod:

kubectl label pod my-running-pod memverge.ai/checkpoint-volume-size=2Gi

Note: Labels applied using kubectl label are live changes to the Pod object. However, if the Pod is managed by a controller, these changes might be overwritten upon the next reconciliation of the controller. For persistent label changes in managed Pods, it's recommended to update the controller's Pod template.

Applying Labels to All Pods in a Namespace

You can apply a label to all existing Pods within a specific namespace using kubectl label with a selector.

Example: Enabling checkpointing for all Pods in the default namespace

kubectl label --overwrite namespace default memverge.ai/checkpoint-mode=true --all

Caution: Applying labels to all Pods in a namespace can have unintended consequences if not done carefully. Ensure you understand the impact on all applications running in that namespace before executing such a command.

Applying Labels to Pods Based on Existing Selectors

You can target a specific set of Pods based on their existing labels using a selector with kubectl label.

Example: Enabling checkpointing for all Pods with the label app=my-app

kubectl label pods -l app=my-app memverge.ai/checkpoint-mode=true

Setting Default Labels for Future Pods in a Namespace (using Mutating Admission Webhooks)

While not a direct kubectl command, you can configure Mutating Admission Webhooks (if your Kubernetes cluster supports them) to automatically add MemVerge labels to newly created Pods within a specific namespace. This approach ensures that all future Pods in that namespace will have checkpointing enabled by default. The configuration of such webhooks is beyond the scope of this basic user guide but is a powerful way to enforce checkpointing policies.

Removing Checkpointing Labels

To disable transparent checkpointing for a Pod, you can remove the MemVerge-related labels.

Removing Labels from Specific Pods using kubectl label

Use the kubectl label --overwrite command with a hyphen (-) at the end of the label name to remove it.

Example: Disabling checkpointing for a Pod named my-checkpointed-pod

kubectl label pod my-checkpointed-pod memverge.ai/checkpoint-mode-

To remove the checkpoint volume size label as well:

kubectl label pod my-checkpointed-pod memverge.ai/checkpoint-volume-size-

Removing Labels from Workload Controller Templates

To permanently disable checkpointing for Pods managed by a controller, you need to remove the MemVerge labels from the template.metadata.labels section of the controller's specification and then apply the updated specification. Existing Pods will retain the label until they are recreated or updated by the controller.

Example: Disabling checkpointing in a Deployment

  1. Edit the Deployment:

    kubectl edit deployment my-deployment
    
  2. Remove the memverge.ai/checkpoint-mode and any other MemVerge labels from the template.metadata.labels section.

  3. Save and close the editor. The Deployment will reconcile, and new Pods created will not have the checkpointing labels. You might need to manually delete existing Pods for the changes to take full effect on all instances.

Removing Labels from All Pods in a Namespace

Similar to adding labels, you can remove a label from all Pods in a namespace using kubectl label with the --all selector and the label name followed by a hyphen.

Example: Disabling checkpointing for all Pods in the mynamespace namespace

kubectl label --overwrite namespace mynamespace memverge.ai/checkpoint-mode- --all

Caution: Exercise caution when removing labels from all Pods in a namespace, as it will affect all applications running there.

Removing Labels from Pods Based on Selectors

You can remove labels from a specific set of Pods based on their existing labels.

Example: Disabling checkpointing for all Pods with the label app=legacy-app

kubectl label pods -l app=legacy-app memverge.ai/checkpoint-mode-

Complete List of Labels

The following table describes the labels supported by the MemVerge Transparent Checkpoint Operator:

Label Description
memverge.ai/checkpoint-mode Set to true to enable MemVerge transparent checkpoint/restore service.
memverge.ai/checkpoint-containers List of container names to be checkpointed, delimited by comma. If not set, all containers except istio-proxy and nginx-proxy are checkpointed.
memverge.ai/checkpoint-storage-volume An existing volume in the pod used for checkpoint storage. If not set, a dynamically provisioned PV is used for checkpoint storage. The PV's lifecycle is controlled by the operator, which requires that the pod has a controller. This option is required for plain pods (no workload controller, i.e., StatefulSet, Job, CronJob, etc.). The user must manage the lifecycle of the volume.
memverge.ai/checkpoint-storage-class The StorageClass name used to dynamically provision the Persistent Volume for checkpoint storage. If not set, the default StorageClass is used. It is ignored if memverge.ai/checkpoint-storage-volume is set.
memverge.ai/checkpoint-volume-size The size of the Persistent Volume for checkpoint storage. If not set, it is computed by summation of the memory limits of all containers in the pod. It is ignored if memverge.ai/checkpoint-storage-volume is set.
memverge.ai/checkpoint-files List of files/directories to be checkpointed, delimited by comma.
memverge.ai/irmap-scan-paths List of paths for irmap scan, delimited by comma.
memverge.ai/checkpoint-diagnosis Set to true to preserve checkpoint images and logs for diagnostic purposes.

By understanding and applying these labels, you can effectively manage the checkpointing behavior of your Kubernetes applications using the MemVerge Transparent Checkpoint Operator. Remember to consult the operator's logs and Kubernetes events for detailed information.