Skip to content

Use Cases

Job Suspension & Resumption

It is often desired to suspend a Job, when cluster resources are limited and a higher priority Job needs to execute in the place of it. The lower priority Job can be suspended with checkpoint, and later resumed from the checkpoint without losing its progress.

  • Preparation: See Enabling Transparent Checkpointing for the Job.

  • Checkpoint: Checkpointing is automatically triggered when the Job is suspended, using the following command:

    kubectl patch job tf-mnist -p '{"spec":{"suspend":true}}'
    
    Checkpointing is automatically triggered when the Job is suspended.

  • Restore: Restoring is automatically triggered when the Job is resumed, using the following command:

    kubectl patch job tf-mnist -p '{"spec":{"suspend":false}}'
    
    The restored Job will pick up its progress from when it was suspended (checkpointed).

Node Maintenance

When a node is drained for maintenance, a Job's pod will be evicted and rescheduled to another available node. If checkpointing has been enabled for the Jobs, The Jobs' pods will be automtically checkpointed when they are evicted, and automatically restored without lossing their progresses when they are rescheduled to another node. The cluster admin doesn't need to manually trigger checkpoint or restore for the pods, except running the following node drain command:

kubectl drain node <node-name>

Plain Pod

MemVerge Transparent Checkpoint Operator also supports plain pods (pods that are not owned by a workload controller like Deployment, StatefulSet, or Job). But these come with two main limitations. - For plain pods, MemVerge Transparent Checkpoint Operator does not automatically create a Persistent Volume Claim (PVC) for checkpointing. Instead, the user must manually create the PVC, associate it with the pod, and pass the PVC name to the MemVerge Transparent Checkpoint Operator. To associate the volume the label memverge.ai/checkpoint-storage-volume is used. - MemVerge Transparent Checkpoint Operator doesn’t introduce any rescheduling capabilities. Therefore any plain pods that get deleted must be rescheduled by the user, i.e., recreating a plain pod with the same name.

Example: Enabling checkpointing for a Plain Pod

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: checkpointing-volume
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: <storage-class-name>
  resources:
    requests:
      storage: 2Gi  # Adjust the size to your needs
---
apiVersion: v1
kind: Pod
metadata:
  name: my-checkpointed-pod
  labels:
    memverge.ai/checkpoint-mode: 'true'  # Enable checkpointing
    memverge.ai/checkpoint-storage-volume: 'checkpointing'  # Specify volume for checkpoint storage
spec:
  containers:
  - name: my-container
    image: my-image:latest
  volumes:
  - name: checkpointing
    persistentVolumeClaim:
      claimName: checkpointing-volume