Use Cases¶
The checkpoint is automatically triggered when the container is removed, and the restore is automatically triggered when the container is recreated, if checkpointing is enabled for the container. More specifically, many events can trigger the checkpoint, including but not limited to the following: - Pod deletion - Job suspension (which leads to pod deletion) - Node drain (which leads to pod deletion)
MemVerge Transparent Checkpoint Operator supports the following Kubernetes native workloads: - Job: Single-pod (either indexed or non-indexed), or Multi-pod indexed (Multi-pod non-indexed Job is not supported) - StatefulSet - Deployment: Single-pod only (Multi-pod Deployment is not supported)
Job Suspension & Resumption¶
It is often desired to suspend a Job, when cluster resources are limited and a higher priority Job needs to execute in the place of it. The lower priority Job can be suspended with checkpoint, and later resumed from the checkpoint without losing its progress.
-
Preparation: See Enabling Transparent Checkpointing for the Job.
-
Checkpoint: Checkpointing is automatically triggered when the Job is suspended, using the following command:
Checkpointing is automatically triggered when the Job is suspended. -
Restore: Restoring is automatically triggered when the Job is resumed, using the following command:
The restored Job will pick up its progress from when it was suspended (checkpointed).
Node Maintenance¶
When a node is drained for maintenance, a Job's pod will be evicted and rescheduled to another available node. If checkpointing has been enabled for the Jobs, The Jobs' pods will be automtically checkpointed when they are evicted, and automatically restored without lossing their progresses when they are rescheduled to another node. The cluster admin doesn't need to manually trigger checkpoint or restore for the pods, except running the following node drain command:
Plain Pod¶
MemVerge Transparent Checkpoint Operator also supports plain pods (pods that are not owned by a workload controller like Deployment, StatefulSet, or Job). But these come with two main limitations:
- For plain pods, MemVerge Transparent Checkpoint Operator does not automatically create a Persistent Volume Claim (PVC) for checkpointing. Instead, the user must manually create the PVC, associate it with the pod, and pass the PVC name to the MemVerge Transparent Checkpoint Operator. To associate the volume the label
memverge.ai/checkpoint-storage-volume
is used. - MemVerge Transparent Checkpoint Operator doesn’t introduce any rescheduling capabilities. Therefore any plain pods that get deleted must be rescheduled by the user, i.e., recreating a plain pod with the same name.
Example: Enabling checkpointing for a Plain Pod
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: checkpointing-volume
spec:
accessModes:
- ReadWriteOnce
storageClassName: <storage-class-name>
resources:
requests:
storage: 2Gi # Adjust the size to your needs
---
apiVersion: v1
kind: Pod
metadata:
name: my-checkpointed-pod
labels:
memverge.ai/checkpoint-mode: 'true' # Enable checkpointing
memverge.ai/checkpoint-storage-volume: 'checkpointing' # Specify volume for checkpoint storage
spec:
containers:
- name: my-container
image: my-image:latest
volumes:
- name: checkpointing
persistentVolumeClaim:
claimName: checkpointing-volume