Skip to content

Release Notes

Release Version: v1.0.0

Release Date: July 2025

Introduction

This release of the MemVerge Transparent Checkpoint Operator introduces Multiple GPU Checkpointing. This feature provides checkpoint and restore capabilities for applications that use more than one GPU, but not all GPUs in the system.

Key Highlights

  • Improved GPU Restore Process: When a job with GPUs is restored, the system now automatically creates new, clean GPU device files to ensure a successful restart. This prevents old or corrupted device files from causing restore failures.
  • More Reliable Checkpointing: We've increased the default alarm timeout for the CUDA plugin from 10 seconds to 30 seconds. This gives the system more time to pause all GPU threads, making the checkpointing process more robust and preventing failures that were previously caused by timeouts.
  • Reduced Unnecessary Error Messages: You will see fewer error messages related to the CUDA plugin when it can't find GPU usage on a thread. This cleans up the logs and makes it easier to spot actual issues.

Bug Fixes

  • None in this release

Enhancements

  • Added Multiple GPU Checkpointing, which can be triggered by MemVerge AI (GPU Cluster Manager) or by automatic node deletion and pod draingit.

Deprecations

  • None in this release

Installation Instructions

  • Refer to the Installation Guide for detailed instructions.
  • Please Note: For users who wish to use RedHat OpenShift, please download the latest version of Transparent Checkpoint Operator from the RedHat Marketplace. It will walk you through the required installation steps.

Known Issues

User Guide

  • Refer to the User Guide for detailed information.

Support

Additional Notes

  • None in this release