Latest Release

A brief introduction to MMBatch followed by what's new in the latest release.

Overview

Memory Machine Batch (MMBatch) captures the entire running state of a Batch Job into a consistent image and restores the Job on a new Compute Instance without losing any work progress. It ensures a high quality of service at the Batch level using low-cost, but unreliable Spot-based Compute Instances. For more details, visit the MMBatch website.

New in the 1.4 Release

The MMBatch 1.4 release adds new features, improves the user experience and enhances telemetry.

MemVerge GPU SpotSurfing

GPU SpotSurfing is a checkpointing and recovery feature designed to seamlessly manage Spot terminations on GPU instances. By integrating checkpointing with cloud automation, this service enables stateful workloads to run safely on GPU Spot instances, reducing compute costs by up to 70%, all without requiring any changes to your application code.

NVIDIA GPU Checkpoint & Restore Support

New support for checkpointing and restoring GPU-based workflows, such as:
- parabricks-fq2bam
- parabricks-germline

Job List View in Application Window

New job list view that shows job status, maximum root filesystem usage of the container, and detailed job events.

Downloadable Worker Node CSV

Users can now download a CSV file containing all finished worker node details.

Checkpoint Cleanup Improvements

When not in diagnostic mode, checkpoints are now automatically cleaned up if the restore succeeds or the job finishes.

Known Limitations

Cross Availability Zone Data Migration: If the batch queue is bound to a multi-availability zone compute environment, EBS volume data migration is required when a job restarts in a different zone. Migration time can range from minutes to hours, depending on AWS load.

EBS Volume Detachment Delays on AWS Instance Termination: When AWS terminates an instance, detaching its associated EBS volume may take longer than 5 minutes. This delay is due to AWS's instance termination behavior, which can prevent the instance from flushing in-memory data to the EBS volume, clearing file system caches, or cleanly unmounting the volume.

Leftover Volumes in Cross-Zone Restarts: If a restarted job runs before the old node is reclaimed, the system may fail to remove the old EBS volume, leaving an orphaned volume in the user’s environment.

Spot reclaim protection is not supported during restore: If a spot reclaim happens during restore, MMBatch will retry the restore on a new instance.

Managed EBS needs to be enabled: This is required for file consistency when saving files under root when running on Amazon Linux 2 and using Juice FS for scratch and checkpoint directories.

MMBatch checkpoint does not support file-backed memory: File-backed memory is considered quadrant #3 and #4 in Linux TOP command output. An example of file-backed memory could be saved to files under /dev/shm, which is a temporary file storage filesystem that uses memory for the backing store.

AWS S3 Throughput Limitations with Concurrent Access: When using AWS S3 as a storage backend, particularly with multiple file systems accessing the same bucket concurrently (e.g., JuiceFS used as a scratch directory in Nextflow pipelines), throughput limitations may be encountered. This can manifest as [Errno 5] Input/output error messages in Nextflow logs. For best practices and optimization strategies to mitigate this, refer to the AWS S3 documentation on Optimizing Amazon S3 Performance.

PreDump Occurs even if MMC_CHECKPOINT_INTERVAL is set to 0: The only way to disable predump would be to set the checkpoint to a number larger than the largest interval using MMC_CHECKPOINT_INTERVAL. The current default, if MMC_CHECKPOINT_INTERVAL is not set, is 15 minutes.

'Total Run Time' is different between AWS batch engine and MMBatch engine: When MMBatch performs its final checkpoint, it first pauses the job's container. The container is only terminated after checkpointing is complete. Since AWS Batch calculates a job's end time based on container termination, this creates a discrepancy in job run times: the additional time reflects the duration MMBatch spent on checkpointing after pausing the container. This results in a slight discrepancy of up to 2 minutes (the max time before the EC2 Spot instance is reclaimed) while the checkpoint is finalized, which is negligible in the total runtime of a job.