Latest Release

A brief introduction to MMBatch followed by what's new in the latest release.

Overview

Memory Machine Batch (MMBatch) captures the entire running state of a Batch Job into a consistent image and restores the Job on a new Compute Instance without losing any work progress. It ensures a high quality of service at the Batch level using low-cost, but unreliable Spot-based Compute Instances. For more details, visit the MMBatch website.

New in the 1.3 Release

The MMBatch 1.3 release adds new features, improves the user experience and enhances telemetry.

Managed EBS Support
- Enables running Cromwell and MiniWDL workflows with large local root filesystems.
- Adds configurations for specifying the type and size of managed EBS volumes.
- Allows users to add custom tags to managed EBS volumes.
New Configuration Options
- Adds “Close TCP Connection” enable/disable option to support application that requires persistent TCP connections.
- Introduces configuration dialog in the GUI for easy setup.
Enables users to query system summary metrics within a selected time range from the GUI.

Known Limitations

Cross Availability Zone Data Migration
- If the batch queue is bound to a multi-availability zone compute environment, EBS volume data migration is required when a job restarts in a different zone.
- Migration time can range from minutes to hours, depending on AWS load.
Leftover Volumes in Cross-Zone Restarts
- If a restarted job runs before the old node is reclaimed, the system may fail to remove the old EBS volume, leaving an orphaned volume in the user’s environment.
Spot reclaim protection is not supported during restore. If spot reclaim happens during restore, the job would re-run from the beginning.
When running on Amazon Linux 2, using JuiceFS for scratch and checkpoint dir, Managed EBS needs to be enabled to save the files under root fs for file consistency.
MMBatch checkpoint does not support file-backed memory (quadrant #3 and #4 in Linux top command). For example, file-backed memory could be saved to files under /dev/shm, a temporary file storage filesystem that uses memory for the backing store.
When using AWS S3 for storage, there could be throughput limitation when multiple file systems access the same bucket. For example, when using JuiceFS over AWS S3 as scratch dir in Nextflow pipeline, when there is [Errno 5] Input/output error on the Nextflow command error in nextflow.log, it could be due to AWS S3 throughput limitation. See AWS S3 best practice here Best practices design patterns: optimizing Amazon S3 performance - Amazon Simple Storage Service .