Latest Release
A brief introduction to MMBatch followed by what's new in the latest release.
Overview
Memory Machine Batch (MMBatch) captures the entire running state of a Batch Job into a consistent image and restores the Job on a new Compute Instance without losing any work progress. It ensures a high quality of service at the Batch level using low-cost, but unreliable Spot-based Compute Instances. For more details, visit the MMBatch website.
New in the 1.3 Release
The MMBatch 1.3 release adds new features, improves the user experience and enhances telemetry.
-
Managed EBS Support
- Enables running Cromwell and MiniWDL workflows with large local root filesystems.
- Adds configurations for specifying the type and size of managed EBS volumes.
- Allows users to add custom tags to managed EBS volumes.
-
New Configuration Options
- Adds “Close TCP Connection” enable/disable option to support application that requires persistent TCP connections.
- Introduces configuration dialog in the GUI for easy setup.
-
Enables users to query system summary metrics within a selected time range from the GUI.
- Adoption of AWS default setting for IOPs
Known Limitations
Cross Availability Zone Data Migration: If the batch queue is bound to a multi-availability zone compute environment, EBS volume data migration is required when a job restarts in a different zone. Migration time can range from minutes to hours, depending on AWS load.
Leftover Volumes in Cross-Zone Restarts: If a restarted job runs before the old node is reclaimed, the system may fail to remove the old EBS volume, leaving an orphaned volume in the user’s environment.
Spot reclaim protection is not supported during restore: If a spot reclaim happens during restore, the job will re-run from the beginning.
Managed EBS needs to be enabled: This is required for file consistency when saving files under root when running on Amazon Linux 2 and using Juice FS for scrach and checkpoint directories.
MMBatch checkpoint does not support file-backed memory: File-backed memory is considered quadrant #3 and #4 in Linux TOP command output. An example of file-backed memory could be saved to files under /dev/shm, which is a temporary file storage filesystem that uses memory for the backing store.
AWS S3 Throughput Limitations with Concurrent Access: When using AWS S3 as a storage backend, particularly with multiple file systems accessing the same bucket concurrently (e.g., JuiceFS used as a scratch directory in Nextflow pipelines), throughput limitations may be encountered. This can manifest as [Errno 5] Input/output error messages in Nextflow logs
. For best practices and optimization strategies to mitigate this, refer to the AWS S3 documentation on Optimizing Amazon S3 Performance.