Latest Release
A brief introduction to MMBatch followed by what's new in the latest release.
Overview
Memory Machine Batch (MMBatch) captures the entire running state of a Batch Job into a consistent image and restores the Job on a new Compute Instance without losing any work progress. It ensures a high quality of service at the Batch level using low-cost, but unreliable Spot-based Compute Instances. For more details, visit the MMBatch website.
What's New in the 1.4.2 Patch Update
This patch update focuses on improving Management Server with new features and key bug fixes to enhance system stability.
Management Server Updates
- Dashboard View Improvements: The Management Server Dashboard now provides detailed insights into Spot Protections, including their checkpoint and restore activity. You can now view both overall, queue-level, and job-level reports that display the number of attempts, successes, failures, and the total success rate, providing a clear picture of MMBatch performance and cost savings.
- Engineering Mode: Included in this release is a new Engineering Mode setting for the Dashboard View, which provides more detailed information on spot protection savings from both the sever and queue levels.
- New Log Bundle Feature: You can now access a log bundle with viewing and download capabilities directly from the Jobs view. This feature will help to quickly troubleshoot and resolve issues by providing logging at both the server and node levels.
What's New in the 1.4.1 Patch Update
This patch update delivers key improvements to Management Server monitoring capabilities and addresses high-priority bug fixes to enhance system stability.
Management Server Observability & Management
- Dashboard View Improvements: The Management Server Dashboard now provides detailed insights into Checkpoint and Restore activity. You can now view overall and queue-level reports showing the number of attempts, successful operations, failures, and the overall success rate, giving you a clearer picture of system performance.
- Enhanced Jobs View: The Jobs view now offers powerful new filtering options to help you quickly find specific job data. You can filter by the date a job began or ended, queue name, job status, spot protection failure, or even by a specific job identifier. Additionally, all data from the Jobs view can now be exported directly to a
.csv
file, enabling easier data analysis and record-keeping.
What's New in the 1.4 Release
The MMBatch 1.4 release adds new features, improves the user experience and enhances telemetry.
MemVerge GPU SpotSurfing
- GPU SpotSurfing is a checkpointing and recovery feature designed to seamlessly manage Spot terminations on GPU instances. By integrating checkpointing with cloud automation, this service enables stateful workloads to run safely on GPU Spot instances, reducing compute costs by up to 70%, all without requiring any changes to your application code.
NVIDIA GPU Checkpoint & Restore Support
- New support for checkpointing and restoring GPU-based workflows, such as:
parabricks-fq2bam
parabricks-germline
Job List View in Application Window
- New job list view that shows job status, maximum root filesystem usage of the container, and detailed job events.
Downloadable Worker Node CSV
- Users can now download a CSV file containing all finished worker node details.
Checkpoint Cleanup Improvements
- When not in diagnostic mode, checkpoints are now automatically cleaned up if the restore succeeds or the job finishes.
Known Limitations
Checkpoint and Restore Count Mismatch After Max Retries: You might see the Checkpointing Count be one higher than the Restore Succeed Count when an AWS Batch job hits its maximum retry limit. If a job is interrupted (like during host pre-emption), we try to checkpoint it. AWS Batch then attempts to retry the job on a new host. However, once the job exhausts its allowed retries, it won't be assigned to another host. So, the very last checkpoint attempt won't have a restore associated with it.
MiniWDL Jobs: Status Can Get Stuck After Cancellation: When a MiniWDL job is interrupted and the user cancels or terminates the MiniWDL workflow, the interrupted job status might not update to a final state, like "Failed" or "Succeeded." This happens because MMBatch isn’t informed by MiniWDL that its workflow job has been cancelled or terminated following an interruption. Because it's waiting for that final "restored" or "failed" signal, the job's status doesn't complete its transition.
Cross Availability Zone Data Migration: If the batch queue is bound to a multi-availability zone compute environment, EBS volume data migration is required when a job restarts in a different zone. Migration time can range from minutes to hours, depending on AWS load.
EBS Volume Detachment Delays on AWS Instance Termination: When AWS terminates an instance, detaching its associated EBS volume may take longer than 5 minutes. This delay is due to AWS's instance termination behavior, which can prevent the instance from flushing in-memory data to the EBS volume, clearing file system caches, or cleanly unmounting the volume.
Leftover Volumes in Cross-Zone Restarts: If a restarted job runs before the old node is reclaimed, the system may fail to remove the old EBS volume, leaving an orphaned volume in the user’s environment.
Spot Reclaim Protection is not Supported during Restore: If a spot reclaim happens during restore, MMBatch will retry the restore on a new instance.
Managed EBS needs to be enabled: This is required for file consistency when saving files under root when running on Amazon Linux 2 and using Juice FS for scratch and checkpoint directories.
MMBatch checkpoint does not support file-backed memory: File-backed memory is considered quadrant #3 and #4 in Linux TOP command output. An example of file-backed memory could be saved to files under /dev/shm, which is a temporary file storage filesystem that uses memory for the backing store.
AWS S3 Throughput Limitations with Concurrent Access: When using AWS S3 as a storage backend, particularly with multiple file systems accessing the same bucket concurrently (e.g., JuiceFS used as a scratch directory in Nextflow pipelines), throughput limitations may be encountered. This can manifest as [Errno 5] Input/output error messages in Nextflow logs
. For best practices and optimization strategies to mitigate this, refer to the AWS S3 documentation on Optimizing Amazon S3 Performance.
PreDump Occurs even if MMC_CHECKPOINT_INTERVAL is set to 0: The only way to disable predump would be to set the checkpoint to a number larger than the largest interval using MMC_CHECKPOINT_INTERVAL
. The current default, if MMC_CHECKPOINT_INTERVAL
is not set, is 15 minutes.
'Total Run Time' is different between AWS batch engine and MMBatch engine: When MMBatch performs its final checkpoint, it first pauses the job's container. The container is only terminated after checkpointing is complete. Since AWS Batch calculates a job's end time based on container termination, this creates a discrepancy in job run times: the additional time reflects the duration MMBatch spent on checkpointing after pausing the container. This results in a slight discrepancy of up to 2 minutes (the max time before the EC2 Spot instance is reclaimed) while the checkpoint is finalized, which is negligible in the total runtime of a job.