Previous Release Highlights
Brief summaries of new features in previous MMCloud releases.
Imperia 3.0 Release
The Imperia 3.0 release accumulates the enhancements from the 2.5.x patch releases, adds new features, and improves the overall reliability and scalability of the platform.
-
Storage Service enables a user to "register" (that is, pre-configure) a cloud service provider-offered storage service (such as AWS EBS or S3) or a network-based service (such as NFS) to serve as the file system created when a job starts.
All members of a group have access to storage registered by members of the group, although only the user who registered the storage (or the admin user) can delete or modify the storage. After registration, a storage service is assigned a name (configurable) and an identifier (automatic).
All storage services require configuration information (for example, an IP address or a bucket name, and access credentials in some cases). By registering storage, a user allows other members of the group to attach the storage using only the name or identifier.
-
Memory Machine Unified Snapshot Engine replaces the checkpoint/restore module used in earlier MMCloud releases and provides improved performance and additional features, such as GPU checkpoint and restore.
-
GPU checkpoint and restore capability enables users to run AI/ML (or algorithmically similar) jobs on the spot versions of the GPU-enabled compute instances.
Applications such as AI/ML make extensive use of tensor calculations which can be accelerated using GPUs. GPU-enabled compute instances are expensive. For example, an on-demand AWS P4d.24xlarge instance (with eight NVIDIA A100 Tensor Core GPUs) costs approximately $32 per hour in the us-east-1 region. The same instance costs about $8 per hour as a spot instance, a 75% discount. Protection against spot reclaims is a significant benefit to users who run AI/ML workloads.
-
Rocky Linux is now the base operating system for the OpCenter and worker nodes because of its robust support for NVIDIA drivers.
Rocky Linux is an open source, community-supported Linux distribution designed to be 100% bug-compatible with Red Hat Enterprise Linux (RHEL). Earlier MMCloud releases rely on CentOS Stream, a community-driven Linux distribution that tracks just ahead (upstream) of RHEL. CentOS Stream uses a rolling release model whereas Rocky Linux follows a traditional release model (scheduled updates) that tracks RHEL. Although both distributions offer performance and stability, Rocky Linux stresses stability over being on the leading edge.
-
High-performance, scalable, distributed file systems (JuiceFS and Lustre) are available as options when configuring data volumes for a job.
JuiceFS and Lustre are open-source file systems that present a standard POSIX interface while distributing data storage among multiple devices or services (such as S3). The result is a high-performance, cost-effective file system that scales easily. JuiceFS can also be used as a file system to store snapshots, that is, a JuiceFS folder is mounted as
/mnt/float-data
. -
Process detail in WaveWatcher, accessible by a button click, shows, for each job, timestamped process steps (start time and finish time) as the job runs. The timestamps allow you to match resource utilization with process steps which in turn allows you to optimize resources for similar runs in the future.
Half Moon Bay 2.5 Release
The Half Moon Bay 2.5 release adds major features and improves the overall reliability and scalability of the platform.
-
SurfZone is a cost management feature that allows an administrator to configure a monthly budget (quota) for a group of users. If the spending limit is reached, jobs are canceled immediately or suspended until the budget is replenished. The choice to cancel or suspend is a configuration option. If the job is suspended, the SurfZone configuration determines whether the job resumes automatically when the budget is replenished or waits for user input.
-
Workflow view of jobs allows the user to examine all the tasks grouped in a single workflow. For example, there can be hundreds of tasks in a single Nextflow pipeline. In the web interface, the Workflow Details screen includes a summary of the entire pipeline, for example, wall time, CPU time, and the numbers of on-demand and spot instances created. A Timeline tab shows when individual jobs start and, if completed, when they stop. Current status is color-coded to indicate success, failure, running, and so on. This feature applies automatically to Nextflow pipelines and can be applied to other workflows by including identifying tags.
-
Multiple Machine Images is a feature (called Quiver) that enables the OpCenter to create VMs using VMIs (virtual machine images) that are specialized for the task, for example, to support an instance with a GPU. Previous versions of OpCenter software used the same VMI for all instances (based on CPUs with x86 architecture).
-
NVIDIA GPU support allows users to submit jobs that take advantage of the NVIDIA drivers and hardware. The NVIDIA GPU support in the Half Moon Bay release does not include checkpoint and restore, so SpotSurfer and WaveRider are not available for these jobs.
Goa 2.4 Release
The Goa 2.4 release adds two major features and a number of other product enhancements.
- AppCapsule++ improves the snapshot feature to increase the size of the memory footprint that can be captured and to decrease the time that the application is "frozen" when the snapshot is taken. AppCapsule++ is only supported on AWS.
- New license model eliminates the Essential plan and changes how users on the Pro plan are charged (instead of a percentage of savings, usage is charged at a fixed rate per CPU core hour). The new license plan is simpler and maps more accurately to how customers use MMCloud.
Other new features include additional functionality in the web interface (user management and system configuration), configuration options to improve performance (such as a compute instance deny list), and native Cloud Storage FUSE support for Google Cloud.
A job template to create a Script-of-Scripts (SOS) polyglot notebook server is added as a preview feature.