Latest Release
A brief introduction to MMCloud followed by what's new in the latest release.
Overview
Memory Machine Cloud (MMCloud) is a software platform that streamlines the deployment of containerized applications in the cloud or in a hybrid cloud arrangement. Based on customizable policy, MMCloud selects and instantiates cloud resources on behalf of the user. A built-in job scheduler deploys Docker containers (and other containers that comply with the Open Container Initiative image-spec) across a group of virtual machines.
MMCloud includes AppCapsule, MemVerge's checkpoint/restore (C/R) capability. The AppCapsule is a moment-in-time snapshot of the application instance, including in-memory state and relevant files. AppCapsule is used to support workload mobility and workload continuity. Workload mobility means that a job can move from one virtual machine to another, for example, to a more powerful virtual machine that is a better fit for the next stage of execution. Workload mobility also provides high availability — if the underlying spot instance is reclaimed, the workload automatically moves to a new virtual machine and resumes running.
Users interact with MMCloud using the float CLI or the MMCloud web interface. The web interface provides a real-time graphical display of resource utilization (CPU, memory, network, etc.) as a job executes.
New in the Imperia 3.0 Release
The Imperia 3.0 release accumulates the enhancements from the 2.5.x patch releases, adds new features, and improves the overall reliability and scalability of the platform.
-
Storage Service enables a user to "register" (that is, pre-configure) a cloud service provider-offered storage service (such as AWS EBS or S3) or a network-based service (such as NFS) to serve as the file system created when a job starts.
All members of a group have access to storage registered by members of the group, although only the user who registered the storage (or the admin user) can delete or modify the storage. After registration, a storage service is assigned a name (configurable) and an identifier (automatic).
All storage services require configuration information (for example, an IP address or a bucket name, and access credentials in some cases). By registering storage, a user allows other members of the group to attach the storage using only the name or identifier.
-
Memory Machine Unified Snapshot Engine replaces the checkpoint/restore module used in earlier MMCloud releases and provides improved performance and additional features, such as GPU checkpoint and restore.
-
GPU checkpoint and restore capability enables users to run AI/ML (or algorithmically similar) jobs on the spot versions of the GPU-enabled compute instances.
Applications such as AI/ML make extensive use of tensor calculations which can be accelerated using GPUs. GPU-enabled compute instances are expensive. For example, an on-demand AWS P4d.24xlarge instance (with eight NVIDIA A100 Tensor Core GPUs) costs approximately $32 per hour in the us-east-1 region. The same instance costs about $8 per hour as a spot instance, a 75% discount. Protection against spot reclaims is a significant benefit to users who run AI/ML workloads.
-
Rocky Linux is now the base operating system for the OpCenter and worker nodes because of its robust support for NVIDIA drivers.
Rocky Linux is an open source, community-supported Linux distribution designed to be 100% bug-compatible with Red Hat Enterprise Linux (RHEL). Earlier MMCloud releases rely on CentOS Stream, a community-driven Linux distribution that tracks just ahead (upstream) of RHEL. CentOS Stream uses a rolling release model whereas Rocky Linux follows a traditional release model (scheduled updates) that tracks RHEL. Although both distributions offer performance and stability, Rocky Linux stresses stability over being on the leading edge.
-
High-performance, scalable, distributed file systems (JuiceFS and Lustre) are available as options when configuring data volumes for a job.
JuiceFS and Lustre are open-source file systems that present a standard POSIX interface while distributing data storage among multiple devices or services (such as S3). The result is a high-performance, cost-effective file system that scales easily. JuiceFS can also be used as a file system to store snapshots, that is, a JuiceFS folder is mounted as
/mnt/float-data
. -
Process detail in WaveWatcher, accessible by a button click, shows, for each job, timestamped process steps (start time and finish time) as the job runs. The timestamps allow you to match resource utilization with process steps which in turn allows you to optimize resources for similar runs in the future.
Detailed descriptions of all the new features and improvements in the Imperia Release are available here.
Recommended Upgrade Procedure
Imperia 3.0 is a major release. The best practice is to perform the upgrade during a scheduled maintenance window when there are no active or suspended jobs.