New Features in MMCloud Imperia 3.0 Release

Date Released

Imperia 3.0 released on 08-01-2024.

Supported Clouds

MMCloud is designed to work on any cloud infrastructure. The Imperia 3.0 release supports the following clouds:

AWS
Google Cloud
Alibaba Cloud

New Features in Imperia 3.0 Release

Type	Domain	Description
Feature	Platform	Storage Service enables a user to "register" (that is, pre-configure) a cloud service provider-offered storage service (such as AWS EBS or S3) or a network-based service (such as NFS) to serve as the file system created when a job starts. All members of a group have access to storage registered by members of the group, although only the user who registered the storage (or the admin user) can delete or modify the storage. After registration, a storage service is assigned a name (configurable) and an identifier (automatic). All storage services require configuration information (for example, an IP address or a bucket name, and access credentials in some cases). By registering storage, a user allows other members of the group to attach the storage using only the name or identifier. The permission associated with a storage service can be set to "public" in which case all users have access to the storage.
Feature	Platform	Memory Machine Unified Snapshot Engine replaces the checkpoint/restore module used in earlier MMCloud releases and provides improved performance and additional features, such as GPU checkpoint and restore.
Feature	Platform	GPU checkpoint and restore capability enables users to run AI/ML (or algorithmically similar) jobs on the spot versions of the GPU-enabled compute instances. Applications such as AI/ML make extensive use of tensor calculations which can be accelerated using GPUs. GPU-enabled compute instances are expensive. For example, an on-demand AWS P4d.24xlarge instance (with eight NVIDIA A100 Tensor Core GPUs) costs approximately $32 per hour in the us-east-1 region. The same instance costs about $8 per hour as a spot instance, roughly a 75% discount. Protection against spot reclaims is a significant to users who run AI/ML workloads.
Feature	Platform	Rocky Linux is now the base operating system for the OpCenter and worker nodes because of its support for NVIDIA drivers. Rocky Linux is an open source, community-supported Linux distribution designed to be 100% bug-compatible with Red Hat Enterprise Linux (RHEL). Earlier MMCloud releases rely on CentOS Stream, a community-driven Linux distribution that tracks just ahead (upstream) of RHEL. CentOS Stream uses a rolling release model whereas Rocky Linux follows a traditional release model (scheduled updates) that tracks slightly downstream of RHEL. Although both distributions offer performance and stability, Rocky Linux stresses stability over being on the leading edge.
Feature	Platform	High-performance, scalable, distributed file systems (JuiceFS and Lustre) are available as options when configuring data volumes for a job. JuiceFS and Lustre are open-source file systems that present a standard POSIX interface while distributing data storage among multiple devices or services (such as S3). The result is a high-performance, cost-effective file system that scales easily. JuiceFS can also be used as a file system to store snapshots, that is, a JuiceFS folder is mounted as `/mnt/float-data` .
Feature	Platform	Process detail in WaveWatcher, accessible by a button click, shows, for each job, timestamped process steps (start time and finish time) as the job runs. The timestamps allow you to match resource utilization with process steps which in turn allows you to optimize resources for similar runs in the future.
Feature	Platform	Configurable image volume type (AWS only) allows users to specify the type (gp2 or gp3) used for the container image volume created for every job. The image volume type can be set per-job (`--imageVolType`) or set globally (`cloud.imageVolumeType`). The default value is `gp2`.
Feature	Platform	Ghost file size limit increase broadens the SpotSurfer support for applications that generate a large volume of ghost files. Ghost files are files that don't show up using Linux tools such as `ls` but are still in use by applications and take up disk space. Ghost files are deleted when the application terminates, so ghost files must be saved as part of the snapshot to allow the application to resume execution at the point where the snapshot was captured. The Imperia release removes the hard limit of 4GB used in earlier releases. This feature allows the user to increase the ghost file size limit from the hard limit of 4GB in earlier releases to an arbitrary number, for example, 20GB.
Feature	Platform	Spot reclaim limit is an option that can be included in the VM creation policy. The syntax is `[maxSpotReclaim=n]` where n is an integer. When the number of spot reclaims reaches the value `n`, the job resumes on an on-demand instance. Setting `[maxSpotReclaim=0]` means that no limit is imposed. This feature is useful in riding out "spot instance storms" when spot reclaims become excessive.
Feature	Platform	Adding custom tags to running jobs is a mechanism for users to group jobs associated with the same tag. The CLI syntax is `float modify --addCustomTag stringArray` Custom tags are an option to monitor multiple jobs as a single group. This is used extensively in Nextflow where a single pipeline may spawn hundreds of jobs. In earlier releases, custom tags must be included in the job submission request. In the Imperia release, custom tags can be added to a running job. This is useful in Nextflow to tag the job running the Nextflow host with the same tag (or tags) as the jobs in the pipeline.
Feature	Platform	CLI option to show container hook content provides users with an easy mechanism to display the content of container hook scripts. The Open Container Initiative (OCI) defines a mechanism, called container hooks, to run scripts at various stages in a container lifecycle (typically when a container starts or ends). The `float show` command now has the option to display container hook scripts associated with a job, for example, `float show -j JOB_ID --containerInitScript`
Feature	Platform	Selectable VM instances for OpCenter server allows users, when deploying the OpCenter, to select from a list of pre-configured compute instances (ranging from extra small to large) or to input their own choice of compute instance by type (as long as the instance is based on an x86 CPU). The user can also select the block storage type mounted by the OpCenter. For example, in AWS, select `gp2` or `gp3`.
Feature	Platform	Configurable CPU compatibility policy improves the robustness of the "restore" part of the checkpoint/restore process by implementing a configurable policy for filtering VM instance candidates based on the compatibility of the CPU with the CPU of the VM on which the snapshot was taken. The policy can be set as a global default (`cloud.cpuCompatibleMode`) or configured per job (`--vmPolicy [mode=loose\|strict\|same]`). The modes are: loose: only migrate if instance uses the same CPU family and the same (or newer) version strict: only migrate if CPU metadata in snapshot is compatible same: only migrate if instance uses the same CPU family and version