New Features in MMCloud Imperia 3.0 Release
Date Released
Imperia 3.0 released on 08-01-2024.
Supported Clouds
MMCloud is designed to work on any cloud infrastructure. The Imperia 3.0 release supports the following clouds:
- AWS
- Google Cloud
- Alibaba Cloud
New Features in Imperia 3.0 Release
Type | Domain | Description |
---|---|---|
Feature | Platform | Storage Service enables a user to "register" (that is, pre-configure) a cloud service provider-offered storage service (such as AWS EBS or S3) or a network-based service (such as NFS) to serve as the file system created when a job starts.
|
Feature | Platform | Memory Machine Unified Snapshot Engine replaces the checkpoint/restore module used in earlier MMCloud releases and provides improved performance and additional features, such as GPU checkpoint and restore. |
Feature | Platform | GPU checkpoint and restore capability enables users to run AI/ML (or algorithmically similar) jobs on the spot versions of the GPU-enabled compute instances. Applications such as AI/ML make extensive use of tensor calculations which can be accelerated using GPUs. GPU-enabled compute instances are expensive. For example, an on-demand AWS P4d.24xlarge instance (with eight NVIDIA A100 Tensor Core GPUs) costs approximately $32 per hour in the us-east-1 region. The same instance costs about $8 per hour as a spot instance, roughly a 75% discount. Protection against spot reclaims is a significant to users who run AI/ML workloads. |
Feature | Platform | Rocky Linux is now the base operating system for the OpCenter and worker nodes because of its support for NVIDIA drivers. Rocky Linux is an open source, community-supported Linux distribution designed to be 100% bug-compatible with Red Hat Enterprise Linux (RHEL). Earlier MMCloud releases rely on CentOS Stream, a community-driven Linux distribution that tracks just ahead (upstream) of RHEL. CentOS Stream uses a rolling release model whereas Rocky Linux follows a traditional release model (scheduled updates) that tracks slightly downstream of RHEL. Although both distributions offer performance and stability, Rocky Linux stresses stability over being on the leading edge. |
Feature | Platform | High-performance, scalable, distributed file systems (JuiceFS and Lustre) are available as options when configuring data volumes for a job. JuiceFS and Lustre are open-source file systems that present a standard POSIX interface while distributing data storage among multiple devices or services (such as S3). The result is a high-performance, cost-effective file system that scales easily. JuiceFS can also be used as a file system to store snapshots, that is, a JuiceFS folder is mounted as /mnt/float-data . |
Feature | Platform | Process detail in WaveWatcher, accessible by a button click, shows, for each job, timestamped process steps (start time and finish time) as the job runs. The timestamps allow you to match resource utilization with process steps which in turn allows you to optimize resources for similar runs in the future. |
Feature | Platform | Configurable image volume type (AWS only) allows users to specify the type (gp2 or gp3) used for the container image volume created for every job. The image volume type can be set per-job (--imageVolType ) or set globally (cloud.imageVolumeType ). The default value is gp2 . |
Feature | Platform | Ghost file size limit increase broadens the SpotSurfer support for applications that generate a large volume of ghost files. Ghost files are files that don't show up using Linux tools such as ls but are still in use by applications and take up disk space. Ghost files are deleted when the application terminates, so ghost files must be saved as part of the snapshot to allow the application to resume execution at the point where the snapshot was captured. The Imperia release removes the hard limit of 4GB used in earlier releases. This feature allows the user to increase the ghost file size limit from the hard limit of 4GB in earlier releases to an arbitrary number, for example, 20GB. |
Feature | Platform | Spot reclaim limit is an option that can be included in the VM creation policy. The syntax is [maxSpotReclaim=n] where n is an integer. When the number of spot reclaims reaches the value n , the job resumes on an on-demand instance. Setting [maxSpotReclaim=0] means that no limit is imposed. This feature is useful in riding out "spot instance storms" when spot reclaims become excessive. |
Feature | Platform | Adding custom tags to running jobs is a mechanism for users to group jobs associated with the same tag. The CLI syntax is float modify --addCustomTag stringArray Custom tags are an option to monitor multiple jobs as a single group. This is used extensively in Nextflow where a single pipeline may spawn hundreds of jobs. In earlier releases, custom tags must be included in the job submission request. In the Imperia release, custom tags can be added to a running job. This is useful in Nextflow to tag the job running the Nextflow host with the same tag (or tags) as the jobs in the pipeline. |
Feature | Platform | CLI option to show container hook content provides users with an easy mechanism to display the content of container hook scripts. The Open Container Initiative (OCI) defines a mechanism, called container hooks, to run scripts at various stages in a container lifecycle (typically when a container starts or ends). The float show command now has the option to display container hook scripts associated with a job, for example, float show -j JOB_ID --containerInitScript |
Feature | Platform | Selectable VM instances for OpCenter server allows users, when deploying the OpCenter, to select from a list of pre-configured compute instances (ranging from extra small to large) or to input their own choice of compute instance by type (as long as the instance is based on an x86 CPU). The user can also select the block storage type mounted by the OpCenter. For example, in AWS, select gp2 or gp3 . |
Feature | Platform | Configurable CPU compatibility policy improves the robustness of the "restore" part of the checkpoint/restore process by implementing a configurable policy for filtering VM instance candidates based on the compatibility of the CPU with the CPU of the VM on which the snapshot was taken. The policy can be set as a global default (cloud.cpuCompatibleMode ) or configured per job (--vmPolicy [mode=loose|strict|same] ). The modes are:
|