Memory Machine X - v1.5.0 Release Notes

Introduction

This document provides information about the Memory Machine X (MMX) version 1.5.0 release. This release introduces significant enhancements to Quality-of-Service (QoS) policies, expanded hardware support, new GPU telemetry features, and substantial updates to the User Interface (UI) and REST API, including crucial security improvements.

Key Features in This Release

This release of Memory Machine X focuses on delivering enhanced performance, manageability, and security. Highlights include:

Refined Quality-of-Service (QoS) Policy Engines:
- Optimized Latency Optimized Policy (Tiering) with improved page movement across CXL devices.
- Enhanced Bandwidth Optimized Policy with fixed ratio configuration and support for multiple CXL devices per CPU socket.
Expanded Hardware Support:
- Support for the latest Intel (Sapphire Rapids, Emerald Rapids) and AMD (Genoa, Turin) CPUs.
- Compatibility with a wider range of OEM servers and CXL expansion devices, including locally installed and externally connected CXL 2.0 solutions.
GPU Integration:
- Addition of NVidia GPU Telemetry for monitoring and insights.
- NVIDIA GPUs are now visible in the system topology view within the UI.
- NVIDIA GPU telemetry data integrated into the UI dashboard.
User Interface (UI) Enhancements:
- Multi-Node Management: Support for managing multiple compute nodes from a single UI instance, including a manual registration procedure for hosts and CXL appliances.
Security Upgrades:
- Addition of a login/authentication page to secure UI access.
- Management UI now runs on HTTPS by default for secure connections.
REST API Security:
- The backend REST API server has been secured.
General Bug Fixes: Various bug fixes have been implemented to improve stability and performance for both Latency Optimized and Bandwidth Optimized policies.

Supported Hardware

CPUs

Intel Xeon 4th Gen (Sapphire Rapids)
Intel Xeon 5th Gen (Emerald Rapids)
AMD EPYC 4th Gen (Genoa)
AMD EPYC 5th Gen (Turin)

Servers

OEM servers must support CXL 1.1 expansion devices installed locally or CXL 2.0.
Supported Server OEMs include (but are not limited to):
Supermicro
Wiwynn
MSI
And more

CXL Expansion Devices

This release supports locally installed memory expanders within the server or externally connected devices using a CXL 2.0 switch (e.g., XConn Technologies).
Supported Expansion Device Vendors include (but are not limited to):
SMARTModular (AIC & E3.S)
Montage (AIC)
AsteraLabs (AIC)
Samsung (E3.S)
Micron (E3.S)

Note

All CXL devices and DRAM modules within a system must be of the same make, model, and capacity. Mixed configurations are not supported or tested in this release.

GPU

NVidia GPUs (for telemetry data).

Memory Machine Features

Quality-of-Service (QoS) Policy Engines

This release includes refinements to the existing QoS Policy Engines:

1. Latency Optimized Policy (Tiering)

This policy utilizes intercept-less Hotness Tracking to dynamically tier memory between DRAM and CXL, prioritizing low-latency access for hot data.
Optimization: Page movement across CXL devices with the same NUMA distance has been optimized for improved performance.
Includes general bug fixes as needed.

2. Bandwidth Optimized Policy

This policy utilizes a decision engine to select and migrate a defined proportion of memory pages for monitored processes between DRAM and CXL.
Fixed Ratio Configuration: The memory ratio (e.g., 80% DRAM, 20% CXL) is defined in the configuration file and can be modified directly by the user or through the Memory Machine user interface (UI).
Multi-CXL Device Support:
- Supports one or more CXL Memory Expansion Devices per CPU socket, leveraging interleaved bandwidth.
- If multiple CXL devices are attached to a CPU socket, the CXL portion of memory is shared equally across all CXL devices. For instance, with an 80:20 ratio and two CXL devices, each CXL device will host 10% of the application's data. With four CXL devices, each will host 5%.
- Supports up to four CXL devices per CPU socket, aligning with the capabilities of AMD and Intel platforms.
Includes general bug fixes as needed.

GPU Telemetry

NVidia GPU telemetry data is now integrated into the monitoring dashboard, providing deeper insights into GPU utilization.

User Interface (UI) and REST API Enhancements

UI Updates

Multi-Node Management:
- The UI now allows users to view and manage all compute nodes running MMX from a centralized interface.
- A manual registration procedure has been added to register hosts, switches, and CXL appliances on the CXL Fabric.
- Users can select a specific compute node to:
  - View its System Topology.
  - Enable/Disable the QoS feature.
  - Enable/Disable WSS/Insights.
GPU Integration in Topology: GPUs are now displayed in the server topology view within the UI.
GPU Telemetry in Dashboard: GPU telemetry metrics are integrated into the UI dashboard.
UI Authentication: A login and authentication page have been implemented to secure access to the Management UI.
HTTPS by Default: The Management UI server now uses HTTPS by default, ensuring encrypted communication.

REST API Security

Secure Backend: The REST API server has been secured to protect against unauthorized access and ensure data integrity.

Bug Fixes

This release includes various bug fixes for both the Latency Optimized and Bandwidth Optimized Quality of Service (QoS) policies, enhancing stability and performance. (Specific bug IDs are tracked internally).

Known Issues and Limitations

The Bandwidth Optimized Policy currently uses a fixed ratio for page allocation. More intelligent, dynamic page selection based on access patterns is a future consideration.
Support for Large (2MiB) and Huge (1GiB) pages is not included in this release.

Contact and Support

For support, documentation, and more information about Memory Machine X, please visit the MemVerge official website or contact your MemVerge support representative.