Infrastructure Map¶
The Infrastructure Map provides a dynamic and visual representation of your GPU Cluster Manager cluster's topology and real-time health. This map is an essential tool for administrators to quickly understand the relationships between nodes, node groups, and management components, as well as to monitor their status and resource utilization at a glance.
The screen displays an interactive diagram illustrating the interconnected components of your GPU Cluster Manager deployment:
-
Cluster Topology Visualization (Main Area): This central area graphically depicts your cluster components and their connections:
- Nodes: Represented by small circles or squares, each labeled with its hostname (e.g.,
mvai-nvgpu01
,mvai-nvgpu02
,mvai-mgmt
). - Node Groups: Larger circles represent logical groupings of nodes. In the image,
ng-amd-23.49-nvidia-a10g-570.124.06-12.8
is shown as a Node Group containing "1 Node". - Connections: Lines connecting the nodes and node groups illustrate network or logical relationships within the cluster.
mvai-mgmt
: A distinct node, likely representing the management or control plane of the MemVerge.AI system.
- Nodes: Represented by small circles or squares, each labeled with its hostname (e.g.,
-
Node Details Card (Left Panel): When a specific node is selected or hovered over (as
Node - mvai-nvgpu02
is in the image), a detailed information card appears on the left. This card provides real-time metrics for that node:-
Node Name and Status: Displays the full hostname and its current operational status (e.g., "Ready").
-
Resource Summary: Shows the total allocated CPU, Memory (in GiB), and number of GPUs on that node.
-
Utilization Metrics:
Provides current percentage utilization for:
- GPU
- GPU Memory
- CPU
- CPU Memory
(Note: In the image above, all utilization metrics are 0.00%, indicating the node is currently idle.)
-
-
Map Controls (Bottom Right):
- Zoom Slider: Allows you to zoom in and out of the infrastructure map for a more granular or broader view.
- Reset View Button: An icon that resets the map to its default zoom level and position.
How to Use the Infrastructure Map:
- Visualize Cluster Topology: Understand the layout of your nodes, node groups, and management components.
- Monitor Node Health: Quickly identify which nodes are
Ready
or if any are experiencingNot Ready
orUnknown
states, which would be indicated by different colors in the node count and on the map. - Inspect Node Resources: Click or select individual nodes to view their allocated resources and current utilization, which is crucial for capacity planning and troubleshooting performance issues.
- Troubleshooting: Pinpoint problematic nodes or connections visually during system diagnostics.
- Confirm Deployment: Verify that all expected nodes and node groups are registered and visible in the cluster.
This map offers administrators an intuitive way to manage and monitor the underlying infrastructure powering their AI workloads.