Managing Nodes¶
Nodes form the foundation of your MemVerge AI cluster. Each node is a computing resource (physical or virtual machine) that runs one or more workloads. These nodes may include CPUs, GPUs, memory, and storage resources for containerized applications.
Viewing the Node List¶
-
Navigate to the Nodes Page
- In the left navigation bar, select Nodes and Node Groups.
- Click the Nodes tab at the top of the page to display a list of all nodes in your cluster.
-
Node Overview
The Nodes table shows the following details for each node:- Name: The node’s hostname or label.
- Status: The readiness status of the node.
- GPUs: The number of GPUs attached to the node.
- Roles: A list of roles (e.g.,
control-plane
,worker
). - Version: The Kubernetes version running on the node (e.g.,
v1.31.6+k3s1
). - Internal/External IP: The node’s IP addresses (internal or external).
- OS-Image: The base operating system (e.g., Ubuntu 22.04.5 LTS).
- Kernel-Version: The kernel version in use.
- Age: How long the node has been part of the cluster.
Viewing Node Details¶
Click on a node’s Name to open the detailed node dashboard. The dashboard is divided into multiple tabs and sections that provide deeper insight:
-
Top Summary
- Node Name and Status: Confirms the node name (e.g.,
mvai-nvgpu02
) and readiness. - Labels: Shows key/value labels that categorize or configure scheduling on the node (e.g.,
beta.kubernetes.io/arch=amd64
). - Node Group: Displays the Node Group to which this node belongs (if any).
- GPUs: Number and model of attached GPUs (e.g.,
1× NVIDIA-A10G
). - CPU Cores: Number of CPU cores on the node.
- Memory: Total system memory available on the node.
- Node Name and Status: Confirms the node name (e.g.,
-
GPUs Tab
- GPU Model: Identifies the GPU vendor and model.
- GPU Utilization: Real-time graph indicating the percentage of GPU usage over time.
- GPU Memory Utilization: Tracks GPU memory usage as a percentage of total GPU memory.
-
Metrics Tab
- Provides detailed metrics for CPU, memory, and additional resource usage.
- Displays usage trends over time to help you identify performance bottlenecks or spikes.
-
Conditions/Logs Tab
- Shows the node’s health conditions and status messages reported by Kubernetes.
- May include logs or events relevant to node management, troubleshooting, and cluster operations.
Tip: Use the Nodes tab for a quick cluster-wide overview, then drill into any node’s detail page for real-time GPU utilization, CPU usage metrics, and health conditions. This information is essential when troubleshooting performance issues or monitoring resource availability.