Managing Nodes¶

Nodes form the foundation of your MemVerge AI cluster. Each node is a computing resource (physical or virtual machine) that runs one or more workloads. These nodes may include CPUs, GPUs, memory, and storage resources for containerized applications.

Viewing the Node List¶

Navigate to the Nodes Page
- In the left navigation bar, select Nodes and Node Groups.
- Click the Nodes tab at the top of the page to display a list of all nodes in your cluster.
Node Overview
The Nodes table shows the following details for each node:
- Name: The node’s hostname or label.
- Status: The readiness status of the node.
- GPUs: The number of GPUs attached to the node.
- Roles: A list of roles (e.g., control-plane, worker).
- Version: The Kubernetes version running on the node (e.g., v1.31.6+k3s1).
- Internal/External IP: The node’s IP addresses (internal or external).
- OS-Image: The base operating system (e.g., Ubuntu 22.04.5 LTS).
- Kernel-Version: The kernel version in use.
- Age: How long the node has been part of the cluster.

Viewing Node Details¶

Click on a node’s Name to open the detailed node dashboard. The dashboard is divided into multiple tabs and sections that provide deeper insight:

Node Details

Top Summary
- Node Name and Status: Confirms the node name (e.g., mvai-nvgpu02) and readiness.
- Labels: Shows key/value labels that categorize or configure scheduling on the node (e.g., beta.kubernetes.io/arch=amd64).
- Node Group: Displays the Node Group to which this node belongs (if any).
- GPUs: Number and model of attached GPUs (e.g., 1× NVIDIA-A10G).
- CPU Cores: Number of CPU cores on the node.
- Memory: Total system memory available on the node.
GPUs Tab
- GPU Model: Identifies the GPU vendor and model.
- GPU Utilization: Real-time graph indicating the percentage of GPU usage over time.
- GPU Memory Utilization: Tracks GPU memory usage as a percentage of total GPU memory.
Metrics Tab
- Provides detailed metrics for CPU, memory, and additional resource usage.
- Displays usage trends over time to help you identify performance bottlenecks or spikes.
Conditions/Logs Tab
- Shows the node’s health conditions and status messages reported by Kubernetes.
- May include logs or events relevant to node management, troubleshooting, and cluster operations.

Tip: Use the Nodes tab for a quick cluster-wide overview, then drill into any node’s detail page for real-time GPU utilization, CPU usage metrics, and health conditions. This information is essential when troubleshooting performance issues or monitoring resource availability.