Skip to content

GPU Cluster Manager Dashboard Overview

The GPU Cluster Manager Dashboard provides a comprehensive, real-time overview of your AI workloads, cluster resources, and overall system health. It serves as your central monitoring hub, allowing administrators and users to quickly assess the status and utilization of their AI infrastructure.

Please Note: The GPU Cluster Manager Dashboard holds many different views and information regarding your GPUs. As such, it is important to scroll down within your browser to see all of its information.

Below is a screenshot of the GPU Cluster Manager on an idle system: GPU Dashboard upon Initial Login

Dashboard Breakdown

The dashboard is organized into several key sections, offering both high-level summaries and detailed utilization metrics:

  1. Header and Navigation:
    • "MemVerge.AI" Logo: Located at the top left, indicating MemVerge Branding for the GPU Cluster Manager
    • AI Factory: Located just under the MemVerge.AI logo, this lists the name of the Cluster managed by GPU Cluster Manager.
    • "Dashboard" subtitle: Just under the Cluster name. Confirms the current view.
    • Global Actions (Top Right): An icon (user avatar) suggests access to profile or settings, and possibly a system status icon.
    • Left Navigation Bar: A series of icons (though not labeled in this view) indicates various application sections. For more information, check out the Navigation Bar section below.
  2. Workload Status Summary (Top Row Cards): This row provides a quick count of all workloads categorized by their current state:
    • Total Workloads: The total number of workloads defined in the system.
    • Running: Workloads currently active.
    • Pending: Workloads waiting to be scheduled or started.
    • Succeeded: Workloads that have completed successfully.
    • Failed: Workloads that terminated with an error.
    • Evicted: Workloads that were removed from a node (e.g., due to resource contention).
    • Preempted: Workloads that were stopped to free up resources for higher-priority tasks.
  3. Resource and Organizational Overview (Second Row Cards): This section summarizes key operational and resource counts:
    • Departments: Number of configured departments.
    • Projects: Number of active projects.
    • Node Groups: Number of defined groups of compute nodes.
    • Nodes: Total number of nodes in the cluster, with a breakdown of Ready and Not Ready/Unknown states.
    • GPUs: Total number of GPUs available across all nodes.
    • (Note: Counts are currently low or zero in the image, reflecting a new or minimally configured system.)
  4. Overall Cluster Utilization Graphs: Two large time-series graphs provide a historical view of aggregate resource consumption:
    • Overall GPU Utilization: Displays the total percentage of GPU resources being used over time, along with current utilization and memory utilization.
    • Overall CPU Utilization: Displays the total percentage of CPU resources being used over time, along with current utilization and memory utilization.
  5. Projects-Specific Utilization (Placeholder Charts): Four circular charts are present, likely intended to display utilization metrics broken down by projects. In this empty state, they serve as placeholders:
    • Projects - GPU Utilization
    • Projects - GPU Memory Utilization
    • Projects - CPU Utilization
    • Projects - CPU Memory Utilization
    • (These will populate with data as projects are initiated and consume resources.)
  6. GPU Utilization Heatmap: Located at the bottom, this visualizer provides a detailed breakdown of GPU utilization across different segments or time intervals using a color-coded legend (e.g., 0-20% blue, >80% red), allowing for quick identification of highly utilized or idle GPUs.
    • (The heatmap in the image shows a consistent blue, indicating minimal or no GPU activity.)
  7. Footer:
  8. Copyright Information: Displays copyright details (e.g., "Copyright © 2018-2025 MemVerge Inc. All Rights Reserved.") at the very bottom.

The dashboard's design prioritizes clarity and immediate understanding of system status, making it easy to spot trends, identify bottlenecks, or confirm the health of your AI infrastructure at a glance.