Known Issues¶

This section outlines the currently known issues in this release. We are actively working on resolving these issues and will provide updates in future releases.

Reporting New Issues¶

If you encounter an issue not listed here, please report it to our support team. Visit the Support Page for more information.

Known Issue List¶

GPU Utilization Heatmap shows numbers 60-70 as default (dark or light mode)

Issue #: AIP-1044
Severity: Medium
Affected Components: GPU Utilization Heatmap
Impacted Versions: 0.5.0
Description: The heatmap is missing the color range mapping for numbers between 60 and 70. The numbers appear, but the corresponding heatmap color does not.
Workaround: None
Cause: Known
Solution: Issue will be resolved in a future release.

Mismatch in workload count on the project summary page

Issue #: AIP-1043
Severity: Medium
Affected Components: Reporting on Project Summary Page
Impacted Versions: 0.5.0
Description: Workload counts on the project list page may be inaccurate. This is because the count on this page doesn't update correctly when certain workloads are deleted. The project detail page, however, displays the correct count.
Workaround: None
Cause: Known
Solution: Issue will be resolved in a future release.

"GPUs in use" Will not Report NVIDIA GPUs Until the Workload is Active

Issue #: AIP-1002
Severity: Low
Affected Components: GPU Utilization Reporting - "GPUs in use" field.
Impacted Versions: 0.5.0
Description: If you've just started a GPU-reliant pod, the Cluster Manager won't show it linked to a GPU immediately. This connection appears in the monitoring data as soon as the GPU starts working. So, an idle pod won't show up until it gets busy.
Workaround: None
Cause: The NVIDIA monitoring stack only associates applications with GPUs once the GPU is actively processing tasks. The stack reports on busy activity, not idle readiness.
Solution: Dependent on NVIDIA's monitoring stack.

MemVerge GPU Cluster Manager is not supported on Red Hat Enterprise Linux (RHEL)

Issue #: AIP-927
Severity: Low
Affected Components: Installation
Impacted Versions: 0.3.0+
Description: The AMD Operator is not supported on RHEL; therefore, the MemVerge GPU Cluster Manager is not supported on servers running RHEL with AMD GPUs.
Cause: This issue is present in the AMD operator.
Workaround: None
Solution: The MemVerge GPU Cluster will support RHEL with AMD GPUs when the AMD Operator supports it.

Workloads get into an unknown state when there is insufficient CPU to process the request

Issue #: AIP-712
Severity: Low
Affected Components: Scheduler
Impacted Versions: 0.4.0
Description: In the 0.4.0 release, Kueue is aware of all submitted workloads. If resources are not available, the workloads fail the admission check and remain in an "Unknown" state, without a status.admission field. They stay in this state (often visualized as gray in the UI) and wait until the required resources become available.
Workaround: Ensure the cluster has enough CPU/Memory/GPUs for all anticipated workloads.
Cause: Known
Solution: Issue cannot be resolved as it relies on Kubernetes and Kueue.

Creating a new Department using a previously deleted Department Name will have old Billing/Usage Information

Issue #: AIP-609
Severity: Low
Affected Components: Billing/Department
Impacted Versions: 0.3.0+
Description: When creating a new Department using the same name as a previously deleted department, the billing information from the previous department may be accessible.
Cause: This is by design. Billing information primarily uses the Name for uniqueness. Reusing an old name will show any historical information.
Workaround: None
Solution: Ensure Department names are unique, unless this is desired.

Cannot Create a Node Group with the same Name assigned to Different Departments

Issue #: AIP-535
Severity: Low
Affected Components: Node Group
Impacted Versions: 0.3.0+
Description: Creating a new Node Group using the same name as an existing Node Group will fail, even if the Node Group is assigned to a different Department.
Cause: Names are globally scoped across the cluster rather than at a Department level/scope.
Workaround: None
Solution: Ensure Node Group names are unique across the entire cluster.

NVIDIA GPU PIDs/TIDs or Usage in the Workspace Terminal

Issue #: AIP-509
Severity: Low
Affected Components: Workspaces
Impacted Versions: 0.3.0+
Description: Inside a user workspace, running nvidia-smi will show some, but not all all the information.
Cause: This issue is a known security limitation of the NVIDIA Operator and Containers. See Cannot see gpu threads in container for more information.
Workaround: None
Solution: None

OAuth does not work with Enterprise GitHub Accounts

Issue #: AIP-503
Severity: Low
Affected Components: Security
Impacted Versions: 0.3.0+
Description: When the GitHub Oauth Provider is Enabled, Enterprise GitHub users cannot login.
Cause: This is an issue in the Rancher OAuth provider.
Workaround: None
Solution: A fix will be made available in a future release.

Using a 'local' StorageClass Persistent Volume Claims (PVC) for Workspaces may not be reusable once the Workspace is Deleted

Issue #: AIP-482
Severity: Low
Affected Components: Workspace Volumes
Impacted Versions: 0.3.0+
Description: If a Volume is created using the local StorageClass, eg a local NVME SSD, and used by a workspace, once the workspace is stopped and deleted, the volume may not be useable to any other workspace.
Cause: Once a PVC is claimed by a workspace pod, the status becomes Bound. If the pod is deleted, a new workload cannot reuse the PVC. If the workspace pod attempts to start on another node, it won't have access to the assigned PVC residing on another worker node.
Workaround: None
Solution: Always use storage accessible by all worker nodes in the cluster. NFS or similar should be used to avoid this issue.

User Account Retention Policy is Disabled

Issue #: AIP-469
Severity: Low
Affected Components: RBAC/Security
Impacted Versions: 0.3.0+
Description: User account retention policy - automatically disables dormant accounts - has been disabled.
Cause: N/A
Workaround: None
Solution: In the future, we will improve this workflow and feature to allow dormant accounts to automatically become disabled, preventing those users from logging in. This is a security improvement.

Workspace Storage Usage via NFS is not Enforced

Issue #: AIP-454
Severity: Low
Affected Components: Workspace Volumes/Storage
Impacted Versions: 0.3.0+
Description: When workspace volumes created backed by NFS, users may be able to commit more data than was requested. For example, if the user creates a 10GB volume, more than 10GB can be written without ENOSPC or other messages.
Cause: This is a known issue. See nfs-subdir-external-provisioner: No restrictions on PVC
Workaround: None
Solution: None

Multi-Pod Workloads/Workspaces may not be correctly admitted to the Kubernetes cluster

Issue #: AIP-394
Severity: Low
Affected Components: Workspaces
Impacted Versions: 0.3.0+
Description: When a workspace with multiple kubernetes pods is created, it is possible only some of the pods may be deployed.
Cause: If there is not enough resources in the chosen project, a workload that needs to span multiple projects won't be scheduled.
Workaround: None
Solution: In the future, we will improve this workflow and feature to allow multi-pod workloads to borrow resources from other projects when resources are unavailable in the primary project.

Creating a Node Group using the same GPU Make/Model in different Modes is not Supported

Issue #: AIP-392
Severity: Low
Affected Components: Node Group
Impacted Versions: 0.3.0+
Description: Creating a Node Group using the same underlying GPU Make/Model, but the GPU is in a different mode (NVIDIA MIG, for example), is not supported. Node Group creation requires GPUs of the same make/model and mode.
Cause: Hybrid GPU configuration are not supported in the same Node Group.
Workaround: None
Solution: Ensure all GPUs are of the same make/model and have been cofigured in the same mode before creating a Node Group.

A Default Node Group is not Automatically Created when the Kubernetes Cluster is a Single Server, or when the Management/Control Plane Node has GPUs

Issue #: AIP-388
Severity: Low
Affected Components: Node Group
Impacted Versions: 0.3.0+
Description: MemVerge.ai will create a default node group after installation. In very small clusters where the cluster is a single node or when the control plane is installed on a GPU Worker, a default node group may not be automatically created.
Cause: This is expected behaviour due to workloads running on the control/management node may cause performance issues when under high workload demand.
Workaround: None
Solution: Manually create node groups using the available node or install the control/management plane on a dedicated CPU-only host.