Known Issues¶
This section outlines the currently known issues in this release. We are actively working on resolving these issues and will provide updates in future releases.
Reporting New Issues¶
If you encounter an issue not listed here, please report it to our support team. Visit the Support Page for more information.
Known Issue List¶
GPU Utilization Heatmap shows numbers 60-70 as default (dark or light mode)
- Issue #: AIP-1044
- Severity: Medium
- Affected Components: GPU Utilization Heatmap
- Impacted Versions: 0.5.0
- Description: The heatmap is missing the color range mapping for numbers between 60 and 70. The numbers appear, but the corresponding heatmap color does not.
- Workaround: None
- Cause: Known
- Solution: Issue will be resolved in a future release.
Mismatch in workload count on the project summary page
- Issue #: AIP-1043
- Severity: Medium
- Affected Components: Reporting on Project Summary Page
- Impacted Versions: 0.5.0
- Description: Workload counts on the project list page may be inaccurate. This is because the count on this page doesn't update correctly when certain workloads are deleted. The project detail page, however, displays the correct count.
- Workaround: None
- Cause: Known
- Solution: Issue will be resolved in a future release.
MemVerge GPU Cluster Manager is not supported on Red Hat Enterprise Linux (RHEL)
- Issue #: AIP-927
- Severity: Low
- Affected Components: Installation
- Impacted Versions: 0.3.0+
- Description: The AMD Operator is not supported on RHEL; therefore, the MemVerge GPU Cluster Manager is not supported on servers running RHEL with AMD GPUs.
- Cause: This issue is present in the AMD operator.
- Workaround: None
- Solution: The MemVerge GPU Cluster will support RHEL with AMD GPUs when the AMD Operator supports it.
Workloads get into an unknown
state when there is insufficient CPU to process the request
- Issue #: AIP-712
- Severity: Low
- Affected Components: Scheduler
- Impacted Versions: 0.4.0
- Description: In the 0.4.0 release, Kueue is aware of all submitted workloads. If resources are not available, the workloads fail the admission check and remain in an "Unknown" state, without a status.admission field. They stay in this state (often visualized as gray in the UI) and wait until the required resources become available.
- Workaround: Ensure the cluster has enough CPU/Memory/GPUs for all anticipated workloads.
- Cause: Known
- Solution: Issue cannot be resolved as it relies on Kubernetes and Kueue.
Creating a new Department using a previously deleted Department Name will have old Billing/Usage Information
- Issue #: AIP-609
- Severity: Low
- Affected Components: Billing/Department
- Impacted Versions: 0.3.0+
- Description: When creating a new Department using the same name as a previously deleted department, the billing information from the previous department may be accessible.
- Cause: This is by design. Billing information primarily uses the Name for uniqueness. Reusing an old name will show any historical information.
- Workaround: None
- Solution: Ensure Department names are unique, unless this is desired.
Cannot Create a Node Group with the same Name assigned to Different Departments
- Issue #: AIP-535
- Severity: Low
- Affected Components: Node Group
- Impacted Versions: 0.3.0+
- Description: Creating a new Node Group using the same name as an existing Node Group will fail, even if the Node Group is assigned to a different Department.
- Cause: Names are globally scoped across the cluster rather than at a Department level/scope.
- Workaround: None
- Solution: Ensure Node Group names are unique across the entire cluster.
NVIDIA GPU PIDs/TIDs or Usage in the Workspace Terminal
- Issue #: AIP-509
- Severity: Low
- Affected Components: Workspaces
- Impacted Versions: 0.3.0+
- Description: Inside a user workspace, running
nvidia-smi
will show some, but not all all the information. - Cause: This issue is a known security limitation of the NVIDIA Operator and Containers. See Cannot see gpu threads in container for more information.
- Workaround: None
- Solution: None
OAuth does not work with Enterprise GitHub Accounts
- Issue #: AIP-503
- Severity: Low
- Affected Components: Security
- Impacted Versions: 0.3.0+
- Description: When the GitHub Oauth Provider is Enabled, Enterprise GitHub users cannot login.
- Cause: This is an issue in the Rancher OAuth provider.
- Workaround: None
- Solution: A fix will be made available in a future release.
Using a 'local' StorageClass
Persistent Volume Claims (PVC) for Workspaces may not be reusable once the Workspace is Deleted
- Issue #: AIP-482
- Severity: Low
- Affected Components: Workspace Volumes
- Impacted Versions: 0.3.0+
- Description: If a Volume is created using the
local
StorageClass, eg a local NVME SSD, and used by a workspace, once the workspace is stopped and deleted, the volume may not be useable to any other workspace. - Cause: Once a PVC is claimed by a workspace pod, the status becomes Bound. If the pod is deleted, a new workload cannot reuse the PVC. If the workspace pod attempts to start on another node, it won't have access to the assigned PVC residing on another worker node.
- Workaround: None
- Solution: Always use storage accessible by all worker nodes in the cluster. NFS or similar should be used to avoid this issue.
User Account Retention Policy is Disabled
- Issue #: AIP-469
- Severity: Low
- Affected Components: RBAC/Security
- Impacted Versions: 0.3.0+
- Description: User account retention policy - automatically disables dormant accounts - has been disabled.
- Cause: N/A
- Workaround: None
- Solution: In the future, we will improve this workflow and feature to allow dormant accounts to automatically become disabled, preventing those users from logging in. This is a security improvement.
Workspace Storage Usage via NFS is not Enforced
- Issue #: AIP-454
- Severity: Low
- Affected Components: Workspace Volumes/Storage
- Impacted Versions: 0.3.0+
- Description: When workspace volumes created backed by NFS, users may be able to commit more data than was requested. For example, if the user creates a 10GB volume, more than 10GB can be written without ENOSPC or other messages.
- Cause: This is a known issue. See nfs-subdir-external-provisioner: No restrictions on PVC
- Workaround: None
- Solution: None
Multi-Pod Workloads/Workspaces may not be correctly admitted to the Kubernetes cluster
- Issue #: AIP-394
- Severity: Low
- Affected Components: Workspaces
- Impacted Versions: 0.3.0+
- Description: When a workspace with multiple kubernetes pods is created, it is possible only some of the pods may be deployed.
- Cause: If there is not enough resources in the chosen project, a workload that needs to span multiple projects won't be scheduled.
- Workaround: None
- Solution: In the future, we will improve this workflow and feature to allow multi-pod workloads to borrow resources from other projects when resources are unavailable in the primary project.
Creating a Node Group using the same GPU Make/Model in different Modes is not Supported
- Issue #: AIP-392
- Severity: Low
- Affected Components: Node Group
- Impacted Versions: 0.3.0+
- Description: Creating a Node Group using the same underlying GPU Make/Model, but the GPU is in a different mode (NVIDIA MIG, for example), is not supported. Node Group creation requires GPUs of the same make/model and mode.
- Cause: Hybrid GPU configuration are not supported in the same Node Group.
- Workaround: None
- Solution: Ensure all GPUs are of the same make/model and have been cofigured in the same mode before creating a Node Group.
A Default Node Group is not Automatically Created when the Kubernetes Cluster is a Single Server, or when the Management/Control Plane Node has GPUs
- Issue #: AIP-388
- Severity: Low
- Affected Components: Node Group
- Impacted Versions: 0.3.0+
- Description: MemVerge.ai will create a default node group after installation. In very small clusters where the cluster is a single node or when the control plane is installed on a GPU Worker, a default node group may not be automatically created.
- Cause: This is expected behaviour due to workloads running on the control/management node may cause performance issues when under high workload demand.
- Workaround: None
- Solution: Manually create node groups using the available node or install the control/management plane on a dedicated CPU-only host.