Install GPU Cluster Manager¶

NOTE: It is currently not possible to upgrade from GPU Cluster Manager (A.K.A. MMAI, MVAI) version 0.3.0 to 0.4.0. If you are currently running 0.3.0 in your environment and wish to use 0.4.0, please follow the uninstall instructions within version 0.3.0 documentation prior to installing 0.4.0.

GPU Cluster Manager is installed using helm charts. These charts deploy and configure the following major components of GPU Cluster Manager:

NVidia GPU Operator
NVidia DCGM Telemetry Exporter
NVidia GPU Drivers
GPU Cluster Manager:
Management/Control Plane
Workload Queuing and Prioritization
Metrics and Telemetry System
Billing - Read ''Configuring the Billing Database" for more information before installing MemVerge.ai.

GPU Cluster Manager's server is designed to be secure by default and requires SSL/TLS configuration. There are three recommended options for the source of the certificate used for TLS termination on the server:

Self-Generated Certificate
Let's Encrypt
Bring Your Own Certificate

Choose the best option for your needs and environment.

Get the LoadBalancer IP Address or Hostname¶

Before proceeding you need to know the IP Address or Hostname of the LoadBalancer service. Use the following method to obtain this information using:

kubectl get services --all-namespaces | grep -E "NAMESPACE|LoadBalancer"

Example:

$ kubectl get services --all-namespaces | grep -E "NAMESPACE|LoadBalancer"
NAMESPACE      NAME      TYPE           CLUSTER-IP      EXTERNAL-IP                  PORT(S)                   
kube-system    traefik   LoadBalancer   10.43.171.70    172.31.25.216,172.31.25.25   80:32536/TCP,443:30099/TCP

If you need more detailed information about this specific LoadBalancer service, you can use:

kubectl describe service traefik -n kube-system

Using the EXTERNAL-IP, perform a name lookup

$ nslookup 172.31.25.216
216.25.31.172.in-addr.arpa name = mvai-mgmt.

In this example, mvai-mgmt is the hostname of the LoadBalancer.

Add Billing Database (Optional)¶

If you are using NFS Storage class, you my wish to consider changing the billing database storage class using the instructions in Configure Billing Storage.

Installing GPU Cluster Manager with a Self-Generated Certificate¶

The default is for GPU Cluster Manager to generate a CA and uses cert-manager to issue the certificate for access to the GPU Cluster Manager server interface. Use the following Helm command to install mvai(GPU Cluster Manager) with a self-signed certificate:

helm install --namespace cattle-system mvai oci://ghcr.io/memverge/charts/mvai \
  --wait --timeout 20m \
  --version 0.4.0 \
  --set hostname=<load-balancer-hostname> \
  --set bootstrapPassword=admin

Command Options Explained:

--namespace cattle-system: Specifies the Kubernetes namespace where mvai (GPU Cluster Manager) will be installed.
--wait: Instructs Helm to wait until all pods, PVCs, services, and minimum number of pods of a deployment are in a ready state before marking the release as successful.
--timeout 20m: Sets a 20-minute timeout for the installation process.
--version <version>: Specifies the version of the GPU Cluster Manager Helm chart to install. Replace <version> with the desired version number, e.g. 0.4.0.
--set hostname=<load-balancer-hostname>: Sets the hostname for accessing the GPU Cluster Manager. Ensure that the <load-balancer-hostname> is correctly set to match your cluster's ingress or load-balancer address. This should be the hostname of the management/control plane node. When using an external DNS name, such as demo.memverge.com, use that instead of the management hostname.
--set bootstrapPassword=admin: Sets the initial admin password for GPU Cluster Manager. For security reasons, change this from "admin" to a strong, unique password.

Installing GPU Cluster Manager with Let's Encrypt SSL Certificate¶

This guides you through installing the GPU Cluster Manager software using a Helm chart, while simultaneously setting up automatic SSL certificate management with Let's Encrypt.

Use the following Helm command to install GPU Cluster Manager (mvai):

helm install --namespace cattle-system mvai oci://ghcr.io/memverge/charts/mvai \
  --wait --timeout 20m \
  --version 0.4.0 \
  --set hostname=<load-balancer-hostname> \
  --set bootstrapPassword=admin \
  --set ingress.tls.source=letsEncrypt \
  --set letsEncrypt.email=<me@example.org> \
  --set letsEncrypt.ingress.class=<ingress-controller-name>

Command Options Explained

--namespace cattle-system: Specifies the Kubernetes namespace where mvai will be installed.
--wait: Instructs Helm to wait until all pods, PVCs, services, and minimum number of pods of a deployment are in a ready state before marking the release as successful.
--timeout 20m: Sets a 20-minute timeout for the installation process.
--version <version>: Specifies the version of the MemVerge.ai to install. Replace <version> with the desired version number, e.g: 0.4.0
--set hostname=<load-balancer-hostname>: Sets the hostname for accessing mvai. Replace <load-balancer-hostname> with your actual hostname.
--set bootstrapPassword=<password>: Sets the initial admin password for MemVerge.ai. You should change this to a secure password of your choice.
--set ingress.tls.source=letsEncrypt: Configures the installation to use Let's Encrypt for SSL certificate management.
--set letsEncrypt.email=<me@example.org>: Specifies the email address for Let's Encrypt notifications. Replace <me@example.org> with a valid email address you can access. The email address provided for letsEncrypt.email is used by Let's Encrypt for important communications about your certificates, including expiration notices and critical updates. Use a valid email address that you or your team can monitor.
--set letsEncrypt.ingress.class=<ingress-controller-name>: Specifies which ingress controller to use. Replace <ingress-controller-name> with the appropriate value for your cluster (e.g., "traefik", "nginx"). Use the value in the NAME column by running kubectl get ingressclasses

Here is an example command to install GPU Cluster Manager:

helm install --namespace cattle-system mvai oci://ghcr.io/memverge/charts/mvai \
  --wait --timeout 20m \
  --version 0.4.0 \
  --set hostname=mvai-mgmt \
  --set bootstrapPassword=admin \
  --set ingress.tls.source=letsEncrypt \
  --set letsEncrypt.email=test.user@gmail.com \
  --set letsEncrypt.ingress.class=traefik

Installing GPU Cluster Manager with Bring Your Own Certificate¶

In this option, Kubernetes secrets are created from your own certificates for MemVerge.ai to use.

Use one of the following Helm commands to install GPU Cluster Manager with your own certificate:

For standard certificates:

helm install --namespace cattle-system mvai oci://ghcr.io/memverge/charts/mvai \
  --wait --timeout 20m --version <version> \
  --set hostname=<load-balancer-hostname> --set bootstrapPassword=admin \
  --set ingress.tls.source=secret

For private CA-signed certificates:

helm install --namespace cattle-system mvai oci://ghcr.io/memverge/charts/mvai \
  --wait --timeout 20m --version <version> \
  --set hostname=<load-balancer-hostname> --set bootstrapPassword=admin \
  --set ingress.tls.source=secret --set privateCA=true

Command Options Explained:

--namespace cattle-system: Specifies the Kubernetes namespace where GPU Cluster Manager will be installed.
--wait: Instructs Helm to wait until all pods, PVCs, services, and minimum number of pods of a deployment are in a ready state before marking the release as successful.
--timeout 20m: Sets a 20-minute timeout for the installation process.
--version <version>: (Optional) Specifies the version of the MemVerge.ai to install. Replace <version> with the desired version number, e.g. 0.4.0
--set hostname=<load-balancer-hostname>: Sets the hostname for accessing mvai. Replace <load-balancer-hostname> with your actual hostname.
--set bootstrapPassword=admin: Sets the initial admin password for mvai. For security reasons, change this from "admin" to a strong, unique password.
--set ingress.tls.source=secret: Configures the installation to use your own certificate stored as a Kubernetes secret.
--set privateCA=true: (Optional) Specifies that the certificate is signed by a private CA. If you're using a certificate signed by a private CA, include the --set privateCA=true option in your Helm command.

Example Output from a Successful Installation of GPU Cluster Manager¶

The following output was generated utilizing the "Let's Encrypt" SSL Certificate:

$ helm install --namespace cattle-system mvai oci://ghcr.io/memverge/charts/mvai \
  --wait --timeout 40m \
  --version 0.4.0 \
  --set hostname=demo.memvergelab.com \
  --set bootstrapPassword=admin \
  --set ingress.tls.source=letsEncrypt \
  --set letsEncrypt.email=noreply@memvergelab.com \
  --set letsEncrypt.ingress.class=traefik
Pulled: ghcr.io/memverge/charts/mvai:0.4.0
Digest: sha256:d9d445b79a4f6422e7f2af3dd115753a14c719072f9f1a714510dc5ffbb4ceb2
E0610 23:34:51.413315  671967 reflector.go:200] "Failed to watch" err="the server is currently unable to handle the request (get jobs.batch)" logger="UnhandledError" reflector="k8s.io/client-go@v0.33.0/tools/cache/reflector.go:285" type="*unstructured.Unstructured"
E0610 23:34:54.130194  671967 reflector.go:200] "Failed to watch" err="failed to list *unstructured.Unstructured: the server is currently unable to handle the request (get jobs.batch)" logger="UnhandledError" reflector="k8s.io/client-go@v0.33.0/tools/cache/reflector.go:285" type="*unstructured.Unstructured"
NAME: mvai
LAST DEPLOYED: Tue Jun 10 23:29:27 2025
NAMESPACE: cattle-system
STATUS: deployed
REVISION: 1
NOTES:
MemVerge.ai v0.4.0 has been deployed successfully!

Check out our docs at https://docs.memverge.com/AI/

If you provided your own bootstrap password during installation, browse to https://demo.memvergelab.com to get started.

To get just the bootstrap password on its own, run:

kubectl get secret --namespace cattle-system bootstrap-secret -o go-template='{{.data.bootstrapPassword|base64decode}}{{ "\n" }}'
$

Note: Some error messages, like the "Failed to watch" notices in the above output, may appear. This is normal and expected.

Now that GPU Cluster Manager is installed, we will want to publish the certificate files so GPU Cluster Manager and the Ingress controller can use them. For more information, check out Adding TLS Secrets.

Below is output from a successful run of the helm command which publishes the certificate files for GPU Cluster Manager and Ingress controller use:

$ kubectl get secret --namespace cattle-system bootstrap-secret -o go-template='{{.data.bootstrapPassword|base64decode}}{{ "\n" }}'
admin

The response of admin matches the password provided in the helm install command above:

--set bootstrapPassword=admin \

The password must be reset upon initial Admin login to the GPU Cluster Manager via web browser.

In the example above, we have now successfully installed GPU Cluster Manager and published the certificate files using the "Let's Encrypt" SSL certificate.