Skip to content

Install GPU Manager

GPU Manager is installed using helm charts. These charts deploy and configure the following major components of GPU Manager:

  • NVidia GPU Operator
  • NVidia DCGM Telemetry Exporter
  • NVidia GPU Drivers
  • GPU Manager
  • Management/Control Plane
  • Workload Queuing and Prioritization
  • Metrics and Telemetry System
  • Billing - Read ''Configuring the Billing Database" for more information before installing MemVerge.ai.

The GPU Manager server is designed to be secure by default and requires SSL/TLS configuration. There are three recommended options for the source of the certificate used for TLS termination on the server:

  1. Self-Generated Certificate
  2. Let's Encrypt
  3. Bring Your Own Certificate

Choose the best option for your needs and environment.

Get the LoadBalancer IP Address or Hostname

Before proceeding you need to know the IP Address or Hostname of the LoadBalancer service. Use the following method to obtain this information using:

kubectl get services --all-namespaces | grep -E "NAMESPACE|LoadBalancer"

Example:

$ kubectl get services --all-namespaces | grep -E "NAMESPACE|LoadBalancer"
NAMESPACE      NAME      TYPE           CLUSTER-IP      EXTERNAL-IP                  PORT(S)                   
kube-system    traefik   LoadBalancer   10.43.171.70    172.31.25.216,172.31.25.25   80:32536/TCP,443:30099/TCP

If you need more detailed information about this specific LoadBalancer service, you can use:

kubectl describe service traefik -n kube-system

Using the EXTERNAL-IP, perform a name lookup

$ nslookup 172.31.25.216
216.25.31.172.in-addr.arpa  name = mvai-mgmt.

In this example, mvai-mgmt is the hostname of the LoadBalancer.

Installing GPU Manager with a Self-Generated Certificate

The default is for GPU Manager to generate a CA and uses cert-manager to issue the certificate for access to the GPU Manager server interface. Use the following Helm command to install MMAI(GPU Manager) with a self-signed certificate:

helm install --namespace cattle-system mmai oci://ghcr.io/memverge/charts/mmai \
  --wait --timeout 20m \
  --version 0.3.0 \
  --set hostname=<load-balancer-hostname> \
  --set bootstrapPassword=admin

Command Options Explained:

  • --namespace cattle-system: Specifies the Kubernetes namespace where MMAI (GPU Manager) will be installed.

  • --wait: Instructs Helm to wait until all pods, PVCs, services, and minimum number of pods of a deployment are in a ready state before marking the release as successful.

  • --timeout 20m: Sets a 20-minute timeout for the installation process.
  • --version <version>: Specifies the version of the MMAI(GPU Manager) Helm chart to install. Replace <version> with the desired version number, e.g. 0.3.0.
  • --set hostname=<load-balancer-hostname>: Sets the hostname for accessing the GPU Manager. Ensure that the <load-balancer-hostname> is correctly set to match your cluster's ingress or load-balancer address. This should be the hostname of the management/control plane node.
  • --set bootstrapPassword=admin: Sets the initial admin password for MMAI(GPU Manager). For security reasons, change this from "admin" to a strong, unique password.

Installing GPU Manager with Let's Encrypt SSL Certificate

This guides you through installing the GPU Manager software using a Helm chart, while simultaneously setting up automatic SSL certificate management with Let's Encrypt.

Use the following Helm command to install GPU Manager (MMAI):

helm install --namespace cattle-system mmai oci://ghcr.io/memverge/charts/mmai \
  --wait --timeout 20m \
  --version 0.3.0 \
  --set hostname=<load-balancer-hostname> \
  --set bootstrapPassword=admin \
  --set ingress.tls.source=letsEncrypt \
  --set letsEncrypt.email=<me@example.org> \
  --set letsEncrypt.ingress.class=<ingress-controller-name>

Command Options Explained

  • --namespace cattle-system: Specifies the Kubernetes namespace where MMAI will be installed.
  • --wait: Instructs Helm to wait until all pods, PVCs, services, and minimum number of pods of a deployment are in a ready state before marking the release as successful.
  • --timeout 20m: Sets a 20-minute timeout for the installation process.
  • --version <version>: Specifies the version of the MemVerge.ai to install. Replace <version> with the desired version number, e.g: 0.3.0
  • --set hostname=<load-balancer-hostname>: Sets the hostname for accessing MMAI. Replace <load-balancer-hostname> with your actual hostname.
  • --set bootstrapPassword=<password>: Sets the initial admin password for MemVerge.ai. You should change this to a secure password of your choice.
  • --set ingress.tls.source=letsEncrypt: Configures the installation to use Let's Encrypt for SSL certificate management.
  • --set letsEncrypt.email=<me@example.org>: Specifies the email address for Let's Encrypt notifications. Replace <me@example.org> with a valid email address you can access. The email address provided for letsEncrypt.email is used by Let's Encrypt for important communications about your certificates, including expiration notices and critical updates. Use a valid email address that you or your team can monitor.
  • --set letsEncrypt.ingress.class=<ingress-controller-name>: Specifies which ingress controller to use. Replace <ingress-controller-name> with the appropriate value for your cluster (e.g., "traefik", "nginx"). Use the value in the NAME column by running kubectl get ingressclasses

Here is an example command to install MemVerge.ai

helm install --namespace cattle-system mmai oci://ghcr.io/memverge/charts/mmai \
  --wait --timeout 20m \
  --version 0.3.0 \
  --set hostname=mvai-mgmt \
  --set bootstrapPassword=admin \
  --set ingress.tls.source=letsEncrypt \
  --set letsEncrypt.email=test.user@gmail.com \
  --set letsEncrypt.ingress.class=traefik

Installing GPU Manager with Bring Your Own Certificate

In this option, Kubernetes secrets are created from your own certificates for MemVerge.ai to use.

Use one of the following Helm commands to install MMAI(GPU Manager) with your own certificate:

For standard certificates:

helm install --namespace cattle-system mmai oci://ghcr.io/memverge/charts/mmai \
  --wait --timeout 20m --version <version> \
  --set hostname=<load-balancer-hostname> --set bootstrapPassword=admin \
  --set ingress.tls.source=secret

For private CA-signed certificates:

helm install --namespace cattle-system mmai oci://ghcr.io/memverge/charts/mmai \
  --wait --timeout 20m --version <version> \
  --set hostname=<load-balancer-hostname> --set bootstrapPassword=admin \
  --set ingress.tls.source=secret --set privateCA=true

Command Options Explained:

  • --namespace cattle-system: Specifies the Kubernetes namespace where MMAI will be installed.
  • --wait: Instructs Helm to wait until all pods, PVCs, services, and minimum number of pods of a deployment are in a ready state before marking the release as successful.
  • --timeout 20m: Sets a 20-minute timeout for the installation process.
  • --version <version>: Specifies the version of the MemVerge.ai to install. Replace <version> with the desired version number, e.g. 0.3.0
  • --set hostname=<load-balancer-hostname>: Sets the hostname for accessing MMAI. Replace <load-balancer-hostname> with your actual hostname.
  • --set bootstrapPassword=admin: Sets the initial admin password for MMAI. For security reasons, change this from "admin" to a strong, unique password.
  • --set ingress.tls.source=secret: Configures the installation to use your own certificate stored as a Kubernetes secret.
  • --set privateCA=true: (Optional) Specifies that the certificate is signed by a private CA. If you're using a certificate signed by a private CA, include the --set privateCA=true option in your Helm command.

Now that GPU Manager is deployed, see Adding TLS Secrets to publish your certificate files so GPU Manager and the Ingress controller can use them.