Confirm the Installation was Successful¶
After installing GPU Cluster Manager, it's important to verify that all components are running correctly. Follow these steps to confirm a successful installation:
- Check MVAI services:
Ensure all MVAI-related services are present and have a status of ClusterIP
, including:
- mvai
- mvai-billing
- mvai-billing-mysql
- mvai-ctrl-controller-manager-metrics-service
- mvai-ctrl-metrics-aggregator-service
- mvai-ctrl-webhook-service
Example:
$ kubectl get services -n cattle-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
gpu-operator ClusterIP 10.43.84.16 <none> 8080/TCP 67m
kueue-controller-manager-metrics-service ClusterIP 10.43.48.39 <none> 8443/TCP 67m
kueue-visibility-server ClusterIP 10.43.241.212 <none> 443/TCP 67m
kueue-webhook-service ClusterIP 10.43.89.126 <none> 443/TCP 67m
mvai ClusterIP 10.43.23.252 <none> 80/TCP,443/TCP 72m
mvai-billing ClusterIP 10.43.63.29 <none> 8080/TCP 72m
mvai-billing-mysql ClusterIP 10.43.41.163 <none> 3306/TCP 72m
mvai-ctrl-controller-manager-metrics-service ClusterIP 10.43.197.99 <none> 8443/TCP 67m
mvai-ctrl-metrics-aggregator-service ClusterIP 10.43.185.32 <none> 9191/TCP 67m
mvai-ctrl-webhook-service ClusterIP 10.43.239.169 <none> 443/TCP 67m
mmcloud-operator-controller-manager-metrics-service ClusterIP 10.43.234.77 <none> 8443/TCP 67m
mmcloud-operator-webhook-service ClusterIP 10.43.99.35 <none> 443/TCP 67m
nvidia-dcgm-exporter ClusterIP 10.43.29.94 <none> 9400/TCP 67m
rancher-webhook ClusterIP 10.43.193.230 <none> 443/TCP 70m
- Verify pod status:
All MVAI-related pods should be in the Running
state.
Example:
$ kubectl get pods -n cattle-system
NAME READY STATUS RESTARTS AGE
engine-q8fjl 1/1 Running 0 67m
engine-r9cw6 1/1 Running 0 67m
gpu-feature-discovery-2q454 1/1 Running 0 67m
gpu-operator-58dcc865fd-bzr5n 1/1 Running 0 68m
gpu-operator-node-feature-discovery-gc-7f546fd4bc-q67nl 1/1 Running 0 68m
gpu-operator-node-feature-discovery-master-8448c8896c-65w4w 1/1 Running 0 68m
gpu-operator-node-feature-discovery-worker-72m4v 1/1 Running 0 68m
gpu-operator-node-feature-discovery-worker-zzkxf 1/1 Running 0 68m
kueue-controller-manager-7f55cc5474-gf9xh 2/2 Running 0 67m
mvai-565bd9f48b-77hls 1/1 Running 0 73m
mvai-565bd9f48b-n4bqg 1/1 Running 0 72m
mvai-billing-6fb99c585d-2b2mt 1/1 Running 6 (70m ago) 73m
mvai-billing-mysql-6464ff86fb-2bbt4 1/1 Running 0 73m
mvai-ctrl-controller-manager-9cbd47d9-h47pm 1/1 Running 0 67m
mvai-ctrl-metrics-aggregator-5bd4d565d5-ksmwz 1/1 Running 0 67m
mmcloud-operator-controller-manager-974899777-vrbcf 1/1 Running 0 67m
nvidia-container-toolkit-daemonset-42dfn 1/1 Running 0 67m
nvidia-cuda-validator-z7drc 0/1 Completed 0 64m
nvidia-dcgm-exporter-hjtjp 1/1 Running 0 67m
nvidia-device-plugin-daemonset-6mx5b 1/1 Running 0 67m
nvidia-driver-daemonset-6gkq2 1/1 Running 0 68m
nvidia-operator-validator-ldf8k 1/1 Running 0 67m
rancher-webhook-5d7c7b486c-qk4ph 1/1 Running 0 71m
- Check ingress configuration:
Confirm that the MVAI ingress is properly configured with the correct hostname and TLS settings.
Example:
$ kubectl get ingress -n cattle-system
NAME CLASS HOSTS ADDRESS PORTS AGE
mvai traefik mvai-mgmt 172.31.25.216,172.31.25.25 80, 443 74m
- Validate MVAI version:
Ensure the deployed version matches the expected version.
Example:
$ kubectl describe deployment mvai -n cattle-system | grep Image
Image: ghcr.io/memverge/mvai:v0.4.0
- Test MVAI web interface accessibility:
Use a web browser to access the MVAI dashboard using the configured hostname. Verify that you can log in successfully.
- Check MVAI logs for any errors:
Review the logs for any error messages or warnings that might indicate configuration issues.
If all these checks pass without errors, your MVAI installation is likely successful and ready for use.