Troubleshooting¶

Even with careful setup and configuration, you may encounter issues with this complex GPU Cluster Manager. This troubleshooting section is designed to help you diagnose and resolve common problems efficiently.

Data to Gather¶

Before diving into specific issues, gathering relevant information and checking appropriate logs is important. Here are some general steps to follow when troubleshooting:

Collect System Information:
- Kubernetes version (kubectl version)
- Node status (kubectl get nodes)
- Pod status (kubectl get pods --all-namespaces)
- Cluster events (kubectl get events --all-namespaces)
Check the Component Logs:
- Kubernetes system logs: /var/log/kube-* on master and worker nodes
- Container logs: Use kubectl logs <pod-name> -n <namespace>
- AMD or NVIDIA GPU Operator logs: Check pods in the gpu-operator-resources namespace
- Kubeflow logs: Examine pods in the kubeflow namespace
Verify Resource Usage:
- Node resource utilization (kubectl top nodes)
- Pod resource consumption (kubectl top pods --all-namespaces)
- Pod resource requests and limits per node (kubectl describe node <hostname>)
- GPU status (AMD: rocm-smi NVIDIA:nvidia-smi on GPU-enabled nodes)
Review Configuration Files:
- Kubernetes configuration: /etc/kubernetes/*
- Kubeflow manifests: Check your Kubeflow installation directory
- NVIDIA GPU Operator values: Review your gpu-operator-values.yaml
Network Diagnostics:
- DNS resolution (nslookup, dig)
- Network connectivity (ping, traceroute)
- Service endpoints (kubectl get endpoints)

External Documentation¶

For problems outside of GPU Cluster Manager, which might affect how GPU Cluster Manager is operating, please check out the following Troubleshooting Guides specific to the GPU type you are using:

Known Issues¶

Known issues are listed on the left menu for fast search and lookup. Each issue will be broken up into Problem Summary, Invesgitation Steps, and Resolution Steps. When pertinent, a Problem example will be provided.