Troubleshooting¶

Even with careful setup and configuration, you may encounter issues with this complex AI infrastructure. This troubleshooting section is designed to help you diagnose and resolve common problems efficiently. Before diving into specific issues, gathering relevant information and checking appropriate logs is important. Here are some general steps to follow when troubleshooting:
Collect System Information:
- Kubernetes version (kubectl version)
- Node status (kubectl get nodes)
- Pod status (kubectl get pods --all-namespaces)
- Cluster events (kubectl get events --all-namespaces)
Check the Component Logs:
- Kubernetes system logs: /var/log/kube-* on master and worker nodes
- Container logs: Use kubectl logs <pod-name> -n <namespace>
- NVIDIA GPU Operator logs: Check pods in the gpu-operator-resources namespace
- Kubeflow logs: Examine pods in the kubeflow namespace
Verify Resource Usage:
- Node resource utilization (kubectl top nodes)
- Pod resource consumption (kubectl top pods --all-namespaces)
- Pod resource requests and limits per node (kubectl describe node <hostname>)
- GPU status (nvidia-smi on GPU-enabled nodes)
Review Configuration Files:
- Kubernetes configuration: /etc/kubernetes/*
- Kubeflow manifests: Check your Kubeflow installation directory
- NVIDIA GPU Operator values: Review your gpu-operator-values.yaml
Network Diagnostics:
- DNS resolution (nslookup, dig)
- Network connectivity (ping, traceroute)
- Service endpoints (kubectl get endpoints)