Skip to content

Troubleshooting

Even with careful setup and configuration, you may encounter issues with this complex GPU Cluster Manager. This troubleshooting section is designed to help you diagnose and resolve common problems efficiently. Before diving into specific issues, gathering relevant information and checking appropriate logs is important. Here are some general steps to follow when troubleshooting:

  1. Collect System Information:
    • Kubernetes version (kubectl version)
    • Node status (kubectl get nodes)
    • Pod status (kubectl get pods --all-namespaces)
    • Cluster events (kubectl get events --all-namespaces)
  2. Check the Component Logs:
    • Kubernetes system logs: /var/log/kube-* on master and worker nodes
    • Container logs: Use kubectl logs <pod-name> -n <namespace>
    • AMD or NVIDIA GPU Operator logs: Check pods in the gpu-operator-resources namespace
    • Kubeflow logs: Examine pods in the kubeflow namespace
  3. Verify Resource Usage:
    • Node resource utilization (kubectl top nodes)
    • Pod resource consumption (kubectl top pods --all-namespaces)
    • Pod resource requests and limits per node (kubectl describe node <hostname>)
    • GPU status (AMD: rocm-smi NVIDIA:nvidia-smi on GPU-enabled nodes)
  4. Review Configuration Files:
    • Kubernetes configuration: /etc/kubernetes/*
    • Kubeflow manifests: Check your Kubeflow installation directory
    • NVIDIA GPU Operator values: Review your gpu-operator-values.yaml
  5. Network Diagnostics:
    • DNS resolution (nslookup, dig)
    • Network connectivity (ping, traceroute)
    • Service endpoints (kubectl get endpoints)

For problems outside of GPU Cluster Manager, which might affect how GPU Cluster Manager is operating, please check out the following Troubleshooting Guides specific to the GPU type you are using: