When uninstalling MVAI using Helm, timeout error may occur¶
Problem Summary¶
- Issue: When uninstalling MVAI using Helm, the following error may be observed
$ helm uninstall --namespace cattle-system mvai --wait --timeout 30m
Error: 1 error occurred:
* timed out waiting for the condition
-
Affected: GPU Cluster Manager v0.3.0 - current
-
Cause: The timeout error means that Helm waited up to 30 minutes for all Kubernetes resources associated with the
mvai
release to be deleted, but one or more resources were still not removed. Helm cannot (by default) display which resources are stuck, which makes troubleshooting necessary -
Error Message:
* timed out waiting for the condition.
Investigation Steps¶
- List Resources Still Present:
List all resources in the target namespace to see what's still there:
$ kubectl get all -n cattle-system
$ kubectl get pvc -n cattle-system
$ kubectl get crd -A | grep mvai
$ kubectl get events -n cattle-system --sort-by='.lastTimestamp'
Expected Output: If some resources are still in "Terminating" state, those are likely the cause.
- Check Resource Status and Details
Describe stuck resources for more detail:
# Replace <type> with the kind, e.g., pod, pvc, service, etc., and <name> with the resource name.
$ kubectl describe <type> <name> -n cattle-system
Expected Output: Look for finalizers, stuck volumes, or events indicating errors.
- Check for Finalizers Blocking Deletion
Many stuck resources (particularly CRDs, PVCs, or pods) have finalizers that block deletion. List and inspect them:
$ kubectl get <type> -n cattle-system -o json | jq '.items[].metadata.finalizers'
# Or for a specific resource
$ kubectl get <type> <name> -n cattle-system -o json | jq '.metadata.finalizers'
To remove a stuck finalizer (with caution):
- Check Helm Release Status
Even after a timeout, Helm may leave some metadata or secrets behind:
Expected Output: Consider deleting stuck Helm release secrets if the release is still listed as "uninstalling":
- Inspect Logs for Stuck Pods or Jobs
If a pod or job is not terminating:
Expected Output: Look for issues like volume detach errors, completed jobs stuck due to a finalizer, or orphaned resources.
-
Commands to Verify Issue
Verify the secret is present and valid:
If you see the following output, it indicates the secret is not available:
$ kubectl get secret memverge-dockerconfig -n cattle-system -o yaml | grep -A2 "imagePullSecrets"
Error from server (NotFound): secrets "memverge-dockerconfig" not found
Resolution Steps¶
- Create a Kubernetes Docker Registry Secret
First, get a GitHub personal access token (PAT) from MemVerge Support (ensure it has the read:packages scope).
Then run:
$ kubectl create secret docker-registry memverge-dockerconfig \
--docker-server=ghcr.io \
--docker-username=<your-github-username> \
--docker-password=<your-github-token> \
--docker-email=ignored@example.com \
--namespace=cattle-system
- Replace
and with the credentials provided by MemVerge Support. -
The email address is syntactically required, but ignored.
-
Verify the Secret
It should now show up.
- Retry the Uninstall
After the secret is present, Helm should be able to proceed. You may need to manually delete or restart the failing jobs, or just re-run:
If the uninstall jobs still fail, you can manually delete the stuck resources:
Then retry the uninstall.
Problem Example¶
The following example shows the output from the commands in Step 1
ubuntu@mvai-mgmt:~$ kubectl get all -n cattle-system
NAME READY STATUS RESTARTS AGE
pod/mvai-pre-delete-n6d7f 0/1 ImagePullBackOff 0 53m
pod/mvai-pre-install-987cq 0/1 ImagePullBackOff 0 8h
NAME STATUS COMPLETIONS DURATION AGE
job.batch/mvai-pre-delete Running 0/1 53m 53m
job.batch/mvai-pre-install Running 0/1 8h 8h
ubuntu@mvai-mgmt:~$ kubectl get pvc -n cattle-system
No resources found in cattle-system namespace.
ubuntu@mvai-mgmt:~$ kubectl get crd -A | grep mvai
ubuntu@mvai-mgmt:~$ kubectl get events -n cattle-system --sort-by='.lastTimestamp'
LAST SEEN TYPE REASON OBJECT MESSAGE
54m Normal Scheduled pod/mvai-pre-delete-n6d7f Successfully assigned cattle-system/mvai-pre-delete-n6d7f to mvai-nvgpu02
54m Normal SuccessfulCreate job/mvai-pre-delete Created pod: mvai-pre-delete-n6d7f
53m Warning Failed pod/mvai-pre-delete-n6d7f Failed to pull image "ghcr.io/memverge/k8s-cli:v0.1.0": failed to pull and unpack image "ghcr.io/memverge/k8s-cli:v0.1.0": failed to resolve reference "ghcr.io/memverge/k8s-cli:v0.1.0": failed to authorize: failed to fetch anonymous token: unexpected status from GET request to https://ghcr.io/token?scope=repository%3Amemverge%2Fk8s-cli%3Apull&service=ghcr.io: 401 Unauthorized
53m Warning Failed pod/mvai-pre-delete-n6d7f Error: ErrImagePull
53m Warning Failed pod/mvai-pre-delete-n6d7f Error: ImagePullBackOff
51m Normal Pulling pod/mvai-pre-delete-n6d7f Pulling image "ghcr.io/memverge/k8s-cli:v0.1.0"
4m37s Warning FailedToRetrieveImagePullSecret pod/mvai-pre-delete-n6d7f Unable to retrieve some image pull secrets (memverge-dockerconfig); attempting to pull the image may not succeed.
4m37s Normal BackOff pod/mvai-pre-delete-n6d7f Back-off pulling image "ghcr.io/memverge/k8s-cli:v0.1.0"
104s Warning FailedToRetrieveImagePullSecret pod/mvai-pre-install-987cq Unable to retrieve some image pull secrets (memverge-dockerconfig); attempting to pull the image may not succeed.
104s Normal BackOff pod/mvai-pre-install-987cq Back-off pulling image "ghcr.io/memverge/k8s-cli:v0.1.0"
The core issue is that both the mvai-pre-delete and mvai-pre-install jobs are failing because their pods cannot pull the image ghcr.io/memverge/k8s-cli:v0.1.0 due to authentication errors:
and This means the pods cannot fetch the image from the private GitHub Container Registry (ghcr.io/memverge) because the cluster does not have the necessary credentials. This is a common situation for images hosted in private or restricted registries.