When uninstalling MMAI using Helm, timeout error may occur¶

Problem Summary¶

Issue: When uninstalling MMAI using Helm, the following error may be observed

$ helm uninstall --namespace cattle-system mmai --wait --timeout 30m
Error: 1 error occurred:
        * timed out waiting for the condition

Affected: GPU Cluster Manager v0.3.0 - current
Cause: The timeout error means that Helm waited up to 30 minutes for all Kubernetes resources associated with the mmai release to be deleted, but one or more resources were still not removed. Helm cannot (by default) display which resources are stuck, which makes troubleshooting necessary
Error Message: * timed out waiting for the condition.

Investigation Steps¶

List Resources Still Present:

List all resources in the target namespace to see what's still there:

$ kubectl get all -n cattle-system
$ kubectl get pvc -n cattle-system
$ kubectl get crd -A | grep mmai
$ kubectl get events -n cattle-system --sort-by='.lastTimestamp'

Expected Output: If some resources are still in "Terminating" state, those are likely the cause.

Check Resource Status and Details

Describe stuck resources for more detail:

# Replace <type> with the kind, e.g., pod, pvc, service, etc., and <name> with the resource name.
$ kubectl describe <type> <name> -n cattle-system

Expected Output: Look for finalizers, stuck volumes, or events indicating errors.

Check for Finalizers Blocking Deletion

Many stuck resources (particularly CRDs, PVCs, or pods) have finalizers that block deletion. List and inspect them:

$ kubectl get <type> -n cattle-system -o json | jq '.items[].metadata.finalizers'
# Or for a specific resource
$ kubectl get <type> <name> -n cattle-system -o json | jq '.metadata.finalizers'

To remove a stuck finalizer (with caution):

kubectl patch <type> <name> -n cattle-system -p '{"metadata":{"finalizers":[]}}' --type=merge

Check Helm Release Status

Even after a timeout, Helm may leave some metadata or secrets behind:

$ helm list -A | grep mmai
$ kubectl get secret -n cattle-system | grep mmai

Expected Output: Consider deleting stuck Helm release secrets if the release is still listed as "uninstalling":

$ kubectl delete secret -n cattle-system sh.helm.release.v1.mmai.v*

Inspect Logs for Stuck Pods or Jobs

If a pod or job is not terminating:

$ kubectl logs <podname> -n cattle-system
$ kubectl describe pod <podname> -n cattle-system

Expected Output: Look for issues like volume detach errors, completed jobs stuck due to a finalizer, or orphaned resources.

Commands to Verify Issue

Verify the secret is present and valid:

kubectl get secret memverge-dockerconfig -n cattle-system -o yaml

If you see the following output, it indicates the secret is not available:

$ kubectl get secret memverge-dockerconfig -n cattle-system -o yaml | grep -A2 "imagePullSecrets"
Error from server (NotFound): secrets "memverge-dockerconfig" not found

Resolution Steps¶

Create a Kubernetes Docker Registry Secret

First, get a GitHub personal access token (PAT) from MemVerge Support (ensure it has the read:packages scope).

Then run:

$ kubectl create secret docker-registry memverge-dockerconfig \
  --docker-server=ghcr.io \
  --docker-username=<your-github-username> \
  --docker-password=<your-github-token> \
  --docker-email=ignored@example.com \
  --namespace=cattle-system

Replace and with the credentials provided by MemVerge Support.
The email address is syntactically required, but ignored.
Verify the Secret

$ kubectl get secret memverge-dockerconfig -n cattle-system

It should now show up.

Retry the Uninstall

After the secret is present, Helm should be able to proceed. You may need to manually delete or restart the failing jobs, or just re-run:

$ helm uninstall --namespace cattle-system mmai --wait --timeout 30m

If the uninstall jobs still fail, you can manually delete the stuck resources:

$ kubectl delete job mmai-pre-delete mmai-pre-install -n cattle-system

Then retry the uninstall.

Problem Example¶

The following example shows the output from the commands in Step 1

ubuntu@mmai-mgmt:~$ kubectl get all -n cattle-system
NAME                         READY   STATUS             RESTARTS   AGE
pod/mmai-pre-delete-n6d7f    0/1     ImagePullBackOff   0          53m
pod/mmai-pre-install-987cq   0/1     ImagePullBackOff   0          8h

NAME                         STATUS    COMPLETIONS   DURATION   AGE
job.batch/mmai-pre-delete    Running   0/1           53m        53m
job.batch/mmai-pre-install   Running   0/1           8h         8h

ubuntu@mmai-mgmt:~$ kubectl get pvc -n cattle-system
No resources found in cattle-system namespace.
ubuntu@mmai-mgmt:~$ kubectl get crd -A | grep mmai
ubuntu@mmai-mgmt:~$ kubectl get events -n cattle-system --sort-by='.lastTimestamp'
LAST SEEN   TYPE      REASON                            OBJECT                       MESSAGE
54m         Normal    Scheduled                         pod/mmai-pre-delete-n6d7f    Successfully assigned cattle-system/mmai-pre-delete-n6d7f to mmai-nvgpu02
54m         Normal    SuccessfulCreate                  job/mmai-pre-delete          Created pod: mmai-pre-delete-n6d7f
53m         Warning   Failed                            pod/mmai-pre-delete-n6d7f    Failed to pull image "ghcr.io/memverge/k8s-cli:v0.1.0": failed to pull and unpack image "ghcr.io/memverge/k8s-cli:v0.1.0": failed to resolve reference "ghcr.io/memverge/k8s-cli:v0.1.0": failed to authorize: failed to fetch anonymous token: unexpected status from GET request to https://ghcr.io/token?scope=repository%3Amemverge%2Fk8s-cli%3Apull&service=ghcr.io: 401 Unauthorized
53m         Warning   Failed                            pod/mmai-pre-delete-n6d7f    Error: ErrImagePull
53m         Warning   Failed                            pod/mmai-pre-delete-n6d7f    Error: ImagePullBackOff
51m         Normal    Pulling                           pod/mmai-pre-delete-n6d7f    Pulling image "ghcr.io/memverge/k8s-cli:v0.1.0"
4m37s       Warning   FailedToRetrieveImagePullSecret   pod/mmai-pre-delete-n6d7f    Unable to retrieve some image pull secrets (memverge-dockerconfig); attempting to pull the image may not succeed.
4m37s       Normal    BackOff                           pod/mmai-pre-delete-n6d7f    Back-off pulling image "ghcr.io/memverge/k8s-cli:v0.1.0"
104s        Warning   FailedToRetrieveImagePullSecret   pod/mmai-pre-install-987cq   Unable to retrieve some image pull secrets (memverge-dockerconfig); attempting to pull the image may not succeed.
104s        Normal    BackOff                           pod/mmai-pre-install-987cq   Back-off pulling image "ghcr.io/memverge/k8s-cli:v0.1.0"

The core issue is that both the mmai-pre-delete and mmai-pre-install jobs are failing because their pods cannot pull the image ghcr.io/memverge/k8s-cli:v0.1.0 due to authentication errors:

Failed to authorize: failed to fetch anonymous token: unexpected status from GET request to https://ghcr.io/token?...: 401 Unauthorized

and

Unable to retrieve some image pull secrets (memverge-dockerconfig); attempting to pull the image may not succeed.

This means the pods cannot fetch the image from the private GitHub Container Registry (ghcr.io/memverge) because the cluster does not have the necessary credentials. This is a common situation for images hosted in private or restricted registries.