InfraRunBook
    Back to articles

    Kubernetes Cluster Upgrade Failing

    Kubernetes
    Published: Apr 18, 2026
    Updated: Apr 18, 2026

    Kubernetes cluster upgrades fail for predictable reasons. This runbook covers the five most common root causes — from missing etcd backups to deprecated APIs — with real commands and fixes for each.

    Kubernetes Cluster Upgrade Failing

    Symptoms

    You run

    kubeadm upgrade apply v1.29.0
    and something goes wrong. Maybe it hangs at the control plane step for ten minutes with no output. Maybe it exits with a cryptic error and you're left staring at a cluster where the API server version and kubelet versions are out of sync. Nodes report
    NotReady
    . Pods are stuck in
    Pending
    . Your on-call phone is ringing.

    Common symptoms you'll see across failed Kubernetes upgrades include:

    • The
      kubeadm upgrade apply
      command exits non-zero mid-flight with a timeout or API error
    • Control plane pods stuck in
      Pending
      or
      CrashLoopBackOff
      after the upgrade completes
    • Worker nodes stuck in
      SchedulingDisabled
      or
      NotReady
      state
    • etcd cluster showing unhealthy members or refusing writes
    • The API server returning
      503 Service Unavailable
      or timing out on all requests
    • kubectl
      commands hanging or returning connection refused
    • Admission webhooks failing because the backing service never came back up

    The root causes are almost always one of a handful of well-known failure modes. Kubernetes upgrades are coordinated, multi-step processes where each component has strict compatibility rules. When something breaks, it usually breaks loudly — but knowing which failure mode you're hitting makes the difference between a 10-minute fix and a multi-hour incident. Let's go through each one systematically.

    Root Cause 1: etcd Backup Not Taken

    This isn't strictly a cause of upgrade failure — it's a preparation failure that turns a recoverable situation into an unrecoverable one. I've seen engineers confidently skip the backup step because "the upgrade will succeed anyway." That logic holds until it doesn't, and then you're staring at a corrupted etcd data directory with no path forward except rebuilding the cluster from scratch.

    etcd is the source of truth for your entire cluster state. If the upgrade corrupts etcd — which can happen when the control plane crashes mid-upgrade, when disk I/O spikes cause a write failure during the etcd binary swap, or when the etcd version bump hits an incompatibility — you need a snapshot to restore from. Without one, your cluster's configuration, secrets, deployments, and persistent volume bindings are gone. No snapshot means no rollback.

    How to Identify It

    Before any upgrade, verify a backup exists and is recent. Take a manual snapshot from the control plane node and validate it:

    ETCDCTL_API=3 etcdctl snapshot save /var/backup/etcd/snapshot-$(date +%Y%m%d-%H%M%S).db \
      --endpoints=https://10.0.1.10:2379 \
      --cacert=/etc/kubernetes/pki/etcd/ca.crt \
      --cert=/etc/kubernetes/pki/etcd/server.crt \
      --key=/etc/kubernetes/pki/etcd/server.key

    Verify the snapshot is valid and non-empty:

    ETCDCTL_API=3 etcdctl snapshot status /var/backup/etcd/snapshot-20260418-143000.db --write-out=table
    
    +----------+----------+------------+------------+
    |   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
    +----------+----------+------------+------------+
    | 3a7b2c1d |   482931 |       2847 |     48 MB  |
    +----------+----------+------------+------------+

    If you're already in the middle of a failed upgrade and etcd is showing distress, you'll see this from the health check:

    ETCDCTL_API=3 etcdctl endpoint health --endpoints=https://10.0.1.10:2379 \
      --cacert=/etc/kubernetes/pki/etcd/ca.crt \
      --cert=/etc/kubernetes/pki/etcd/server.crt \
      --key=/etc/kubernetes/pki/etcd/server.key
    
    https://10.0.1.10:2379 is unhealthy: failed to commit proposal: context deadline exceeded

    How to Fix It

    If etcd is corrupted and you have a snapshot, restore it. First stop the static pod by moving the manifest out of the static pod directory — kubelet will immediately stop managing the etcd container:

    mv /etc/kubernetes/manifests/etcd.yaml /tmp/etcd.yaml.bak
    sleep 5
    # Confirm etcd is stopped
    crictl ps | grep etcd

    Restore the snapshot to a new data directory:

    ETCDCTL_API=3 etcdctl snapshot restore /var/backup/etcd/snapshot-20260418-143000.db \
      --data-dir=/var/lib/etcd-restore \
      --name=sw-infrarunbook-01 \
      --initial-cluster=sw-infrarunbook-01=https://10.0.1.10:2380 \
      --initial-cluster-token=etcd-cluster-1 \
      --initial-advertise-peer-urls=https://10.0.1.10:2380

    Update the etcd static pod manifest to reference

    /var/lib/etcd-restore
    as the data directory, then move the manifest back. etcd comes up, the API server reconnects, and your cluster state is restored to the snapshot point. Always take the backup. It takes 30 seconds and it's the difference between a recoverable incident and a disaster declaration.


    Root Cause 2: Version Skew Too Large

    Kubernetes enforces a strict n±1 minor version skew policy between components. You cannot upgrade from 1.26 to 1.29 in a single jump. I've seen teams try this — usually because they've been deferring upgrades for months — and the result is either kubeadm refusing outright, or a partial upgrade that leaves the cluster in an inconsistent state where different nodes are running incompatible kubelet versions.

    The skew policy also applies between individual components. If your kube-apiserver is running 1.28 but your kubelets are still on 1.25, you're outside the supported range. You'll start seeing strange behavior: pods not scheduling correctly, node conditions misreported, or kubelets failing to register after a restart.

    How to Identify It

    Check your current node versions before touching anything:

    kubectl get nodes -o wide
    
    NAME               STATUS   ROLES           AGE   VERSION    INTERNAL-IP
    sw-infrarunbook-01 Ready    control-plane   180d  v1.26.12   10.0.1.10
    worker-node-01     Ready    <none>          180d  v1.25.10   10.0.1.11
    worker-node-02     Ready    <none>          180d  v1.25.10   10.0.1.12

    If you attempt to jump more than one minor version, kubeadm will block you:

    kubeadm upgrade apply v1.29.0
    
    [preflight] Running pre-flight checks.
    error execution phase preflight: [preflight] Some fatal errors occurred:
      [ERROR InvalidNewVersion]: Specified version to upgrade to "v1.29.0" is too far
      from the current version "v1.26.12". Kubernetes only supports upgrades from one
      minor version to the next. Please upgrade from v1.26 to v1.27 first.

    Also check what's installed at the package level, not just what's reported via the API — the two can diverge after a partial upgrade:

    ssh infrarunbook-admin@10.0.1.10 "kubelet --version && kubeadm version -o short"
    
    Kubernetes v1.26.12
    v1.26.12

    How to Fix It

    Upgrade one minor version at a time, in strict sequence: 1.26 → 1.27 → 1.28 → 1.29. For each hop, update kubeadm first, run the control plane upgrade, then drain and upgrade each node's kubelet and kubectl:

    # On sw-infrarunbook-01 — upgrade kubeadm to the next minor version
    apt-mark unhold kubeadm && \
    apt-get update && \
    apt-get install -y kubeadm=1.27.0-00 && \
    apt-mark hold kubeadm
    
    # Verify the version before applying
    kubeadm version
    
    # Apply the control plane upgrade
    kubeadm upgrade apply v1.27.0
    
    # Then on each node, drain, upgrade kubelet and kubectl, uncordon
    kubectl drain worker-node-01 --ignore-daemonsets --delete-emptydir-data
    ssh infrarunbook-admin@10.0.1.11 \
      "apt-mark unhold kubelet kubectl && \
       apt-get install -y kubelet=1.27.0-00 kubectl=1.27.0-00 && \
       apt-mark hold kubelet kubectl && \
       systemctl daemon-reload && \
       systemctl restart kubelet"
    kubectl uncordon worker-node-01

    Repeat for each minor version hop. It feels tedious when you're trying to close a three-version gap, but there's no supported shortcut.


    Root Cause 3: Node Drain Failing

    Before upgrading the kubelet on any node, you need to drain it — evicting all pods so workloads reschedule elsewhere before the node goes offline for maintenance. This sounds straightforward, but it fails in production clusters more often than you'd expect. The usual culprits are PodDisruptionBudgets configured too aggressively, pods with stuck finalizers that will never terminate, and static or mirror pods that drain won't touch without extra flags.

    How to Identify It

    Run the drain command and watch what happens:

    kubectl drain worker-node-01 --ignore-daemonsets --delete-emptydir-data
    
    node/worker-node-01 cordoned
    evicting pod default/app-deployment-7d9f8c-xk2p9
    evicting pod default/app-deployment-7d9f8c-mn4r7
    error when evicting pods/"app-deployment-7d9f8c-xk2p9" -n "default" (will retry after 5s):
      Cannot evict pod as it would violate the pod's disruption budget.
    error when evicting pods/"app-deployment-7d9f8c-mn4r7" -n "default" (will retry after 5s):
      Cannot evict pod as it would violate the pod's disruption budget.

    The drain is blocked because the PDB's

    minAvailable
    threshold is leaving no room for disruption. Inspect the PDB:

    kubectl get pdb -A
    
    NAMESPACE   NAME             MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
    default     app-pdb          3               N/A               0                     45d

    ALLOWED DISRUPTIONS: 0
    means all three required replicas are already accounted for — evicting any one of them violates the budget. You might also encounter stuck finalizers. A pod waiting on a controller that no longer exists will never terminate on its own:

    kubectl get pod stuck-pod-abc123 -n default -o jsonpath='{.metadata.finalizers}'
    
    ["foregroundDeletion","example.io/cleanup"]

    How to Fix It

    For PDB violations, the correct fix depends on intent. If the budget is protecting a legitimately critical service, scale the deployment up so eviction headroom opens before you drain:

    # Scale up to create headroom, then retry the drain
    kubectl scale deployment app-deployment --replicas=6 -n default
    kubectl rollout status deployment/app-deployment -n default
    kubectl drain worker-node-01 --ignore-daemonsets --delete-emptydir-data

    If the PDB is misconfigured or this is a planned maintenance window where you need to proceed regardless:

    kubectl patch pdb app-pdb -n default --type='json' \
      -p='[{"op": "replace", "path": "/spec/minAvailable", "value": 0}]'
    
    # Restore after maintenance
    kubectl patch pdb app-pdb -n default --type='json' \
      -p='[{"op": "replace", "path": "/spec/minAvailable", "value": 3}]'

    For stuck finalizers, remove them manually after confirming the owning controller is gone and it's safe to do so:

    kubectl patch pod stuck-pod-abc123 -n default --type='json' \
      -p='[{"op": "remove", "path": "/metadata/finalizers"}]'

    For pods that are simply non-evictable — mirror pods managed by the kubelet directly, for example — use

    --force
    with an explicit grace period:

    kubectl drain worker-node-01 --ignore-daemonsets --delete-emptydir-data --force --grace-period=60

    Don't forget to uncordon the node after the upgrade is complete. I've seen nodes left in

    SchedulingDisabled
    for days because the engineer forgot this step after an incident.


    Root Cause 4: Addon Compatibility Issue

    Kubernetes addons — your CNI plugin, CoreDNS, kube-proxy, the metrics server, ingress controllers, storage provisioners — all have their own compatibility matrices. Upgrading the cluster without upgrading the addons, or upgrading addons to versions that don't support the new Kubernetes API surface, breaks things in ways that are easy to miss until users start reporting DNS resolution failures or networking dead zones.

    In my experience, CNI plugins are the most common culprit here. A Calico version that worked perfectly on 1.26 might fail to initialize on 1.29 due to changes in how node resources are managed. CoreDNS is another frequent issue — the ConfigMap format changes between releases and the kubeadm upgrade process doesn't always update it cleanly, especially if you've customized the Corefile.

    How to Identify It

    After the upgrade, scan the kube-system namespace for unhealthy pods immediately:

    kubectl get pods -n kube-system
    
    NAME                                    READY   STATUS             RESTARTS   AGE
    calico-node-4xk9m                       0/1     CrashLoopBackOff   8          12m
    calico-node-7r2hn                       0/1     CrashLoopBackOff   7          12m
    coredns-5d78c9869d-6p9m2                0/1     CrashLoopBackOff   3          8m
    kube-proxy-9sdkl                        1/1     Running            0          15m
    kube-apiserver-sw-infrarunbook-01       1/1     Running            0          20m

    Pull the logs from the crashing Calico pod:

    kubectl logs calico-node-4xk9m -n kube-system --previous | tail -15
    
    2026-04-18 14:32:11.421 [ERROR] Failed to read node resource:
      the server could not find the requested resource
      (get nodes.v1alpha1.crd.projectcalico.org)
    2026-04-18 14:32:11.421 [FATAL] Failed to initialize datastore:
      client is not compatible with this Kubernetes version
      ensure your Calico version supports Kubernetes v1.29

    For CoreDNS, check the logs for plugin or configuration errors:

    kubectl logs coredns-5d78c9869d-6p9m2 -n kube-system
    
    [ERROR] plugin/errors: 2 SERVFAIL (no healthy upstream)
    [FATAL] Failed to initialize: invalid Corefile: unknown plugin "kubernetes"
    plugin/forward: no healthy upstream host

    How to Fix It

    For CoreDNS, first check what version is deployed and what version is recommended for your Kubernetes release (see the Kubernetes changelog for each version):

    kubectl get deployment coredns -n kube-system \
      -o jsonpath='{.spec.template.spec.containers[0].image}'
    
    registry.k8s.io/coredns/coredns:v1.9.3
    
    # Update to the recommended version for Kubernetes 1.29
    kubectl set image deployment/coredns \
      coredns=registry.k8s.io/coredns/coredns:v1.11.1 -n kube-system
    
    kubectl rollout status deployment/coredns -n kube-system

    For CNI plugins, consult the plugin's compatibility matrix and apply the appropriate manifests:

    # Check current Calico version
    kubectl get daemonset calico-node -n kube-system \
      -o jsonpath='{.spec.template.spec.containers[0].image}'
    
    docker.io/calico/node:v3.25.0
    
    # Apply manifests for the Kubernetes 1.29-compatible version
    kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.27.0/manifests/calico.yaml
    
    # Watch the rollout
    kubectl rollout status daemonset/calico-node -n kube-system

    Don't rely on kubeadm to manage your third-party addons for you. It handles CoreDNS and kube-proxy, but everything else is your responsibility. Check addon release notes before the upgrade, not after.


    Root Cause 5: API Deprecated Resource Blocking

    Kubernetes removes deprecated APIs on a predictable schedule — typically deprecated for two or three releases before hard removal. When you upgrade across a release that drops an API version, any manifests or Helm charts still using those API versions will stop working. In some cases, kubeadm's pre-flight checks will refuse to proceed until you clean them up.

    The 1.25 release is the one that catches the most teams off guard: it removed

    batch/v1beta1
    CronJobs,
    networking.k8s.io/v1beta1
    Ingress resources, and several other APIs that had been deprecated since 1.22. If you haven't been tracking the deprecation notices in release notes, you'll hit this wall hard on upgrade day. The frustrating part is that your cluster was running fine — everything looked green — right up until the upgrade attempted to proceed.

    How to Identify It

    kubeadm's pre-flight phase will block the upgrade when it detects in-use removed APIs:

    kubeadm upgrade apply v1.25.0
    
    [preflight] Running pre-flight checks.
    [upgrade/config] Making sure the configuration is correct:
    error: unable to upgrade: the cluster has objects that use removed API versions:
      - batch/v1beta1 CronJob: default/nightly-backup
      - networking.k8s.io/v1beta1 Ingress: default/app-ingress
      - networking.k8s.io/v1beta1 Ingress: production/api-ingress
    
    Please update these objects to use current API versions before upgrading.

    Use

    pluto
    for a comprehensive sweep of both live cluster resources and Helm chart templates:

    pluto detect-all-in-cluster --target-versions k8s=v1.25.0
    
    NAME              NAMESPACE    KIND      VERSION                       REPLACEMENT            REMOVED   DEPRECATED
    nightly-backup    default      CronJob   batch/v1beta1                 batch/v1               true      true
    app-ingress       default      Ingress   networking.k8s.io/v1beta1     networking.k8s.io/v1   true      true
    api-ingress       production   Ingress   networking.k8s.io/v1beta1     networking.k8s.io/v1   true      true

    You can also check the live API groups to see what's actually available in the current cluster:

    kubectl api-versions | grep -E 'batch|networking'
    
    batch/v1
    networking.k8s.io/v1
    
    # batch/v1beta1 and networking.k8s.io/v1beta1 are gone post-1.25

    How to Fix It

    Migrate each affected resource to the current API version before the upgrade. For CronJobs, changing the apiVersion is the primary change — the spec structure is the same:

    apiVersion: batch/v1
    kind: CronJob
    metadata:
      name: nightly-backup
      namespace: default
    spec:
      schedule: "0 2 * * *"
      jobTemplate:
        spec:
          template:
            spec:
              containers:
              - name: backup
                image: backup-tool:1.2.3
              restartPolicy: OnFailure

    For Ingress resources,

    networking.k8s.io/v1
    requires an explicit
    pathType
    field on each path rule — that's the one field that'll bite you if you just swap the apiVersion and move on:

    apiVersion: networking.k8s.io/v1
    kind: Ingress
    metadata:
      name: app-ingress
      namespace: default
    spec:
      rules:
      - host: app.solvethenetwork.com
        http:
          paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: app-service
                port:
                  number: 80

    If you're using Helm charts that bake in the old API versions, you need to either update to a chart version that targets the new APIs or patch the chart templates yourself. Don't try to fix manifests and run the upgrade simultaneously — fix and verify first, then upgrade. The Kubernetes API migration guide for each release is the authoritative reference for what changed and what the replacement looks like.


    Prevention

    Most failed Kubernetes upgrades are predictable. The pattern I've seen repeatedly is that teams skip preparation steps because the last upgrade went fine. Then something changes — a new addon version, a long-deferred upgrade that accumulated version skew, a PDB that wasn't there last time — and suddenly they're in an incident at 11pm.

    Build a pre-upgrade checklist and actually enforce it. At minimum: take an etcd snapshot and validate it, read the target version's release notes for removed APIs, run

    pluto
    against your live cluster and all Helm releases, confirm every addon has a compatible version available, verify all nodes are in
    Ready
    state, check that PDBs won't block node drains, and run
    kubeadm upgrade plan
    to let the tool surface issues before you commit:

    kubeadm upgrade plan
    
    Components that must be upgraded manually after you upgrade the control plane:
    COMPONENT   CURRENT       TARGET
    kubelet     3 x v1.27.0   v1.28.6
    
    Upgrade to the latest stable version:
    COMPONENT                 CURRENT    TARGET
    kube-apiserver            v1.27.12   v1.28.6
    kube-controller-manager   v1.27.12   v1.28.6
    kube-scheduler            v1.27.12   v1.28.6
    kube-proxy                v1.27.12   v1.28.6
    CoreDNS                   v1.10.1    v1.11.1
    etcd                      3.5.9-0    3.5.12-0

    Test upgrades in a non-production environment that mirrors your production cluster as closely as possible. That means the same addons, same Helm chart versions, same custom resource definitions, and ideally a similar workload profile. A staging cluster that looks nothing like prod defeats the purpose — you'll validate the upgrade on paper and then hit surprises on the real thing.

    During the upgrade, keep two terminals open: one watching

    kubectl get nodes -w
    and another watching
    kubectl get pods -n kube-system -w
    . The moment a pod enters
    CrashLoopBackOff
    or a node flips to
    NotReady
    , you want to know immediately rather than discovering it after 20 minutes of compounding failures.

    Finally, document the upgrade after it succeeds. Write down the exact sequence, which addon versions you moved to, and any issues you hit along the way. That document becomes your runbook for the next upgrade and saves the next engineer from rediscovering the same failure modes from scratch.

    Frequently Asked Questions

    Can I skip multiple minor versions when upgrading Kubernetes?

    No. Kubernetes enforces a strict n±1 minor version skew policy. You must upgrade one minor version at a time — for example, 1.26 to 1.27, then 1.27 to 1.28. Attempting to jump multiple versions will cause kubeadm to abort during pre-flight checks.

    What should I do if kubectl drain hangs indefinitely?

    A stuck drain is usually caused by a PodDisruptionBudget with zero allowed disruptions, a pod with a stuck finalizer, or a pod that kubeadm can't evict automatically. Check the PDB with kubectl get pdb -A, inspect pod finalizers with kubectl describe, and either scale up the deployment to create headroom or temporarily relax the PDB before retrying.

    How do I find deprecated API resources before a Kubernetes upgrade?

    Use the pluto CLI tool with pluto detect-all-in-cluster --target-versions k8s=<target-version>. It scans live cluster resources and Helm releases for deprecated or removed API versions. You can also check kubeadm upgrade apply, which will list blocking deprecated resources during its pre-flight phase.

    How do I verify an etcd snapshot is valid before an upgrade?

    Run ETCDCTL_API=3 etcdctl snapshot status <snapshot-file> --write-out=table. This command displays the snapshot hash, revision number, total key count, and data size. A valid snapshot will show a non-zero key count and a reasonable data size. A zero-byte or corrupt snapshot will return an error immediately.

    Why are my addon pods crashing after a Kubernetes upgrade?

    Addon crashes after an upgrade are almost always a compatibility mismatch — the addon version you're running doesn't support the new Kubernetes API surface. Check the addon logs with kubectl logs -n kube-system, then consult the addon's compatibility matrix to find the version that supports your new Kubernetes release. Update CoreDNS, your CNI plugin, and any other cluster-level addons before declaring the upgrade complete.

    Related Articles