Kubernetes Cluster Upgrade Failing

Symptoms

You run

kubeadm upgrade apply v1.29.0

and something goes wrong. Maybe it hangs at the control plane step for ten minutes with no output. Maybe it exits with a cryptic error and you're left staring at a cluster where the API server version and kubelet versions are out of sync. Nodes report

NotReady

. Pods are stuck in

Pending

. Your on-call phone is ringing.

Common symptoms you'll see across failed Kubernetes upgrades include:

The
kubeadm upgrade apply
command exits non-zero mid-flight with a timeout or API error
Control plane pods stuck in
Pending
or
CrashLoopBackOff
after the upgrade completes
Worker nodes stuck in
SchedulingDisabled
or
NotReady
state
etcd cluster showing unhealthy members or refusing writes
The API server returning
503 Service Unavailable
or timing out on all requests
kubectl
commands hanging or returning connection refused
Admission webhooks failing because the backing service never came back up

The root causes are almost always one of a handful of well-known failure modes. Kubernetes upgrades are coordinated, multi-step processes where each component has strict compatibility rules. When something breaks, it usually breaks loudly — but knowing which failure mode you're hitting makes the difference between a 10-minute fix and a multi-hour incident. Let's go through each one systematically.

Root Cause 1: etcd Backup Not Taken

This isn't strictly a cause of upgrade failure — it's a preparation failure that turns a recoverable situation into an unrecoverable one. I've seen engineers confidently skip the backup step because "the upgrade will succeed anyway." That logic holds until it doesn't, and then you're staring at a corrupted etcd data directory with no path forward except rebuilding the cluster from scratch.

etcd is the source of truth for your entire cluster state. If the upgrade corrupts etcd — which can happen when the control plane crashes mid-upgrade, when disk I/O spikes cause a write failure during the etcd binary swap, or when the etcd version bump hits an incompatibility — you need a snapshot to restore from. Without one, your cluster's configuration, secrets, deployments, and persistent volume bindings are gone. No snapshot means no rollback.

How to Identify It

Before any upgrade, verify a backup exists and is recent. Take a manual snapshot from the control plane node and validate it:

ETCDCTL_API=3 etcdctl snapshot save /var/backup/etcd/snapshot-$(date +%Y%m%d-%H%M%S).db \
  --endpoints=https://10.0.1.10:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

Verify the snapshot is valid and non-empty:

ETCDCTL_API=3 etcdctl snapshot status /var/backup/etcd/snapshot-20260418-143000.db --write-out=table

+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| 3a7b2c1d |   482931 |       2847 |     48 MB  |
+----------+----------+------------+------------+

If you're already in the middle of a failed upgrade and etcd is showing distress, you'll see this from the health check:

ETCDCTL_API=3 etcdctl endpoint health --endpoints=https://10.0.1.10:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

https://10.0.1.10:2379 is unhealthy: failed to commit proposal: context deadline exceeded

How to Fix It

If etcd is corrupted and you have a snapshot, restore it. First stop the static pod by moving the manifest out of the static pod directory — kubelet will immediately stop managing the etcd container:

mv /etc/kubernetes/manifests/etcd.yaml /tmp/etcd.yaml.bak
sleep 5
# Confirm etcd is stopped
crictl ps | grep etcd

Restore the snapshot to a new data directory:

ETCDCTL_API=3 etcdctl snapshot restore /var/backup/etcd/snapshot-20260418-143000.db \
  --data-dir=/var/lib/etcd-restore \
  --name=sw-infrarunbook-01 \
  --initial-cluster=sw-infrarunbook-01=https://10.0.1.10:2380 \
  --initial-cluster-token=etcd-cluster-1 \
  --initial-advertise-peer-urls=https://10.0.1.10:2380

Update the etcd static pod manifest to reference

/var/lib/etcd-restore

as the data directory, then move the manifest back. etcd comes up, the API server reconnects, and your cluster state is restored to the snapshot point. Always take the backup. It takes 30 seconds and it's the difference between a recoverable incident and a disaster declaration.

Root Cause 2: Version Skew Too Large

Kubernetes enforces a strict n±1 minor version skew policy between components. You cannot upgrade from 1.26 to 1.29 in a single jump. I've seen teams try this — usually because they've been deferring upgrades for months — and the result is either kubeadm refusing outright, or a partial upgrade that leaves the cluster in an inconsistent state where different nodes are running incompatible kubelet versions.

The skew policy also applies between individual components. If your kube-apiserver is running 1.28 but your kubelets are still on 1.25, you're outside the supported range. You'll start seeing strange behavior: pods not scheduling correctly, node conditions misreported, or kubelets failing to register after a restart.

How to Identify It

Check your current node versions before touching anything:

kubectl get nodes -o wide

NAME               STATUS   ROLES           AGE   VERSION    INTERNAL-IP
sw-infrarunbook-01 Ready    control-plane   180d  v1.26.12   10.0.1.10
worker-node-01     Ready    <none>          180d  v1.25.10   10.0.1.11
worker-node-02     Ready    <none>          180d  v1.25.10   10.0.1.12

If you attempt to jump more than one minor version, kubeadm will block you:

kubeadm upgrade apply v1.29.0

[preflight] Running pre-flight checks.
error execution phase preflight: [preflight] Some fatal errors occurred:
  [ERROR InvalidNewVersion]: Specified version to upgrade to "v1.29.0" is too far
  from the current version "v1.26.12". Kubernetes only supports upgrades from one
  minor version to the next. Please upgrade from v1.26 to v1.27 first.

Also check what's installed at the package level, not just what's reported via the API — the two can diverge after a partial upgrade:

ssh infrarunbook-admin@10.0.1.10 "kubelet --version && kubeadm version -o short"

Kubernetes v1.26.12
v1.26.12

How to Fix It

Upgrade one minor version at a time, in strict sequence: 1.26 → 1.27 → 1.28 → 1.29. For each hop, update kubeadm first, run the control plane upgrade, then drain and upgrade each node's kubelet and kubectl:

# On sw-infrarunbook-01 — upgrade kubeadm to the next minor version
apt-mark unhold kubeadm && \
apt-get update && \
apt-get install -y kubeadm=1.27.0-00 && \
apt-mark hold kubeadm

# Verify the version before applying
kubeadm version

# Apply the control plane upgrade
kubeadm upgrade apply v1.27.0

# Then on each node, drain, upgrade kubelet and kubectl, uncordon
kubectl drain worker-node-01 --ignore-daemonsets --delete-emptydir-data
ssh infrarunbook-admin@10.0.1.11 \
  "apt-mark unhold kubelet kubectl && \
   apt-get install -y kubelet=1.27.0-00 kubectl=1.27.0-00 && \
   apt-mark hold kubelet kubectl && \
   systemctl daemon-reload && \
   systemctl restart kubelet"
kubectl uncordon worker-node-01

Repeat for each minor version hop. It feels tedious when you're trying to close a three-version gap, but there's no supported shortcut.

Root Cause 3: Node Drain Failing

Before upgrading the kubelet on any node, you need to drain it — evicting all pods so workloads reschedule elsewhere before the node goes offline for maintenance. This sounds straightforward, but it fails in production clusters more often than you'd expect. The usual culprits are PodDisruptionBudgets configured too aggressively, pods with stuck finalizers that will never terminate, and static or mirror pods that drain won't touch without extra flags.

How to Identify It

Run the drain command and watch what happens:

kubectl drain worker-node-01 --ignore-daemonsets --delete-emptydir-data

node/worker-node-01 cordoned
evicting pod default/app-deployment-7d9f8c-xk2p9
evicting pod default/app-deployment-7d9f8c-mn4r7
error when evicting pods/"app-deployment-7d9f8c-xk2p9" -n "default" (will retry after 5s):
  Cannot evict pod as it would violate the pod's disruption budget.
error when evicting pods/"app-deployment-7d9f8c-mn4r7" -n "default" (will retry after 5s):
  Cannot evict pod as it would violate the pod's disruption budget.

The drain is blocked because the PDB's

minAvailable

threshold is leaving no room for disruption. Inspect the PDB:

kubectl get pdb -A

NAMESPACE   NAME             MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
default     app-pdb          3               N/A               0                     45d

ALLOWED DISRUPTIONS: 0

means all three required replicas are already accounted for — evicting any one of them violates the budget. You might also encounter stuck finalizers. A pod waiting on a controller that no longer exists will never terminate on its own:

kubectl get pod stuck-pod-abc123 -n default -o jsonpath='{.metadata.finalizers}'

["foregroundDeletion","example.io/cleanup"]

How to Fix It

For PDB violations, the correct fix depends on intent. If the budget is protecting a legitimately critical service, scale the deployment up so eviction headroom opens before you drain:

# Scale up to create headroom, then retry the drain
kubectl scale deployment app-deployment --replicas=6 -n default
kubectl rollout status deployment/app-deployment -n default
kubectl drain worker-node-01 --ignore-daemonsets --delete-emptydir-data

If the PDB is misconfigured or this is a planned maintenance window where you need to proceed regardless:

kubectl patch pdb app-pdb -n default --type='json' \
  -p='[{"op": "replace", "path": "/spec/minAvailable", "value": 0}]'

# Restore after maintenance
kubectl patch pdb app-pdb -n default --type='json' \
  -p='[{"op": "replace", "path": "/spec/minAvailable", "value": 3}]'

For stuck finalizers, remove them manually after confirming the owning controller is gone and it's safe to do so:

kubectl patch pod stuck-pod-abc123 -n default --type='json' \
  -p='[{"op": "remove", "path": "/metadata/finalizers"}]'

For pods that are simply non-evictable — mirror pods managed by the kubelet directly, for example — use

--force

with an explicit grace period:

kubectl drain worker-node-01 --ignore-daemonsets --delete-emptydir-data --force --grace-period=60

Don't forget to uncordon the node after the upgrade is complete. I've seen nodes left in

SchedulingDisabled

for days because the engineer forgot this step after an incident.

Root Cause 4: Addon Compatibility Issue

Kubernetes addons — your CNI plugin, CoreDNS, kube-proxy, the metrics server, ingress controllers, storage provisioners — all have their own compatibility matrices. Upgrading the cluster without upgrading the addons, or upgrading addons to versions that don't support the new Kubernetes API surface, breaks things in ways that are easy to miss until users start reporting DNS resolution failures or networking dead zones.

In my experience, CNI plugins are the most common culprit here. A Calico version that worked perfectly on 1.26 might fail to initialize on 1.29 due to changes in how node resources are managed. CoreDNS is another frequent issue — the ConfigMap format changes between releases and the kubeadm upgrade process doesn't always update it cleanly, especially if you've customized the Corefile.

How to Identify It

After the upgrade, scan the kube-system namespace for unhealthy pods immediately:

kubectl get pods -n kube-system

NAME                                    READY   STATUS             RESTARTS   AGE
calico-node-4xk9m                       0/1     CrashLoopBackOff   8          12m
calico-node-7r2hn                       0/1     CrashLoopBackOff   7          12m
coredns-5d78c9869d-6p9m2                0/1     CrashLoopBackOff   3          8m
kube-proxy-9sdkl                        1/1     Running            0          15m
kube-apiserver-sw-infrarunbook-01       1/1     Running            0          20m

Pull the logs from the crashing Calico pod:

kubectl logs calico-node-4xk9m -n kube-system --previous | tail -15

2026-04-18 14:32:11.421 [ERROR] Failed to read node resource:
  the server could not find the requested resource
  (get nodes.v1alpha1.crd.projectcalico.org)
2026-04-18 14:32:11.421 [FATAL] Failed to initialize datastore:
  client is not compatible with this Kubernetes version
  ensure your Calico version supports Kubernetes v1.29

For CoreDNS, check the logs for plugin or configuration errors:

kubectl logs coredns-5d78c9869d-6p9m2 -n kube-system

[ERROR] plugin/errors: 2 SERVFAIL (no healthy upstream)
[FATAL] Failed to initialize: invalid Corefile: unknown plugin "kubernetes"
plugin/forward: no healthy upstream host

How to Fix It

For CoreDNS, first check what version is deployed and what version is recommended for your Kubernetes release (see the Kubernetes changelog for each version):

kubectl get deployment coredns -n kube-system \
  -o jsonpath='{.spec.template.spec.containers[0].image}'

registry.k8s.io/coredns/coredns:v1.9.3

# Update to the recommended version for Kubernetes 1.29
kubectl set image deployment/coredns \
  coredns=registry.k8s.io/coredns/coredns:v1.11.1 -n kube-system

kubectl rollout status deployment/coredns -n kube-system

For CNI plugins, consult the plugin's compatibility matrix and apply the appropriate manifests:

# Check current Calico version
kubectl get daemonset calico-node -n kube-system \
  -o jsonpath='{.spec.template.spec.containers[0].image}'

docker.io/calico/node:v3.25.0

# Apply manifests for the Kubernetes 1.29-compatible version
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.27.0/manifests/calico.yaml

# Watch the rollout
kubectl rollout status daemonset/calico-node -n kube-system

Don't rely on kubeadm to manage your third-party addons for you. It handles CoreDNS and kube-proxy, but everything else is your responsibility. Check addon release notes before the upgrade, not after.

Root Cause 5: API Deprecated Resource Blocking

Kubernetes removes deprecated APIs on a predictable schedule — typically deprecated for two or three releases before hard removal. When you upgrade across a release that drops an API version, any manifests or Helm charts still using those API versions will stop working. In some cases, kubeadm's pre-flight checks will refuse to proceed until you clean them up.

The 1.25 release is the one that catches the most teams off guard: it removed

batch/v1beta1

CronJobs,

networking.k8s.io/v1beta1

Ingress resources, and several other APIs that had been deprecated since 1.22. If you haven't been tracking the deprecation notices in release notes, you'll hit this wall hard on upgrade day. The frustrating part is that your cluster was running fine — everything looked green — right up until the upgrade attempted to proceed.

How to Identify It

kubeadm's pre-flight phase will block the upgrade when it detects in-use removed APIs:

kubeadm upgrade apply v1.25.0

[preflight] Running pre-flight checks.
[upgrade/config] Making sure the configuration is correct:
error: unable to upgrade: the cluster has objects that use removed API versions:
  - batch/v1beta1 CronJob: default/nightly-backup
  - networking.k8s.io/v1beta1 Ingress: default/app-ingress
  - networking.k8s.io/v1beta1 Ingress: production/api-ingress

Please update these objects to use current API versions before upgrading.

Use

pluto

for a comprehensive sweep of both live cluster resources and Helm chart templates:

pluto detect-all-in-cluster --target-versions k8s=v1.25.0

NAME              NAMESPACE    KIND      VERSION                       REPLACEMENT            REMOVED   DEPRECATED
nightly-backup    default      CronJob   batch/v1beta1                 batch/v1               true      true
app-ingress       default      Ingress   networking.k8s.io/v1beta1     networking.k8s.io/v1   true      true
api-ingress       production   Ingress   networking.k8s.io/v1beta1     networking.k8s.io/v1   true      true

You can also check the live API groups to see what's actually available in the current cluster:

kubectl api-versions | grep -E 'batch|networking'

batch/v1
networking.k8s.io/v1

# batch/v1beta1 and networking.k8s.io/v1beta1 are gone post-1.25

How to Fix It

Migrate each affected resource to the current API version before the upgrade. For CronJobs, changing the apiVersion is the primary change — the spec structure is the same:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-backup
  namespace: default
spec:
  schedule: "0 2 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup
            image: backup-tool:1.2.3
          restartPolicy: OnFailure

For Ingress resources,

networking.k8s.io/v1

requires an explicit

pathType

field on each path rule — that's the one field that'll bite you if you just swap the apiVersion and move on:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: app-ingress
  namespace: default
spec:
  rules:
  - host: app.solvethenetwork.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: app-service
            port:
              number: 80

If you're using Helm charts that bake in the old API versions, you need to either update to a chart version that targets the new APIs or patch the chart templates yourself. Don't try to fix manifests and run the upgrade simultaneously — fix and verify first, then upgrade. The Kubernetes API migration guide for each release is the authoritative reference for what changed and what the replacement looks like.

Prevention

Most failed Kubernetes upgrades are predictable. The pattern I've seen repeatedly is that teams skip preparation steps because the last upgrade went fine. Then something changes — a new addon version, a long-deferred upgrade that accumulated version skew, a PDB that wasn't there last time — and suddenly they're in an incident at 11pm.

Build a pre-upgrade checklist and actually enforce it. At minimum: take an etcd snapshot and validate it, read the target version's release notes for removed APIs, run

pluto

against your live cluster and all Helm releases, confirm every addon has a compatible version available, verify all nodes are in

Ready

state, check that PDBs won't block node drains, and run

kubeadm upgrade plan

to let the tool surface issues before you commit:

kubeadm upgrade plan

Components that must be upgraded manually after you upgrade the control plane:
COMPONENT   CURRENT       TARGET
kubelet     3 x v1.27.0   v1.28.6

Upgrade to the latest stable version:
COMPONENT                 CURRENT    TARGET
kube-apiserver            v1.27.12   v1.28.6
kube-controller-manager   v1.27.12   v1.28.6
kube-scheduler            v1.27.12   v1.28.6
kube-proxy                v1.27.12   v1.28.6
CoreDNS                   v1.10.1    v1.11.1
etcd                      3.5.9-0    3.5.12-0

Test upgrades in a non-production environment that mirrors your production cluster as closely as possible. That means the same addons, same Helm chart versions, same custom resource definitions, and ideally a similar workload profile. A staging cluster that looks nothing like prod defeats the purpose — you'll validate the upgrade on paper and then hit surprises on the real thing.

During the upgrade, keep two terminals open: one watching

kubectl get nodes -w

and another watching

kubectl get pods -n kube-system -w

. The moment a pod enters

CrashLoopBackOff

or a node flips to

NotReady

, you want to know immediately rather than discovering it after 20 minutes of compounding failures.

Finally, document the upgrade after it succeeds. Write down the exact sequence, which addon versions you moved to, and any issues you hit along the way. That document becomes your runbook for the next upgrade and saves the next engineer from rediscovering the same failure modes from scratch.

Kubernetes Cluster Upgrade Failing

Symptoms

Root Cause 1: etcd Backup Not Taken

How to Identify It

How to Fix It

Root Cause 2: Version Skew Too Large

How to Identify It

How to Fix It

Root Cause 3: Node Drain Failing

How to Identify It

How to Fix It

Root Cause 4: Addon Compatibility Issue

How to Identify It

How to Fix It

Root Cause 5: API Deprecated Resource Blocking

How to Identify It

How to Fix It

Prevention

Frequently Asked Questions

Can I skip multiple minor versions when upgrading Kubernetes?

What should I do if kubectl drain hangs indefinitely?

How do I find deprecated API resources before a Kubernetes upgrade?

How do I verify an etcd snapshot is valid before an upgrade?

Why are my addon pods crashing after a Kubernetes upgrade?

Related Articles