Symptoms
You run
kubeadm upgrade apply v1.29.0and something goes wrong. Maybe it hangs at the control plane step for ten minutes with no output. Maybe it exits with a cryptic error and you're left staring at a cluster where the API server version and kubelet versions are out of sync. Nodes report
NotReady. Pods are stuck in
Pending. Your on-call phone is ringing.
Common symptoms you'll see across failed Kubernetes upgrades include:
- The
kubeadm upgrade apply
command exits non-zero mid-flight with a timeout or API error - Control plane pods stuck in
Pending
orCrashLoopBackOff
after the upgrade completes - Worker nodes stuck in
SchedulingDisabled
orNotReady
state - etcd cluster showing unhealthy members or refusing writes
- The API server returning
503 Service Unavailable
or timing out on all requests kubectl
commands hanging or returning connection refused- Admission webhooks failing because the backing service never came back up
The root causes are almost always one of a handful of well-known failure modes. Kubernetes upgrades are coordinated, multi-step processes where each component has strict compatibility rules. When something breaks, it usually breaks loudly — but knowing which failure mode you're hitting makes the difference between a 10-minute fix and a multi-hour incident. Let's go through each one systematically.
Root Cause 1: etcd Backup Not Taken
This isn't strictly a cause of upgrade failure — it's a preparation failure that turns a recoverable situation into an unrecoverable one. I've seen engineers confidently skip the backup step because "the upgrade will succeed anyway." That logic holds until it doesn't, and then you're staring at a corrupted etcd data directory with no path forward except rebuilding the cluster from scratch.
etcd is the source of truth for your entire cluster state. If the upgrade corrupts etcd — which can happen when the control plane crashes mid-upgrade, when disk I/O spikes cause a write failure during the etcd binary swap, or when the etcd version bump hits an incompatibility — you need a snapshot to restore from. Without one, your cluster's configuration, secrets, deployments, and persistent volume bindings are gone. No snapshot means no rollback.
How to Identify It
Before any upgrade, verify a backup exists and is recent. Take a manual snapshot from the control plane node and validate it:
ETCDCTL_API=3 etcdctl snapshot save /var/backup/etcd/snapshot-$(date +%Y%m%d-%H%M%S).db \
--endpoints=https://10.0.1.10:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
Verify the snapshot is valid and non-empty:
ETCDCTL_API=3 etcdctl snapshot status /var/backup/etcd/snapshot-20260418-143000.db --write-out=table
+----------+----------+------------+------------+
| HASH | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| 3a7b2c1d | 482931 | 2847 | 48 MB |
+----------+----------+------------+------------+
If you're already in the middle of a failed upgrade and etcd is showing distress, you'll see this from the health check:
ETCDCTL_API=3 etcdctl endpoint health --endpoints=https://10.0.1.10:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
https://10.0.1.10:2379 is unhealthy: failed to commit proposal: context deadline exceeded
How to Fix It
If etcd is corrupted and you have a snapshot, restore it. First stop the static pod by moving the manifest out of the static pod directory — kubelet will immediately stop managing the etcd container:
mv /etc/kubernetes/manifests/etcd.yaml /tmp/etcd.yaml.bak
sleep 5
# Confirm etcd is stopped
crictl ps | grep etcd
Restore the snapshot to a new data directory:
ETCDCTL_API=3 etcdctl snapshot restore /var/backup/etcd/snapshot-20260418-143000.db \
--data-dir=/var/lib/etcd-restore \
--name=sw-infrarunbook-01 \
--initial-cluster=sw-infrarunbook-01=https://10.0.1.10:2380 \
--initial-cluster-token=etcd-cluster-1 \
--initial-advertise-peer-urls=https://10.0.1.10:2380
Update the etcd static pod manifest to reference
/var/lib/etcd-restoreas the data directory, then move the manifest back. etcd comes up, the API server reconnects, and your cluster state is restored to the snapshot point. Always take the backup. It takes 30 seconds and it's the difference between a recoverable incident and a disaster declaration.
Root Cause 2: Version Skew Too Large
Kubernetes enforces a strict n±1 minor version skew policy between components. You cannot upgrade from 1.26 to 1.29 in a single jump. I've seen teams try this — usually because they've been deferring upgrades for months — and the result is either kubeadm refusing outright, or a partial upgrade that leaves the cluster in an inconsistent state where different nodes are running incompatible kubelet versions.
The skew policy also applies between individual components. If your kube-apiserver is running 1.28 but your kubelets are still on 1.25, you're outside the supported range. You'll start seeing strange behavior: pods not scheduling correctly, node conditions misreported, or kubelets failing to register after a restart.
How to Identify It
Check your current node versions before touching anything:
kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP
sw-infrarunbook-01 Ready control-plane 180d v1.26.12 10.0.1.10
worker-node-01 Ready <none> 180d v1.25.10 10.0.1.11
worker-node-02 Ready <none> 180d v1.25.10 10.0.1.12
If you attempt to jump more than one minor version, kubeadm will block you:
kubeadm upgrade apply v1.29.0
[preflight] Running pre-flight checks.
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR InvalidNewVersion]: Specified version to upgrade to "v1.29.0" is too far
from the current version "v1.26.12". Kubernetes only supports upgrades from one
minor version to the next. Please upgrade from v1.26 to v1.27 first.
Also check what's installed at the package level, not just what's reported via the API — the two can diverge after a partial upgrade:
ssh infrarunbook-admin@10.0.1.10 "kubelet --version && kubeadm version -o short"
Kubernetes v1.26.12
v1.26.12
How to Fix It
Upgrade one minor version at a time, in strict sequence: 1.26 → 1.27 → 1.28 → 1.29. For each hop, update kubeadm first, run the control plane upgrade, then drain and upgrade each node's kubelet and kubectl:
# On sw-infrarunbook-01 — upgrade kubeadm to the next minor version
apt-mark unhold kubeadm && \
apt-get update && \
apt-get install -y kubeadm=1.27.0-00 && \
apt-mark hold kubeadm
# Verify the version before applying
kubeadm version
# Apply the control plane upgrade
kubeadm upgrade apply v1.27.0
# Then on each node, drain, upgrade kubelet and kubectl, uncordon
kubectl drain worker-node-01 --ignore-daemonsets --delete-emptydir-data
ssh infrarunbook-admin@10.0.1.11 \
"apt-mark unhold kubelet kubectl && \
apt-get install -y kubelet=1.27.0-00 kubectl=1.27.0-00 && \
apt-mark hold kubelet kubectl && \
systemctl daemon-reload && \
systemctl restart kubelet"
kubectl uncordon worker-node-01
Repeat for each minor version hop. It feels tedious when you're trying to close a three-version gap, but there's no supported shortcut.
Root Cause 3: Node Drain Failing
Before upgrading the kubelet on any node, you need to drain it — evicting all pods so workloads reschedule elsewhere before the node goes offline for maintenance. This sounds straightforward, but it fails in production clusters more often than you'd expect. The usual culprits are PodDisruptionBudgets configured too aggressively, pods with stuck finalizers that will never terminate, and static or mirror pods that drain won't touch without extra flags.
How to Identify It
Run the drain command and watch what happens:
kubectl drain worker-node-01 --ignore-daemonsets --delete-emptydir-data
node/worker-node-01 cordoned
evicting pod default/app-deployment-7d9f8c-xk2p9
evicting pod default/app-deployment-7d9f8c-mn4r7
error when evicting pods/"app-deployment-7d9f8c-xk2p9" -n "default" (will retry after 5s):
Cannot evict pod as it would violate the pod's disruption budget.
error when evicting pods/"app-deployment-7d9f8c-mn4r7" -n "default" (will retry after 5s):
Cannot evict pod as it would violate the pod's disruption budget.
The drain is blocked because the PDB's
minAvailablethreshold is leaving no room for disruption. Inspect the PDB:
kubectl get pdb -A
NAMESPACE NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
default app-pdb 3 N/A 0 45d
ALLOWED DISRUPTIONS: 0means all three required replicas are already accounted for — evicting any one of them violates the budget. You might also encounter stuck finalizers. A pod waiting on a controller that no longer exists will never terminate on its own:
kubectl get pod stuck-pod-abc123 -n default -o jsonpath='{.metadata.finalizers}'
["foregroundDeletion","example.io/cleanup"]
How to Fix It
For PDB violations, the correct fix depends on intent. If the budget is protecting a legitimately critical service, scale the deployment up so eviction headroom opens before you drain:
# Scale up to create headroom, then retry the drain
kubectl scale deployment app-deployment --replicas=6 -n default
kubectl rollout status deployment/app-deployment -n default
kubectl drain worker-node-01 --ignore-daemonsets --delete-emptydir-data
If the PDB is misconfigured or this is a planned maintenance window where you need to proceed regardless:
kubectl patch pdb app-pdb -n default --type='json' \
-p='[{"op": "replace", "path": "/spec/minAvailable", "value": 0}]'
# Restore after maintenance
kubectl patch pdb app-pdb -n default --type='json' \
-p='[{"op": "replace", "path": "/spec/minAvailable", "value": 3}]'
For stuck finalizers, remove them manually after confirming the owning controller is gone and it's safe to do so:
kubectl patch pod stuck-pod-abc123 -n default --type='json' \
-p='[{"op": "remove", "path": "/metadata/finalizers"}]'
For pods that are simply non-evictable — mirror pods managed by the kubelet directly, for example — use
--forcewith an explicit grace period:
kubectl drain worker-node-01 --ignore-daemonsets --delete-emptydir-data --force --grace-period=60
Don't forget to uncordon the node after the upgrade is complete. I've seen nodes left in
SchedulingDisabledfor days because the engineer forgot this step after an incident.
Root Cause 4: Addon Compatibility Issue
Kubernetes addons — your CNI plugin, CoreDNS, kube-proxy, the metrics server, ingress controllers, storage provisioners — all have their own compatibility matrices. Upgrading the cluster without upgrading the addons, or upgrading addons to versions that don't support the new Kubernetes API surface, breaks things in ways that are easy to miss until users start reporting DNS resolution failures or networking dead zones.
In my experience, CNI plugins are the most common culprit here. A Calico version that worked perfectly on 1.26 might fail to initialize on 1.29 due to changes in how node resources are managed. CoreDNS is another frequent issue — the ConfigMap format changes between releases and the kubeadm upgrade process doesn't always update it cleanly, especially if you've customized the Corefile.
How to Identify It
After the upgrade, scan the kube-system namespace for unhealthy pods immediately:
kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
calico-node-4xk9m 0/1 CrashLoopBackOff 8 12m
calico-node-7r2hn 0/1 CrashLoopBackOff 7 12m
coredns-5d78c9869d-6p9m2 0/1 CrashLoopBackOff 3 8m
kube-proxy-9sdkl 1/1 Running 0 15m
kube-apiserver-sw-infrarunbook-01 1/1 Running 0 20m
Pull the logs from the crashing Calico pod:
kubectl logs calico-node-4xk9m -n kube-system --previous | tail -15
2026-04-18 14:32:11.421 [ERROR] Failed to read node resource:
the server could not find the requested resource
(get nodes.v1alpha1.crd.projectcalico.org)
2026-04-18 14:32:11.421 [FATAL] Failed to initialize datastore:
client is not compatible with this Kubernetes version
ensure your Calico version supports Kubernetes v1.29
For CoreDNS, check the logs for plugin or configuration errors:
kubectl logs coredns-5d78c9869d-6p9m2 -n kube-system
[ERROR] plugin/errors: 2 SERVFAIL (no healthy upstream)
[FATAL] Failed to initialize: invalid Corefile: unknown plugin "kubernetes"
plugin/forward: no healthy upstream host
How to Fix It
For CoreDNS, first check what version is deployed and what version is recommended for your Kubernetes release (see the Kubernetes changelog for each version):
kubectl get deployment coredns -n kube-system \
-o jsonpath='{.spec.template.spec.containers[0].image}'
registry.k8s.io/coredns/coredns:v1.9.3
# Update to the recommended version for Kubernetes 1.29
kubectl set image deployment/coredns \
coredns=registry.k8s.io/coredns/coredns:v1.11.1 -n kube-system
kubectl rollout status deployment/coredns -n kube-system
For CNI plugins, consult the plugin's compatibility matrix and apply the appropriate manifests:
# Check current Calico version
kubectl get daemonset calico-node -n kube-system \
-o jsonpath='{.spec.template.spec.containers[0].image}'
docker.io/calico/node:v3.25.0
# Apply manifests for the Kubernetes 1.29-compatible version
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.27.0/manifests/calico.yaml
# Watch the rollout
kubectl rollout status daemonset/calico-node -n kube-system
Don't rely on kubeadm to manage your third-party addons for you. It handles CoreDNS and kube-proxy, but everything else is your responsibility. Check addon release notes before the upgrade, not after.
Root Cause 5: API Deprecated Resource Blocking
Kubernetes removes deprecated APIs on a predictable schedule — typically deprecated for two or three releases before hard removal. When you upgrade across a release that drops an API version, any manifests or Helm charts still using those API versions will stop working. In some cases, kubeadm's pre-flight checks will refuse to proceed until you clean them up.
The 1.25 release is the one that catches the most teams off guard: it removed
batch/v1beta1CronJobs,
networking.k8s.io/v1beta1Ingress resources, and several other APIs that had been deprecated since 1.22. If you haven't been tracking the deprecation notices in release notes, you'll hit this wall hard on upgrade day. The frustrating part is that your cluster was running fine — everything looked green — right up until the upgrade attempted to proceed.
How to Identify It
kubeadm's pre-flight phase will block the upgrade when it detects in-use removed APIs:
kubeadm upgrade apply v1.25.0
[preflight] Running pre-flight checks.
[upgrade/config] Making sure the configuration is correct:
error: unable to upgrade: the cluster has objects that use removed API versions:
- batch/v1beta1 CronJob: default/nightly-backup
- networking.k8s.io/v1beta1 Ingress: default/app-ingress
- networking.k8s.io/v1beta1 Ingress: production/api-ingress
Please update these objects to use current API versions before upgrading.
Use
plutofor a comprehensive sweep of both live cluster resources and Helm chart templates:
pluto detect-all-in-cluster --target-versions k8s=v1.25.0
NAME NAMESPACE KIND VERSION REPLACEMENT REMOVED DEPRECATED
nightly-backup default CronJob batch/v1beta1 batch/v1 true true
app-ingress default Ingress networking.k8s.io/v1beta1 networking.k8s.io/v1 true true
api-ingress production Ingress networking.k8s.io/v1beta1 networking.k8s.io/v1 true true
You can also check the live API groups to see what's actually available in the current cluster:
kubectl api-versions | grep -E 'batch|networking'
batch/v1
networking.k8s.io/v1
# batch/v1beta1 and networking.k8s.io/v1beta1 are gone post-1.25
How to Fix It
Migrate each affected resource to the current API version before the upgrade. For CronJobs, changing the apiVersion is the primary change — the spec structure is the same:
apiVersion: batch/v1
kind: CronJob
metadata:
name: nightly-backup
namespace: default
spec:
schedule: "0 2 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: backup-tool:1.2.3
restartPolicy: OnFailure
For Ingress resources,
networking.k8s.io/v1requires an explicit
pathTypefield on each path rule — that's the one field that'll bite you if you just swap the apiVersion and move on:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: app-ingress
namespace: default
spec:
rules:
- host: app.solvethenetwork.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: app-service
port:
number: 80
If you're using Helm charts that bake in the old API versions, you need to either update to a chart version that targets the new APIs or patch the chart templates yourself. Don't try to fix manifests and run the upgrade simultaneously — fix and verify first, then upgrade. The Kubernetes API migration guide for each release is the authoritative reference for what changed and what the replacement looks like.
Prevention
Most failed Kubernetes upgrades are predictable. The pattern I've seen repeatedly is that teams skip preparation steps because the last upgrade went fine. Then something changes — a new addon version, a long-deferred upgrade that accumulated version skew, a PDB that wasn't there last time — and suddenly they're in an incident at 11pm.
Build a pre-upgrade checklist and actually enforce it. At minimum: take an etcd snapshot and validate it, read the target version's release notes for removed APIs, run
plutoagainst your live cluster and all Helm releases, confirm every addon has a compatible version available, verify all nodes are in
Readystate, check that PDBs won't block node drains, and run
kubeadm upgrade planto let the tool surface issues before you commit:
kubeadm upgrade plan
Components that must be upgraded manually after you upgrade the control plane:
COMPONENT CURRENT TARGET
kubelet 3 x v1.27.0 v1.28.6
Upgrade to the latest stable version:
COMPONENT CURRENT TARGET
kube-apiserver v1.27.12 v1.28.6
kube-controller-manager v1.27.12 v1.28.6
kube-scheduler v1.27.12 v1.28.6
kube-proxy v1.27.12 v1.28.6
CoreDNS v1.10.1 v1.11.1
etcd 3.5.9-0 3.5.12-0
Test upgrades in a non-production environment that mirrors your production cluster as closely as possible. That means the same addons, same Helm chart versions, same custom resource definitions, and ideally a similar workload profile. A staging cluster that looks nothing like prod defeats the purpose — you'll validate the upgrade on paper and then hit surprises on the real thing.
During the upgrade, keep two terminals open: one watching
kubectl get nodes -wand another watching
kubectl get pods -n kube-system -w. The moment a pod enters
CrashLoopBackOffor a node flips to
NotReady, you want to know immediately rather than discovering it after 20 minutes of compounding failures.
Finally, document the upgrade after it succeeds. Write down the exact sequence, which addon versions you moved to, and any issues you hit along the way. That document becomes your runbook for the next upgrade and saves the next engineer from rediscovering the same failure modes from scratch.
