Symptoms
When a pod enters CrashLoopBackOff, Kubernetes is signalling that the container started, crashed, and is being restarted repeatedly — with each restart delayed exponentially (10s, 20s, 40s, 80s, up to a maximum of 5 minutes). The pod never stabilises. You will see it clearly when listing pods:
kubectl get pods -n production
NAME READY STATUS RESTARTS AGE
api-deployment-7d4b9c-xkp2z 0/1 CrashLoopBackOff 8 14m
worker-deployment-5f8bb-r9t4k 0/1 CrashLoopBackOff 12 22mAdditional symptoms you will observe:
- The
RESTARTS
counter climbs steadily over time. kubectl describe pod
shows Back-off restarting failed container in the Events section.- Logs are either empty, cut short, or show a clear application-level error just before the process exits.
- The pod may briefly flash
Running
before falling back into the crash loop. - Dependent workloads that rely on this pod begin failing their own health checks.
The exit code in the pod's
lastState.terminatedfield is a key diagnostic signal: exit code 1 typically means the application itself returned an error; exit code 137 means OOMKilled; exit code 143 means SIGTERM was sent but not handled. This runbook covers every major cause systematically.
Root Cause 1: Bad Config or Environment Variable
Why It Happens
Applications commonly read critical parameters — database URIs, port numbers, feature flags, log levels — from environment variables at startup. If a required variable is absent, misspelled in the manifest, or holds an invalid value (a string where an integer is expected, a malformed URL, an unsupported flag value), the application process exits immediately with a non-zero exit code. Kubernetes treats any non-zero exit as a container failure and schedules a restart, producing the crash loop.
How to Identify It
Fetch logs from the previous (already-crashed) container instance — the current container may not have produced output yet:
kubectl logs api-deployment-7d4b9c-xkp2z --previousA misconfigured environment variable typically produces output similar to:
FATAL: invalid value for DATABASE_PORT: "abc" (expected integer)
exit status 1or a Go-style panic:
panic: DATABASE_URL environment variable is not set
goroutine 1 [running]:
main.mustGetEnv(...)
/app/main.go:22 +0x5a
exit status 2List all environment variables currently injected into the running (or last-run) container:
kubectl exec api-deployment-7d4b9c-xkp2z -- env | sortCompare that against what the deployment manifest declares:
kubectl get deployment api-deployment -o yaml | grep -A 60 'env:'How to Fix It
Correct the bad value directly on the deployment:
kubectl set env deployment/api-deployment DATABASE_PORT=5432For a more durable fix, edit the manifest and apply it:
kubectl edit deployment api-deploymentThen watch the rollout complete cleanly:
kubectl rollout status deployment/api-deployment
Waiting for deployment "api-deployment" rollout to finish: 1 old replicas are pending termination...
deployment "api-deployment" successfully rolled outRoot Cause 2: Missing Kubernetes Secret
Why It Happens
Pods referencing a
Secretvia
envFrom,
env.valueFrom.secretKeyRef, or a volume mount will fail to start if the secret does not exist in the same namespace. The container never actually launches — the kubelet marks it as failed during setup. Similarly, if the secret object exists but a specific key referenced by
secretKeyRefis absent from that secret's
datamap, the pod enters
CrashLoopBackOffwithout ever running application code.
How to Identify It
Describe the pod and scan the Events block at the bottom:
kubectl describe pod api-deployment-7d4b9c-xkp2z -n productionThe event section will show one of these messages:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Failed 3s (x4 over 45s) kubelet Error: secret "api-db-credentials" not foundOr, for a key that is missing inside an existing secret:
Warning Failed 5s kubelet Error: couldn't find key DB_PASSWORD in Secret production/api-db-credentialsList all secrets in the namespace to confirm what actually exists:
kubectl get secrets -n production
NAME TYPE DATA AGE
api-tls-cert Opaque 2 30d
registry-pull-secret Opaque 1 45dNote that
api-db-credentialsis absent entirely. If the secret exists, inspect its keys without exposing values:
kubectl get secret api-db-credentials -n production \
-o jsonpath='{.data}' | python3 -c \
"import sys, json; print(list(json.load(sys.stdin).keys()))"['DB_USER', 'DB_HOST']This confirms
DB_PASSWORDis missing from the secret's data map.
How to Fix It
Create the missing secret in the correct namespace:
kubectl create secret generic api-db-credentials \
--from-literal=DB_USER='infrarunbook-admin' \
--from-literal=DB_PASSWORD='s3cur3P@ss!' \
--from-literal=DB_HOST='10.0.1.45' \
-n productionIf the secret exists but is missing a key, patch it:
kubectl patch secret api-db-credentials -n production \
--type='json' \
-p='[{"op":"add","path":"/data/DB_PASSWORD","value":"'$(echo -n 's3cur3P@ss!' | base64)'"}]'Force the deployment to pick up the updated secret:
kubectl rollout restart deployment/api-deployment -n production
kubectl rollout status deployment/api-deployment -n productionRoot Cause 3: OOMKilled (Out of Memory)
Why It Happens
Every container in Kubernetes runs under a cgroup enforced memory ceiling defined by the
resources.limits.memoryfield. When the container's memory consumption crosses that ceiling — whether due to a memory leak, a burst in workload, or a limit that was never sized correctly — the Linux kernel's OOM killer terminates the container process with SIGKILL (signal 9). Kubernetes records this as an OOMKilled exit reason and exit code 137 (128 + 9). If the application consistently needs more memory than the limit allows, the pod will be killed and restarted in a continuous loop.
How to Identify It
Describe the pod and look for the
Last Stateblock under the container status:
kubectl describe pod api-deployment-7d4b9c-xkp2z -n production Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Fri, 04 Apr 2026 08:12:01 +0000
Finished: Fri, 04 Apr 2026 08:12:44 +0000
Ready: False
Restart Count: 9Confirm live memory consumption versus the configured limit:
kubectl top pod api-deployment-7d4b9c-xkp2z --containers -n production
POD NAME CPU(cores) MEMORY(bytes)
api-deployment-7d4b9c-xkp2z api 38m 503MiThen inspect the spec limits:
kubectl get pod api-deployment-7d4b9c-xkp2z -n production \
-o jsonpath='{.spec.containers[0].resources}'
{"limits":{"memory":"512Mi"},"requests":{"memory":"256Mi"}}The pod is consuming 503Mi against a 512Mi hard limit — any brief spike will trigger the kill.
How to Fix It
Increase the memory limit on the deployment:
kubectl set resources deployment api-deployment \
--limits=memory=1Gi \
--requests=memory=512Mi \
-n productionIf the OOM is caused by a genuine application memory leak rather than an undersized limit, you need to fix the code. In the meantime, consider enabling a Vertical Pod Autoscaler in recommendation mode to get data-driven sizing suggestions:
kubectl get vpa api-deployment-vpa -n production -o yaml | grep -A 10 'recommendation'Long-term, instrument the application with heap profiling tooling (pprof for Go, jmap for JVM workloads, memory_profiler for Python) and analyse allocations under realistic load before setting final limits.
Root Cause 4: Dependency Service Unavailable
Why It Happens
Many applications perform a blocking connection check during their startup sequence — connecting to PostgreSQL, Redis, RabbitMQ, or a downstream gRPC service before declaring themselves ready. If the dependency is unreachable due to a wrong hostname, a missing DNS record, a firewall rule, a NetworkPolicy, or because the dependency itself is crashing, the application's startup routine fails and the process exits. In microservices deployments, this is especially common when multiple services are deployed simultaneously and the dependency simply hasn't started yet — a classic race condition that results in both services entering a crash loop.
How to Identify It
Pull the previous container's logs to see what the application was trying to connect to:
kubectl logs api-deployment-7d4b9c-xkp2z --previous -n productionError: unable to connect to PostgreSQL at postgres-service.production.svc.cluster.local:5432
dial tcp 10.0.1.45:5432: connect: connection refused
failed to initialise application, exitingCheck the state of the dependency itself:
kubectl get pods -n production -l app=postgres
NAME READY STATUS RESTARTS AGE
postgres-0 0/1 CrashLoopBackOff 5 8mThe dependency is itself crashing — confirming the root cause. If the dependency appears healthy, test DNS resolution and TCP reachability from within the cluster using a debug pod:
kubectl run debug --image=busybox:1.35 --restart=Never --rm -it -n production -- \
nslookup postgres-service.production.svc.cluster.local
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
Name: postgres-service.production.svc.cluster.local
Address 1: 10.0.1.45kubectl run debug --image=busybox:1.35 --restart=Never --rm -it -n production -- \
nc -zv 10.0.1.45 5432
10.0.1.45 (10.0.1.45:5432) openAlso check for NetworkPolicies that may be blocking traffic:
kubectl get networkpolicies -n production
kubectl describe networkpolicy allow-api-to-postgres -n productionHow to Fix It
Resolve the dependency's own crash first, then prevent the race condition from recurring by adding an init container that gates startup on the dependency being ready:
initContainers:
- name: wait-for-postgres
image: busybox:1.35
command:
- sh
- -c
- |
until nc -z postgres-service.production.svc.cluster.local 5432; do
echo "waiting for postgres at postgres-service.production.svc.cluster.local:5432..."
sleep 3
done
echo "postgres is ready"This init container runs to completion before the main application container starts, eliminating the startup race condition entirely regardless of deployment ordering.
Root Cause 5: Liveness Probe Misconfiguration
Why It Happens
A liveness probe failure causes Kubernetes to kill and restart the container. A misconfigured liveness probe — one pointing at a non-existent URL path, using the wrong port, or with an
initialDelaySecondsthat is shorter than the application's actual startup time — will kill a perfectly healthy container before it has finished initialising. This produces a crash loop that is indistinguishable from an application crash unless you look carefully at the Events section. Note that a readiness probe failure alone does not cause a restart — it only removes the pod from Service endpoints — but when both liveness and readiness probes share a misconfigured endpoint, both fail simultaneously and the liveness kill drives the restart loop.
How to Identify It
Describe the pod and inspect the Events section closely:
kubectl describe pod api-deployment-7d4b9c-xkp2z -n production Liveness: http-get http://:8080/healthz delay=5s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:8080/readyz delay=5s timeout=1s period=10s #success=1 #failure=3
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 18s (x3 over 38s) kubelet Liveness probe failed: HTTP probe failed with statuscode: 404
Warning Killing 18s kubelet Container api failed liveness probe, will be restartedThe 404 shows the probe is hitting a path that does not exist. Verify the correct health endpoint directly inside the container:
kubectl exec api-deployment-7d4b9c-xkp2z -n production -- \
wget -qO- http://localhost:8080/health
{"status":"ok","uptime":42}The actual endpoint is
/health, not
/healthz. Also check how long the application takes to start listening:
kubectl logs api-deployment-7d4b9c-xkp2z -n production | grep -i 'listen\|ready\|started\|binding'
2026-04-04T08:15:33Z INFO server listening on :8080 after 22s startupThe application takes 22 seconds to start, but
initialDelaySecondsis only 5 seconds — the liveness probe fires and fails 17 seconds before the server is ready.
How to Fix It
Correct the probe path and give the application enough headroom to initialise. A well-tuned probe configuration:
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 40
periodSeconds: 15
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
startupProbe:
httpGet:
path: /health
port: 8080
failureThreshold: 30
periodSeconds: 5The
startupProbeis particularly valuable here — it gives the application up to 150 seconds (30 × 5s) to start before the liveness probe takes over, preventing premature kills during slow initialisation. Apply and verify:
kubectl apply -f deployment.yaml
kubectl rollout status deployment/api-deployment -n production
deployment "api-deployment" successfully rolled outRoot Cause 6: Application Code Crash or Missing Runtime Dependency
Why It Happens
The container image itself may contain a bug, a missing Python module, a missing shared library, or an incorrect entrypoint script. The application process exits the moment it starts — often before producing meaningful logs — with a non-zero exit code. This is common after image rebuilds where a dependency was accidentally removed or after a bad merge.
How to Identify It
kubectl logs api-deployment-7d4b9c-xkp2z --previous -n production
Traceback (most recent call last):
File "/app/main.py", line 14, in <module>
from utils.auth import verify_token
ModuleNotFoundError: No module named 'utils.auth'Or for a missing shared library:
/app/server: error while loading shared libraries: libssl.so.3: cannot open shared object file: No such file or directoryGet the exact exit code from the terminated state:
kubectl get pod api-deployment-7d4b9c-xkp2z -n production \
-o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
1How to Fix It
Fix the application code or Dockerfile, rebuild the image, push it to your registry, and update the deployment:
kubectl set image deployment/api-deployment \
api=registry.solvethenetwork.com/api:v1.2.2 \
-n production
kubectl rollout status deployment/api-deployment -n productionTo roll back to the last known-good image while the fix is in progress:
kubectl rollout undo deployment/api-deployment -n productionRoot Cause 7: Init Container Failure
Why It Happens
Init containers execute sequentially before the main container and must exit with code 0. Common uses include running database migrations, fetching secrets from Vault, rendering config files, or seeding a volume. If an init container fails — because the migration SQL is broken, the secret store is unreachable, or a pre-flight assertion fails — Kubernetes retries it according to the pod's
restartPolicy, producing
Init:CrashLoopBackOffin the status column.
How to Identify It
kubectl get pods -n production
NAME READY STATUS RESTARTS AGE
api-deployment-7d4b9c-xkp2z 0/1 Init:CrashLoopBackOff 4 6mFetch logs from the init container by name using the
-cflag:
kubectl logs api-deployment-7d4b9c-xkp2z -c init-migrate --previous -n production
running migration 004_add_sessions_table.sql
ERROR: relation "users" does not exist
LINE 1: ALTER TABLE users ADD COLUMN last_seen TIMESTAMP;
migration failed — exiting with code 1How to Fix It
Resolve the underlying issue the init container is reporting. In this case, a previous migration was never applied, so the schema is in an inconsistent state. After fixing the database state and correcting the migration files:
kubectl rollout restart deployment/api-deployment -n production
kubectl rollout status deployment/api-deployment -n productionPrevention
Preventing CrashLoopBackOff requires discipline across your entire delivery pipeline — from manifest authoring to production monitoring.
- Lint manifests in CI/CD: Run
kubeval
,kubeconform
, orkube-score
against every manifest change in your pipeline. Catch missingsecretKeyRef
targets and invalid field values before they ever reach the cluster. - Use init containers for dependency readiness: Never assume a dependency is up. Gate every application container behind an init container with a TCP or HTTP readiness check and an exponential-backoff retry loop.
- Set realistic resource limits: Run load tests in staging with
kubectl top
monitoring before committing to production limits. Use VPA recommendations as a baseline, then add a 30–50% headroom buffer. - Tune probes carefully: Always set
initialDelaySeconds
to at least 120% of your observed worst-case startup time. Use astartupProbe
for slow-starting applications. Test probe endpoints explicitly in staging under load. - Manage secrets lifecycle with GitOps: Use External Secrets Operator, Vault Agent Injector, or Sealed Secrets to ensure secrets are provisioned before workloads are deployed. Never rely on manual
kubectl create secret
steps in production. - Alert on restart counts before they spiral: Add a Prometheus alert:
increase(kube_pod_container_status_restarts_total[15m]) > 3
so your team is notified well before a pod enters the 5-minute backoff phase. - Use immutable image tags: Never deploy
:latest
. Tag images with the full Git commit SHA so every running pod can be traced back to an exact code state and rolled back deterministically. - Enforce namespace-level LimitRanges: Apply
LimitRange
objects to namespaces to set sensible default requests and limits, preventing pods from being scheduled with no resource constraints at all.
Frequently Asked Questions
Q: What is the difference between CrashLoopBackOff and Error status in Kubernetes?
A: Error is a transient state shown immediately after a container exits with a non-zero code. CrashLoopBackOff appears after Kubernetes has attempted several restarts and has begun applying an exponential backoff delay between attempts. CrashLoopBackOff means the crash is persistent and recurring — it is not a different type of failure, just a later stage of the same failure being rate-limited by the kubelet.
Q: How do I get logs from a container that is in CrashLoopBackOff if it crashes too fast?
A: Use the
--previousflag to retrieve logs from the last terminated container instance rather than the currently running (or starting) one:
kubectl logs <pod-name> --previous -n <namespace>. If the application exits before writing any logs, check the exit code via
kubectl describe podand look at the
Last State: Terminated: Reasonfield for an OS-level signal name like OOMKilled.
Q: What does exit code 137 mean in a Kubernetes pod?
A: Exit code 137 equals 128 + 9, where 9 is the SIGKILL signal number. It means the container process was forcibly killed by the Linux kernel's OOM killer because it exceeded its memory limit. Look for
Reason: OOMKilledin
kubectl describe podto confirm. The fix is to increase the memory limit or address a memory leak in the application.
Q: How long does Kubernetes wait between CrashLoopBackOff restarts?
A: The backoff follows an exponential progression: 10s, 20s, 40s, 80s, 160s, then caps at 300s (5 minutes). Each subsequent crash resets back to the start of this sequence if the container had been running successfully for at least 10 minutes between failures. You can observe the current backoff timer in the Events section of
kubectl describe pod.
Q: Can a CrashLoopBackOff pod be fixed without redeploying?
A: It depends on the cause. If the crash is caused by a missing secret or a bad env var that exists outside the pod spec (e.g., a
ConfigMapvalue), fixing the external resource and then running
kubectl rollout restart deployment/<name>is sufficient. If the crash is in the pod spec itself (wrong image, bad probe path, wrong memory limit), you must update the deployment spec which triggers a new pod rollout. You cannot edit a running pod spec in place for managed fields.
Q: What is the difference between a liveness probe failure and a readiness probe failure causing a restart?
A: Only a liveness probe failure causes Kubernetes to kill and restart the container. A readiness probe failure only removes the pod from the load balancer's endpoint list — the container keeps running, it just stops receiving traffic. This means readiness probe failures alone do not produce CrashLoopBackOff. If you see CrashLoopBackOff with probe-related events, the liveness probe is the culprit.
Q: How do I temporarily stop CrashLoopBackOff to debug the container?
A: Override the container's command to prevent the application from starting, giving you a shell to investigate:
kubectl debug -it <pod-name> --copy-to=debug-pod --container=api -- /bin/sh. Alternatively, scale the deployment to zero (
kubectl scale deployment/api-deployment --replicas=0), then create a standalone debug pod using the same image and environment variables to reproduce the crash interactively.
Q: How do I tell if CrashLoopBackOff is caused by the application or by Kubernetes infrastructure?
A: Run the container image locally using Docker or Podman with the same environment variables:
docker run --env-file=.env registry.solvethenetwork.com/api:v1.2.1. If it runs cleanly locally, the issue is infrastructure-side — a missing secret, a NetworkPolicy blocking a dependency, a misconfigured probe, or a resource limit. If it crashes locally with the same error, the issue is in the application code or image itself.
Q: What Prometheus alerts should I set up to catch CrashLoopBackOff early?
A: Two key alerts: First,
increase(kube_pod_container_status_restarts_total[15m]) > 3fires as soon as a container has restarted 3 times in 15 minutes, before the 5-minute backoff kicks in. Second,
kube_pod_status_phase{phase="Failed"} > 0catches pods that have transitioned to the Failed phase. Both give your team time to intervene before the situation becomes a full outage.
Q: Can a CrashLoopBackOff be caused by a node-level issue rather than the application?
A: Yes. Node-level disk pressure, memory pressure, or a corrupted container runtime (containerd, CRI-O) can cause containers to fail to start or to be killed immediately. Check node conditions:
kubectl describe node sw-infrarunbook-01 | grep -A 10 Conditions. If you see
MemoryPressure=True,
DiskPressure=True, or
PIDPressure=True, the node itself is the problem and pods across the node may be affected — not just yours.
Q: Does CrashLoopBackOff affect all replicas or just one?
A: If the root cause is configuration-based (bad env var, missing secret, misconfigured probe), all replicas of the deployment will enter CrashLoopBackOff since they all share the same pod spec. If the crash is caused by a transient node-level issue or a specific data condition that only affects one pod, you may see only one or a subset of replicas crashing while others remain healthy. The pattern of which pods are affected is itself a diagnostic clue.
Q: How do I roll back a deployment that is stuck in CrashLoopBackOff?
A: Use the built-in rollout undo command:
kubectl rollout undo deployment/api-deployment -n production. Kubernetes will immediately start rolling out the previous ReplicaSet. To roll back to a specific revision, first list history with
kubectl rollout history deployment/api-deployment, then target it:
kubectl rollout undo deployment/api-deployment --to-revision=3. This is why immutable image tags and recorded rollout changes are essential — without them, rollback history is difficult to interpret.
