Kubernetes Pod CrashLoopBackOff

Published: Apr 4, 2026

Updated: Apr 4, 2026

A complete infrastructure runbook for diagnosing and resolving Kubernetes CrashLoopBackOff errors, covering every major root cause with real CLI commands, error output, and step-by-step fixes.

Symptoms

When a pod enters CrashLoopBackOff, Kubernetes is signalling that the container started, crashed, and is being restarted repeatedly — with each restart delayed exponentially (10s, 20s, 40s, 80s, up to a maximum of 5 minutes). The pod never stabilises. You will see it clearly when listing pods:

kubectl get pods -n production

NAME                            READY   STATUS             RESTARTS   AGE
api-deployment-7d4b9c-xkp2z     0/1     CrashLoopBackOff   8          14m
worker-deployment-5f8bb-r9t4k   0/1     CrashLoopBackOff   12         22m

Additional symptoms you will observe:

The
RESTARTS
counter climbs steadily over time.
kubectl describe pod
shows Back-off restarting failed container in the Events section.
Logs are either empty, cut short, or show a clear application-level error just before the process exits.
The pod may briefly flash
Running
before falling back into the crash loop.
Dependent workloads that rely on this pod begin failing their own health checks.

The exit code in the pod's

lastState.terminated

field is a key diagnostic signal: exit code 1 typically means the application itself returned an error; exit code 137 means OOMKilled; exit code 143 means SIGTERM was sent but not handled. This runbook covers every major cause systematically.

Root Cause 1: Bad Config or Environment Variable

Why It Happens

Applications commonly read critical parameters — database URIs, port numbers, feature flags, log levels — from environment variables at startup. If a required variable is absent, misspelled in the manifest, or holds an invalid value (a string where an integer is expected, a malformed URL, an unsupported flag value), the application process exits immediately with a non-zero exit code. Kubernetes treats any non-zero exit as a container failure and schedules a restart, producing the crash loop.

How to Identify It

Fetch logs from the previous (already-crashed) container instance — the current container may not have produced output yet:

kubectl logs api-deployment-7d4b9c-xkp2z --previous

A misconfigured environment variable typically produces output similar to:

FATAL: invalid value for DATABASE_PORT: "abc" (expected integer)
exit status 1

or a Go-style panic:

panic: DATABASE_URL environment variable is not set
goroutine 1 [running]:
main.mustGetEnv(...)
        /app/main.go:22 +0x5a
exit status 2

List all environment variables currently injected into the running (or last-run) container:

kubectl exec api-deployment-7d4b9c-xkp2z -- env | sort

Compare that against what the deployment manifest declares:

kubectl get deployment api-deployment -o yaml | grep -A 60 'env:'

How to Fix It

Correct the bad value directly on the deployment:

kubectl set env deployment/api-deployment DATABASE_PORT=5432

For a more durable fix, edit the manifest and apply it:

kubectl edit deployment api-deployment

Then watch the rollout complete cleanly:

kubectl rollout status deployment/api-deployment

Waiting for deployment "api-deployment" rollout to finish: 1 old replicas are pending termination...
deployment "api-deployment" successfully rolled out

Root Cause 2: Missing Kubernetes Secret

Why It Happens

Pods referencing a

Secret

via

envFrom

env.valueFrom.secretKeyRef

, or a volume mount will fail to start if the secret does not exist in the same namespace. The container never actually launches — the kubelet marks it as failed during setup. Similarly, if the secret object exists but a specific key referenced by

secretKeyRef

is absent from that secret's

data

map, the pod enters

CrashLoopBackOff

without ever running application code.

How to Identify It

Describe the pod and scan the Events block at the bottom:

kubectl describe pod api-deployment-7d4b9c-xkp2z -n production

The event section will show one of these messages:

Events:
  Type     Reason     Age               From      Message
  ----     ------     ----              ----      -------
  Warning  Failed     3s (x4 over 45s)  kubelet   Error: secret "api-db-credentials" not found

Or, for a key that is missing inside an existing secret:

  Warning  Failed     5s  kubelet  Error: couldn't find key DB_PASSWORD in Secret production/api-db-credentials

List all secrets in the namespace to confirm what actually exists:

kubectl get secrets -n production

NAME                  TYPE     DATA   AGE
api-tls-cert          Opaque   2      30d
registry-pull-secret  Opaque   1      45d

Note that

api-db-credentials

is absent entirely. If the secret exists, inspect its keys without exposing values:

kubectl get secret api-db-credentials -n production \
  -o jsonpath='{.data}' | python3 -c \
  "import sys, json; print(list(json.load(sys.stdin).keys()))"

['DB_USER', 'DB_HOST']

This confirms

DB_PASSWORD

is missing from the secret's data map.

How to Fix It

Create the missing secret in the correct namespace:

kubectl create secret generic api-db-credentials \
  --from-literal=DB_USER='infrarunbook-admin' \
  --from-literal=DB_PASSWORD='s3cur3P@ss!' \
  --from-literal=DB_HOST='10.0.1.45' \
  -n production

If the secret exists but is missing a key, patch it:

kubectl patch secret api-db-credentials -n production \
  --type='json' \
  -p='[{"op":"add","path":"/data/DB_PASSWORD","value":"'$(echo -n 's3cur3P@ss!' | base64)'"}]'

Force the deployment to pick up the updated secret:

kubectl rollout restart deployment/api-deployment -n production
kubectl rollout status deployment/api-deployment -n production

Root Cause 3: OOMKilled (Out of Memory)

Why It Happens

Every container in Kubernetes runs under a cgroup enforced memory ceiling defined by the

resources.limits.memory

field. When the container's memory consumption crosses that ceiling — whether due to a memory leak, a burst in workload, or a limit that was never sized correctly — the Linux kernel's OOM killer terminates the container process with SIGKILL (signal 9). Kubernetes records this as an OOMKilled exit reason and exit code 137 (128 + 9). If the application consistently needs more memory than the limit allows, the pod will be killed and restarted in a continuous loop.

How to Identify It

Describe the pod and look for the

Last State

block under the container status:

kubectl describe pod api-deployment-7d4b9c-xkp2z -n production

    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Fri, 04 Apr 2026 08:12:01 +0000
      Finished:     Fri, 04 Apr 2026 08:12:44 +0000
    Ready:          False
    Restart Count:  9

Confirm live memory consumption versus the configured limit:

kubectl top pod api-deployment-7d4b9c-xkp2z --containers -n production

POD                           NAME   CPU(cores)   MEMORY(bytes)
api-deployment-7d4b9c-xkp2z  api    38m          503Mi

Then inspect the spec limits:

kubectl get pod api-deployment-7d4b9c-xkp2z -n production \
  -o jsonpath='{.spec.containers[0].resources}'

{"limits":{"memory":"512Mi"},"requests":{"memory":"256Mi"}}

The pod is consuming 503Mi against a 512Mi hard limit — any brief spike will trigger the kill.

How to Fix It

Increase the memory limit on the deployment:

kubectl set resources deployment api-deployment \
  --limits=memory=1Gi \
  --requests=memory=512Mi \
  -n production

If the OOM is caused by a genuine application memory leak rather than an undersized limit, you need to fix the code. In the meantime, consider enabling a Vertical Pod Autoscaler in recommendation mode to get data-driven sizing suggestions:

kubectl get vpa api-deployment-vpa -n production -o yaml | grep -A 10 'recommendation'

Long-term, instrument the application with heap profiling tooling (pprof for Go, jmap for JVM workloads, memory_profiler for Python) and analyse allocations under realistic load before setting final limits.

Root Cause 4: Dependency Service Unavailable

Why It Happens

Many applications perform a blocking connection check during their startup sequence — connecting to PostgreSQL, Redis, RabbitMQ, or a downstream gRPC service before declaring themselves ready. If the dependency is unreachable due to a wrong hostname, a missing DNS record, a firewall rule, a NetworkPolicy, or because the dependency itself is crashing, the application's startup routine fails and the process exits. In microservices deployments, this is especially common when multiple services are deployed simultaneously and the dependency simply hasn't started yet — a classic race condition that results in both services entering a crash loop.

How to Identify It

Pull the previous container's logs to see what the application was trying to connect to:

kubectl logs api-deployment-7d4b9c-xkp2z --previous -n production

Error: unable to connect to PostgreSQL at postgres-service.production.svc.cluster.local:5432
dial tcp 10.0.1.45:5432: connect: connection refused
failed to initialise application, exiting

Check the state of the dependency itself:

kubectl get pods -n production -l app=postgres

NAME           READY   STATUS             RESTARTS   AGE
postgres-0     0/1     CrashLoopBackOff   5          8m

The dependency is itself crashing — confirming the root cause. If the dependency appears healthy, test DNS resolution and TCP reachability from within the cluster using a debug pod:

kubectl run debug --image=busybox:1.35 --restart=Never --rm -it -n production -- \
  nslookup postgres-service.production.svc.cluster.local

Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      postgres-service.production.svc.cluster.local
Address 1: 10.0.1.45

kubectl run debug --image=busybox:1.35 --restart=Never --rm -it -n production -- \
  nc -zv 10.0.1.45 5432

10.0.1.45 (10.0.1.45:5432) open

Also check for NetworkPolicies that may be blocking traffic:

kubectl get networkpolicies -n production
kubectl describe networkpolicy allow-api-to-postgres -n production

How to Fix It

Resolve the dependency's own crash first, then prevent the race condition from recurring by adding an init container that gates startup on the dependency being ready:

initContainers:
- name: wait-for-postgres
  image: busybox:1.35
  command:
    - sh
    - -c
    - |
      until nc -z postgres-service.production.svc.cluster.local 5432; do
        echo "waiting for postgres at postgres-service.production.svc.cluster.local:5432..."
        sleep 3
      done
      echo "postgres is ready"

This init container runs to completion before the main application container starts, eliminating the startup race condition entirely regardless of deployment ordering.

Root Cause 5: Liveness Probe Misconfiguration

Why It Happens

A liveness probe failure causes Kubernetes to kill and restart the container. A misconfigured liveness probe — one pointing at a non-existent URL path, using the wrong port, or with an

initialDelaySeconds

that is shorter than the application's actual startup time — will kill a perfectly healthy container before it has finished initialising. This produces a crash loop that is indistinguishable from an application crash unless you look carefully at the Events section. Note that a readiness probe failure alone does not cause a restart — it only removes the pod from Service endpoints — but when both liveness and readiness probes share a misconfigured endpoint, both fail simultaneously and the liveness kill drives the restart loop.

How to Identify It

Describe the pod and inspect the Events section closely:

kubectl describe pod api-deployment-7d4b9c-xkp2z -n production

    Liveness:   http-get http://:8080/healthz delay=5s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:8080/readyz delay=5s timeout=1s period=10s #success=1 #failure=3

Events:
  Type     Reason    Age                From      Message
  ----     ------    ----               ----      -------
  Warning  Unhealthy 18s (x3 over 38s)  kubelet   Liveness probe failed: HTTP probe failed with statuscode: 404
  Warning  Killing   18s                kubelet   Container api failed liveness probe, will be restarted

The 404 shows the probe is hitting a path that does not exist. Verify the correct health endpoint directly inside the container:

kubectl exec api-deployment-7d4b9c-xkp2z -n production -- \
  wget -qO- http://localhost:8080/health

{"status":"ok","uptime":42}

The actual endpoint is

/health

, not

/healthz

. Also check how long the application takes to start listening:

kubectl logs api-deployment-7d4b9c-xkp2z -n production | grep -i 'listen\|ready\|started\|binding'

2026-04-04T08:15:33Z INFO  server listening on :8080 after 22s startup

The application takes 22 seconds to start, but

initialDelaySeconds

is only 5 seconds — the liveness probe fires and fails 17 seconds before the server is ready.

How to Fix It

Correct the probe path and give the application enough headroom to initialise. A well-tuned probe configuration:

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 40
  periodSeconds: 15
  timeoutSeconds: 5
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 10
  timeoutSeconds: 3
  failureThreshold: 3

startupProbe:
  httpGet:
    path: /health
    port: 8080
  failureThreshold: 30
  periodSeconds: 5

The

startupProbe

is particularly valuable here — it gives the application up to 150 seconds (30 × 5s) to start before the liveness probe takes over, preventing premature kills during slow initialisation. Apply and verify:

kubectl apply -f deployment.yaml
kubectl rollout status deployment/api-deployment -n production

deployment "api-deployment" successfully rolled out

Root Cause 6: Application Code Crash or Missing Runtime Dependency

Why It Happens

The container image itself may contain a bug, a missing Python module, a missing shared library, or an incorrect entrypoint script. The application process exits the moment it starts — often before producing meaningful logs — with a non-zero exit code. This is common after image rebuilds where a dependency was accidentally removed or after a bad merge.

How to Identify It

kubectl logs api-deployment-7d4b9c-xkp2z --previous -n production

Traceback (most recent call last):
  File "/app/main.py", line 14, in <module>
    from utils.auth import verify_token
ModuleNotFoundError: No module named 'utils.auth'

Or for a missing shared library:

/app/server: error while loading shared libraries: libssl.so.3: cannot open shared object file: No such file or directory

Get the exact exit code from the terminated state:

kubectl get pod api-deployment-7d4b9c-xkp2z -n production \
  -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'

1

How to Fix It

Fix the application code or Dockerfile, rebuild the image, push it to your registry, and update the deployment:

kubectl set image deployment/api-deployment \
  api=registry.solvethenetwork.com/api:v1.2.2 \
  -n production
kubectl rollout status deployment/api-deployment -n production

To roll back to the last known-good image while the fix is in progress:

kubectl rollout undo deployment/api-deployment -n production

Root Cause 7: Init Container Failure

Why It Happens

Init containers execute sequentially before the main container and must exit with code 0. Common uses include running database migrations, fetching secrets from Vault, rendering config files, or seeding a volume. If an init container fails — because the migration SQL is broken, the secret store is unreachable, or a pre-flight assertion fails — Kubernetes retries it according to the pod's

restartPolicy

, producing

Init:CrashLoopBackOff

in the status column.

How to Identify It

kubectl get pods -n production

NAME                            READY   STATUS                  RESTARTS   AGE
api-deployment-7d4b9c-xkp2z     0/1     Init:CrashLoopBackOff   4          6m

Fetch logs from the init container by name using the

-c

flag:

kubectl logs api-deployment-7d4b9c-xkp2z -c init-migrate --previous -n production

running migration 004_add_sessions_table.sql
ERROR: relation "users" does not exist
LINE 1: ALTER TABLE users ADD COLUMN last_seen TIMESTAMP;
migration failed — exiting with code 1

How to Fix It

Resolve the underlying issue the init container is reporting. In this case, a previous migration was never applied, so the schema is in an inconsistent state. After fixing the database state and correcting the migration files:

kubectl rollout restart deployment/api-deployment -n production
kubectl rollout status deployment/api-deployment -n production

Prevention

Preventing CrashLoopBackOff requires discipline across your entire delivery pipeline — from manifest authoring to production monitoring.

Lint manifests in CI/CD: Run
kubeval
,
kubeconform
, or
kube-score
against every manifest change in your pipeline. Catch missing
secretKeyRef
targets and invalid field values before they ever reach the cluster.
Use init containers for dependency readiness: Never assume a dependency is up. Gate every application container behind an init container with a TCP or HTTP readiness check and an exponential-backoff retry loop.
Set realistic resource limits: Run load tests in staging with
kubectl top
monitoring before committing to production limits. Use VPA recommendations as a baseline, then add a 30–50% headroom buffer.
Tune probes carefully: Always set
initialDelaySeconds
to at least 120% of your observed worst-case startup time. Use a
startupProbe
for slow-starting applications. Test probe endpoints explicitly in staging under load.
Manage secrets lifecycle with GitOps: Use External Secrets Operator, Vault Agent Injector, or Sealed Secrets to ensure secrets are provisioned before workloads are deployed. Never rely on manual
kubectl create secret
steps in production.
Alert on restart counts before they spiral: Add a Prometheus alert:
increase(kube_pod_container_status_restarts_total[15m]) > 3
so your team is notified well before a pod enters the 5-minute backoff phase.
Use immutable image tags: Never deploy
:latest
. Tag images with the full Git commit SHA so every running pod can be traced back to an exact code state and rolled back deterministically.
Enforce namespace-level LimitRanges: Apply
LimitRange
objects to namespaces to set sensible default requests and limits, preventing pods from being scheduled with no resource constraints at all.

Frequently Asked Questions

What is the difference between CrashLoopBackOff and Error status in Kubernetes?

Error is a transient state shown immediately after a container exits with a non-zero code. CrashLoopBackOff appears after Kubernetes has attempted several restarts and has begun applying an exponential backoff delay between attempts. CrashLoopBackOff means the crash is persistent and recurring — it is not a different type of failure, just a later stage of the same failure being rate-limited by the kubelet.

How do I get logs from a container that is in CrashLoopBackOff if it crashes too fast?

Use the --previous flag to retrieve logs from the last terminated container instance: kubectl logs <pod-name> --previous -n <namespace>. If the application exits before writing any logs, check the exit code via kubectl describe pod and look at the Last State: Terminated: Reason field for an OS-level signal name like OOMKilled.

What does exit code 137 mean in a Kubernetes pod?

Exit code 137 equals 128 + 9, where 9 is the SIGKILL signal number. It means the container process was forcibly killed by the Linux kernel's OOM killer because it exceeded its memory limit. Look for Reason: OOMKilled in kubectl describe pod to confirm. The fix is to increase the memory limit or address a memory leak in the application.

How long does Kubernetes wait between CrashLoopBackOff restarts?

The backoff follows an exponential progression: 10s, 20s, 40s, 80s, 160s, then caps at 300s (5 minutes). Each subsequent crash resets if the container ran successfully for at least 10 minutes between failures. The current backoff timer is visible in the Events section of kubectl describe pod.

Can a CrashLoopBackOff pod be fixed without redeploying?

It depends on the cause. If the crash is caused by a missing secret or a bad env var in an external ConfigMap, fixing the external resource and running kubectl rollout restart deployment/<name> is sufficient. If the crash is in the pod spec itself (wrong image, bad probe path, wrong memory limit), you must update the deployment spec which triggers a new rollout. You cannot edit a running pod spec in place for managed fields.

What is the difference between a liveness probe failure and a readiness probe failure?

Only a liveness probe failure causes Kubernetes to kill and restart the container. A readiness probe failure only removes the pod from the load balancer's endpoint list — the container keeps running, it just stops receiving traffic. Readiness probe failures alone do not produce CrashLoopBackOff. If you see CrashLoopBackOff with probe-related events, the liveness probe is the culprit.

How do I temporarily stop CrashLoopBackOff to debug the container?

Override the container's command to prevent the application from starting and get a shell: kubectl debug -it <pod-name> --copy-to=debug-pod --container=api -- /bin/sh. Alternatively, scale the deployment to zero (kubectl scale deployment/api-deployment --replicas=0) and create a standalone debug pod using the same image and environment variables to reproduce the crash interactively.

How do I tell if CrashLoopBackOff is caused by the application or by Kubernetes infrastructure?

Run the container image locally using Docker or Podman with the same environment variables. If it runs cleanly locally, the issue is infrastructure-side — a missing secret, a NetworkPolicy blocking a dependency, a misconfigured probe, or a resource limit. If it crashes locally with the same error, the issue is in the application code or image itself.

What Prometheus alerts should I set up to catch CrashLoopBackOff early?

Two key alerts: First, increase(kube_pod_container_status_restarts_total[15m]) > 3 fires as soon as a container has restarted 3 times in 15 minutes, before the 5-minute backoff kicks in. Second, kube_pod_status_phase{phase="Failed"} > 0 catches pods that have transitioned to the Failed phase. Both give your team time to intervene before the situation becomes a full outage.

Can a CrashLoopBackOff be caused by a node-level issue rather than the application?

Yes. Node-level disk pressure, memory pressure, or a corrupted container runtime can cause containers to fail to start or be killed immediately. Check node conditions: kubectl describe node sw-infrarunbook-01 | grep -A 10 Conditions. If you see MemoryPressure=True, DiskPressure=True, or PIDPressure=True, the node itself is the problem and pods across the node may be affected.

Symptoms

Root Cause 1: Bad Config or Environment Variable

Why It Happens

How to Identify It

How to Fix It

Root Cause 2: Missing Kubernetes Secret

Why It Happens

How to Identify It

How to Fix It

Root Cause 3: OOMKilled (Out of Memory)

Why It Happens

How to Identify It

How to Fix It

Root Cause 4: Dependency Service Unavailable

Why It Happens

How to Identify It

How to Fix It

Root Cause 5: Liveness Probe Misconfiguration

Why It Happens

How to Identify It

How to Fix It

Root Cause 6: Application Code Crash or Missing Runtime Dependency

Why It Happens

How to Identify It

How to Fix It

Root Cause 7: Init Container Failure

Why It Happens

How to Identify It

How to Fix It

Prevention

Related Articles

Frequently Asked Questions

What is the difference between CrashLoopBackOff and Error status in Kubernetes?

How do I get logs from a container that is in CrashLoopBackOff if it crashes too fast?

What does exit code 137 mean in a Kubernetes pod?

How long does Kubernetes wait between CrashLoopBackOff restarts?

Can a CrashLoopBackOff pod be fixed without redeploying?

What is the difference between a liveness probe failure and a readiness probe failure?

How do I temporarily stop CrashLoopBackOff to debug the container?

How do I tell if CrashLoopBackOff is caused by the application or by Kubernetes infrastructure?

What Prometheus alerts should I set up to catch CrashLoopBackOff early?

Can a CrashLoopBackOff be caused by a node-level issue rather than the application?

Related Articles