InfraRunBook
    Back to articles

    Kubernetes Pod CrashLoopBackOff

    Kubernetes
    Published: Apr 4, 2026
    Updated: Apr 4, 2026

    A complete infrastructure runbook for diagnosing and resolving Kubernetes CrashLoopBackOff errors, covering every major root cause with real CLI commands, error output, and step-by-step fixes.

    Kubernetes Pod CrashLoopBackOff

    Symptoms

    When a pod enters CrashLoopBackOff, Kubernetes is signalling that the container started, crashed, and is being restarted repeatedly — with each restart delayed exponentially (10s, 20s, 40s, 80s, up to a maximum of 5 minutes). The pod never stabilises. You will see it clearly when listing pods:

    kubectl get pods -n production
    
    NAME                            READY   STATUS             RESTARTS   AGE
    api-deployment-7d4b9c-xkp2z     0/1     CrashLoopBackOff   8          14m
    worker-deployment-5f8bb-r9t4k   0/1     CrashLoopBackOff   12         22m

    Additional symptoms you will observe:

    • The
      RESTARTS
      counter climbs steadily over time.
    • kubectl describe pod
      shows Back-off restarting failed container in the Events section.
    • Logs are either empty, cut short, or show a clear application-level error just before the process exits.
    • The pod may briefly flash
      Running
      before falling back into the crash loop.
    • Dependent workloads that rely on this pod begin failing their own health checks.

    The exit code in the pod's

    lastState.terminated
    field is a key diagnostic signal: exit code 1 typically means the application itself returned an error; exit code 137 means OOMKilled; exit code 143 means SIGTERM was sent but not handled. This runbook covers every major cause systematically.


    Root Cause 1: Bad Config or Environment Variable

    Why It Happens

    Applications commonly read critical parameters — database URIs, port numbers, feature flags, log levels — from environment variables at startup. If a required variable is absent, misspelled in the manifest, or holds an invalid value (a string where an integer is expected, a malformed URL, an unsupported flag value), the application process exits immediately with a non-zero exit code. Kubernetes treats any non-zero exit as a container failure and schedules a restart, producing the crash loop.

    How to Identify It

    Fetch logs from the previous (already-crashed) container instance — the current container may not have produced output yet:

    kubectl logs api-deployment-7d4b9c-xkp2z --previous

    A misconfigured environment variable typically produces output similar to:

    FATAL: invalid value for DATABASE_PORT: "abc" (expected integer)
    exit status 1

    or a Go-style panic:

    panic: DATABASE_URL environment variable is not set
    goroutine 1 [running]:
    main.mustGetEnv(...)
            /app/main.go:22 +0x5a
    exit status 2

    List all environment variables currently injected into the running (or last-run) container:

    kubectl exec api-deployment-7d4b9c-xkp2z -- env | sort

    Compare that against what the deployment manifest declares:

    kubectl get deployment api-deployment -o yaml | grep -A 60 'env:'

    How to Fix It

    Correct the bad value directly on the deployment:

    kubectl set env deployment/api-deployment DATABASE_PORT=5432

    For a more durable fix, edit the manifest and apply it:

    kubectl edit deployment api-deployment

    Then watch the rollout complete cleanly:

    kubectl rollout status deployment/api-deployment
    
    Waiting for deployment "api-deployment" rollout to finish: 1 old replicas are pending termination...
    deployment "api-deployment" successfully rolled out

    Root Cause 2: Missing Kubernetes Secret

    Why It Happens

    Pods referencing a

    Secret
    via
    envFrom
    ,
    env.valueFrom.secretKeyRef
    , or a volume mount will fail to start if the secret does not exist in the same namespace. The container never actually launches — the kubelet marks it as failed during setup. Similarly, if the secret object exists but a specific key referenced by
    secretKeyRef
    is absent from that secret's
    data
    map, the pod enters
    CrashLoopBackOff
    without ever running application code.

    How to Identify It

    Describe the pod and scan the Events block at the bottom:

    kubectl describe pod api-deployment-7d4b9c-xkp2z -n production

    The event section will show one of these messages:

    Events:
      Type     Reason     Age               From      Message
      ----     ------     ----              ----      -------
      Warning  Failed     3s (x4 over 45s)  kubelet   Error: secret "api-db-credentials" not found

    Or, for a key that is missing inside an existing secret:

      Warning  Failed     5s  kubelet  Error: couldn't find key DB_PASSWORD in Secret production/api-db-credentials

    List all secrets in the namespace to confirm what actually exists:

    kubectl get secrets -n production
    
    NAME                  TYPE     DATA   AGE
    api-tls-cert          Opaque   2      30d
    registry-pull-secret  Opaque   1      45d

    Note that

    api-db-credentials
    is absent entirely. If the secret exists, inspect its keys without exposing values:

    kubectl get secret api-db-credentials -n production \
      -o jsonpath='{.data}' | python3 -c \
      "import sys, json; print(list(json.load(sys.stdin).keys()))"
    ['DB_USER', 'DB_HOST']

    This confirms

    DB_PASSWORD
    is missing from the secret's data map.

    How to Fix It

    Create the missing secret in the correct namespace:

    kubectl create secret generic api-db-credentials \
      --from-literal=DB_USER='infrarunbook-admin' \
      --from-literal=DB_PASSWORD='s3cur3P@ss!' \
      --from-literal=DB_HOST='10.0.1.45' \
      -n production

    If the secret exists but is missing a key, patch it:

    kubectl patch secret api-db-credentials -n production \
      --type='json' \
      -p='[{"op":"add","path":"/data/DB_PASSWORD","value":"'$(echo -n 's3cur3P@ss!' | base64)'"}]'

    Force the deployment to pick up the updated secret:

    kubectl rollout restart deployment/api-deployment -n production
    kubectl rollout status deployment/api-deployment -n production

    Root Cause 3: OOMKilled (Out of Memory)

    Why It Happens

    Every container in Kubernetes runs under a cgroup enforced memory ceiling defined by the

    resources.limits.memory
    field. When the container's memory consumption crosses that ceiling — whether due to a memory leak, a burst in workload, or a limit that was never sized correctly — the Linux kernel's OOM killer terminates the container process with SIGKILL (signal 9). Kubernetes records this as an OOMKilled exit reason and exit code 137 (128 + 9). If the application consistently needs more memory than the limit allows, the pod will be killed and restarted in a continuous loop.

    How to Identify It

    Describe the pod and look for the

    Last State
    block under the container status:

    kubectl describe pod api-deployment-7d4b9c-xkp2z -n production
        Last State:     Terminated
          Reason:       OOMKilled
          Exit Code:    137
          Started:      Fri, 04 Apr 2026 08:12:01 +0000
          Finished:     Fri, 04 Apr 2026 08:12:44 +0000
        Ready:          False
        Restart Count:  9

    Confirm live memory consumption versus the configured limit:

    kubectl top pod api-deployment-7d4b9c-xkp2z --containers -n production
    
    POD                           NAME   CPU(cores)   MEMORY(bytes)
    api-deployment-7d4b9c-xkp2z  api    38m          503Mi

    Then inspect the spec limits:

    kubectl get pod api-deployment-7d4b9c-xkp2z -n production \
      -o jsonpath='{.spec.containers[0].resources}'
    
    {"limits":{"memory":"512Mi"},"requests":{"memory":"256Mi"}}

    The pod is consuming 503Mi against a 512Mi hard limit — any brief spike will trigger the kill.

    How to Fix It

    Increase the memory limit on the deployment:

    kubectl set resources deployment api-deployment \
      --limits=memory=1Gi \
      --requests=memory=512Mi \
      -n production

    If the OOM is caused by a genuine application memory leak rather than an undersized limit, you need to fix the code. In the meantime, consider enabling a Vertical Pod Autoscaler in recommendation mode to get data-driven sizing suggestions:

    kubectl get vpa api-deployment-vpa -n production -o yaml | grep -A 10 'recommendation'

    Long-term, instrument the application with heap profiling tooling (pprof for Go, jmap for JVM workloads, memory_profiler for Python) and analyse allocations under realistic load before setting final limits.


    Root Cause 4: Dependency Service Unavailable

    Why It Happens

    Many applications perform a blocking connection check during their startup sequence — connecting to PostgreSQL, Redis, RabbitMQ, or a downstream gRPC service before declaring themselves ready. If the dependency is unreachable due to a wrong hostname, a missing DNS record, a firewall rule, a NetworkPolicy, or because the dependency itself is crashing, the application's startup routine fails and the process exits. In microservices deployments, this is especially common when multiple services are deployed simultaneously and the dependency simply hasn't started yet — a classic race condition that results in both services entering a crash loop.

    How to Identify It

    Pull the previous container's logs to see what the application was trying to connect to:

    kubectl logs api-deployment-7d4b9c-xkp2z --previous -n production
    Error: unable to connect to PostgreSQL at postgres-service.production.svc.cluster.local:5432
    dial tcp 10.0.1.45:5432: connect: connection refused
    failed to initialise application, exiting

    Check the state of the dependency itself:

    kubectl get pods -n production -l app=postgres
    
    NAME           READY   STATUS             RESTARTS   AGE
    postgres-0     0/1     CrashLoopBackOff   5          8m

    The dependency is itself crashing — confirming the root cause. If the dependency appears healthy, test DNS resolution and TCP reachability from within the cluster using a debug pod:

    kubectl run debug --image=busybox:1.35 --restart=Never --rm -it -n production -- \
      nslookup postgres-service.production.svc.cluster.local
    
    Server:    10.96.0.10
    Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
    
    Name:      postgres-service.production.svc.cluster.local
    Address 1: 10.0.1.45
    kubectl run debug --image=busybox:1.35 --restart=Never --rm -it -n production -- \
      nc -zv 10.0.1.45 5432
    
    10.0.1.45 (10.0.1.45:5432) open

    Also check for NetworkPolicies that may be blocking traffic:

    kubectl get networkpolicies -n production
    kubectl describe networkpolicy allow-api-to-postgres -n production

    How to Fix It

    Resolve the dependency's own crash first, then prevent the race condition from recurring by adding an init container that gates startup on the dependency being ready:

    initContainers:
    - name: wait-for-postgres
      image: busybox:1.35
      command:
        - sh
        - -c
        - |
          until nc -z postgres-service.production.svc.cluster.local 5432; do
            echo "waiting for postgres at postgres-service.production.svc.cluster.local:5432..."
            sleep 3
          done
          echo "postgres is ready"

    This init container runs to completion before the main application container starts, eliminating the startup race condition entirely regardless of deployment ordering.


    Root Cause 5: Liveness Probe Misconfiguration

    Why It Happens

    A liveness probe failure causes Kubernetes to kill and restart the container. A misconfigured liveness probe — one pointing at a non-existent URL path, using the wrong port, or with an

    initialDelaySeconds
    that is shorter than the application's actual startup time — will kill a perfectly healthy container before it has finished initialising. This produces a crash loop that is indistinguishable from an application crash unless you look carefully at the Events section. Note that a readiness probe failure alone does not cause a restart — it only removes the pod from Service endpoints — but when both liveness and readiness probes share a misconfigured endpoint, both fail simultaneously and the liveness kill drives the restart loop.

    How to Identify It

    Describe the pod and inspect the Events section closely:

    kubectl describe pod api-deployment-7d4b9c-xkp2z -n production
        Liveness:   http-get http://:8080/healthz delay=5s timeout=1s period=10s #success=1 #failure=3
        Readiness:  http-get http://:8080/readyz delay=5s timeout=1s period=10s #success=1 #failure=3
    
    Events:
      Type     Reason    Age                From      Message
      ----     ------    ----               ----      -------
      Warning  Unhealthy 18s (x3 over 38s)  kubelet   Liveness probe failed: HTTP probe failed with statuscode: 404
      Warning  Killing   18s                kubelet   Container api failed liveness probe, will be restarted

    The 404 shows the probe is hitting a path that does not exist. Verify the correct health endpoint directly inside the container:

    kubectl exec api-deployment-7d4b9c-xkp2z -n production -- \
      wget -qO- http://localhost:8080/health
    
    {"status":"ok","uptime":42}

    The actual endpoint is

    /health
    , not
    /healthz
    . Also check how long the application takes to start listening:

    kubectl logs api-deployment-7d4b9c-xkp2z -n production | grep -i 'listen\|ready\|started\|binding'
    
    2026-04-04T08:15:33Z INFO  server listening on :8080 after 22s startup

    The application takes 22 seconds to start, but

    initialDelaySeconds
    is only 5 seconds — the liveness probe fires and fails 17 seconds before the server is ready.

    How to Fix It

    Correct the probe path and give the application enough headroom to initialise. A well-tuned probe configuration:

    livenessProbe:
      httpGet:
        path: /health
        port: 8080
      initialDelaySeconds: 40
      periodSeconds: 15
      timeoutSeconds: 5
      failureThreshold: 3
    
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      initialDelaySeconds: 15
      periodSeconds: 10
      timeoutSeconds: 3
      failureThreshold: 3
    
    startupProbe:
      httpGet:
        path: /health
        port: 8080
      failureThreshold: 30
      periodSeconds: 5

    The

    startupProbe
    is particularly valuable here — it gives the application up to 150 seconds (30 × 5s) to start before the liveness probe takes over, preventing premature kills during slow initialisation. Apply and verify:

    kubectl apply -f deployment.yaml
    kubectl rollout status deployment/api-deployment -n production
    
    deployment "api-deployment" successfully rolled out

    Root Cause 6: Application Code Crash or Missing Runtime Dependency

    Why It Happens

    The container image itself may contain a bug, a missing Python module, a missing shared library, or an incorrect entrypoint script. The application process exits the moment it starts — often before producing meaningful logs — with a non-zero exit code. This is common after image rebuilds where a dependency was accidentally removed or after a bad merge.

    How to Identify It

    kubectl logs api-deployment-7d4b9c-xkp2z --previous -n production
    
    Traceback (most recent call last):
      File "/app/main.py", line 14, in <module>
        from utils.auth import verify_token
    ModuleNotFoundError: No module named 'utils.auth'

    Or for a missing shared library:

    /app/server: error while loading shared libraries: libssl.so.3: cannot open shared object file: No such file or directory

    Get the exact exit code from the terminated state:

    kubectl get pod api-deployment-7d4b9c-xkp2z -n production \
      -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
    
    1

    How to Fix It

    Fix the application code or Dockerfile, rebuild the image, push it to your registry, and update the deployment:

    kubectl set image deployment/api-deployment \
      api=registry.solvethenetwork.com/api:v1.2.2 \
      -n production
    kubectl rollout status deployment/api-deployment -n production

    To roll back to the last known-good image while the fix is in progress:

    kubectl rollout undo deployment/api-deployment -n production

    Root Cause 7: Init Container Failure

    Why It Happens

    Init containers execute sequentially before the main container and must exit with code 0. Common uses include running database migrations, fetching secrets from Vault, rendering config files, or seeding a volume. If an init container fails — because the migration SQL is broken, the secret store is unreachable, or a pre-flight assertion fails — Kubernetes retries it according to the pod's

    restartPolicy
    , producing
    Init:CrashLoopBackOff
    in the status column.

    How to Identify It

    kubectl get pods -n production
    
    NAME                            READY   STATUS                  RESTARTS   AGE
    api-deployment-7d4b9c-xkp2z     0/1     Init:CrashLoopBackOff   4          6m

    Fetch logs from the init container by name using the

    -c
    flag:

    kubectl logs api-deployment-7d4b9c-xkp2z -c init-migrate --previous -n production
    
    running migration 004_add_sessions_table.sql
    ERROR: relation "users" does not exist
    LINE 1: ALTER TABLE users ADD COLUMN last_seen TIMESTAMP;
    migration failed — exiting with code 1

    How to Fix It

    Resolve the underlying issue the init container is reporting. In this case, a previous migration was never applied, so the schema is in an inconsistent state. After fixing the database state and correcting the migration files:

    kubectl rollout restart deployment/api-deployment -n production
    kubectl rollout status deployment/api-deployment -n production

    Prevention

    Preventing CrashLoopBackOff requires discipline across your entire delivery pipeline — from manifest authoring to production monitoring.

    • Lint manifests in CI/CD: Run
      kubeval
      ,
      kubeconform
      , or
      kube-score
      against every manifest change in your pipeline. Catch missing
      secretKeyRef
      targets and invalid field values before they ever reach the cluster.
    • Use init containers for dependency readiness: Never assume a dependency is up. Gate every application container behind an init container with a TCP or HTTP readiness check and an exponential-backoff retry loop.
    • Set realistic resource limits: Run load tests in staging with
      kubectl top
      monitoring before committing to production limits. Use VPA recommendations as a baseline, then add a 30–50% headroom buffer.
    • Tune probes carefully: Always set
      initialDelaySeconds
      to at least 120% of your observed worst-case startup time. Use a
      startupProbe
      for slow-starting applications. Test probe endpoints explicitly in staging under load.
    • Manage secrets lifecycle with GitOps: Use External Secrets Operator, Vault Agent Injector, or Sealed Secrets to ensure secrets are provisioned before workloads are deployed. Never rely on manual
      kubectl create secret
      steps in production.
    • Alert on restart counts before they spiral: Add a Prometheus alert:
      increase(kube_pod_container_status_restarts_total[15m]) > 3
      so your team is notified well before a pod enters the 5-minute backoff phase.
    • Use immutable image tags: Never deploy
      :latest
      . Tag images with the full Git commit SHA so every running pod can be traced back to an exact code state and rolled back deterministically.
    • Enforce namespace-level LimitRanges: Apply
      LimitRange
      objects to namespaces to set sensible default requests and limits, preventing pods from being scheduled with no resource constraints at all.

    Frequently Asked Questions

    Q: What is the difference between CrashLoopBackOff and Error status in Kubernetes?

    A: Error is a transient state shown immediately after a container exits with a non-zero code. CrashLoopBackOff appears after Kubernetes has attempted several restarts and has begun applying an exponential backoff delay between attempts. CrashLoopBackOff means the crash is persistent and recurring — it is not a different type of failure, just a later stage of the same failure being rate-limited by the kubelet.

    Q: How do I get logs from a container that is in CrashLoopBackOff if it crashes too fast?

    A: Use the

    --previous
    flag to retrieve logs from the last terminated container instance rather than the currently running (or starting) one:
    kubectl logs <pod-name> --previous -n <namespace>
    . If the application exits before writing any logs, check the exit code via
    kubectl describe pod
    and look at the
    Last State: Terminated: Reason
    field for an OS-level signal name like OOMKilled.

    Q: What does exit code 137 mean in a Kubernetes pod?

    A: Exit code 137 equals 128 + 9, where 9 is the SIGKILL signal number. It means the container process was forcibly killed by the Linux kernel's OOM killer because it exceeded its memory limit. Look for

    Reason: OOMKilled
    in
    kubectl describe pod
    to confirm. The fix is to increase the memory limit or address a memory leak in the application.

    Q: How long does Kubernetes wait between CrashLoopBackOff restarts?

    A: The backoff follows an exponential progression: 10s, 20s, 40s, 80s, 160s, then caps at 300s (5 minutes). Each subsequent crash resets back to the start of this sequence if the container had been running successfully for at least 10 minutes between failures. You can observe the current backoff timer in the Events section of

    kubectl describe pod
    .

    Q: Can a CrashLoopBackOff pod be fixed without redeploying?

    A: It depends on the cause. If the crash is caused by a missing secret or a bad env var that exists outside the pod spec (e.g., a

    ConfigMap
    value), fixing the external resource and then running
    kubectl rollout restart deployment/<name>
    is sufficient. If the crash is in the pod spec itself (wrong image, bad probe path, wrong memory limit), you must update the deployment spec which triggers a new pod rollout. You cannot edit a running pod spec in place for managed fields.

    Q: What is the difference between a liveness probe failure and a readiness probe failure causing a restart?

    A: Only a liveness probe failure causes Kubernetes to kill and restart the container. A readiness probe failure only removes the pod from the load balancer's endpoint list — the container keeps running, it just stops receiving traffic. This means readiness probe failures alone do not produce CrashLoopBackOff. If you see CrashLoopBackOff with probe-related events, the liveness probe is the culprit.

    Q: How do I temporarily stop CrashLoopBackOff to debug the container?

    A: Override the container's command to prevent the application from starting, giving you a shell to investigate:

    kubectl debug -it <pod-name> --copy-to=debug-pod --container=api -- /bin/sh
    . Alternatively, scale the deployment to zero (
    kubectl scale deployment/api-deployment --replicas=0
    ), then create a standalone debug pod using the same image and environment variables to reproduce the crash interactively.

    Q: How do I tell if CrashLoopBackOff is caused by the application or by Kubernetes infrastructure?

    A: Run the container image locally using Docker or Podman with the same environment variables:

    docker run --env-file=.env registry.solvethenetwork.com/api:v1.2.1
    . If it runs cleanly locally, the issue is infrastructure-side — a missing secret, a NetworkPolicy blocking a dependency, a misconfigured probe, or a resource limit. If it crashes locally with the same error, the issue is in the application code or image itself.

    Q: What Prometheus alerts should I set up to catch CrashLoopBackOff early?

    A: Two key alerts: First,

    increase(kube_pod_container_status_restarts_total[15m]) > 3
    fires as soon as a container has restarted 3 times in 15 minutes, before the 5-minute backoff kicks in. Second,
    kube_pod_status_phase{phase="Failed"} > 0
    catches pods that have transitioned to the Failed phase. Both give your team time to intervene before the situation becomes a full outage.

    Q: Can a CrashLoopBackOff be caused by a node-level issue rather than the application?

    A: Yes. Node-level disk pressure, memory pressure, or a corrupted container runtime (containerd, CRI-O) can cause containers to fail to start or to be killed immediately. Check node conditions:

    kubectl describe node sw-infrarunbook-01 | grep -A 10 Conditions
    . If you see
    MemoryPressure=True
    ,
    DiskPressure=True
    , or
    PIDPressure=True
    , the node itself is the problem and pods across the node may be affected — not just yours.

    Q: Does CrashLoopBackOff affect all replicas or just one?

    A: If the root cause is configuration-based (bad env var, missing secret, misconfigured probe), all replicas of the deployment will enter CrashLoopBackOff since they all share the same pod spec. If the crash is caused by a transient node-level issue or a specific data condition that only affects one pod, you may see only one or a subset of replicas crashing while others remain healthy. The pattern of which pods are affected is itself a diagnostic clue.

    Q: How do I roll back a deployment that is stuck in CrashLoopBackOff?

    A: Use the built-in rollout undo command:

    kubectl rollout undo deployment/api-deployment -n production
    . Kubernetes will immediately start rolling out the previous ReplicaSet. To roll back to a specific revision, first list history with
    kubectl rollout history deployment/api-deployment
    , then target it:
    kubectl rollout undo deployment/api-deployment --to-revision=3
    . This is why immutable image tags and recorded rollout changes are essential — without them, rollback history is difficult to interpret.

    Frequently Asked Questions

    What is the difference between CrashLoopBackOff and Error status in Kubernetes?

    Error is a transient state shown immediately after a container exits with a non-zero code. CrashLoopBackOff appears after Kubernetes has attempted several restarts and has begun applying an exponential backoff delay between attempts. CrashLoopBackOff means the crash is persistent and recurring — it is not a different type of failure, just a later stage of the same failure being rate-limited by the kubelet.

    How do I get logs from a container that is in CrashLoopBackOff if it crashes too fast?

    Use the --previous flag to retrieve logs from the last terminated container instance: kubectl logs <pod-name> --previous -n <namespace>. If the application exits before writing any logs, check the exit code via kubectl describe pod and look at the Last State: Terminated: Reason field for an OS-level signal name like OOMKilled.

    What does exit code 137 mean in a Kubernetes pod?

    Exit code 137 equals 128 + 9, where 9 is the SIGKILL signal number. It means the container process was forcibly killed by the Linux kernel's OOM killer because it exceeded its memory limit. Look for Reason: OOMKilled in kubectl describe pod to confirm. The fix is to increase the memory limit or address a memory leak in the application.

    How long does Kubernetes wait between CrashLoopBackOff restarts?

    The backoff follows an exponential progression: 10s, 20s, 40s, 80s, 160s, then caps at 300s (5 minutes). Each subsequent crash resets if the container ran successfully for at least 10 minutes between failures. The current backoff timer is visible in the Events section of kubectl describe pod.

    Can a CrashLoopBackOff pod be fixed without redeploying?

    It depends on the cause. If the crash is caused by a missing secret or a bad env var in an external ConfigMap, fixing the external resource and running kubectl rollout restart deployment/<name> is sufficient. If the crash is in the pod spec itself (wrong image, bad probe path, wrong memory limit), you must update the deployment spec which triggers a new rollout. You cannot edit a running pod spec in place for managed fields.

    What is the difference between a liveness probe failure and a readiness probe failure?

    Only a liveness probe failure causes Kubernetes to kill and restart the container. A readiness probe failure only removes the pod from the load balancer's endpoint list — the container keeps running, it just stops receiving traffic. Readiness probe failures alone do not produce CrashLoopBackOff. If you see CrashLoopBackOff with probe-related events, the liveness probe is the culprit.

    How do I temporarily stop CrashLoopBackOff to debug the container?

    Override the container's command to prevent the application from starting and get a shell: kubectl debug -it <pod-name> --copy-to=debug-pod --container=api -- /bin/sh. Alternatively, scale the deployment to zero (kubectl scale deployment/api-deployment --replicas=0) and create a standalone debug pod using the same image and environment variables to reproduce the crash interactively.

    How do I tell if CrashLoopBackOff is caused by the application or by Kubernetes infrastructure?

    Run the container image locally using Docker or Podman with the same environment variables. If it runs cleanly locally, the issue is infrastructure-side — a missing secret, a NetworkPolicy blocking a dependency, a misconfigured probe, or a resource limit. If it crashes locally with the same error, the issue is in the application code or image itself.

    What Prometheus alerts should I set up to catch CrashLoopBackOff early?

    Two key alerts: First, increase(kube_pod_container_status_restarts_total[15m]) > 3 fires as soon as a container has restarted 3 times in 15 minutes, before the 5-minute backoff kicks in. Second, kube_pod_status_phase{phase="Failed"} > 0 catches pods that have transitioned to the Failed phase. Both give your team time to intervene before the situation becomes a full outage.

    Can a CrashLoopBackOff be caused by a node-level issue rather than the application?

    Yes. Node-level disk pressure, memory pressure, or a corrupted container runtime can cause containers to fail to start or be killed immediately. Check node conditions: kubectl describe node sw-infrarunbook-01 | grep -A 10 Conditions. If you see MemoryPressure=True, DiskPressure=True, or PIDPressure=True, the node itself is the problem and pods across the node may be affected.

    Related Articles