InfraRunBook
    Back to articles

    Envoy Retry Policy Not Working

    Envoy
    Published: Apr 17, 2026
    Updated: Apr 17, 2026

    Envoy retry policies fail silently in ways that are hard to diagnose. This guide walks through every common root cause — wrong retry conditions, exhausted budgets, per-try timeouts, mismatched status codes, and missing host selection plugins — with real commands and fixes.

    Envoy Retry Policy Not Working

    Symptoms

    You've wired up Envoy's retry policy —

    num_retries
    ,
    retry_on
    , maybe a
    per_try_timeout
    — and your upstream keeps returning errors you expect to be retried. But clients are getting those errors immediately. No retries. The
    upstream_rq_retry
    counter on the admin API is flat. You tail the access log and watch 503s sailing straight through to the downstream client without a single retry attempt.

    Or maybe retries are happening, but they're not helping. The same upstream host keeps failing. Clients still see elevated error rates despite a retry config that should be absorbing the failures. The counters are moving, but success rates aren't improving.

    This is one of the more frustrating classes of Envoy problems because the failure is silent. Envoy doesn't log "I decided not to retry this" — it just doesn't. You have to go hunting through stats, access logs, and config dumps to piece together what's actually happening. This guide covers every root cause I've run into in production, with the specific commands that surface each one.


    Root Cause 1: Retry Condition Wrong

    The

    retry_on
    field is the gatekeeper for all retry logic in Envoy. If the condition you've specified doesn't match what's actually happening at the transport or protocol level, Envoy won't retry. Full stop. I've seen this trip up engineers who copy a config from a gRPC service and drop it on an HTTP/1.1 route without adjusting the retry conditions, or who assume
    5xx
    covers connection-level failures when it doesn't.

    The most common mismatch is using

    reset
    when you're actually getting a clean HTTP response with a 5xx body, or using
    5xx
    when the upstream is terminating the connection before sending any HTTP response at all. These are different failure modes and Envoy distinguishes them precisely.

    To identify this, start by checking the response flags in your Envoy access logs. The response flags field tells you exactly how the request failed:

    kubectl exec -n solvethenetwork deploy/envoy-proxy -- \
      cat /var/log/envoy/access.log | tail -100 | awk '{print $9, $10, $NF}'

    Match the response flags you see to the correct

    retry_on
    value. The flags that matter most are UF (upstream connection failure before headers — use
    connect-failure
    ), UC (upstream connection terminated mid-stream — use
    reset
    ), UH (no healthy upstream hosts — pair
    retriable-status-codes
    with 503), and URX (retry limit exceeded — retries are actually working but being exhausted). If you're seeing URX, your retry conditions are correct but you're hitting a budget problem, which is covered next.

    Once you know your actual failure mode, configure

    retry_on
    to match it explicitly:

    retry_policy:
      retry_on: "connect-failure,reset,5xx"
      num_retries: 3
      per_try_timeout: 2s

    Don't blindly stuff every condition into

    retry_on
    , but do make sure the conditions you list actually correspond to the failures you're observing. The Envoy documentation lists every valid value and they're not all intuitive —
    gateway-error
    covers 502, 503, and 504 specifically, while
    5xx
    covers 500 through 504. These overlap in ways that matter depending on your upstream error distribution.


    Root Cause 2: Max Retries Too Low

    num_retries
    defaults to 1 in Envoy. That's one retry attempt in addition to the original request. Under any real load with a degraded upstream, a single retry gets exhausted almost immediately and stops being meaningful. But there's a subtler and more dangerous version of this problem that operates at the cluster level.

    Envoy enforces a retry budget through its circuit breaker settings. The default

    max_retries
    threshold in the cluster circuit breaker is 3. This is not the per-request retry count — it's the maximum number of active concurrent retry attempts across all connections to the entire cluster. When that threshold is hit, Envoy stops issuing new retries and increments the
    upstream_rq_retry_overflow
    counter instead. Your route-level
    num_retries: 5
    setting becomes irrelevant the moment the circuit breaker trips.

    Check whether this is happening right now:

    curl -s http://sw-infrarunbook-01:9901/stats | grep -E "upstream_rq_retry"

    You should see output like this:

    cluster.api_backend.upstream_rq_retry: 8241
    cluster.api_backend.upstream_rq_retry_success: 7903
    cluster.api_backend.upstream_rq_retry_overflow: 4712
    cluster.api_backend.upstream_rq_retry_limit_exceeded: 338

    If

    upstream_rq_retry_overflow
    is a non-trivial number and climbing, the cluster circuit breaker is throttling your retries. The fix is to raise the cluster-level
    max_retries
    threshold to something realistic for your traffic volume:

    circuit_breakers:
      thresholds:
        - priority: DEFAULT
          max_retries: 100
          max_pending_requests: 1024
          max_requests: 2048
          max_connections: 512

    The right value for

    max_retries
    depends on your concurrency. A rough starting point: take your expected peak concurrent requests, multiply by your
    num_retries
    value, and use that as the ceiling. Under high concurrency, the default of 3 is useless. That said, don't just set it to an arbitrarily large number — retries under load can amplify pressure on a struggling upstream and trigger a retry storm. Be deliberate about the ceiling you choose.


    Root Cause 3: Per-Try Timeout Too Low

    The

    per_try_timeout
    is the maximum time Envoy will wait for each individual attempt, including every retry. If your upstream is slow but eventually succeeds, a per-try timeout that's too short kills every attempt before it can succeed, burning through your retry budget while accomplishing nothing.

    Here's the exact failure scenario. You configure a 30-second route timeout, a 500ms per-try timeout, and three retries. Your upstream is under load and responding in 650ms. Every attempt — original and all three retries — times out at 500ms. Envoy dutifully retries, exhausts the budget, and delivers a 504 to the client. The upstream would have responded successfully 150ms later, but you never let it breathe.

    Identify this by correlating retry and timeout counters:

    curl -s http://sw-infrarunbook-01:9901/stats | grep -E "upstream_rq_(retry|timeout)"
    # Problematic output — retry and timeout counts tracking together:
    cluster.api_backend.upstream_rq_retry: 5103
    cluster.api_backend.upstream_rq_retry_success: 41
    cluster.api_backend.upstream_rq_timeout: 4977

    When retry counts and timeout counts are nearly equal and retry success is near zero, per-try timeout is almost certainly the culprit. Follow up by checking your actual upstream response time distribution:

    curl -s http://sw-infrarunbook-01:9901/stats | grep "upstream_rq_time"

    Compare the p95 and p99 latencies against your configured

    per_try_timeout
    . Your per-try timeout must comfortably exceed the p95 upstream response time under normal conditions. If your p99 is 800ms, set
    per_try_timeout
    to at least 1.5 seconds. Then make sure your overall route timeout accounts for the worst case:

    retry_policy:
      retry_on: "5xx,reset,connect-failure"
      num_retries: 3
      per_try_timeout: 2s
    
    # Route-level timeout must be >= per_try_timeout * (num_retries + 1) + buffer
    # In this case: 2s * 4 = 8s minimum, set to 12s for headroom
    route:
      timeout: 12s

    A route timeout that fires before your retry budget is exhausted is a misconfiguration. The client gets a 504, your retries were wasted, and the upstream never got a fair chance. Calculate the worst-case duration and set the route timeout accordingly.


    Root Cause 4: Retry on Wrong Status Code

    This is distinct from getting the retry condition wrong at the transport level. When you use

    retriable-status-codes
    in
    retry_on
    , Envoy also needs a populated
    retriable_status_codes
    list specifying which HTTP status codes should trigger a retry. Without the list,
    retriable-status-codes
    is a no-op — Envoy accepts it, applies it, and never retries anything because no codes are defined.

    In my experience, this happens most often when copying config snippets. The

    retry_on
    value gets copied, but the corresponding
    retriable_status_codes
    block is missing. The config passes validation silently:

    # Broken — retriable-status-codes with no codes specified
    retry_policy:
      retry_on: "retriable-status-codes"
      num_retries: 3

    Envoy doesn't warn you. It simply never matches any status code because the list is empty. The corrected config:

    retry_policy:
      retry_on: "retriable-status-codes"
      num_retries: 3
      retriable_status_codes:
        - 503
        - 500
        - 429

    The related mistake is assuming

    5xx
    in
    retry_on
    covers 429. It doesn't. The
    5xx
    condition matches HTTP 500, 502, 503, and 504 only. If your upstream returns 429 under rate limiting and you want Envoy to retry those requests, you need
    retriable-status-codes
    with 429 explicitly listed.

    Always confirm what status codes your upstream is actually returning before writing the config. Pull the distribution from the cluster stats:

    curl -s http://sw-infrarunbook-01:9901/stats | grep -E "upstream_rq_[0-9]{3}:"
    
    # Sample output:
    # cluster.api_backend.upstream_rq_200: 189043
    # cluster.api_backend.upstream_rq_429: 5821
    # cluster.api_backend.upstream_rq_503: 344
    # cluster.api_backend.upstream_rq_504: 87

    Once you see 429s in that output, you know you need

    retriable-status-codes
    with 429 in the list. Don't configure retry policy based on what you think the upstream returns — verify it first.


    Root Cause 5: Host Selection Retry Plugin Missing

    This is the most insidious of all retry problems because retries are technically happening — they're just not helping. Without a host selection retry plugin configured, Envoy's load balancer may select the exact same failed upstream host for each retry attempt. If a host is throwing 503s because it's unhealthy, retrying to that same host three times accomplishes nothing except wasting time and burning your route timeout.

    The

    previous_hosts
    retry plugin instructs Envoy to exclude hosts that were already attempted during the current request's retry chain. It sounds like it should be default behavior. It isn't. You have to explicitly configure it, and most teams don't know it exists until they've been burned by it.

    To identify whether retries are landing on the same host, you need access log entries that include the

    %UPSTREAM_HOST%
    field. If you have that enabled, look for patterns like this:

    grep "upstream_rq_retry" /var/log/envoy/access.log | \
      awk '{print $upstream_host_column}' | sort | uniq -c | sort -rn
    
    # A broken retry distribution looks like:
    # 6203  10.10.1.45:8080
    # 17    10.10.1.46:8080
    # 11    10.10.1.47:8080

    One host absorbing nearly all retry traffic while the others get almost none is a strong signal that the load balancer is repeatedly selecting the same host. You can also detect this indirectly: if your

    upstream_rq_retry
    counter is climbing but
    upstream_rq_retry_success
    is near zero, and one specific upstream host has a persistently high error rate in your monitoring, retries are probably piling onto that broken instance.

    The fix is to add the

    previous_hosts
    predicate to your retry policy and set
    host_selection_retry_max_attempts
    high enough that the load balancer actually gets to pick a different host:

    retry_policy:
      retry_on: "5xx,reset,connect-failure"
      num_retries: 3
      per_try_timeout: 2s
      host_selection_retry_max_attempts: 10
      retry_host_predicate:
        - name: envoy.retry_host_predicates.previous_hosts

    Set

    host_selection_retry_max_attempts
    to at least twice your
    num_retries
    value. This controls how many times Envoy will spin the load balancer wheel trying to find a host it hasn't tried yet. If you have three upstream hosts and
    host_selection_retry_max_attempts
    is 1, you're giving Envoy only one shot at avoiding the previous host — which may not be enough if the load balancer naturally tends toward that host. Give it room to maneuver.

    Envoy also ships an

    omit_canary_hosts
    predicate that prevents retries from landing on canary-tagged endpoints, and an
    omit_host_metadata
    predicate for metadata-based host exclusion. In most production setups,
    previous_hosts
    is the one you need. Make it part of your standard retry policy template.


    Root Cause 6: x-envoy-retry-on Header Override

    Envoy respects the

    x-envoy-retry-on
    request header from downstream clients by default. If a downstream service, sidecar, or API gateway is sending this header — set to an empty value,
    off
    , or a more restrictive condition than your route config — it overrides your retry policy entirely.

    This shows up in service mesh deployments where multiple Envoy proxies are in the call path, or when an upstream load balancer is adding Envoy-specific headers before the request reaches your proxy. Check whether incoming requests carry this header by enabling debug-level header logging or using a test request:

    curl -v http://sw-infrarunbook-01:8080/api/health 2>&1 | grep -i "x-envoy"
    
    # If you see something like:
    # > x-envoy-retry-on: off
    # ...that's your problem

    You can also temporarily enable debug logging on the Envoy admin API and trigger a request to see exactly what headers arrive at the route level:

    curl -s -X POST "http://sw-infrarunbook-01:9901/logging?filter=debug"

    The cleanest fix is to strip the override header at the ingress point using Envoy's

    request_headers_to_remove
    directive on the route or virtual host:

    request_headers_to_remove:
      - x-envoy-retry-on
      - x-envoy-max-retries
      - x-envoy-retry-grpc-on

    This forces all retry decisions through your static route configuration, regardless of what the downstream client sends. If you legitimately need to allow per-request retry tuning from trusted internal services, scope the header trust to specific virtual hosts rather than allowing it globally.


    Prevention

    Getting Envoy retry policies right the first time requires understanding several interacting configuration layers. Keeping them working requires ongoing observability. A few practices that prevent these problems from recurring in production.

    Always instrument the core retry metrics as part of your cluster dashboards. The four counters you care about are

    upstream_rq_retry
    ,
    upstream_rq_retry_success
    ,
    upstream_rq_retry_overflow
    , and
    upstream_rq_retry_limit_exceeded
    . Alert on
    retry_overflow
    trending upward — that's the circuit breaker throttling your retries. Alert on
    retry_success
    being low relative to
    upstream_rq_retry
    — that often signals the host selection problem or a per-try timeout issue.

    curl -s http://sw-infrarunbook-01:9901/stats | grep -E "upstream_rq_retry"
    
    # Healthy baseline:
    # cluster.api_backend.upstream_rq_retry: 203
    # cluster.api_backend.upstream_rq_retry_success: 191
    # cluster.api_backend.upstream_rq_retry_overflow: 0
    # cluster.api_backend.upstream_rq_retry_limit_exceeded: 12

    Use the admin API's

    /config_dump
    endpoint to verify your retry policy is actually applied at the route level you intend. It's easy to apply a retry policy to a virtual host when it should be on a specific route, or to misconfigure the route match so the retry policy never triggers. Verify the rendered config, not just the YAML you wrote:

    curl -s http://sw-infrarunbook-01:9901/config_dump?resource=dynamic_route_config | \
      python3 -m json.tool | grep -A 20 '"retry_policy"'

    Load-test your retry configuration before it matters. A retry policy that looks correct under light traffic fails silently under load when the circuit breaker retry budget gets exhausted. Run controlled fault injection — return artificial 503s from your upstream using Envoy's fault injection filter, then load test at realistic concurrency and watch whether retries are actually helping. This is the only way to catch the

    max_retries
    circuit breaker problem before it hits production.

    Establish a standard retry policy template for your organization and require deviation from it to be explicit. Something like this covers the common cases and avoids most of the pitfalls covered in this guide:

    retry_policy:
      retry_on: "connect-failure,reset,5xx"
      num_retries: 3
      per_try_timeout: 3s
      host_selection_retry_max_attempts: 10
      retry_host_predicate:
        - name: envoy.retry_host_predicates.previous_hosts
    
    circuit_breakers:
      thresholds:
        - priority: DEFAULT
          max_retries: 100

    This template includes the host selection predicate so retries land on different hosts, sets a sensible per-try timeout, and raises the circuit breaker threshold above the default. Review your upstream p99 latency before accepting the 3-second per-try timeout — that's a starting point, not a universal answer. Pair it with a route timeout of at least 15 seconds to give the full retry chain room to complete.

    Finally, strip

    x-envoy-retry-on
    and
    x-envoy-max-retries
    headers at ingress by default unless you have a specific, audited reason to honor them from downstream clients. Allowing downstream clients to override retry policy is a footgun that's easy to forget exists until a misconfigured upstream starts sending
    x-envoy-retry-on: off
    and your SLO starts degrading.

    Frequently Asked Questions

    Why does Envoy show upstream_rq_retry incrementing but errors still reach clients?

    Retries are happening but failing every time. The most likely causes are per_try_timeout too short (every attempt times out before the upstream can respond), the host selection retry plugin missing (retries go to the same failed host), or the retry condition not matching the actual failure type so Envoy retries but never hits a success path. Check upstream_rq_retry_success versus upstream_rq_retry — if the success count is near zero, one of these causes is responsible.

    What is the difference between retry_on: 5xx and retry_on: retriable-status-codes in Envoy?

    The 5xx condition retries on HTTP response codes 500, 502, 503, and 504 automatically. The retriable-status-codes condition lets you specify arbitrary HTTP status codes via the retriable_status_codes list — useful for retrying 429 responses or other non-5xx codes. Without a populated retriable_status_codes list, the retriable-status-codes condition is a no-op.

    How do I stop Envoy retries from going to the same failed upstream host?

    Add the envoy.retry_host_predicates.previous_hosts predicate to your retry_policy, and set host_selection_retry_max_attempts to at least twice your num_retries value. This tells Envoy to exclude previously attempted hosts when the load balancer selects a host for each retry attempt.

    Why does Envoy stop retrying even though num_retries is set to 5?

    The cluster-level circuit breaker has a max_retries threshold (default: 3) that caps the total number of active concurrent retry attempts across all connections to the cluster. When this threshold is exceeded, Envoy stops issuing retries and increments upstream_rq_retry_overflow. Increase max_retries in the cluster circuit_breakers configuration to match your actual traffic volume.

    How should I set per_try_timeout relative to num_retries and the route timeout?

    Your per_try_timeout should exceed the p95 upstream response latency. Your overall route timeout should be at least per_try_timeout multiplied by (num_retries + 1), plus a buffer. For example, with per_try_timeout of 2s and num_retries of 3, set the route timeout to at least 10–12 seconds. A route timeout that fires before all retries complete means clients see 504 errors even when retries would have succeeded.

    Related Articles