Envoy Retry Policy Not Working

Symptoms

You've wired up Envoy's retry policy —

num_retries

retry_on

, maybe a

per_try_timeout

— and your upstream keeps returning errors you expect to be retried. But clients are getting those errors immediately. No retries. The

upstream_rq_retry

counter on the admin API is flat. You tail the access log and watch 503s sailing straight through to the downstream client without a single retry attempt.

Or maybe retries are happening, but they're not helping. The same upstream host keeps failing. Clients still see elevated error rates despite a retry config that should be absorbing the failures. The counters are moving, but success rates aren't improving.

This is one of the more frustrating classes of Envoy problems because the failure is silent. Envoy doesn't log "I decided not to retry this" — it just doesn't. You have to go hunting through stats, access logs, and config dumps to piece together what's actually happening. This guide covers every root cause I've run into in production, with the specific commands that surface each one.

Root Cause 1: Retry Condition Wrong

The

retry_on

field is the gatekeeper for all retry logic in Envoy. If the condition you've specified doesn't match what's actually happening at the transport or protocol level, Envoy won't retry. Full stop. I've seen this trip up engineers who copy a config from a gRPC service and drop it on an HTTP/1.1 route without adjusting the retry conditions, or who assume

5xx

covers connection-level failures when it doesn't.

The most common mismatch is using

reset

when you're actually getting a clean HTTP response with a 5xx body, or using

5xx

when the upstream is terminating the connection before sending any HTTP response at all. These are different failure modes and Envoy distinguishes them precisely.

To identify this, start by checking the response flags in your Envoy access logs. The response flags field tells you exactly how the request failed:

kubectl exec -n solvethenetwork deploy/envoy-proxy -- \
  cat /var/log/envoy/access.log | tail -100 | awk '{print $9, $10, $NF}'

Match the response flags you see to the correct

retry_on

value. The flags that matter most are UF (upstream connection failure before headers — use

connect-failure

), UC (upstream connection terminated mid-stream — use

reset

), UH (no healthy upstream hosts — pair

retriable-status-codes

with 503), and URX (retry limit exceeded — retries are actually working but being exhausted). If you're seeing URX, your retry conditions are correct but you're hitting a budget problem, which is covered next.

Once you know your actual failure mode, configure

retry_on

to match it explicitly:

retry_policy:
  retry_on: "connect-failure,reset,5xx"
  num_retries: 3
  per_try_timeout: 2s

Don't blindly stuff every condition into

retry_on

, but do make sure the conditions you list actually correspond to the failures you're observing. The Envoy documentation lists every valid value and they're not all intuitive —

gateway-error

covers 502, 503, and 504 specifically, while

5xx

covers 500 through 504. These overlap in ways that matter depending on your upstream error distribution.

Root Cause 2: Max Retries Too Low

num_retries

defaults to 1 in Envoy. That's one retry attempt in addition to the original request. Under any real load with a degraded upstream, a single retry gets exhausted almost immediately and stops being meaningful. But there's a subtler and more dangerous version of this problem that operates at the cluster level.

Envoy enforces a retry budget through its circuit breaker settings. The default

max_retries

threshold in the cluster circuit breaker is 3. This is not the per-request retry count — it's the maximum number of active concurrent retry attempts across all connections to the entire cluster. When that threshold is hit, Envoy stops issuing new retries and increments the

upstream_rq_retry_overflow

counter instead. Your route-level

num_retries: 5

setting becomes irrelevant the moment the circuit breaker trips.

Check whether this is happening right now:

curl -s http://sw-infrarunbook-01:9901/stats | grep -E "upstream_rq_retry"

You should see output like this:

cluster.api_backend.upstream_rq_retry: 8241
cluster.api_backend.upstream_rq_retry_success: 7903
cluster.api_backend.upstream_rq_retry_overflow: 4712
cluster.api_backend.upstream_rq_retry_limit_exceeded: 338

upstream_rq_retry_overflow

is a non-trivial number and climbing, the cluster circuit breaker is throttling your retries. The fix is to raise the cluster-level

max_retries

threshold to something realistic for your traffic volume:

circuit_breakers:
  thresholds:
    - priority: DEFAULT
      max_retries: 100
      max_pending_requests: 1024
      max_requests: 2048
      max_connections: 512

The right value for

max_retries

depends on your concurrency. A rough starting point: take your expected peak concurrent requests, multiply by your

num_retries

value, and use that as the ceiling. Under high concurrency, the default of 3 is useless. That said, don't just set it to an arbitrarily large number — retries under load can amplify pressure on a struggling upstream and trigger a retry storm. Be deliberate about the ceiling you choose.

Root Cause 3: Per-Try Timeout Too Low

The

per_try_timeout

is the maximum time Envoy will wait for each individual attempt, including every retry. If your upstream is slow but eventually succeeds, a per-try timeout that's too short kills every attempt before it can succeed, burning through your retry budget while accomplishing nothing.

Here's the exact failure scenario. You configure a 30-second route timeout, a 500ms per-try timeout, and three retries. Your upstream is under load and responding in 650ms. Every attempt — original and all three retries — times out at 500ms. Envoy dutifully retries, exhausts the budget, and delivers a 504 to the client. The upstream would have responded successfully 150ms later, but you never let it breathe.

Identify this by correlating retry and timeout counters:

curl -s http://sw-infrarunbook-01:9901/stats | grep -E "upstream_rq_(retry|timeout)"

# Problematic output — retry and timeout counts tracking together:
cluster.api_backend.upstream_rq_retry: 5103
cluster.api_backend.upstream_rq_retry_success: 41
cluster.api_backend.upstream_rq_timeout: 4977

When retry counts and timeout counts are nearly equal and retry success is near zero, per-try timeout is almost certainly the culprit. Follow up by checking your actual upstream response time distribution:

curl -s http://sw-infrarunbook-01:9901/stats | grep "upstream_rq_time"

Compare the p95 and p99 latencies against your configured

per_try_timeout

. Your per-try timeout must comfortably exceed the p95 upstream response time under normal conditions. If your p99 is 800ms, set

per_try_timeout

to at least 1.5 seconds. Then make sure your overall route timeout accounts for the worst case:

retry_policy:
  retry_on: "5xx,reset,connect-failure"
  num_retries: 3
  per_try_timeout: 2s

# Route-level timeout must be >= per_try_timeout * (num_retries + 1) + buffer
# In this case: 2s * 4 = 8s minimum, set to 12s for headroom
route:
  timeout: 12s

A route timeout that fires before your retry budget is exhausted is a misconfiguration. The client gets a 504, your retries were wasted, and the upstream never got a fair chance. Calculate the worst-case duration and set the route timeout accordingly.

Root Cause 4: Retry on Wrong Status Code

This is distinct from getting the retry condition wrong at the transport level. When you use

retriable-status-codes

retry_on

, Envoy also needs a populated

retriable_status_codes

list specifying which HTTP status codes should trigger a retry. Without the list,

retriable-status-codes

is a no-op — Envoy accepts it, applies it, and never retries anything because no codes are defined.

In my experience, this happens most often when copying config snippets. The

retry_on

value gets copied, but the corresponding

retriable_status_codes

block is missing. The config passes validation silently:

# Broken — retriable-status-codes with no codes specified
retry_policy:
  retry_on: "retriable-status-codes"
  num_retries: 3

Envoy doesn't warn you. It simply never matches any status code because the list is empty. The corrected config:

retry_policy:
  retry_on: "retriable-status-codes"
  num_retries: 3
  retriable_status_codes:
    - 503
    - 500
    - 429

The related mistake is assuming

5xx

retry_on

covers 429. It doesn't. The

5xx

condition matches HTTP 500, 502, 503, and 504 only. If your upstream returns 429 under rate limiting and you want Envoy to retry those requests, you need

retriable-status-codes

with 429 explicitly listed.

Always confirm what status codes your upstream is actually returning before writing the config. Pull the distribution from the cluster stats:

curl -s http://sw-infrarunbook-01:9901/stats | grep -E "upstream_rq_[0-9]{3}:"

# Sample output:
# cluster.api_backend.upstream_rq_200: 189043
# cluster.api_backend.upstream_rq_429: 5821
# cluster.api_backend.upstream_rq_503: 344
# cluster.api_backend.upstream_rq_504: 87

Once you see 429s in that output, you know you need

retriable-status-codes

with 429 in the list. Don't configure retry policy based on what you think the upstream returns — verify it first.

Root Cause 5: Host Selection Retry Plugin Missing

This is the most insidious of all retry problems because retries are technically happening — they're just not helping. Without a host selection retry plugin configured, Envoy's load balancer may select the exact same failed upstream host for each retry attempt. If a host is throwing 503s because it's unhealthy, retrying to that same host three times accomplishes nothing except wasting time and burning your route timeout.

The

previous_hosts

retry plugin instructs Envoy to exclude hosts that were already attempted during the current request's retry chain. It sounds like it should be default behavior. It isn't. You have to explicitly configure it, and most teams don't know it exists until they've been burned by it.

To identify whether retries are landing on the same host, you need access log entries that include the

%UPSTREAM_HOST%

field. If you have that enabled, look for patterns like this:

grep "upstream_rq_retry" /var/log/envoy/access.log | \
  awk '{print $upstream_host_column}' | sort | uniq -c | sort -rn

# A broken retry distribution looks like:
# 6203  10.10.1.45:8080
# 17    10.10.1.46:8080
# 11    10.10.1.47:8080

One host absorbing nearly all retry traffic while the others get almost none is a strong signal that the load balancer is repeatedly selecting the same host. You can also detect this indirectly: if your

upstream_rq_retry

counter is climbing but

upstream_rq_retry_success

is near zero, and one specific upstream host has a persistently high error rate in your monitoring, retries are probably piling onto that broken instance.

The fix is to add the

previous_hosts

predicate to your retry policy and set

host_selection_retry_max_attempts

high enough that the load balancer actually gets to pick a different host:

retry_policy:
  retry_on: "5xx,reset,connect-failure"
  num_retries: 3
  per_try_timeout: 2s
  host_selection_retry_max_attempts: 10
  retry_host_predicate:
    - name: envoy.retry_host_predicates.previous_hosts

Set

host_selection_retry_max_attempts

to at least twice your

num_retries

value. This controls how many times Envoy will spin the load balancer wheel trying to find a host it hasn't tried yet. If you have three upstream hosts and

host_selection_retry_max_attempts

is 1, you're giving Envoy only one shot at avoiding the previous host — which may not be enough if the load balancer naturally tends toward that host. Give it room to maneuver.

Envoy also ships an

omit_canary_hosts

predicate that prevents retries from landing on canary-tagged endpoints, and an

omit_host_metadata

predicate for metadata-based host exclusion. In most production setups,

previous_hosts

is the one you need. Make it part of your standard retry policy template.

Root Cause 6: x-envoy-retry-on Header Override

Envoy respects the

x-envoy-retry-on

request header from downstream clients by default. If a downstream service, sidecar, or API gateway is sending this header — set to an empty value,

off

, or a more restrictive condition than your route config — it overrides your retry policy entirely.

This shows up in service mesh deployments where multiple Envoy proxies are in the call path, or when an upstream load balancer is adding Envoy-specific headers before the request reaches your proxy. Check whether incoming requests carry this header by enabling debug-level header logging or using a test request:

curl -v http://sw-infrarunbook-01:8080/api/health 2>&1 | grep -i "x-envoy"

# If you see something like:
# > x-envoy-retry-on: off
# ...that's your problem

You can also temporarily enable debug logging on the Envoy admin API and trigger a request to see exactly what headers arrive at the route level:

curl -s -X POST "http://sw-infrarunbook-01:9901/logging?filter=debug"

The cleanest fix is to strip the override header at the ingress point using Envoy's

request_headers_to_remove

directive on the route or virtual host:

request_headers_to_remove:
  - x-envoy-retry-on
  - x-envoy-max-retries
  - x-envoy-retry-grpc-on

This forces all retry decisions through your static route configuration, regardless of what the downstream client sends. If you legitimately need to allow per-request retry tuning from trusted internal services, scope the header trust to specific virtual hosts rather than allowing it globally.

Prevention

Getting Envoy retry policies right the first time requires understanding several interacting configuration layers. Keeping them working requires ongoing observability. A few practices that prevent these problems from recurring in production.

Always instrument the core retry metrics as part of your cluster dashboards. The four counters you care about are

upstream_rq_retry

upstream_rq_retry_success

upstream_rq_retry_overflow

, and

upstream_rq_retry_limit_exceeded

. Alert on

retry_overflow

trending upward — that's the circuit breaker throttling your retries. Alert on

retry_success

being low relative to

upstream_rq_retry

— that often signals the host selection problem or a per-try timeout issue.

curl -s http://sw-infrarunbook-01:9901/stats | grep -E "upstream_rq_retry"

# Healthy baseline:
# cluster.api_backend.upstream_rq_retry: 203
# cluster.api_backend.upstream_rq_retry_success: 191
# cluster.api_backend.upstream_rq_retry_overflow: 0
# cluster.api_backend.upstream_rq_retry_limit_exceeded: 12

Use the admin API's

/config_dump

endpoint to verify your retry policy is actually applied at the route level you intend. It's easy to apply a retry policy to a virtual host when it should be on a specific route, or to misconfigure the route match so the retry policy never triggers. Verify the rendered config, not just the YAML you wrote:

curl -s http://sw-infrarunbook-01:9901/config_dump?resource=dynamic_route_config | \
  python3 -m json.tool | grep -A 20 '"retry_policy"'

Load-test your retry configuration before it matters. A retry policy that looks correct under light traffic fails silently under load when the circuit breaker retry budget gets exhausted. Run controlled fault injection — return artificial 503s from your upstream using Envoy's fault injection filter, then load test at realistic concurrency and watch whether retries are actually helping. This is the only way to catch the

max_retries

circuit breaker problem before it hits production.

Establish a standard retry policy template for your organization and require deviation from it to be explicit. Something like this covers the common cases and avoids most of the pitfalls covered in this guide:

retry_policy:
  retry_on: "connect-failure,reset,5xx"
  num_retries: 3
  per_try_timeout: 3s
  host_selection_retry_max_attempts: 10
  retry_host_predicate:
    - name: envoy.retry_host_predicates.previous_hosts

circuit_breakers:
  thresholds:
    - priority: DEFAULT
      max_retries: 100

This template includes the host selection predicate so retries land on different hosts, sets a sensible per-try timeout, and raises the circuit breaker threshold above the default. Review your upstream p99 latency before accepting the 3-second per-try timeout — that's a starting point, not a universal answer. Pair it with a route timeout of at least 15 seconds to give the full retry chain room to complete.

Finally, strip

x-envoy-retry-on

and

x-envoy-max-retries

headers at ingress by default unless you have a specific, audited reason to honor them from downstream clients. Allowing downstream clients to override retry policy is a footgun that's easy to forget exists until a misconfigured upstream starts sending

x-envoy-retry-on: off

and your SLO starts degrading.

Envoy Retry Policy Not Working

Symptoms

Root Cause 1: Retry Condition Wrong

Root Cause 2: Max Retries Too Low

Root Cause 3: Per-Try Timeout Too Low

Root Cause 4: Retry on Wrong Status Code

Root Cause 5: Host Selection Retry Plugin Missing

Root Cause 6: x-envoy-retry-on Header Override

Prevention

Frequently Asked Questions

Why does Envoy show upstream_rq_retry incrementing but errors still reach clients?

What is the difference between retry_on: 5xx and retry_on: retriable-status-codes in Envoy?

How do I stop Envoy retries from going to the same failed upstream host?

Why does Envoy stop retrying even though num_retries is set to 5?

How should I set per_try_timeout relative to num_retries and the route timeout?

Related Articles