Symptoms
You've wired up Envoy's retry policy —
num_retries,
retry_on, maybe a
per_try_timeout— and your upstream keeps returning errors you expect to be retried. But clients are getting those errors immediately. No retries. The
upstream_rq_retrycounter on the admin API is flat. You tail the access log and watch 503s sailing straight through to the downstream client without a single retry attempt.
Or maybe retries are happening, but they're not helping. The same upstream host keeps failing. Clients still see elevated error rates despite a retry config that should be absorbing the failures. The counters are moving, but success rates aren't improving.
This is one of the more frustrating classes of Envoy problems because the failure is silent. Envoy doesn't log "I decided not to retry this" — it just doesn't. You have to go hunting through stats, access logs, and config dumps to piece together what's actually happening. This guide covers every root cause I've run into in production, with the specific commands that surface each one.
Root Cause 1: Retry Condition Wrong
The
retry_onfield is the gatekeeper for all retry logic in Envoy. If the condition you've specified doesn't match what's actually happening at the transport or protocol level, Envoy won't retry. Full stop. I've seen this trip up engineers who copy a config from a gRPC service and drop it on an HTTP/1.1 route without adjusting the retry conditions, or who assume
5xxcovers connection-level failures when it doesn't.
The most common mismatch is using
resetwhen you're actually getting a clean HTTP response with a 5xx body, or using
5xxwhen the upstream is terminating the connection before sending any HTTP response at all. These are different failure modes and Envoy distinguishes them precisely.
To identify this, start by checking the response flags in your Envoy access logs. The response flags field tells you exactly how the request failed:
kubectl exec -n solvethenetwork deploy/envoy-proxy -- \
cat /var/log/envoy/access.log | tail -100 | awk '{print $9, $10, $NF}'
Match the response flags you see to the correct
retry_onvalue. The flags that matter most are UF (upstream connection failure before headers — use
connect-failure), UC (upstream connection terminated mid-stream — use
reset), UH (no healthy upstream hosts — pair
retriable-status-codeswith 503), and URX (retry limit exceeded — retries are actually working but being exhausted). If you're seeing URX, your retry conditions are correct but you're hitting a budget problem, which is covered next.
Once you know your actual failure mode, configure
retry_onto match it explicitly:
retry_policy:
retry_on: "connect-failure,reset,5xx"
num_retries: 3
per_try_timeout: 2s
Don't blindly stuff every condition into
retry_on, but do make sure the conditions you list actually correspond to the failures you're observing. The Envoy documentation lists every valid value and they're not all intuitive —
gateway-errorcovers 502, 503, and 504 specifically, while
5xxcovers 500 through 504. These overlap in ways that matter depending on your upstream error distribution.
Root Cause 2: Max Retries Too Low
num_retriesdefaults to 1 in Envoy. That's one retry attempt in addition to the original request. Under any real load with a degraded upstream, a single retry gets exhausted almost immediately and stops being meaningful. But there's a subtler and more dangerous version of this problem that operates at the cluster level.
Envoy enforces a retry budget through its circuit breaker settings. The default
max_retriesthreshold in the cluster circuit breaker is 3. This is not the per-request retry count — it's the maximum number of active concurrent retry attempts across all connections to the entire cluster. When that threshold is hit, Envoy stops issuing new retries and increments the
upstream_rq_retry_overflowcounter instead. Your route-level
num_retries: 5setting becomes irrelevant the moment the circuit breaker trips.
Check whether this is happening right now:
curl -s http://sw-infrarunbook-01:9901/stats | grep -E "upstream_rq_retry"
You should see output like this:
cluster.api_backend.upstream_rq_retry: 8241
cluster.api_backend.upstream_rq_retry_success: 7903
cluster.api_backend.upstream_rq_retry_overflow: 4712
cluster.api_backend.upstream_rq_retry_limit_exceeded: 338
If
upstream_rq_retry_overflowis a non-trivial number and climbing, the cluster circuit breaker is throttling your retries. The fix is to raise the cluster-level
max_retriesthreshold to something realistic for your traffic volume:
circuit_breakers:
thresholds:
- priority: DEFAULT
max_retries: 100
max_pending_requests: 1024
max_requests: 2048
max_connections: 512
The right value for
max_retriesdepends on your concurrency. A rough starting point: take your expected peak concurrent requests, multiply by your
num_retriesvalue, and use that as the ceiling. Under high concurrency, the default of 3 is useless. That said, don't just set it to an arbitrarily large number — retries under load can amplify pressure on a struggling upstream and trigger a retry storm. Be deliberate about the ceiling you choose.
Root Cause 3: Per-Try Timeout Too Low
The
per_try_timeoutis the maximum time Envoy will wait for each individual attempt, including every retry. If your upstream is slow but eventually succeeds, a per-try timeout that's too short kills every attempt before it can succeed, burning through your retry budget while accomplishing nothing.
Here's the exact failure scenario. You configure a 30-second route timeout, a 500ms per-try timeout, and three retries. Your upstream is under load and responding in 650ms. Every attempt — original and all three retries — times out at 500ms. Envoy dutifully retries, exhausts the budget, and delivers a 504 to the client. The upstream would have responded successfully 150ms later, but you never let it breathe.
Identify this by correlating retry and timeout counters:
curl -s http://sw-infrarunbook-01:9901/stats | grep -E "upstream_rq_(retry|timeout)"
# Problematic output — retry and timeout counts tracking together:
cluster.api_backend.upstream_rq_retry: 5103
cluster.api_backend.upstream_rq_retry_success: 41
cluster.api_backend.upstream_rq_timeout: 4977
When retry counts and timeout counts are nearly equal and retry success is near zero, per-try timeout is almost certainly the culprit. Follow up by checking your actual upstream response time distribution:
curl -s http://sw-infrarunbook-01:9901/stats | grep "upstream_rq_time"
Compare the p95 and p99 latencies against your configured
per_try_timeout. Your per-try timeout must comfortably exceed the p95 upstream response time under normal conditions. If your p99 is 800ms, set
per_try_timeoutto at least 1.5 seconds. Then make sure your overall route timeout accounts for the worst case:
retry_policy:
retry_on: "5xx,reset,connect-failure"
num_retries: 3
per_try_timeout: 2s
# Route-level timeout must be >= per_try_timeout * (num_retries + 1) + buffer
# In this case: 2s * 4 = 8s minimum, set to 12s for headroom
route:
timeout: 12s
A route timeout that fires before your retry budget is exhausted is a misconfiguration. The client gets a 504, your retries were wasted, and the upstream never got a fair chance. Calculate the worst-case duration and set the route timeout accordingly.
Root Cause 4: Retry on Wrong Status Code
This is distinct from getting the retry condition wrong at the transport level. When you use
retriable-status-codesin
retry_on, Envoy also needs a populated
retriable_status_codeslist specifying which HTTP status codes should trigger a retry. Without the list,
retriable-status-codesis a no-op — Envoy accepts it, applies it, and never retries anything because no codes are defined.
In my experience, this happens most often when copying config snippets. The
retry_onvalue gets copied, but the corresponding
retriable_status_codesblock is missing. The config passes validation silently:
# Broken — retriable-status-codes with no codes specified
retry_policy:
retry_on: "retriable-status-codes"
num_retries: 3
Envoy doesn't warn you. It simply never matches any status code because the list is empty. The corrected config:
retry_policy:
retry_on: "retriable-status-codes"
num_retries: 3
retriable_status_codes:
- 503
- 500
- 429
The related mistake is assuming
5xxin
retry_oncovers 429. It doesn't. The
5xxcondition matches HTTP 500, 502, 503, and 504 only. If your upstream returns 429 under rate limiting and you want Envoy to retry those requests, you need
retriable-status-codeswith 429 explicitly listed.
Always confirm what status codes your upstream is actually returning before writing the config. Pull the distribution from the cluster stats:
curl -s http://sw-infrarunbook-01:9901/stats | grep -E "upstream_rq_[0-9]{3}:"
# Sample output:
# cluster.api_backend.upstream_rq_200: 189043
# cluster.api_backend.upstream_rq_429: 5821
# cluster.api_backend.upstream_rq_503: 344
# cluster.api_backend.upstream_rq_504: 87
Once you see 429s in that output, you know you need
retriable-status-codeswith 429 in the list. Don't configure retry policy based on what you think the upstream returns — verify it first.
Root Cause 5: Host Selection Retry Plugin Missing
This is the most insidious of all retry problems because retries are technically happening — they're just not helping. Without a host selection retry plugin configured, Envoy's load balancer may select the exact same failed upstream host for each retry attempt. If a host is throwing 503s because it's unhealthy, retrying to that same host three times accomplishes nothing except wasting time and burning your route timeout.
The
previous_hostsretry plugin instructs Envoy to exclude hosts that were already attempted during the current request's retry chain. It sounds like it should be default behavior. It isn't. You have to explicitly configure it, and most teams don't know it exists until they've been burned by it.
To identify whether retries are landing on the same host, you need access log entries that include the
%UPSTREAM_HOST%field. If you have that enabled, look for patterns like this:
grep "upstream_rq_retry" /var/log/envoy/access.log | \
awk '{print $upstream_host_column}' | sort | uniq -c | sort -rn
# A broken retry distribution looks like:
# 6203 10.10.1.45:8080
# 17 10.10.1.46:8080
# 11 10.10.1.47:8080
One host absorbing nearly all retry traffic while the others get almost none is a strong signal that the load balancer is repeatedly selecting the same host. You can also detect this indirectly: if your
upstream_rq_retrycounter is climbing but
upstream_rq_retry_successis near zero, and one specific upstream host has a persistently high error rate in your monitoring, retries are probably piling onto that broken instance.
The fix is to add the
previous_hostspredicate to your retry policy and set
host_selection_retry_max_attemptshigh enough that the load balancer actually gets to pick a different host:
retry_policy:
retry_on: "5xx,reset,connect-failure"
num_retries: 3
per_try_timeout: 2s
host_selection_retry_max_attempts: 10
retry_host_predicate:
- name: envoy.retry_host_predicates.previous_hosts
Set
host_selection_retry_max_attemptsto at least twice your
num_retriesvalue. This controls how many times Envoy will spin the load balancer wheel trying to find a host it hasn't tried yet. If you have three upstream hosts and
host_selection_retry_max_attemptsis 1, you're giving Envoy only one shot at avoiding the previous host — which may not be enough if the load balancer naturally tends toward that host. Give it room to maneuver.
Envoy also ships an
omit_canary_hostspredicate that prevents retries from landing on canary-tagged endpoints, and an
omit_host_metadatapredicate for metadata-based host exclusion. In most production setups,
previous_hostsis the one you need. Make it part of your standard retry policy template.
Root Cause 6: x-envoy-retry-on Header Override
Envoy respects the
x-envoy-retry-onrequest header from downstream clients by default. If a downstream service, sidecar, or API gateway is sending this header — set to an empty value,
off, or a more restrictive condition than your route config — it overrides your retry policy entirely.
This shows up in service mesh deployments where multiple Envoy proxies are in the call path, or when an upstream load balancer is adding Envoy-specific headers before the request reaches your proxy. Check whether incoming requests carry this header by enabling debug-level header logging or using a test request:
curl -v http://sw-infrarunbook-01:8080/api/health 2>&1 | grep -i "x-envoy"
# If you see something like:
# > x-envoy-retry-on: off
# ...that's your problem
You can also temporarily enable debug logging on the Envoy admin API and trigger a request to see exactly what headers arrive at the route level:
curl -s -X POST "http://sw-infrarunbook-01:9901/logging?filter=debug"
The cleanest fix is to strip the override header at the ingress point using Envoy's
request_headers_to_removedirective on the route or virtual host:
request_headers_to_remove:
- x-envoy-retry-on
- x-envoy-max-retries
- x-envoy-retry-grpc-on
This forces all retry decisions through your static route configuration, regardless of what the downstream client sends. If you legitimately need to allow per-request retry tuning from trusted internal services, scope the header trust to specific virtual hosts rather than allowing it globally.
Prevention
Getting Envoy retry policies right the first time requires understanding several interacting configuration layers. Keeping them working requires ongoing observability. A few practices that prevent these problems from recurring in production.
Always instrument the core retry metrics as part of your cluster dashboards. The four counters you care about are
upstream_rq_retry,
upstream_rq_retry_success,
upstream_rq_retry_overflow, and
upstream_rq_retry_limit_exceeded. Alert on
retry_overflowtrending upward — that's the circuit breaker throttling your retries. Alert on
retry_successbeing low relative to
upstream_rq_retry— that often signals the host selection problem or a per-try timeout issue.
curl -s http://sw-infrarunbook-01:9901/stats | grep -E "upstream_rq_retry"
# Healthy baseline:
# cluster.api_backend.upstream_rq_retry: 203
# cluster.api_backend.upstream_rq_retry_success: 191
# cluster.api_backend.upstream_rq_retry_overflow: 0
# cluster.api_backend.upstream_rq_retry_limit_exceeded: 12
Use the admin API's
/config_dumpendpoint to verify your retry policy is actually applied at the route level you intend. It's easy to apply a retry policy to a virtual host when it should be on a specific route, or to misconfigure the route match so the retry policy never triggers. Verify the rendered config, not just the YAML you wrote:
curl -s http://sw-infrarunbook-01:9901/config_dump?resource=dynamic_route_config | \
python3 -m json.tool | grep -A 20 '"retry_policy"'
Load-test your retry configuration before it matters. A retry policy that looks correct under light traffic fails silently under load when the circuit breaker retry budget gets exhausted. Run controlled fault injection — return artificial 503s from your upstream using Envoy's fault injection filter, then load test at realistic concurrency and watch whether retries are actually helping. This is the only way to catch the
max_retriescircuit breaker problem before it hits production.
Establish a standard retry policy template for your organization and require deviation from it to be explicit. Something like this covers the common cases and avoids most of the pitfalls covered in this guide:
retry_policy:
retry_on: "connect-failure,reset,5xx"
num_retries: 3
per_try_timeout: 3s
host_selection_retry_max_attempts: 10
retry_host_predicate:
- name: envoy.retry_host_predicates.previous_hosts
circuit_breakers:
thresholds:
- priority: DEFAULT
max_retries: 100
This template includes the host selection predicate so retries land on different hosts, sets a sensible per-try timeout, and raises the circuit breaker threshold above the default. Review your upstream p99 latency before accepting the 3-second per-try timeout — that's a starting point, not a universal answer. Pair it with a route timeout of at least 15 seconds to give the full retry chain room to complete.
Finally, strip
x-envoy-retry-onand
x-envoy-max-retriesheaders at ingress by default unless you have a specific, audited reason to honor them from downstream clients. Allowing downstream clients to override retry policy is a footgun that's easy to forget exists until a misconfigured upstream starts sending
x-envoy-retry-on: offand your SLO starts degrading.
