Symptoms
When Envoy's circuit breaker trips, the most visible symptom is a stream of HTTP 503 responses hitting your clients almost instantly — no connection delay, no timeout wait, just immediate rejection. Unlike a normal upstream failure where you'd see latency climb first, circuit breaker-triggered 503s arrive fast because Envoy is intentionally refusing the request before it ever touches the upstream. The response will carry the header x-envoy-overloaded: true, which is your clearest indicator that this is a circuit breaker event and not an upstream crash.
On the Envoy admin interface (default port 9901), you'll see specific stat counters ticking upward. The ones that matter are upstream_cx_overflow, upstream_rq_pending_overflow, and upstream_rq_retry_overflow. In my experience, these counters sit at zero for long stretches and then spike suddenly — often correlating with a traffic burst, a slow deploy, or an upstream that started responding slower than usual.
Other symptoms to watch for:
- Downstream services reporting connection refused or immediate 503 errors with no matching upstream log entries
- Envoy access logs showing the
UO
response flag, meaning upstream overflow - Observability dashboards showing a sudden vertical line on overflow counters
- The circuit breaker gauge
circuit_breakers.default.cx_open
returning a value of 1, meaning the breaker is actively open
The tricky part is that a correctly functioning circuit breaker looks identical from the outside to a misconfigured one. The stat counters alone won't tell you which situation you're in — you need to cross-reference them with upstream health, latency, and your threshold configuration to figure out whether Envoy is protecting a degraded upstream or unnecessarily rejecting perfectly healthy traffic.
Root Cause 1: Max Connections Exceeded
Envoy maintains a per-cluster limit on active TCP connections to upstream hosts. This is controlled by the max_connections field in the circuit breaker thresholds block. The default is 1024, which sounds generous until you realize that in high-throughput environments a single service mesh proxy can easily sustain thousands of concurrent connections across multiple upstreams.
This limit applies to the total number of active connections across all hosts in the cluster, not per host. So if you have three upstream endpoints each handling 400 active connections, you're already at 1200 — above the default 1024 limit — and Envoy will start rejecting new connection attempts while all three hosts are perfectly healthy.
How to Identify It
Pull the overflow stats from the admin endpoint and filter for connection-related counters:
curl -s http://192.168.1.10:9901/stats | grep -E "upstream_cx_overflow|upstream_cx_active|cx_open"
You're looking for output like this:
cluster.api_service.circuit_breakers.default.cx_open: 1
cluster.api_service.upstream_cx_active: 1024
cluster.api_service.upstream_cx_overflow: 847
The
cx_open: 1confirms the connection circuit breaker is open. The
upstream_cx_overflowcounter tells you how many connection attempts have been rejected since Envoy started. Compare
upstream_cx_activeagainst your configured threshold — if they match, that's your bottleneck. Verify the configured limit with a config dump:
curl -s http://192.168.1.10:9901/config_dump | python3 -m json.tool | grep -A 20 "circuit_breakers"
"circuit_breakers": {
"thresholds": [
{
"priority": "DEFAULT",
"max_connections": 1024,
"max_pending_requests": 1024,
"max_requests": 1024,
"max_retries": 3
}
]
}
How to Fix It
Increase the max_connections threshold in your cluster configuration. The right value depends on your actual traffic profile. I usually recommend pulling 95th-percentile active connection counts over a representative traffic window and setting the threshold to 2–3x that value as headroom.
circuit_breakers:
thresholds:
- priority: DEFAULT
max_connections: 4096
max_pending_requests: 2048
max_requests: 4096
max_retries: 50
After applying the config update, watch the overflow counter in real time to confirm it stops climbing:
watch -n 2 "curl -s http://192.168.1.10:9901/stats | grep upstream_cx_overflow"
If the counter keeps rising even after raising the threshold, don't just keep increasing it blindly. That's masking a deeper problem. Investigate whether the upstream is accumulating slow or stuck connections that should be timing out instead of staying active.
Root Cause 2: Max Pending Requests Hit
When all connections in the pool are in use, new requests don't get dropped immediately — they queue as pending requests. The max_pending_requests circuit breaker limit controls how deep that queue can grow. Once it hits the limit, Envoy starts rejecting incoming requests with 503. This is a subtly different failure mode than max connections: active connections might be well within range, but requests are stacking up behind them faster than they can be processed.
I've seen this happen most often when upstream latency spikes — not enough to trigger health checks, but enough to push response times from 20ms to 200ms. That 10x latency increase means the same number of connections can handle only a tenth of the previous throughput, and the pending queue fills up fast.
How to Identify It
curl -s http://192.168.1.10:9901/stats | grep -E "upstream_rq_pending_overflow|upstream_rq_pending_active|rq_open"
cluster.api_service.circuit_breakers.default.rq_open: 1
cluster.api_service.upstream_rq_pending_active: 1024
cluster.api_service.upstream_rq_pending_overflow: 3291
The
rq_open: 1flag and a climbing
upstream_rq_pending_overflowcounter are the tell. Cross-reference with the upstream request time histogram to see whether latency is the driver:
curl -s http://192.168.1.10:9901/stats | grep "upstream_rq_time"
cluster.api_service.upstream_rq_time: P0(nan,1) P25(nan,8) P50(nan,45) P75(nan,182) P99(nan,847) P99.9(nan,1204)
A P99 of 847ms when your normal baseline is under 50ms tells you the upstream is the underlying cause, not the queue depth setting.
How to Fix It
In the short term you can raise max_pending_requests to absorb the queue spike. But this is treating symptoms. If upstream latency is elevated, increasing the pending limit without addressing latency just delays the overflow by a few seconds. The correct approach is both: raise the limit while simultaneously investigating upstream health.
circuit_breakers:
thresholds:
- priority: DEFAULT
max_connections: 4096
max_pending_requests: 4096
max_requests: 4096
max_retries: 50
Pair this with a per-request timeout on the cluster so stuck requests fail fast and drain from the queue rather than accumulating indefinitely:
connect_timeout: 0.5s
typed_extension_protocol_options:
envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
common_http_protocol_options:
idle_timeout: 60s
Root Cause 3: Max Retries Exceeded
This one catches people off guard. Envoy's circuit breaker has a dedicated limit for concurrent retries — the max_retries threshold controls how many retry attempts can be in flight simultaneously across the entire cluster. The default value is 3. That's extremely conservative for anything handling real production traffic.
A retry spike can happen entirely independently of connection or pending request counts. If your retry policy is configured to retry on 5xx responses and the upstream starts throwing intermittent errors, retries multiply your effective request load. When concurrent retries exceed max_retries, Envoy stops retrying and returns the 503 directly to the client. From the outside it looks like every other circuit breaker trip.
How to Identify It
curl -s http://192.168.1.10:9901/stats | grep -E "upstream_rq_retry|retry_overflow"
cluster.api_service.upstream_rq_retry: 4829
cluster.api_service.upstream_rq_retry_overflow: 1203
cluster.api_service.upstream_rq_retry_success: 2871
cluster.api_service.upstream_rq_retry_limit_exceeded: 1405
The
upstream_rq_retry_overflowcounter is the smoking gun. Also check your access logs for the
URXresponse flag, which means upstream retry limit exceeded:
tail -f /var/log/envoy/access.log | grep " URX "
[2026-04-12T09:14:22.341Z] "POST /api/v2/process HTTP/1.1" 503 URX 0 91 12 - "-" "python-requests/2.28.0" "a3f8b1c2" "10.0.1.50:8080"
How to Fix It
First, evaluate whether your retry policy is actually appropriate. Retrying on 5xx when the upstream is consistently failing just amplifies load and makes the situation worse. Consider tightening retry conditions to specific retriable status codes — 503 and 504 are reasonable candidates; 500 is often not retriable because it represents an application-level error that won't change on retry.
If retries are appropriate and you need more headroom, raise max_retries:
circuit_breakers:
thresholds:
- priority: DEFAULT
max_connections: 4096
max_pending_requests: 4096
max_requests: 4096
max_retries: 100
Better yet, replace the fixed retry count with a retry budget. A retry budget scales allowed retries proportionally to active request volume, which is almost always what you actually want in production:
retry_budget:
budget_percent:
value: 20.0
min_retry_concurrency: 3
This limits retries to 20% of active requests with a minimum floor of 3. Far more sensible than a hard limit of 3 for a service handling several hundred RPS.
Root Cause 4: Threshold Too Low for Traffic
Sometimes nothing is wrong with the upstream. The circuit breaker trips purely because the configured limits don't match actual production traffic volumes. This is extremely common when a config was copied from a lower-traffic staging environment, or when traffic has grown organically since the config was last reviewed.
The default Envoy circuit breaker thresholds — 1024 connections, 1024 pending requests, 1024 active requests, 3 retries — are deliberately conservative. They're meant to prevent runaway resource exhaustion, not to serve as permanent production settings for a high-throughput service. Services handling more than a few hundred RPS almost always need these numbers tuned upward.
How to Identify It
The key diagnostic is comparing overflow events against upstream health. If overflows are happening but the upstream hosts are healthy and responding quickly, you have a threshold misconfiguration — not an upstream problem.
curl -s http://192.168.1.10:9901/clusters | grep -E "health_flags|cx_active"
api_service::10.0.1.50:8080::health_flags::healthy
api_service::10.0.1.51:8080::health_flags::healthy
api_service::10.0.1.52:8080::health_flags::healthy
api_service::10.0.1.50:8080::cx_active::341
api_service::10.0.1.51:8080::cx_active::339
api_service::10.0.1.52:8080::cx_active::344
Three healthy hosts, each with roughly 340 active connections, totaling ~1024 — exactly at the default limit. Now check whether upstream latency is normal:
curl -s http://192.168.1.10:9901/stats | grep -E "upstream_cx_overflow|upstream_rq_time"
cluster.api_service.upstream_cx_overflow: 12483
cluster.api_service.upstream_rq_time: P0(nan,1) P25(nan,3) P50(nan,6) P75(nan,11) P99(nan,28)
P99 at 28ms with 12,000+ overflow events is the clearest sign your thresholds are undersized for your load. The upstream is healthy — Envoy just needs permission to use more connections.
How to Fix It
Baseline your actual traffic before setting new limits. Look at peak active connection counts and request concurrency over a representative time window that includes your daily and weekly traffic peaks:
curl -s http://192.168.1.10:9901/stats | grep -E "cx_active|rq_active|rq_pending_active"
cluster.api_service.upstream_cx_active: 1024
cluster.api_service.upstream_rq_active: 876
cluster.api_service.upstream_rq_pending_active: 203
If peak active connections hit 1024, setting max_connections to 3000 gives you roughly 3x headroom for spikes. Apply the same reasoning to pending requests and active requests:
circuit_breakers:
thresholds:
- priority: DEFAULT
max_connections: 3000
max_pending_requests: 3000
max_requests: 3000
max_retries: 100
After applying the update, verify the overflow counters stop incrementing:
watch -n 1 "curl -s http://192.168.1.10:9901/stats | grep overflow"
Root Cause 5: Upstream Service Degraded
This is the scenario where the circuit breaker is working exactly as designed. When an upstream slows down — due to a bad deploy, a database query regression, external dependency latency, or resource exhaustion — responses take longer. Slower responses mean connections stay active longer. Active connections accumulate. Eventually you hit the max_connections or max_pending_requests threshold and the breaker trips.
The circuit breaker here is doing you a favor. It's preventing a slow upstream from exhausting all of Envoy's connection resources and dragging every other service sharing that proxy into the same degraded state. But you still need to fix the upstream.
How to Identify It
Upstream latency is the key differentiator between a threshold problem and an upstream problem. Pull the upstream request time histogram:
curl -s http://192.168.1.10:9901/stats | grep "upstream_rq_time"
cluster.api_service.upstream_rq_time: P0(nan,2) P25(nan,120) P50(nan,480) P75(nan,1890) P99(nan,8430)
P50 at 480ms and P99 at 8.4 seconds is not a threshold issue — that upstream is struggling. Also check for host-level degradation:
curl -s http://192.168.1.10:9901/clusters | grep -E "health_flags|rq_error"
api_service::10.0.1.50:8080::health_flags::healthy
api_service::10.0.1.51:8080::health_flags::/failed_active_hc
api_service::10.0.1.52:8080::health_flags::healthy
api_service::10.0.1.50:8080::rq_error: 0
api_service::10.0.1.51:8080::rq_error: 1847
api_service::10.0.1.52:8080::rq_error: 3
One host is failing active health checks and generating 1847 request errors. That host is dragging the cluster down and contributing disproportionately to connection accumulation. Also check whether outlier detection has already kicked in:
curl -s http://192.168.1.10:9901/stats | grep -E "outlier_detection|ejections"
cluster.api_service.outlier_detection.ejections_active: 1
cluster.api_service.outlier_detection.ejections_consecutive_5xx: 3
cluster.api_service.outlier_detection.ejections_total: 3
How to Fix It
The circuit breaker is a band-aid here — restoring upstream health is the real fix. In the immediate term, if a specific host is clearly bad, drain it from the cluster via the admin API:
curl -X POST "http://192.168.1.10:9901/cluster_manager/api_service/10.0.1.51:8080/remove_host"
Then investigate the actual upstream issue on that host. Check application logs, look for OOM kills, slow database queries, or thread pool saturation. Once the host is healthy, re-add it to the rotation and monitor its error rate before trusting it with full traffic.
Proactively, configure outlier detection to automatically eject degraded hosts before they can accumulate enough stuck connections to trip the circuit breaker:
outlier_detection:
consecutive_5xx: 5
interval: 10s
base_ejection_time: 30s
max_ejection_percent: 50
consecutive_gateway_failure: 5
enforcing_consecutive_5xx: 100
enforcing_success_rate: 100
success_rate_minimum_hosts: 3
success_rate_request_volume: 100
With this config, a host generating 5 consecutive 5xx errors within a 10-second window gets ejected for at least 30 seconds. Envoy brings it back automatically after the ejection period and monitors for sustained recovery. This is the difference between a circuit breaker event that lasts 45 seconds and one that lasts 10 minutes.
Prevention
The most reliable prevention strategy is treating circuit breaker thresholds as a first-class operational concern — not a set-it-and-forget-it default. Every service has different traffic characteristics, and the built-in defaults are intentionally conservative.
Start by establishing baselines. Wire Envoy stats into your metrics pipeline — Prometheus and StatsD are both natively supported. Track
upstream_cx_active,
upstream_rq_active, and
upstream_rq_pending_activeas time series. Set your thresholds based on observed peaks plus meaningful headroom, and revisit them whenever traffic patterns change significantly or after major deployments.
Enable outlier detection on every cluster that has more than one upstream host. This gives you automatic host ejection before a degraded instance can cause enough connection accumulation to trip the circuit breaker. Pair outlier detection with a health check interval short enough to detect degradation quickly — 10 seconds is a reasonable default for most services.
Use retry budgets instead of fixed retry counts for any cluster handling real traffic. A fixed
max_retriesof 3 trips immediately under load or is meaningless for a quiet service. Retry budgets scale proportionally with request volume, which is what you actually want.
Alert on overflow counters, not just 503 rates. By the time clients are seeing elevated error rates, the circuit breaker has been tripping for a while. Alert early:
# Prometheus alerting rules for Envoy circuit breaker events
- alert: EnvoyCircuitBreakerConnectionOverflow
expr: rate(envoy_cluster_upstream_cx_overflow[5m]) > 1
for: 2m
labels:
severity: warning
annotations:
summary: "Envoy circuit breaker overflows on {{ $labels.envoy_cluster_name }}"
description: "Check thresholds and upstream health on sw-infrarunbook-01"
- alert: EnvoyUpstreamPendingOverflow
expr: rate(envoy_cluster_upstream_rq_pending_overflow[5m]) > 5
for: 1m
labels:
severity: critical
annotations:
summary: "Pending request overflow on {{ $labels.envoy_cluster_name }}"
description: "Upstream latency or capacity issue detected — review stats at http://192.168.1.10:9901"
Circuit breaker events in production are rarely a simple "raise the limit" fix or purely an "upstream is broken" problem. They're almost always a signal that something has changed — traffic grew, latency increased, or a host degraded. The stats are there. You just need to read them in context instead of reaching for a config change as the first response.
