The Two Safety Nets You Need to Understand
If you've spent any time with Envoy, you've probably seen circuit_breakers and outlier_detection sitting in cluster configs and wondered what exactly each one does — and more importantly, when you need one versus the other. I've reviewed a lot of production Envoy configs that either have these misconfigured, conflate the two concepts entirely, or just copied an example from a blog post without understanding the tradeoffs. This article is going to fix that.
Here's the short version: circuit breaking limits how many resources Envoy will consume when connecting to an upstream cluster. Outlier detection watches individual hosts in that cluster and ejects the sick ones from rotation. They're complementary. You want both, and you want to understand what each one is actually doing at runtime.
Circuit Breaking: Hard Limits on Resource Consumption
Envoy's circuit breaker doesn't work like a traditional circuit breaker you'd know from Hystrix or resilience4j. There's no open/half-open/closed state machine that trips based on error rate over a sliding time window. Instead, it's a set of hard concurrency limits. When you exceed those limits, Envoy starts failing new requests immediately — no waiting, no queueing, just an instant 503 with a specific overflow counter incremented.
Circuit breakers are configured at the cluster level and organized by priority. Envoy has two priority levels: DEFAULT and HIGH. Most traffic runs at DEFAULT. You can route specific traffic to HIGH priority using route-level priority settings, and HIGH has its own independent threshold buckets with whatever limits you configure. The point is that critical traffic (say, synchronous user-facing requests) can have looser circuit breaker limits than background batch traffic.
There are four thresholds that actually matter in practice:
- max_connections — The maximum number of active TCP connections Envoy will maintain to all upstream hosts in this cluster combined. Once you hit this, new connection attempts overflow.
- max_pending_requests — The maximum number of HTTP requests queued waiting for a connection pool slot. Once you hit this, new requests overflow immediately rather than queuing further.
- max_requests — The maximum number of requests currently in-flight across all connections to the cluster. This is the threshold you tune most carefully for HTTP/2, where a single connection multiplexes many streams.
- max_retries — The maximum number of concurrent retries happening across all connections. This is specifically to prevent a retry storm from turning a partial outage into a total one.
Default values for all four are 1024, which is almost always far too high for real services. In my experience, I've seen max_requests sitting at 1024 on internal API services that handle maybe 30 concurrent requests at peak. That means Envoy would faithfully hammer a struggling upstream with over a thousand concurrent requests before circuit breaking even activates — which is exactly the kind of behavior that turns a degraded backend into a dead one.
clusters:
- name: payment_service
connect_timeout: 0.5s
type: STRICT_DNS
lb_policy: ROUND_ROBIN
circuit_breakers:
thresholds:
- priority: DEFAULT
max_connections: 100
max_pending_requests: 50
max_requests: 200
max_retries: 3
track_remaining: true
- priority: HIGH
max_connections: 500
max_pending_requests: 200
max_requests: 1000
max_retries: 10
load_assignment:
cluster_name: payment_service
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 10.0.1.50
port_value: 8080
When a threshold is exceeded, Envoy increments a specific stat counter. For connections it's upstream_cx_overflow, for pending requests it's upstream_rq_pending_overflow, for active requests it's upstream_rq_overflow, and for retries it's upstream_rq_retry_overflow. These are your signal that the circuit breaker is actually working. If you never see these counters increment, your thresholds are too high relative to your actual traffic — or your upstream is genuinely healthy and you don't have a problem. If they're ticking constantly, your thresholds are too tight or your upstream is genuinely overwhelmed and needs attention.
The track_remaining: true flag on any threshold tier enables additional gauges — upstream_rq_pending_active, upstream_cx_active, and similar — so you can see in real time how close you are to the thresholds without waiting for overflow events. It adds slight runtime overhead, but it's invaluable during capacity planning and incident response.
Outlier Detection: Ejecting Sick Hosts from the Pool
Outlier detection is a fundamentally different mechanism. It's passive health checking — Envoy observes the responses coming back from each individual upstream host and makes ejection decisions based on that observed behavior. It doesn't limit concurrency. It removes misbehaving hosts from the load balancing rotation so healthy traffic stops getting routed to them.
There are four detection algorithms, and you can enable any combination:
Consecutive 5xx ejection is the simplest. If a host returns N consecutive 5xx responses, it gets ejected. The threshold is consecutive_5xx and defaults to 5. Gateway errors — connection refused, TCP reset, hard timeout — are tracked separately under consecutive_gateway_failure, also defaulting to 5. In practice, I'd set the gateway failure threshold lower than 5xx, because a host that's refusing TCP connections is in significantly worse shape than one returning application-level errors. Three consecutive gateway failures is usually enough signal.
Success rate ejection is more nuanced. Envoy tracks each host's success rate over a configurable interval and compares it against the mean success rate across all hosts in the cluster. Hosts whose success rate falls more than some number of standard deviations below the mean get ejected. The standard deviation factor is success_rate_stdev_factor, which is expressed in thousandths — the default of 1900 means 1.9 standard deviations, roughly the 97th percentile. This algorithm only activates if there are at least success_rate_minimum_hosts hosts with at least success_rate_request_volume requests during the interval. Below those minimums, Envoy doesn't have enough statistical confidence to act.
Failure percentage ejection is straightforward — eject any host whose failure percentage exceeds failure_percentage_threshold. It uses the same minimum hosts and request volume gating. This is useful in scenarios where all hosts in the cluster are struggling, but one is clearly much worse than the rest and the success rate algorithm isn't firing because the cluster mean is already low.
clusters:
- name: payment_service
outlier_detection:
# Consecutive error thresholds
consecutive_5xx: 5
consecutive_gateway_failure: 3
enforcing_consecutive_5xx: 100
enforcing_consecutive_gateway_failure: 100
# Detection sweep interval
interval: 10s
# Ejection timing with backoff
base_ejection_time: 30s
max_ejection_time: 300s
max_ejection_percent: 25
# Success rate algorithm
success_rate_minimum_hosts: 5
success_rate_request_volume: 100
success_rate_stdev_factor: 1900
enforcing_success_rate: 100
# Failure percentage algorithm
failure_percentage_threshold: 85
failure_percentage_minimum_hosts: 5
failure_percentage_request_volume: 50
enforcing_failure_percentage: 100
The ejection mechanics are worth understanding in detail. When a host is ejected, it's removed from load balancing for base_ejection_time multiplied by the ejection count — where ejection count is how many times this specific host has been ejected since the cluster was initialized. First ejection: 30 seconds out. Second ejection: 60 seconds out. Third: 90 seconds. This keeps increasing up to max_ejection_time, which caps the backoff. After the ejection period expires, the host re-enters the pool on the next detection interval sweep.
The max_ejection_percent parameter is a safety valve. Envoy won't eject more than this percentage of the cluster's hosts simultaneously. If you have 10 hosts and max_ejection_percent is 20, at most 2 can be ejected at once. This prevents aggressive ejection from leaving you with no upstream hosts when an entire cluster starts having problems. Don't set this too low on small clusters — with 5 hosts and 10%, that's effectively 0 ejectable hosts since you can't eject half a host.
Why This Matters: The Cascading Failure Problem
Without circuit breaking, a slow or overwhelmed upstream will back up your request queues, exhaust connection pools, and eventually cause proxy memory exhaustion or file descriptor starvation. I've seen this kill a proxy cluster that was fine by its own CPU and memory metrics — it was sitting on 40,000 half-open connections to a backend that had stopped processing them. The proxy looked healthy right up until it didn't.
Without outlier detection, you'll keep sending traffic to hosts that are actively failing. If your cluster has 10 hosts and 2 of them got a bad deployment, 20% of your requests will fail consistently — for every user, until you manually intervene or the bad pods restart. With outlier detection tuned correctly, those two hosts get ejected within one detection interval and traffic redistributes to the healthy eight. Your error rate drops from 20% to near zero without any human involved.
Together, these mechanisms implement the resilience patterns that used to require application-level libraries. Envoy handles them at the proxy layer, which means your application code doesn't need to know about upstream health, every service in your mesh benefits automatically, and the behavior is consistent across languages and runtimes.
Observing Both Mechanisms in Practice
Envoy's admin interface at port 9901 exposes everything you need. On a proxy running at sw-infrarunbook-01 with admin bound to 10.0.1.10, you can query circuit breaker overflow counters and ejection stats for any cluster:
# Query stats for the payment_service cluster
curl -s http://10.0.1.10:9901/stats?filter=payment_service | grep -E "(overflow|ejection)"
# Sample output during a degraded upstream event
cluster.payment_service.upstream_cx_overflow: 0
cluster.payment_service.upstream_rq_overflow: 47
cluster.payment_service.upstream_rq_pending_overflow: 12
cluster.payment_service.upstream_rq_retry_overflow: 5
cluster.payment_service.outlier_detection.ejections_active: 2
cluster.payment_service.outlier_detection.ejections_consecutive_5xx: 11
cluster.payment_service.outlier_detection.ejections_success_rate: 0
cluster.payment_service.outlier_detection.ejections_total: 14
Reading this: 47 requests have been rejected due to hitting max_requests, 12 from pending queue overflow, and there are currently 2 hosts ejected (14 total ejection events historically, 11 from consecutive 5xx). This paints a clear picture — there's request-level pressure on the cluster and at least two hosts have been repeatedly misbehaving.
You can also get per-host health status from the clusters endpoint:
# Get detailed cluster and endpoint state
curl -s http://10.0.1.10:9901/clusters | grep -A2 payment_service
# Ejected hosts will show health_flags containing /failed_outlier_check
# Example line for an ejected host:
# 10.0.1.52:8080::health_flags::/failed_outlier_check
A Complete Production Configuration
Here's a full cluster config combining both mechanisms for a latency-sensitive API backend — the kind of thing I'd deploy for a service handling synchronous user traffic on solvethenetwork.com infrastructure:
static_resources:
clusters:
- name: api_backend
connect_timeout: 0.25s
type: STRICT_DNS
lb_policy: LEAST_REQUEST
http2_protocol_options: {}
# Circuit breaker — hard resource ceilings
circuit_breakers:
thresholds:
- priority: DEFAULT
max_connections: 150
max_pending_requests: 75
max_requests: 300
max_retries: 5
track_remaining: true
- priority: HIGH
max_connections: 300
max_pending_requests: 150
max_requests: 600
max_retries: 15
# Outlier detection — passive host health tracking
outlier_detection:
consecutive_5xx: 5
consecutive_gateway_failure: 3
enforcing_consecutive_5xx: 100
enforcing_consecutive_gateway_failure: 100
interval: 10s
base_ejection_time: 30s
max_ejection_time: 300s
max_ejection_percent: 25
success_rate_minimum_hosts: 3
success_rate_request_volume: 100
success_rate_stdev_factor: 1900
enforcing_success_rate: 100
failure_percentage_threshold: 80
failure_percentage_minimum_hosts: 3
failure_percentage_request_volume: 50
enforcing_failure_percentage: 100
# Active health checks — complements outlier detection for idle clusters
health_checks:
- timeout: 1s
interval: 5s
unhealthy_threshold: 2
healthy_threshold: 1
http_health_check:
path: /healthz
load_assignment:
cluster_name: api_backend
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 10.0.2.10
port_value: 8080
- endpoint:
address:
socket_address:
address: 10.0.2.11
port_value: 8080
- endpoint:
address:
socket_address:
address: 10.0.2.12
port_value: 8080
Common Misconceptions
The first one I keep seeing: circuit breaking will prevent cascading failures. It limits the blast radius — it stops you from drowning an upstream in more traffic than it can process. But it won't make requests succeed. You still need outlier detection, sensible retry policies, timeout budgets, and fallback strategies as part of a complete resilience picture. A circuit breaker tripping constantly is a symptom that something needs fixing, not a solution in itself.
Second: outlier detection replaces active health checks. It doesn't. Active health checks probe hosts periodically with synthetic requests regardless of real traffic. Outlier detection only sees real production requests — if traffic to a cluster is zero or very low, outlier detection never fires. Active health checks catch problems in idle clusters, catch issues during rolling deployments before real traffic reaches a new instance, and can check deeper endpoint health (like database connectivity) that a 5xx counter wouldn't reveal. Use both in combination.
Third: max_ejection_percent protects you from accidentally ejecting all hosts. Only partly. Consider a cluster with 3 hosts and max_ejection_percent set to 10 — 10% of 3 is 0.3, which rounds to 0, meaning no hosts can ever be ejected. Always verify that your max_ejection_percent is large enough on small clusters to actually permit at least one ejection. For clusters under 10 hosts, 30–50% is usually appropriate.
Fourth: high thresholds are the safe, conservative choice. This is backwards. High circuit breaker thresholds mean Envoy will faithfully queue and dispatch hundreds or thousands of requests to a struggling upstream before backing off. That turns a minor backend degradation into a full outage as you pile requests onto a half-working service. Tight, realistic thresholds fail fast and let the caller decide what to do — retry to another service instance, serve from cache, return a degraded response, or shed load gracefully.
Tuning Guidance
Start by understanding your service's actual concurrency profile. Pull your 99th percentile active request count from a normal traffic period. Set max_requests to roughly 2–3x that value. This gives you meaningful headroom for traffic spikes while still protecting the upstream from being avalanched during an incident.
For base_ejection_time, think about how long it realistically takes your service to recover from transient issues. If your Go service restarts in 2 seconds, 30 seconds of ejection is conservative. If your JVM service takes 20 seconds to warm up and reach nominal throughput, a 10-second ejection window will just keep re-ejecting the same host every time it comes back. Match the ejection duration to your actual recovery time.
Set consecutive_gateway_failure lower than consecutive_5xx. A host returning 500s is experiencing application-level errors. A host refusing TCP connections or resetting them is in a fundamentally different and usually worse state. Three consecutive gateway failures should be enough to pull a host without waiting for five.
In my experience, the teams that get the most value from outlier detection treat ejections_total and ejections_active as first-class signals in their monitoring dashboards, not just passive counters. If ejections are climbing steadily on a cluster, something is wrong with those hosts — noisy neighbor, bad deployment, hardware degradation, or upstream dependency flapping. The ejections are a symptom worth investigating, not merely proof that the feature is working.
Circuit breaking and outlier detection are two of the most operationally valuable primitives Envoy gives you. Neither requires changing a line of application code. Get them configured correctly and tuned to your actual traffic patterns, and your proxy layer becomes an active participant in keeping your services healthy — rather than a passive conduit that faithfully delivers traffic into a collapsing cluster.
