InfraRunBook
    Back to articles

    Envoy Circuit Breaker Triggering

    Envoy
    Published: Apr 12, 2026
    Updated: Apr 12, 2026

    Diagnose and fix Envoy circuit breaker triggers — covering max connections exceeded, pending request overflow, retry overflow, and upstream degradation with real CLI commands and stat output.

    Envoy Circuit Breaker Triggering

    Symptoms

    When Envoy's circuit breaker trips, the most visible symptom is a stream of HTTP 503 responses hitting your clients almost instantly — no connection delay, no timeout wait, just immediate rejection. Unlike a normal upstream failure where you'd see latency climb first, circuit breaker-triggered 503s arrive fast because Envoy is intentionally refusing the request before it ever touches the upstream. The response will carry the header x-envoy-overloaded: true, which is your clearest indicator that this is a circuit breaker event and not an upstream crash.

    On the Envoy admin interface (default port 9901), you'll see specific stat counters ticking upward. The ones that matter are upstream_cx_overflow, upstream_rq_pending_overflow, and upstream_rq_retry_overflow. In my experience, these counters sit at zero for long stretches and then spike suddenly — often correlating with a traffic burst, a slow deploy, or an upstream that started responding slower than usual.

    Other symptoms to watch for:

    • Downstream services reporting connection refused or immediate 503 errors with no matching upstream log entries
    • Envoy access logs showing the
      UO
      response flag, meaning upstream overflow
    • Observability dashboards showing a sudden vertical line on overflow counters
    • The circuit breaker gauge
      circuit_breakers.default.cx_open
      returning a value of 1, meaning the breaker is actively open

    The tricky part is that a correctly functioning circuit breaker looks identical from the outside to a misconfigured one. The stat counters alone won't tell you which situation you're in — you need to cross-reference them with upstream health, latency, and your threshold configuration to figure out whether Envoy is protecting a degraded upstream or unnecessarily rejecting perfectly healthy traffic.


    Root Cause 1: Max Connections Exceeded

    Envoy maintains a per-cluster limit on active TCP connections to upstream hosts. This is controlled by the max_connections field in the circuit breaker thresholds block. The default is 1024, which sounds generous until you realize that in high-throughput environments a single service mesh proxy can easily sustain thousands of concurrent connections across multiple upstreams.

    This limit applies to the total number of active connections across all hosts in the cluster, not per host. So if you have three upstream endpoints each handling 400 active connections, you're already at 1200 — above the default 1024 limit — and Envoy will start rejecting new connection attempts while all three hosts are perfectly healthy.

    How to Identify It

    Pull the overflow stats from the admin endpoint and filter for connection-related counters:

    curl -s http://192.168.1.10:9901/stats | grep -E "upstream_cx_overflow|upstream_cx_active|cx_open"

    You're looking for output like this:

    cluster.api_service.circuit_breakers.default.cx_open: 1
    cluster.api_service.upstream_cx_active: 1024
    cluster.api_service.upstream_cx_overflow: 847

    The

    cx_open: 1
    confirms the connection circuit breaker is open. The
    upstream_cx_overflow
    counter tells you how many connection attempts have been rejected since Envoy started. Compare
    upstream_cx_active
    against your configured threshold — if they match, that's your bottleneck. Verify the configured limit with a config dump:

    curl -s http://192.168.1.10:9901/config_dump | python3 -m json.tool | grep -A 20 "circuit_breakers"
    "circuit_breakers": {
      "thresholds": [
        {
          "priority": "DEFAULT",
          "max_connections": 1024,
          "max_pending_requests": 1024,
          "max_requests": 1024,
          "max_retries": 3
        }
      ]
    }

    How to Fix It

    Increase the max_connections threshold in your cluster configuration. The right value depends on your actual traffic profile. I usually recommend pulling 95th-percentile active connection counts over a representative traffic window and setting the threshold to 2–3x that value as headroom.

    circuit_breakers:
      thresholds:
      - priority: DEFAULT
        max_connections: 4096
        max_pending_requests: 2048
        max_requests: 4096
        max_retries: 50

    After applying the config update, watch the overflow counter in real time to confirm it stops climbing:

    watch -n 2 "curl -s http://192.168.1.10:9901/stats | grep upstream_cx_overflow"

    If the counter keeps rising even after raising the threshold, don't just keep increasing it blindly. That's masking a deeper problem. Investigate whether the upstream is accumulating slow or stuck connections that should be timing out instead of staying active.


    Root Cause 2: Max Pending Requests Hit

    When all connections in the pool are in use, new requests don't get dropped immediately — they queue as pending requests. The max_pending_requests circuit breaker limit controls how deep that queue can grow. Once it hits the limit, Envoy starts rejecting incoming requests with 503. This is a subtly different failure mode than max connections: active connections might be well within range, but requests are stacking up behind them faster than they can be processed.

    I've seen this happen most often when upstream latency spikes — not enough to trigger health checks, but enough to push response times from 20ms to 200ms. That 10x latency increase means the same number of connections can handle only a tenth of the previous throughput, and the pending queue fills up fast.

    How to Identify It

    curl -s http://192.168.1.10:9901/stats | grep -E "upstream_rq_pending_overflow|upstream_rq_pending_active|rq_open"
    cluster.api_service.circuit_breakers.default.rq_open: 1
    cluster.api_service.upstream_rq_pending_active: 1024
    cluster.api_service.upstream_rq_pending_overflow: 3291

    The

    rq_open: 1
    flag and a climbing
    upstream_rq_pending_overflow
    counter are the tell. Cross-reference with the upstream request time histogram to see whether latency is the driver:

    curl -s http://192.168.1.10:9901/stats | grep "upstream_rq_time"
    cluster.api_service.upstream_rq_time: P0(nan,1) P25(nan,8) P50(nan,45) P75(nan,182) P99(nan,847) P99.9(nan,1204)

    A P99 of 847ms when your normal baseline is under 50ms tells you the upstream is the underlying cause, not the queue depth setting.

    How to Fix It

    In the short term you can raise max_pending_requests to absorb the queue spike. But this is treating symptoms. If upstream latency is elevated, increasing the pending limit without addressing latency just delays the overflow by a few seconds. The correct approach is both: raise the limit while simultaneously investigating upstream health.

    circuit_breakers:
      thresholds:
      - priority: DEFAULT
        max_connections: 4096
        max_pending_requests: 4096
        max_requests: 4096
        max_retries: 50

    Pair this with a per-request timeout on the cluster so stuck requests fail fast and drain from the queue rather than accumulating indefinitely:

    connect_timeout: 0.5s
    typed_extension_protocol_options:
      envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
        common_http_protocol_options:
          idle_timeout: 60s

    Root Cause 3: Max Retries Exceeded

    This one catches people off guard. Envoy's circuit breaker has a dedicated limit for concurrent retries — the max_retries threshold controls how many retry attempts can be in flight simultaneously across the entire cluster. The default value is 3. That's extremely conservative for anything handling real production traffic.

    A retry spike can happen entirely independently of connection or pending request counts. If your retry policy is configured to retry on 5xx responses and the upstream starts throwing intermittent errors, retries multiply your effective request load. When concurrent retries exceed max_retries, Envoy stops retrying and returns the 503 directly to the client. From the outside it looks like every other circuit breaker trip.

    How to Identify It

    curl -s http://192.168.1.10:9901/stats | grep -E "upstream_rq_retry|retry_overflow"
    cluster.api_service.upstream_rq_retry: 4829
    cluster.api_service.upstream_rq_retry_overflow: 1203
    cluster.api_service.upstream_rq_retry_success: 2871
    cluster.api_service.upstream_rq_retry_limit_exceeded: 1405

    The

    upstream_rq_retry_overflow
    counter is the smoking gun. Also check your access logs for the
    URX
    response flag, which means upstream retry limit exceeded:

    tail -f /var/log/envoy/access.log | grep " URX "
    [2026-04-12T09:14:22.341Z] "POST /api/v2/process HTTP/1.1" 503 URX 0 91 12 - "-" "python-requests/2.28.0" "a3f8b1c2" "10.0.1.50:8080"

    How to Fix It

    First, evaluate whether your retry policy is actually appropriate. Retrying on 5xx when the upstream is consistently failing just amplifies load and makes the situation worse. Consider tightening retry conditions to specific retriable status codes — 503 and 504 are reasonable candidates; 500 is often not retriable because it represents an application-level error that won't change on retry.

    If retries are appropriate and you need more headroom, raise max_retries:

    circuit_breakers:
      thresholds:
      - priority: DEFAULT
        max_connections: 4096
        max_pending_requests: 4096
        max_requests: 4096
        max_retries: 100

    Better yet, replace the fixed retry count with a retry budget. A retry budget scales allowed retries proportionally to active request volume, which is almost always what you actually want in production:

    retry_budget:
      budget_percent:
        value: 20.0
      min_retry_concurrency: 3

    This limits retries to 20% of active requests with a minimum floor of 3. Far more sensible than a hard limit of 3 for a service handling several hundred RPS.


    Root Cause 4: Threshold Too Low for Traffic

    Sometimes nothing is wrong with the upstream. The circuit breaker trips purely because the configured limits don't match actual production traffic volumes. This is extremely common when a config was copied from a lower-traffic staging environment, or when traffic has grown organically since the config was last reviewed.

    The default Envoy circuit breaker thresholds — 1024 connections, 1024 pending requests, 1024 active requests, 3 retries — are deliberately conservative. They're meant to prevent runaway resource exhaustion, not to serve as permanent production settings for a high-throughput service. Services handling more than a few hundred RPS almost always need these numbers tuned upward.

    How to Identify It

    The key diagnostic is comparing overflow events against upstream health. If overflows are happening but the upstream hosts are healthy and responding quickly, you have a threshold misconfiguration — not an upstream problem.

    curl -s http://192.168.1.10:9901/clusters | grep -E "health_flags|cx_active"
    api_service::10.0.1.50:8080::health_flags::healthy
    api_service::10.0.1.51:8080::health_flags::healthy
    api_service::10.0.1.52:8080::health_flags::healthy
    api_service::10.0.1.50:8080::cx_active::341
    api_service::10.0.1.51:8080::cx_active::339
    api_service::10.0.1.52:8080::cx_active::344

    Three healthy hosts, each with roughly 340 active connections, totaling ~1024 — exactly at the default limit. Now check whether upstream latency is normal:

    curl -s http://192.168.1.10:9901/stats | grep -E "upstream_cx_overflow|upstream_rq_time"
    cluster.api_service.upstream_cx_overflow: 12483
    cluster.api_service.upstream_rq_time: P0(nan,1) P25(nan,3) P50(nan,6) P75(nan,11) P99(nan,28)

    P99 at 28ms with 12,000+ overflow events is the clearest sign your thresholds are undersized for your load. The upstream is healthy — Envoy just needs permission to use more connections.

    How to Fix It

    Baseline your actual traffic before setting new limits. Look at peak active connection counts and request concurrency over a representative time window that includes your daily and weekly traffic peaks:

    curl -s http://192.168.1.10:9901/stats | grep -E "cx_active|rq_active|rq_pending_active"
    cluster.api_service.upstream_cx_active: 1024
    cluster.api_service.upstream_rq_active: 876
    cluster.api_service.upstream_rq_pending_active: 203

    If peak active connections hit 1024, setting max_connections to 3000 gives you roughly 3x headroom for spikes. Apply the same reasoning to pending requests and active requests:

    circuit_breakers:
      thresholds:
      - priority: DEFAULT
        max_connections: 3000
        max_pending_requests: 3000
        max_requests: 3000
        max_retries: 100

    After applying the update, verify the overflow counters stop incrementing:

    watch -n 1 "curl -s http://192.168.1.10:9901/stats | grep overflow"

    Root Cause 5: Upstream Service Degraded

    This is the scenario where the circuit breaker is working exactly as designed. When an upstream slows down — due to a bad deploy, a database query regression, external dependency latency, or resource exhaustion — responses take longer. Slower responses mean connections stay active longer. Active connections accumulate. Eventually you hit the max_connections or max_pending_requests threshold and the breaker trips.

    The circuit breaker here is doing you a favor. It's preventing a slow upstream from exhausting all of Envoy's connection resources and dragging every other service sharing that proxy into the same degraded state. But you still need to fix the upstream.

    How to Identify It

    Upstream latency is the key differentiator between a threshold problem and an upstream problem. Pull the upstream request time histogram:

    curl -s http://192.168.1.10:9901/stats | grep "upstream_rq_time"
    cluster.api_service.upstream_rq_time: P0(nan,2) P25(nan,120) P50(nan,480) P75(nan,1890) P99(nan,8430)

    P50 at 480ms and P99 at 8.4 seconds is not a threshold issue — that upstream is struggling. Also check for host-level degradation:

    curl -s http://192.168.1.10:9901/clusters | grep -E "health_flags|rq_error"
    api_service::10.0.1.50:8080::health_flags::healthy
    api_service::10.0.1.51:8080::health_flags::/failed_active_hc
    api_service::10.0.1.52:8080::health_flags::healthy
    api_service::10.0.1.50:8080::rq_error: 0
    api_service::10.0.1.51:8080::rq_error: 1847
    api_service::10.0.1.52:8080::rq_error: 3

    One host is failing active health checks and generating 1847 request errors. That host is dragging the cluster down and contributing disproportionately to connection accumulation. Also check whether outlier detection has already kicked in:

    curl -s http://192.168.1.10:9901/stats | grep -E "outlier_detection|ejections"
    cluster.api_service.outlier_detection.ejections_active: 1
    cluster.api_service.outlier_detection.ejections_consecutive_5xx: 3
    cluster.api_service.outlier_detection.ejections_total: 3

    How to Fix It

    The circuit breaker is a band-aid here — restoring upstream health is the real fix. In the immediate term, if a specific host is clearly bad, drain it from the cluster via the admin API:

    curl -X POST "http://192.168.1.10:9901/cluster_manager/api_service/10.0.1.51:8080/remove_host"

    Then investigate the actual upstream issue on that host. Check application logs, look for OOM kills, slow database queries, or thread pool saturation. Once the host is healthy, re-add it to the rotation and monitor its error rate before trusting it with full traffic.

    Proactively, configure outlier detection to automatically eject degraded hosts before they can accumulate enough stuck connections to trip the circuit breaker:

    outlier_detection:
      consecutive_5xx: 5
      interval: 10s
      base_ejection_time: 30s
      max_ejection_percent: 50
      consecutive_gateway_failure: 5
      enforcing_consecutive_5xx: 100
      enforcing_success_rate: 100
      success_rate_minimum_hosts: 3
      success_rate_request_volume: 100

    With this config, a host generating 5 consecutive 5xx errors within a 10-second window gets ejected for at least 30 seconds. Envoy brings it back automatically after the ejection period and monitors for sustained recovery. This is the difference between a circuit breaker event that lasts 45 seconds and one that lasts 10 minutes.


    Prevention

    The most reliable prevention strategy is treating circuit breaker thresholds as a first-class operational concern — not a set-it-and-forget-it default. Every service has different traffic characteristics, and the built-in defaults are intentionally conservative.

    Start by establishing baselines. Wire Envoy stats into your metrics pipeline — Prometheus and StatsD are both natively supported. Track

    upstream_cx_active
    ,
    upstream_rq_active
    , and
    upstream_rq_pending_active
    as time series. Set your thresholds based on observed peaks plus meaningful headroom, and revisit them whenever traffic patterns change significantly or after major deployments.

    Enable outlier detection on every cluster that has more than one upstream host. This gives you automatic host ejection before a degraded instance can cause enough connection accumulation to trip the circuit breaker. Pair outlier detection with a health check interval short enough to detect degradation quickly — 10 seconds is a reasonable default for most services.

    Use retry budgets instead of fixed retry counts for any cluster handling real traffic. A fixed

    max_retries
    of 3 trips immediately under load or is meaningless for a quiet service. Retry budgets scale proportionally with request volume, which is what you actually want.

    Alert on overflow counters, not just 503 rates. By the time clients are seeing elevated error rates, the circuit breaker has been tripping for a while. Alert early:

    # Prometheus alerting rules for Envoy circuit breaker events
    - alert: EnvoyCircuitBreakerConnectionOverflow
      expr: rate(envoy_cluster_upstream_cx_overflow[5m]) > 1
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: "Envoy circuit breaker overflows on {{ $labels.envoy_cluster_name }}"
        description: "Check thresholds and upstream health on sw-infrarunbook-01"
    
    - alert: EnvoyUpstreamPendingOverflow
      expr: rate(envoy_cluster_upstream_rq_pending_overflow[5m]) > 5
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "Pending request overflow on {{ $labels.envoy_cluster_name }}"
        description: "Upstream latency or capacity issue detected — review stats at http://192.168.1.10:9901"

    Circuit breaker events in production are rarely a simple "raise the limit" fix or purely an "upstream is broken" problem. They're almost always a signal that something has changed — traffic grew, latency increased, or a host degraded. The stats are there. You just need to read them in context instead of reaching for a config change as the first response.

    Frequently Asked Questions

    How do I tell if Envoy's circuit breaker is open right now?

    Check the circuit breaker gauge via the admin endpoint: curl -s http://192.168.1.10:9901/stats | grep cx_open. A value of 1 means the connection circuit breaker is open. You can also check rq_open for the request circuit breaker. Cross-reference with upstream_cx_overflow or upstream_rq_pending_overflow to confirm it's actively tripping.

    What does the x-envoy-overloaded response header mean?

    It means Envoy rejected the request due to a circuit breaker limit rather than forwarding it upstream. The upstream never saw the request. This is Envoy's intentional load shedding behavior to prevent resource exhaustion. You'll see it alongside HTTP 503 responses and incrementing overflow stat counters.

    What is the difference between max_connections and max_pending_requests?

    max_connections limits how many active TCP connections Envoy can hold open to all upstream hosts in a cluster simultaneously. max_pending_requests limits how many requests can queue up waiting for a connection to become available. You can hit the pending limit without hitting the connection limit — this typically happens when upstream latency spikes and existing connections slow down.

    Why does Envoy have a separate max_retries circuit breaker?

    Retries are a force multiplier on upstream load. If many requests are failing simultaneously and all trigger retries, the retry traffic can overwhelm an already-degraded upstream. The max_retries circuit breaker caps the number of concurrent retry attempts across the cluster, preventing retry storms from making a bad situation worse.

    Should I just set Envoy circuit breaker thresholds very high to avoid trips?

    No. Circuit breakers are a safety net — removing them by setting extremely high limits means a degraded upstream can exhaust all of Envoy's connection resources and take down adjacent services sharing the same proxy. The right approach is to baseline your normal traffic peaks and set thresholds at 2-3x that value, not to disable the protection entirely.

    Related Articles