InfraRunBook
    Back to articles

    Envoy xDS API and Dynamic Configuration Explained

    Envoy
    Published: Apr 8, 2026
    Updated: Apr 8, 2026

    A deep-dive into Envoy's xDS API — how the discovery services work, why ADS ordering matters, and how to build a custom control plane for real production use.

    Envoy xDS API and Dynamic Configuration Explained

    What Is the xDS API?

    If you've spent time configuring Envoy, you've probably started with a YAML or JSON static config file and reloaded the proxy every time something changed. That works fine in development. In production — where upstream endpoints change every few seconds, TLS certificates rotate on a schedule, and traffic weights shift during deployments — reloading a proxy for every config change is a non-starter. The xDS API is Envoy's answer to that problem.

    xDS stands for "x Discovery Service," where x is a placeholder for any of several resource types. It's a family of gRPC (and optionally REST) APIs that allow an external control plane to stream configuration to one or more running Envoy instances in real time. When an endpoint becomes unhealthy, the control plane pushes an updated endpoint list over the xDS stream, and Envoy reconfigures itself — no restart, no downtime, no file change on disk.

    The design originated from Google's internal Stubby load balancing infrastructure and was adapted by Lyft when they built Envoy around 2015. The API was intentionally made generic enough that multiple control planes could implement it, and that's exactly what happened. Today Istio, Consul Connect, AWS App Mesh, and dozens of custom implementations all speak xDS. The current version is xDS v3 — if you're starting fresh, use v3. v2 is deprecated and the control plane ecosystem has moved on.

    The Discovery Services and What They Control

    There are six core services in the xDS family, each responsible for a different layer of Envoy's configuration model. Understanding what each one manages — and how they depend on each other — is essential before you try to implement or debug a control plane.

    LDS (Listener Discovery Service) manages Envoy's listeners: the combination of address, port, and filter chains that define how Envoy accepts connections. A listener config tells Envoy to bind 0.0.0.0:8080 and apply a specific HTTP connection manager filter, for example.

    RDS (Route Discovery Service) manages the virtual hosts and route rules that live inside an HTTP connection manager filter. Rather than embedding route configuration directly in a listener, you reference a named route config that gets streamed via RDS. This is what lets you do traffic shifting and path-based routing without touching the listener definition at all.

    CDS (Cluster Discovery Service) manages Envoy's cluster definitions — the named upstream services that Envoy knows how to connect to. A cluster defines the protocol, circuit breaker settings, health check configuration, and how to discover the actual endpoints (either statically or via EDS).

    EDS (Endpoint Discovery Service) manages the individual endpoints within a cluster. When a new pod comes up at 10.20.30.45:8080 and should start receiving traffic, EDS is how that information gets communicated to Envoy. This is typically the highest-churn data in most production systems.

    SDS (Secret Discovery Service) manages TLS certificates and private keys. Instead of loading certificates from disk and reloading Envoy when they rotate, SDS lets Vault or cert-manager push updated credentials over a secure gRPC stream. In my experience, this is one of the most underused services — I've seen teams set up automated cert rotation and then still manually reload Envoy because they simply weren't aware SDS existed.

    ADS (Aggregated Discovery Service) isn't a separate resource type. It's a transport layer that multiplexes all of the above over a single bidirectional gRPC stream. More on why this matters in a moment.

    How the Discovery Protocol Actually Works

    Each xDS interaction follows a request-response pattern over a gRPC stream. The Envoy node sends a DiscoveryRequest specifying the resource type it wants (using a type URL like

    type.googleapis.com/envoy.config.cluster.v3.Cluster
    ), and the control plane responds with a DiscoveryResponse containing the full list of resources of that type, a version string, and a nonce.

    Envoy then ACKs or NACKs the response. An ACK sends back the same version and nonce, signaling that Envoy applied the config successfully. A NACK sends back the previous version and nonce along with error detail, indicating the new config was rejected — usually due to a validation error. A well-behaved control plane should not re-send a NACKed resource without making a change first.

    There are two protocol variants: State of the World (SotW) and Incremental (Delta xDS). In SotW mode, every DiscoveryResponse contains the complete list of resources of that type — even if only one endpoint changed, you send all endpoints. In Delta mode, you only send what changed. For EDS in a large cluster with thousands of endpoints across hundreds of services, Delta xDS makes a significant difference in bandwidth and CPU overhead. That said, Delta is more complex to implement correctly on the control plane side, so most teams start with SotW and migrate later if scale demands it.

    Static vs. Dynamic Configuration: The Bootstrap File

    Every Envoy process starts with a bootstrap configuration file, and this file is always static. It defines the node identity, the admin interface, and crucially, where to find the xDS control plane. Once Envoy connects to the control plane, everything else can be dynamic.

    Here's a minimal static bootstrap that defines a listener and cluster directly in the config file — no control plane involved:

    static_resources:
      listeners:
        - name: listener_0
          address:
            socket_address:
              address: 0.0.0.0
              port_value: 10000
          filter_chains:
            - filters:
                - name: envoy.filters.network.http_connection_manager
                  typed_config:
                    "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                    stat_prefix: ingress_http
                    route_config:
                      virtual_hosts:
                        - name: backend
                          domains: ["*"]
                          routes:
                            - match: { prefix: "/" }
                              route: { cluster: upstream_service }
                    http_filters:
                      - name: envoy.filters.http.router
                        typed_config:
                          "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
      clusters:
        - name: upstream_service
          connect_timeout: 0.25s
          type: STATIC
          load_assignment:
            cluster_name: upstream_service
            endpoints:
              - lb_endpoints:
                  - endpoint:
                      address:
                        socket_address:
                          address: 10.20.30.10
                          port_value: 8080

    Compare that to a dynamic bootstrap that delegates everything to a control plane running at 10.10.10.5:18000:

    node:
      id: envoy-node-01
      cluster: solvethenetwork-proxy-cluster
    
    dynamic_resources:
      ads_config:
        api_type: GRPC
        transport_api_version: V3
        grpc_services:
          - envoy_grpc:
              cluster_name: xds_control_plane
      cds_config:
        resource_api_version: V3
        ads: {}
      lds_config:
        resource_api_version: V3
        ads: {}
    
    static_resources:
      clusters:
        - name: xds_control_plane
          connect_timeout: 1s
          type: STATIC
          typed_extension_protocol_options:
            envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
              "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
              explicit_http_config:
                http2_protocol_options: {}
          load_assignment:
            cluster_name: xds_control_plane
            endpoints:
              - lb_endpoints:
                  - endpoint:
                      address:
                        socket_address:
                          address: 10.10.10.5
                          port_value: 18000
    
    admin:
      address:
        socket_address:
          address: 127.0.0.1
          port_value: 9901

    Notice that the

    xds_control_plane
    cluster is still static — it has to be, because Envoy needs to know where to connect before it can receive any dynamic configuration. The admin interface is bound to loopback only, which is the right default for production. Don't expose the admin interface on a routable address unless you have a very good reason.

    Why ADS Ordering Matters

    Here's a subtle problem that bites teams when they use individual xDS streams — one gRPC connection per resource type — instead of ADS. Suppose your control plane sends a new CDS update that references a cluster named

    backend-v2
    , and then immediately sends an LDS update that routes traffic to
    backend-v2
    . If those updates travel over separate streams, there's no guarantee Envoy processes them in order. Envoy might apply the LDS update first, try to route traffic to
    backend-v2
    , fail because that cluster doesn't exist yet, and either log a warning or black-hole traffic for a brief window.

    ADS solves this because all resource updates flow over a single ordered gRPC stream. The control plane can sequence CDS before LDS, and Envoy will process them in that order. If you're building anything beyond a trivial control plane, use ADS. The xDS spec calls out the recommended push ordering explicitly: EDS first, then CDS, then RDS, then LDS, then SDS. Always send cluster and endpoint updates before the listeners that reference them.

    Real-World Example: Weighted Traffic Shifting with EDS

    One of the most practical uses of dynamic xDS in production is blue/green and canary deployments. Rather than configuring traffic weights in a separate load balancer layer, you can express them directly in the EDS response using locality load balancing weights. Here's what that looks like:

    cluster_name: api-service
    endpoints:
      - locality:
          region: us-east-1
          zone: us-east-1a
        load_balancing_weight: 80
        lb_endpoints:
          - endpoint:
              address:
                socket_address:
                  address: 10.20.1.10
                  port_value: 8080
            load_balancing_weight: 100
            health_status: HEALTHY
          - endpoint:
              address:
                socket_address:
                  address: 10.20.1.11
                  port_value: 8080
            load_balancing_weight: 100
            health_status: HEALTHY
      - locality:
          region: us-east-1
          zone: us-east-1b
        load_balancing_weight: 20
        lb_endpoints:
          - endpoint:
              address:
                socket_address:
                  address: 10.20.2.10
                  port_value: 8080
            load_balancing_weight: 100
            health_status: HEALTHY

    The first locality group (stable version) gets 80% of traffic; the second (canary) gets 20%. Your control plane can programmatically adjust these weights as the canary proves itself — shifting from 80/20 to 50/50 to 0/100 over time. The entire shift happens live. Envoy instances pick up the EDS update and immediately start redistributing load. No deploy, no reload, no maintenance window.

    I've used this pattern to migrate services across availability zones with zero downtime. The alternative was updating HAProxy backends and reloading — slower, more error-prone, and it required human coordination between the deploy and the config change.

    Building a Custom xDS Control Plane

    If you need a control plane tightly integrated with your own infrastructure state — a custom service registry, an internal IPAM system, a home-grown health check framework — you don't have to bolt on Istio. The

    go-control-plane
    library from the Envoy project gives you the gRPC server scaffolding and the snapshot cache you need to get started quickly.

    Here's the minimal Go structure for a working xDS server using go-control-plane v3:

    package main
    
    import (
        "context"
        "net"
    
        cachev3  "github.com/envoyproxy/go-control-plane/pkg/cache/v3"
        serverv3 "github.com/envoyproxy/go-control-plane/pkg/server/v3"
        discoverygrpc "github.com/envoyproxy/go-control-plane/envoy/service/discovery/v3"
        "google.golang.org/grpc"
    )
    
    func main() {
        cache := cachev3.NewSnapshotCache(true, cachev3.IDHash{}, nil)
    
        snapshot, _ := cachev3.NewSnapshot("v1",
            map[cachev3.ResponseType][]cachev3.Resource{
                cachev3.ClusterType:  {makeCluster("api-service", "10.20.30.10", 8080)},
                cachev3.EndpointType: {makeEndpoint("api-service", "10.20.30.10", 8080)},
                cachev3.ListenerType: {makeListener("listener_0", 10000, "api-service")},
            },
        )
    
        cache.SetSnapshot(context.Background(), "envoy-node-01", snapshot)
    
        grpcServer := grpc.NewServer()
        xdsServer  := serverv3.NewServer(context.Background(), cache, nil)
        discoverygrpc.RegisterAggregatedDiscoveryServiceServer(grpcServer, xdsServer)
    
        lis, _ := net.Listen("tcp", "10.10.10.5:18000")
        grpcServer.Serve(lis)
    }

    The snapshot cache is the key abstraction here. You call

    SetSnapshot
    whenever your infrastructure state changes — a new pod registers, a health check fails, a certificate rotates — and the cache handles diffing the state and pushing the right updates to each connected Envoy node. In production you'd replace the hardcoded addresses with lookups from your service registry or the Kubernetes API, but the structure stays exactly the same. I've seen teams run a few hundred Envoy proxies with a control plane under 300 lines of Go that reads from etcd. It works reliably and is far easier to debug than a full Istio installation when something goes wrong at 3am.

    Observability and Debugging xDS

    When xDS config isn't being applied as expected, start with Envoy's admin interface. The

    /config_dump
    endpoint at
    http://127.0.0.1:9901/config_dump
    gives you the full current state of all dynamic and static resources — including which resources were last updated, their version strings, and any pending changes. The
    /clusters
    endpoint shows current endpoint health status per cluster, which is invaluable when EDS updates aren't landing the way you expect.

    For control plane debugging, enable xDS logging on the Envoy side with

    --component-log-level config:debug
    . You'll see every DiscoveryRequest and DiscoveryResponse, including the type URL, version string, nonce, and whether Envoy ACKed or NACKed. It's verbose, but it's indispensable when a config update isn't applying and you can't figure out why.

    Common Misconceptions About xDS

    "xDS is only for service meshes." Not true. xDS is Envoy's config API, full stop. You can use it for any Envoy deployment — standalone edge proxies, API gateways, database proxies, sidecars. If you're running Envoy in front of a PostgreSQL cluster and want endpoint failover without restarts, xDS is exactly the right tool, service mesh or not.

    "You need Istio to use dynamic configuration." Also not true. Istio is one control plane implementation, and a heavyweight one at that. If Istio's operational complexity doesn't fit your environment, you can run go-control-plane directly, use Consul's built-in xDS server, or write a custom control plane. The xDS API is an open spec — that's the whole point.

    "xDS updates are atomic across all proxies." This is a dangerous assumption. Each Envoy proxy maintains its own connection to the control plane and processes updates independently. During any config push, you'll have a window where different proxies are running different configurations. Design your systems to tolerate this. Add new clusters and listeners before removing old ones — never simultaneously. The safest approach is always additive first, then removal after convergence.

    "SotW and Delta xDS are interchangeable at the protocol level." They're not, and mixing them up on the control plane side causes subtle bugs. In SotW mode, an empty resource list in a DiscoveryResponse means "delete everything of this type." In Delta mode, deletions are explicit. If you're migrating a control plane from SotW to Delta, test that edge case carefully — I've seen it silently remove all listeners on a control plane restart, which made for a very eventful incident review.

    "The xDS stream is stateless." It isn't. The control plane is expected to track what each Envoy node has acknowledged and respond appropriately. The version strings and nonces in DiscoveryRequest and DiscoveryResponse are how both sides maintain stream state. A control plane that ignores nonces will either spam unnecessary updates or fail to converge correctly after a NACK arrives — and that failure mode tends to be silent until something goes badly wrong under load.

    Dynamic configuration via xDS is what separates Envoy from a sophisticated-but-static proxy. Once you internalize the resource model and the ordering constraints, building reliable control planes becomes straightforward work — and the operational leverage you get compounds quickly as your infrastructure scales.

    Frequently Asked Questions

    What is the difference between xDS v2 and xDS v3?

    xDS v3 introduced breaking changes to the protobuf type URLs, field names, and package namespaces compared to v2. v2 is officially deprecated and most control plane libraries, including go-control-plane, have dropped or are dropping v2 support. If you're starting a new deployment, use v3 exclusively. Migrating from v2 to v3 requires updating both the control plane and the Envoy bootstrap config to reference v3 type URLs.

    Can Envoy use xDS without a full service mesh like Istio?

    Yes. xDS is an open API specification that any control plane can implement. You can run go-control-plane, Consul's xDS server, or a custom control plane without deploying Istio at all. Istio is one control plane implementation — not a requirement for using dynamic xDS configuration.

    What happens to Envoy if the xDS control plane goes down?

    Envoy continues operating with the last successfully applied configuration. It will periodically retry connecting to the control plane and resume receiving updates when the connection is restored. No traffic interruption occurs as long as the cached configuration remains valid. This is by design — Envoy is intended to be resilient to control plane unavailability.

    What is the recommended ordering for pushing xDS resource types?

    The xDS spec recommends pushing in this order: EDS first, then CDS, then RDS, then LDS, and finally SDS. This ensures that clusters and endpoints exist before listeners reference them, and that route configs are present before listeners activate them. Using ADS enforces this ordering over a single gRPC stream, which is why ADS is strongly preferred over separate per-resource streams.

    How does Envoy signal that an xDS config update failed validation?

    Envoy sends a NACK — a DiscoveryRequest that includes the previous (last-good) version string and nonce, plus an error_detail field describing what failed validation. The control plane should detect the NACK, log the error detail, and not resend the same broken resource until a corrected version is available.

    Related Articles