What Is the xDS API?
If you've spent time configuring Envoy, you've probably started with a YAML or JSON static config file and reloaded the proxy every time something changed. That works fine in development. In production — where upstream endpoints change every few seconds, TLS certificates rotate on a schedule, and traffic weights shift during deployments — reloading a proxy for every config change is a non-starter. The xDS API is Envoy's answer to that problem.
xDS stands for "x Discovery Service," where x is a placeholder for any of several resource types. It's a family of gRPC (and optionally REST) APIs that allow an external control plane to stream configuration to one or more running Envoy instances in real time. When an endpoint becomes unhealthy, the control plane pushes an updated endpoint list over the xDS stream, and Envoy reconfigures itself — no restart, no downtime, no file change on disk.
The design originated from Google's internal Stubby load balancing infrastructure and was adapted by Lyft when they built Envoy around 2015. The API was intentionally made generic enough that multiple control planes could implement it, and that's exactly what happened. Today Istio, Consul Connect, AWS App Mesh, and dozens of custom implementations all speak xDS. The current version is xDS v3 — if you're starting fresh, use v3. v2 is deprecated and the control plane ecosystem has moved on.
The Discovery Services and What They Control
There are six core services in the xDS family, each responsible for a different layer of Envoy's configuration model. Understanding what each one manages — and how they depend on each other — is essential before you try to implement or debug a control plane.
LDS (Listener Discovery Service) manages Envoy's listeners: the combination of address, port, and filter chains that define how Envoy accepts connections. A listener config tells Envoy to bind 0.0.0.0:8080 and apply a specific HTTP connection manager filter, for example.
RDS (Route Discovery Service) manages the virtual hosts and route rules that live inside an HTTP connection manager filter. Rather than embedding route configuration directly in a listener, you reference a named route config that gets streamed via RDS. This is what lets you do traffic shifting and path-based routing without touching the listener definition at all.
CDS (Cluster Discovery Service) manages Envoy's cluster definitions — the named upstream services that Envoy knows how to connect to. A cluster defines the protocol, circuit breaker settings, health check configuration, and how to discover the actual endpoints (either statically or via EDS).
EDS (Endpoint Discovery Service) manages the individual endpoints within a cluster. When a new pod comes up at 10.20.30.45:8080 and should start receiving traffic, EDS is how that information gets communicated to Envoy. This is typically the highest-churn data in most production systems.
SDS (Secret Discovery Service) manages TLS certificates and private keys. Instead of loading certificates from disk and reloading Envoy when they rotate, SDS lets Vault or cert-manager push updated credentials over a secure gRPC stream. In my experience, this is one of the most underused services — I've seen teams set up automated cert rotation and then still manually reload Envoy because they simply weren't aware SDS existed.
ADS (Aggregated Discovery Service) isn't a separate resource type. It's a transport layer that multiplexes all of the above over a single bidirectional gRPC stream. More on why this matters in a moment.
How the Discovery Protocol Actually Works
Each xDS interaction follows a request-response pattern over a gRPC stream. The Envoy node sends a DiscoveryRequest specifying the resource type it wants (using a type URL like
type.googleapis.com/envoy.config.cluster.v3.Cluster), and the control plane responds with a DiscoveryResponse containing the full list of resources of that type, a version string, and a nonce.
Envoy then ACKs or NACKs the response. An ACK sends back the same version and nonce, signaling that Envoy applied the config successfully. A NACK sends back the previous version and nonce along with error detail, indicating the new config was rejected — usually due to a validation error. A well-behaved control plane should not re-send a NACKed resource without making a change first.
There are two protocol variants: State of the World (SotW) and Incremental (Delta xDS). In SotW mode, every DiscoveryResponse contains the complete list of resources of that type — even if only one endpoint changed, you send all endpoints. In Delta mode, you only send what changed. For EDS in a large cluster with thousands of endpoints across hundreds of services, Delta xDS makes a significant difference in bandwidth and CPU overhead. That said, Delta is more complex to implement correctly on the control plane side, so most teams start with SotW and migrate later if scale demands it.
Static vs. Dynamic Configuration: The Bootstrap File
Every Envoy process starts with a bootstrap configuration file, and this file is always static. It defines the node identity, the admin interface, and crucially, where to find the xDS control plane. Once Envoy connects to the control plane, everything else can be dynamic.
Here's a minimal static bootstrap that defines a listener and cluster directly in the config file — no control plane involved:
static_resources:
listeners:
- name: listener_0
address:
socket_address:
address: 0.0.0.0
port_value: 10000
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress_http
route_config:
virtual_hosts:
- name: backend
domains: ["*"]
routes:
- match: { prefix: "/" }
route: { cluster: upstream_service }
http_filters:
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
clusters:
- name: upstream_service
connect_timeout: 0.25s
type: STATIC
load_assignment:
cluster_name: upstream_service
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 10.20.30.10
port_value: 8080
Compare that to a dynamic bootstrap that delegates everything to a control plane running at 10.10.10.5:18000:
node:
id: envoy-node-01
cluster: solvethenetwork-proxy-cluster
dynamic_resources:
ads_config:
api_type: GRPC
transport_api_version: V3
grpc_services:
- envoy_grpc:
cluster_name: xds_control_plane
cds_config:
resource_api_version: V3
ads: {}
lds_config:
resource_api_version: V3
ads: {}
static_resources:
clusters:
- name: xds_control_plane
connect_timeout: 1s
type: STATIC
typed_extension_protocol_options:
envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
"@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
explicit_http_config:
http2_protocol_options: {}
load_assignment:
cluster_name: xds_control_plane
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 10.10.10.5
port_value: 18000
admin:
address:
socket_address:
address: 127.0.0.1
port_value: 9901
Notice that the
xds_control_planecluster is still static — it has to be, because Envoy needs to know where to connect before it can receive any dynamic configuration. The admin interface is bound to loopback only, which is the right default for production. Don't expose the admin interface on a routable address unless you have a very good reason.
Why ADS Ordering Matters
Here's a subtle problem that bites teams when they use individual xDS streams — one gRPC connection per resource type — instead of ADS. Suppose your control plane sends a new CDS update that references a cluster named
backend-v2, and then immediately sends an LDS update that routes traffic to
backend-v2. If those updates travel over separate streams, there's no guarantee Envoy processes them in order. Envoy might apply the LDS update first, try to route traffic to
backend-v2, fail because that cluster doesn't exist yet, and either log a warning or black-hole traffic for a brief window.
ADS solves this because all resource updates flow over a single ordered gRPC stream. The control plane can sequence CDS before LDS, and Envoy will process them in that order. If you're building anything beyond a trivial control plane, use ADS. The xDS spec calls out the recommended push ordering explicitly: EDS first, then CDS, then RDS, then LDS, then SDS. Always send cluster and endpoint updates before the listeners that reference them.
Real-World Example: Weighted Traffic Shifting with EDS
One of the most practical uses of dynamic xDS in production is blue/green and canary deployments. Rather than configuring traffic weights in a separate load balancer layer, you can express them directly in the EDS response using locality load balancing weights. Here's what that looks like:
cluster_name: api-service
endpoints:
- locality:
region: us-east-1
zone: us-east-1a
load_balancing_weight: 80
lb_endpoints:
- endpoint:
address:
socket_address:
address: 10.20.1.10
port_value: 8080
load_balancing_weight: 100
health_status: HEALTHY
- endpoint:
address:
socket_address:
address: 10.20.1.11
port_value: 8080
load_balancing_weight: 100
health_status: HEALTHY
- locality:
region: us-east-1
zone: us-east-1b
load_balancing_weight: 20
lb_endpoints:
- endpoint:
address:
socket_address:
address: 10.20.2.10
port_value: 8080
load_balancing_weight: 100
health_status: HEALTHY
The first locality group (stable version) gets 80% of traffic; the second (canary) gets 20%. Your control plane can programmatically adjust these weights as the canary proves itself — shifting from 80/20 to 50/50 to 0/100 over time. The entire shift happens live. Envoy instances pick up the EDS update and immediately start redistributing load. No deploy, no reload, no maintenance window.
I've used this pattern to migrate services across availability zones with zero downtime. The alternative was updating HAProxy backends and reloading — slower, more error-prone, and it required human coordination between the deploy and the config change.
Building a Custom xDS Control Plane
If you need a control plane tightly integrated with your own infrastructure state — a custom service registry, an internal IPAM system, a home-grown health check framework — you don't have to bolt on Istio. The
go-control-planelibrary from the Envoy project gives you the gRPC server scaffolding and the snapshot cache you need to get started quickly.
Here's the minimal Go structure for a working xDS server using go-control-plane v3:
package main
import (
"context"
"net"
cachev3 "github.com/envoyproxy/go-control-plane/pkg/cache/v3"
serverv3 "github.com/envoyproxy/go-control-plane/pkg/server/v3"
discoverygrpc "github.com/envoyproxy/go-control-plane/envoy/service/discovery/v3"
"google.golang.org/grpc"
)
func main() {
cache := cachev3.NewSnapshotCache(true, cachev3.IDHash{}, nil)
snapshot, _ := cachev3.NewSnapshot("v1",
map[cachev3.ResponseType][]cachev3.Resource{
cachev3.ClusterType: {makeCluster("api-service", "10.20.30.10", 8080)},
cachev3.EndpointType: {makeEndpoint("api-service", "10.20.30.10", 8080)},
cachev3.ListenerType: {makeListener("listener_0", 10000, "api-service")},
},
)
cache.SetSnapshot(context.Background(), "envoy-node-01", snapshot)
grpcServer := grpc.NewServer()
xdsServer := serverv3.NewServer(context.Background(), cache, nil)
discoverygrpc.RegisterAggregatedDiscoveryServiceServer(grpcServer, xdsServer)
lis, _ := net.Listen("tcp", "10.10.10.5:18000")
grpcServer.Serve(lis)
}
The snapshot cache is the key abstraction here. You call
SetSnapshotwhenever your infrastructure state changes — a new pod registers, a health check fails, a certificate rotates — and the cache handles diffing the state and pushing the right updates to each connected Envoy node. In production you'd replace the hardcoded addresses with lookups from your service registry or the Kubernetes API, but the structure stays exactly the same. I've seen teams run a few hundred Envoy proxies with a control plane under 300 lines of Go that reads from etcd. It works reliably and is far easier to debug than a full Istio installation when something goes wrong at 3am.
Observability and Debugging xDS
When xDS config isn't being applied as expected, start with Envoy's admin interface. The
/config_dumpendpoint at
http://127.0.0.1:9901/config_dumpgives you the full current state of all dynamic and static resources — including which resources were last updated, their version strings, and any pending changes. The
/clustersendpoint shows current endpoint health status per cluster, which is invaluable when EDS updates aren't landing the way you expect.
For control plane debugging, enable xDS logging on the Envoy side with
--component-log-level config:debug. You'll see every DiscoveryRequest and DiscoveryResponse, including the type URL, version string, nonce, and whether Envoy ACKed or NACKed. It's verbose, but it's indispensable when a config update isn't applying and you can't figure out why.
Common Misconceptions About xDS
"xDS is only for service meshes." Not true. xDS is Envoy's config API, full stop. You can use it for any Envoy deployment — standalone edge proxies, API gateways, database proxies, sidecars. If you're running Envoy in front of a PostgreSQL cluster and want endpoint failover without restarts, xDS is exactly the right tool, service mesh or not.
"You need Istio to use dynamic configuration." Also not true. Istio is one control plane implementation, and a heavyweight one at that. If Istio's operational complexity doesn't fit your environment, you can run go-control-plane directly, use Consul's built-in xDS server, or write a custom control plane. The xDS API is an open spec — that's the whole point.
"xDS updates are atomic across all proxies." This is a dangerous assumption. Each Envoy proxy maintains its own connection to the control plane and processes updates independently. During any config push, you'll have a window where different proxies are running different configurations. Design your systems to tolerate this. Add new clusters and listeners before removing old ones — never simultaneously. The safest approach is always additive first, then removal after convergence.
"SotW and Delta xDS are interchangeable at the protocol level." They're not, and mixing them up on the control plane side causes subtle bugs. In SotW mode, an empty resource list in a DiscoveryResponse means "delete everything of this type." In Delta mode, deletions are explicit. If you're migrating a control plane from SotW to Delta, test that edge case carefully — I've seen it silently remove all listeners on a control plane restart, which made for a very eventful incident review.
"The xDS stream is stateless." It isn't. The control plane is expected to track what each Envoy node has acknowledged and respond appropriately. The version strings and nonces in DiscoveryRequest and DiscoveryResponse are how both sides maintain stream state. A control plane that ignores nonces will either spam unnecessary updates or fail to converge correctly after a NACK arrives — and that failure mode tends to be silent until something goes badly wrong under load.
Dynamic configuration via xDS is what separates Envoy from a sophisticated-but-static proxy. Once you internalize the resource model and the ordering constraints, building reliable control planes becomes straightforward work — and the operational leverage you get compounds quickly as your infrastructure scales.
