Symptoms
You've deployed Envoy as your edge proxy or service mesh sidecar, everything looks fine in the config, the process is running — and then nothing works. TLS connections are failing and you're staring at a wall of cryptic output. In my experience, Envoy TLS failures tend to manifest in a few recognizable patterns, and knowing which pattern you're in cuts the debugging time dramatically.
On the client side, you'll typically see errors like these:
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to 10.10.1.45:8443
curl: (60) SSL certificate problem: unable to get local issuer certificate
curl: (56) OpenSSL SSL_read: error:14094412:SSL routines:ssl3_read_bytes:sslv3 alert bad certificate
Inside Envoy's access logs or admin interface, you might see upstream connection failures, TLS handshake timeouts, or listeners that are bound but silently dropping connections. That last scenario is arguably more frustrating to debug than a loud crash. A
503 Service Unavailablewith the
UF,URXupstream flag combination is a classic tell that something is wrong at the transport layer.
With debug logging enabled (
--log-level debug), Envoy will surface errors like:
tls: TLS error: 268435581:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED
tls: TLS error: 268435612:SSL routines:OPENSSL_internal:NO_SHARED_CIPHER
tls: handshake error: connection error: desc = "transport: authentication handshake failed"
Let's work through the most common root causes one by one.
Root Cause 1: Certificate Not Loaded
This one sounds obvious but it catches people constantly. Envoy has accepted your configuration, the process is running, the listener is up — but the certificate was never actually loaded into the TLS context. This happens most often in two scenarios: you're using the xDS SDS API and the secret hasn't been delivered yet, or you're using static config and the file path is wrong or unreadable by the Envoy process user.
With static config, Envoy reads the certificate at startup. If the path is wrong, the error surfaces immediately and the process either fails to start or — if the listener is otherwise structurally valid — the TLS filter chain quietly has no certificate. With SDS, timing matters: if your secret management service hasn't pushed the secret yet, Envoy will sit waiting. Connections will fail until the secret arrives, and that window can be longer than you'd expect during a fresh deployment.
To identify this, check Envoy's live config via the admin interface on sw-infrarunbook-01:
curl -s http://10.10.1.45:9901/config_dump | python3 -m json.tool | grep -A 20 "tls_context"
For the SDS case, check the active secrets section directly:
curl -s http://10.10.1.45:9901/config_dump?resource=dynamic_active_secrets
If the output is empty or missing your expected secret name, the certificate has not been delivered. Check the SDS stats endpoint next:
curl -s http://10.10.1.45:9901/stats | grep sds
sds.tls_certs.update_success: 0
sds.tls_certs.update_failure: 1
sds.tls_certs.update_rejected: 0
An incrementing
update_failurecounter is a dead giveaway. For static configs, verify the file path and permissions — Envoy typically runs as a non-root user and can't read files owned by root with mode 600:
ls -la /etc/envoy/certs/solvethenetwork.com.crt
# -rw------- 1 root root 2891 Apr 10 14:22 /etc/envoy/certs/solvethenetwork.com.crt
ps aux | grep envoy
# envoy 1234 0.0 0.1 ...
# Fix ownership so the envoy process user can read it
chown envoy:envoy /etc/envoy/certs/solvethenetwork.com.crt
chmod 640 /etc/envoy/certs/solvethenetwork.com.crt
For SDS delivery failures, ensure your SDS server is reachable from the Envoy pod or process and has the secret available to push. For static config path issues, correct the path in your Envoy YAML and hot-reload or restart the process.
Root Cause 2: Wrong Listener Filter Chain
Envoy's listener model is powerful, but the filter chain matching logic trips people up more than almost anything else. A listener can have multiple filter chains, each with distinct match criteria — SNI, destination port, transport protocol, and so on. If an incoming TLS connection doesn't match the intended filter chain, it may fall through to a non-TLS chain, get rejected entirely, or silently terminate.
This is especially common when a listener handles both TLS and plaintext traffic, or when you've added SNI-based routing. The filter chain match happens before the TLS handshake completes, using the SNI value from the TLS ClientHello. If the SNI the client sends doesn't match any filter chain's
server_nameslist, that chain won't be selected — and if there's no catch-all chain, the connection is dropped.
Identify this by inspecting your listener config dump:
curl -s http://10.10.1.45:9901/config_dump | python3 -m json.tool | grep -A 60 '"listeners"'
Then test connectivity with and without an explicit SNI to isolate the mismatch:
# Without SNI — will fail if SNI matching is required
curl -v --cacert /etc/envoy/certs/ca-bundle.crt https://10.10.1.45:8443/
# With SNI matching what your filter chain expects
curl -v --cacert /etc/envoy/certs/ca-bundle.crt \
--resolve api.solvethenetwork.com:8443:10.10.1.45 \
https://api.solvethenetwork.com:8443/health
Envoy's debug logs will show which filter chain was selected — or that none matched:
[debug][filter] [source/server/listener_impl.cc:282] new connection accepted
[debug][filter] no filter chain found for connection
That second line is the smoking gun. The fix is to ensure your filter chain match criteria covers the SNI your clients are sending. If you want a catch-all, include a filter chain with an empty
filter_chain_matchblock — it acts as a default for any connection that doesn't match a more specific chain:
filter_chains:
- filter_chain_match:
server_names:
- "api.solvethenetwork.com"
transport_socket:
name: envoy.transport_sockets.tls
typed_config:
"@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
common_tls_context:
tls_certificates:
- certificate_chain:
filename: /etc/envoy/certs/solvethenetwork.com.crt
private_key:
filename: /etc/envoy/certs/solvethenetwork.com.key
- filter_chain_match: {} # catch-all for unmatched or absent SNI
filters:
- name: envoy.filters.network.tcp_proxy
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.tcp_proxy.v3.TcpProxy
stat_prefix: ingress_tcp
cluster: fallback_cluster
Root Cause 3: mTLS Client Certificate Not Provided
Mutual TLS requires both sides to present certificates. When you enable
require_client_certificate: truein Envoy's downstream TLS context, any client that doesn't send a certificate will be rejected at the handshake stage. This is frequently the issue during rolling deployments where new pods require mTLS but some clients haven't yet received their certificates, or where a client application hasn't been configured to present one at all.
I've seen this bite teams repeatedly during service mesh migrations. The new Envoy sidecar is configured, mTLS is required, and then half the calls from legacy clients start failing. The error on the client side is:
curl: (35) error:14094412:SSL routines:ssl3_read_bytes:sslv3 alert bad certificate
On the Envoy side with debug logging:
[debug][tls] TLS error: 268436487:SSL routines:OPENSSL_internal:PEER_DID_NOT_RETURN_A_CERTIFICATE
Confirm that mTLS is required by checking the downstream TLS context in the running config:
curl -s http://10.10.1.45:9901/config_dump | python3 -m json.tool | grep -B2 -A5 "require_client_certificate"
# Output:
"require_client_certificate": true,
"validation_context": {
"trusted_ca": {
"filename": "/etc/envoy/certs/ca-bundle.crt"
}
}
Verify that providing a client certificate resolves it:
# Fails — no client cert sent
curl -v --cacert /etc/envoy/certs/ca-bundle.crt \
https://api.solvethenetwork.com:8443/health
# Works — client cert and key provided
curl -v \
--cacert /etc/envoy/certs/ca-bundle.crt \
--cert /etc/envoy/certs/client.crt \
--key /etc/envoy/certs/client.key \
https://api.solvethenetwork.com:8443/health
The fix depends on your deployment model. In a service mesh environment, ensure the client workload's sidecar has been injected and its SPIFFE certificate has been issued by the mesh CA. For manual configurations, distribute the correct client cert and key material to each client. If you need a transitional mode — mTLS not yet universally deployed — set
require_client_certificate: false. Envoy will still validate client certs when they are provided, but won't require them, letting you migrate incrementally without a hard cutover.
Root Cause 4: CA Bundle Wrong
A misconfigured or incomplete CA bundle is responsible for more Envoy TLS failures than most engineers expect. It shows up in two directions: Envoy can't verify the upstream server's certificate because the upstream's issuing CA isn't in the bundle, or downstream clients can't verify Envoy's certificate because the issuing CA isn't in what they trust.
The subtler version — and the one that really wastes time — is an incomplete chain. Your server might present only its leaf certificate without the intermediate CA. Or your bundle might contain the root but not the intermediate. OpenSSL is strict: if it can't build a complete chain to a trusted root, it rejects the connection regardless of how valid the leaf certificate looks on its own.
Verify the certificate chain your upstream is actually presenting:
openssl s_client -connect 10.10.1.55:443 -showcerts 2>/dev/null | \
awk '/BEGIN CERTIFICATE/,/END CERTIFICATE/' | \
openssl x509 -noout -subject -issuer
# You want to see the full chain:
subject=CN=api-backend.solvethenetwork.com
issuer=CN=SolveNetwork Intermediate CA G2
# That intermediate CA must be in your trust bundle
Check what's actually in your CA bundle file:
openssl crl2pkcs7 -nocrl -certfile /etc/envoy/certs/ca-bundle.crt | \
openssl pkcs7 -print_certs -noout | grep subject
# All intermediates in the upstream's chain must appear here
Confirm the validation context in Envoy's running cluster config is pointing to the correct bundle:
curl -s http://10.10.1.45:9901/config_dump | python3 -m json.tool | grep -A 10 "validation_context"
The fix is to rebuild your CA bundle to include all necessary intermediates in the correct order — intermediates first, root last:
cat /etc/envoy/certs/intermediate-ca-g2.crt \
/etc/envoy/certs/root-ca.crt \
> /etc/envoy/certs/ca-bundle.crt
# Verify the bundle
openssl verify -CAfile /etc/envoy/certs/ca-bundle.crt \
/etc/envoy/certs/solvethenetwork.com.crt
# Expected: /etc/envoy/certs/solvethenetwork.com.crt: OK
After updating the bundle, reload Envoy's config. If you're using static file references with a watch, Envoy will pick up the change. Otherwise, send a SIGHUP or use the admin endpoint to drain and reload. If you're using SDS, push the updated bundle through your secret management system and confirm the SDS update success counter increments.
Root Cause 5: ALPN Mismatch
Application-Layer Protocol Negotiation sits at the intersection of TLS and HTTP version negotiation. During the TLS handshake, client and server exchange lists of supported application protocols — typically
h2for HTTP/2 and
http/1.1for HTTP/1.1. If they can't agree on a protocol, the handshake either fails outright or the connection falls back in unexpected ways that don't look like TLS problems at first glance.
Envoy is opinionated about ALPN. When you configure an HTTP/2 upstream cluster using
http2_protocol_options, Envoy advertises only
h2to that upstream. If the upstream server doesn't support HTTP/2, you'll get a negotiation failure. The reverse is equally problematic — if your downstream clients negotiate HTTP/2 but your Envoy listener isn't configured to handle it, connections establish but then reset immediately on the first real request.
The error often looks like a protocol error rather than a TLS error, which makes it easy to chase the wrong thing:
[debug][http2] [source/common/http/http2/codec_impl.cc] invalid frame type 0x6 (SETTINGS) received
[debug][connection] closing data_to_write=0 type=1
[warning][connection] [source/common/network/connection_impl.cc:763] remote close
Use openssl's s_client to inspect what ALPN protocols the upstream actually advertises:
# Test what protocols the upstream supports
openssl s_client -connect 10.10.1.55:443 -alpn h2,http/1.1 2>&1 | grep -i alpn
# Successful h2 negotiation:
ALPN protocol: h2
# Server doesn't support h2:
No ALPN negotiated
# Or server only supports http/1.1:
ALPN protocol: http/1.1
Check how your cluster is configured in Envoy's running config:
curl -s http://10.10.1.45:9901/config_dump | python3 -m json.tool | grep -B5 -A10 "http2_protocol_options"
If
http2_protocol_options: {}is set in the cluster definition, Envoy will exclusively use HTTP/2 to that upstream and won't fall back to HTTP/1.1. If the upstream doesn't support h2, remove the
http2_protocol_optionsblock entirely or switch to
http_protocol_optionsto force HTTP/1.1. For newer Envoy versions,
upstream_http_protocol_optionswith auto-detection is a cleaner option that lets the ALPN negotiation determine the protocol automatically.
On the listener side, make sure your
DownstreamTlsContextadvertises both protocols so clients with either preference can connect:
transport_socket:
name: envoy.transport_sockets.tls
typed_config:
"@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
common_tls_context:
alpn_protocols:
- h2
- http/1.1
Order matters here. Listing
h2first signals preference for HTTP/2, while keeping
http/1.1as a fallback ensures clients that don't support h2 can still connect without errors.
Root Cause 6: Certificate SAN Mismatch
Modern TLS clients validate that the server certificate's Subject Alternative Names match the hostname being connected to. Common Name matching has been deprecated for years and is no longer honored by compliant clients. If your certificate has the wrong SANs — or no SANs at all — the connection fails even if the cert is signed by a trusted CA, is not expired, and is otherwise completely valid.
This trips people up during internal service deployments where certificates were generated with only a CN and no SAN extension, which was fine years ago but isn't acceptable to current OpenSSL, Go's TLS stack, or any modern browser.
Check the SANs on your certificate:
openssl x509 -in /etc/envoy/certs/solvethenetwork.com.crt -noout -text | \
grep -A3 "Subject Alternative Name"
# Good output:
X509v3 Subject Alternative Name:
DNS:api.solvethenetwork.com, DNS:*.solvethenetwork.com
# Bad — no SAN extension at all:
# (nothing returned — CN-only cert, will fail validation)
If you're connecting to an IP address directly — common in internal infrastructure — the certificate must include an IP SAN, not just a DNS SAN:
openssl x509 -in /etc/envoy/certs/internal-svc.crt -noout -text | \
grep -A5 "Subject Alternative Name"
# Must include: IP Address:10.10.1.45 for direct-IP connections to work
There's no workaround here — you can't disable SAN validation in compliant clients. The fix is to reissue the certificate with the correct SANs. If you're running your own PKI (Vault PKI, cfssl, or similar), update your certificate template to include both DNS and IP SANs as appropriate, then reissue and rotate.
Root Cause 7: Expired or Not-Yet-Valid Certificate
Certificates have a validity window bounded by
notBeforeand
notAfter. An expired cert is usually caught quickly because the error is explicit. The not-yet-valid scenario is trickier — it happens when system clocks are out of sync between hosts, and the cert looks fine when you inspect it on the issuing machine but fails on the receiving host whose clock is behind.
Check validity on sw-infrarunbook-01 and compare to the actual system time:
openssl x509 -in /etc/envoy/certs/solvethenetwork.com.crt -noout -dates
notBefore=Mar 1 00:00:00 2026 GMT
notAfter=Mar 1 00:00:00 2027 GMT
# Compare to current UTC time
date -u
# Tue Apr 15 09:12:34 UTC 2026
Check NTP synchronization status if clock skew is suspected:
timedatectl status | grep -E "NTP|synchronized"
# Healthy output:
System clock synchronized: yes
NTP service: active
chronyc tracking | grep "System time"
# System time : 0.000012483 seconds fast of NTP time
For certificate lifecycle management, automate rotation — use cert-manager in Kubernetes, Vault PKI with short-lived certs and automatic renewal, or ACME. Don't rely on human process to catch expiry. A 30-day expiry alert in your monitoring system is a reasonable minimum; 14-day and 7-day alerts give you time to act before an outage.
Prevention
Debugging TLS issues in Envoy is much less painful when you've invested in the right tooling upfront. A few practices that consistently pay off.
Instrument certificate expiry. Envoy exposes certificate expiration metrics through its stats endpoint when using SDS. Scrape these with Prometheus and alert at 30 and 14 days before expiry. The stat to watch is
ssl.certs.certificate_expiration_days— set a threshold alert so you're never surprised:
curl -s http://10.10.1.45:9901/stats | grep "ssl.certs"
ssl.certs.no_certificate: 0
ssl.certs.certificate_expiration_days: 347
Use the admin endpoint as your first debugging tool, not your logs. The
/config_dumpendpoint gives you the effective running configuration — what Envoy is actually using, not what you think you deployed. The
/clustersendpoint shows upstream health and SSL handshake counters. The
/listenersendpoint confirms filter chain layout. Make these three endpoints the first stop in any TLS debugging runbook.
Test mTLS paths explicitly in CI. A connectivity smoke test that doesn't send a client certificate won't catch mTLS regressions. Write your integration tests to exercise the full handshake — cert and key included — so a misconfigured sidecar or missing secret gets caught in the pipeline rather than in production.
Version-control your CA bundles. When a CA rolls over or an intermediate is added, you want a clear change history and the ability to roll back. Treat CA bundles as artifacts with a lifecycle — store them in a secrets manager or in version-controlled Kubernetes Secrets, not as manually copied files on a host.
Test ALPN explicitly when adding new clusters. A quick
openssl s_client -alpn h2,http/1.1 -connect 10.10.1.55:443against any new upstream tells you immediately what protocol it supports. This takes thirty seconds and saves you from subtle HTTP/2 failures that don't obviously look like TLS problems until you've spent an hour chasing connection resets.
TLS in Envoy is reliable once configured correctly. The handshake either works or fails with enough information to tell you why — if you know where to look. The admin interface, the stats endpoint, and a copy of openssl on the box will get you to root cause on any of these failures in minutes rather than hours.
