Symptoms
You've been staring at dashboards for the last hour and something is clearly wrong. Requests that used to complete in 40ms are now sitting at 800ms, 1.2 seconds, sometimes timing out entirely. The backends themselves look fine when you curl them directly. Traefik is in the critical path and it's the prime suspect.
Common symptoms that bring engineers to this runbook include: response times spiking in Prometheus under the
traefik_service_request_duration_secondshistogram, upstream services reporting normal latency while end users see slowness, connection errors appearing in browser devtools as TTFB (Time to First Byte) balloons, and Traefik logs filling with
Gateway Timeoutor
Service Unavailablemessages. Sometimes it's intermittent — every fifth request gets hammered. Sometimes it's all requests, all the time. Either way, let's go through this methodically.
Root Cause 1: Backend Is Actually Slow
I know this sounds obvious, but you'd be surprised how often engineers spend two hours tuning Traefik when the problem is a database query that regressed after a deploy. Traefik is a transparent proxy — if your backend is slow, Traefik will faithfully proxy that slowness back to the client. The latency Traefik adds itself is typically sub-millisecond under normal conditions.
Why it happens: A backend service degrades due to a code regression, a slow query, GC pressure, resource contention, or an upstream dependency (like a third-party API or a cache miss storm). Because Traefik sits in front, users blame the proxy.
How to identify it: Check the Traefik access log. By default it's JSON and includes both the request duration as seen by Traefik and the upstream address it forwarded to.
$ docker logs traefik 2>&1 | grep '"RouterName"' | jq '{upstream: .RequestAddr, duration: .Duration, status: .DownstreamStatus}' | tail -20
{"upstream": "10.0.1.15:8080", "duration": 1243000000, "status": 200}
{"upstream": "10.0.1.15:8080", "duration": 1198000000, "status": 200}
{"upstream": "10.0.1.16:8080", "duration": 45000000, "status": 200}
{"upstream": "10.0.1.15:8080", "duration": 1307000000, "status": 200}
The pattern is damning — 10.0.1.15 is consistently around 1.2 seconds while 10.0.1.16 is healthy. Bypass Traefik and hit the backend directly:
$ curl -o /dev/null -s -w "Total: %{time_total}s\nTTFB: %{time_starttransfer}s\n" http://10.0.1.15:8080/api/health
Total: 1.243812s
TTFB: 1.243501s
How to fix it: Temporarily remove the slow backend from the load balancer rotation while you investigate it. With Traefik's label-based routing in Docker or Kubernetes, you can remove the label or drain the pod. For a static backend defined in a file provider, comment it out and reload:
# traefik/dynamic/services.yml — remove the slow server
http:
services:
my-app:
loadBalancer:
servers:
# - url: "http://10.0.1.15:8080" # DRAINED — high latency
- url: "http://10.0.1.16:8080"
Traefik picks up file provider changes without restart. Confirm with
curl http://sw-infrarunbook-01:8080/api/http/servicesagainst the Traefik API.
Root Cause 2: Connection Pool Exhausted
This one bites production environments hard, especially after a traffic surge. Traefik maintains a pool of idle connections to each backend. When that pool is exhausted, new requests have to wait for a connection to become available — or Traefik has to open a fresh TCP connection, which adds round-trip latency before a single byte of your application payload moves.
Why it happens: The default Traefik transport settings are conservative. By default,
maxIdleConnsPerHostis 200 and
maxIdleConns(global) is also bounded. Under high concurrency — say, a burst of 500 concurrent users — you can exhaust available idle connections, forcing Traefik to either queue requests or establish new connections repeatedly. New TCP connections under TLS are especially expensive (see Root Cause 4).
How to identify it: Look at Traefik's internal metrics. If you have the Prometheus integration enabled:
$ curl -s http://sw-infrarunbook-01:8082/metrics | grep -E 'traefik_service_open_connections|go_goroutines'
traefik_service_open_connections{entrypoint="websecure",method="GET",service="my-app@docker"} 198
go_goroutines 3847
A connection count sitting at or near the configured maximum is your smoking gun. You can also correlate this with
netstaton the Traefik host:
$ ss -s
Total: 8921
TCP: 4102 (estab 3894, closed 198, orphaned 0, timewait 198)
Transport Total IP IPv6
RAW 0 0 0
UDP 12 8 4
TCP 3904 3901 3
INET 3916 3909 7
FRAG 0 0 0
Nearly 4000 established TCP connections from a single Traefik instance points to pool saturation. TIME_WAIT sockets piling up confirm that connections are being torn down and re-established constantly.
How to fix it: Tune the ServersTransport configuration. This can be done globally or per-service:
# traefik/dynamic/transports.yml
http:
serversTransports:
high-concurrency:
maxIdleConnsPerHost: 1000
forwardingTimeouts:
dialTimeout: "5s"
responseHeaderTimeout: "30s"
idleConnTimeout: "90s"
services:
my-app:
loadBalancer:
serversTransport: high-concurrency
servers:
- url: "http://10.0.1.16:8080"
Also check the kernel's ephemeral port range and socket backlog on the host. If you're burning through ports:
$ sysctl net.ipv4.ip_local_port_range
net.ipv4.ip_local_port_range = 32768 60999
# Expand it
$ sysctl -w net.ipv4.ip_local_port_range="1024 65535"
# Persist in /etc/sysctl.d/99-traefik.conf
net.ipv4.ip_local_port_range = 1024 65535
net.ipv4.tcp_tw_reuse = 1
Root Cause 3: Health Checks Consuming Bandwidth and Slots
Traefik's health check feature is genuinely useful — it automatically removes unhealthy backends from rotation. But poorly configured health checks can actually cause the latency they're meant to protect against. In my experience, this is one of the most overlooked causes of sporadic high latency in Traefik deployments.
Why it happens: If health check interval is set very low (say, every 2 seconds) across many backends, Traefik generates a constant stream of HTTP requests to your services. These health check requests compete for connections in the same pool used by real traffic. Worse, if your health check endpoint is heavyweight — hitting a database, running a deep dependency check — those probes themselves add load to the backend, which then slows down real requests.
A second failure mode: if the health check endpoint is on a large payload path or triggers side effects, it generates unnecessary bandwidth and CPU on the backend side, starving real requests of resources.
How to identify it: Check your backend access logs and filter for the health check user-agent:
$ grep 'Traefik' /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head
3600 10.0.0.2 # This is Traefik — 1 probe/sec from a 1s interval
842 203.0.113.44
521 198.51.100.7
3600 requests per hour from Traefik's health checker means a 1-second interval. Compare that against actual backend request rates — if health checks represent more than 5-10% of total requests, you have a problem. Also look at what the health endpoint actually does:
$ time curl -s http://10.0.1.16:8080/healthz -o /dev/null
real 0m0.312s
user 0m0.004s
sys 0m0.004s
A 312ms health check endpoint is a disaster. It should be under 10ms — ideally a static response that checks nothing but liveness.
How to fix it: First, slow down the interval and make the health check lightweight:
# traefik/dynamic/services.yml
http:
services:
my-app:
loadBalancer:
healthCheck:
path: /ping # Lightweight endpoint, not /health/deep
interval: "30s" # Not 1s or 5s
timeout: "3s"
headers:
X-Health-Check: "traefik"
servers:
- url: "http://10.0.1.16:8080"
On the application side, ensure
/pingreturns a static 200 OK with no database calls. It should just confirm the process is alive. If you need a deep dependency check, put it on a separate endpoint that ops can call manually — don't let Traefik hammer it every few seconds.
Root Cause 4: TLS Handshake Overhead
TLS adds latency. That's the deal you make for encryption. But the overhead should be a one-time cost amortized over many requests via TLS session resumption and connection reuse. When either of those mechanisms breaks down, every request pays the full handshake tax — and that's often 50-200ms per request.
Why it happens: TLS handshake overhead becomes a problem when connections aren't being reused. This happens when
keep-aliveis disabled between Traefik and backends, when backends close connections aggressively, when session tickets aren't configured, or when the backend itself has mismatched TLS settings that force a full handshake each time. It also happens when Traefik terminates TLS from clients and then re-encrypts to backends (passthrough vs. termination confusion).
How to identify it: Use
openssl s_clientand watch for session reuse:
$ openssl s_client -connect solvethenetwork.com:443 -reconnect 2>&1 | grep -E 'Reused|New|Session-ID'
depth=2 C = US, O = Internet Security Research Group, CN = ISRG Root X1
---
New, TLSv1.3, Cipher is TLS_AES_256_GCM_SHA384
Session-ID: 3A9F2B...
---
Reused, TLSv1.3, Cipher is TLS_AES_256_GCM_SHA384 # Good — session resumed
Reused, TLSv1.3, Cipher is TLS_AES_256_GCM_SHA384
New, TLSv1.3, Cipher is TLS_AES_256_GCM_SHA384 # Bad — forced new handshake
Intermittent
Newhandshakes where you'd expect
Reusedindicate session resumption is failing. You can also measure handshake time directly:
$ curl -o /dev/null -s -w "%{time_appconnect} appconnect\n%{time_starttransfer} TTFB\n%{time_total} total\n" https://solvethenetwork.com/api/data
0.187423 appconnect
0.234891 TTFB
0.235012 total
When
time_appconnectis 187ms, nearly all your latency is in the TLS handshake, not the application. Healthy values for a local network should be under 5ms.
How to fix it: If Traefik is terminating TLS from clients, ensure it's using HTTP (not HTTPS) to communicate with backends on the private network — there's no reason to double-encrypt on a trusted internal segment:
# traefik/dynamic/services.yml — use plain HTTP to backend on RFC 1918
http:
services:
my-app:
loadBalancer:
servers:
- url: "http://10.0.1.16:8080" # HTTP, not HTTPS, inside the cluster
If you must use TLS to backends (compliance requirement), enable session caching and tune cipher suites in Traefik's TLS options:
# traefik/dynamic/tls.yml
tls:
options:
default:
minVersion: VersionTLS12
cipherSuites:
- TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
- TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
- TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305
sniStrict: true
Also verify your certificate chain is complete. An incomplete chain forces clients to do additional round trips to fetch intermediate certificates, which adds latency that appears in Traefik's observed request times even though it's client-side:
$ openssl s_client -connect solvethenetwork.com:443 2>&1 | grep -A3 'Certificate chain'
Certificate chain
0 s:CN = solvethenetwork.com
i:C = US, O = Let's Encrypt, CN = R11
1 s:C = US, O = Let's Encrypt, CN = R11
i:C = US, O = Internet Security Research Group, CN = ISRG Root X1
Two levels in the chain is correct for Let's Encrypt. If you see only level 0, your chain is incomplete.
Root Cause 5: DNS Resolution Delay
Traefik resolves backend hostnames at startup and when services are updated — but depending on your configuration and provider type, it may re-resolve on each request or at intervals. DNS failures or slow responses translate directly into request latency, and this one is notoriously difficult to spot because DNS issues are intermittent by nature.
Why it happens: In Docker environments, Traefik often uses container names or service names as backend addresses. The Docker DNS resolver (127.0.0.11) handles these. Under heavy load or when containers are cycling, DNS responses can be slow or return stale records. In Kubernetes, CoreDNS becomes the chokepoint. In bare-metal or VM setups, if backends are specified by hostname rather than IP, every connection attempt that misses the OS DNS cache triggers a resolver query.
How to identify it: Capture DNS query latency on the Traefik host:
$ dig @127.0.0.11 my-app-service +stats 2>&1 | grep 'Query time'
;; Query time: 287 msec
;; SERVER: 127.0.0.11#53(127.0.0.11)
;; WHEN: Sat Apr 19 14:23:41 UTC 2026
287ms for a DNS query is catastrophic. Even 20ms is bad when you're trying to serve requests in under 100ms total. Run it several times and watch for variance:
$ for i in $(seq 1 10); do dig @127.0.0.11 my-app-service +short +stats 2>&1 | grep 'Query time'; done
;; Query time: 3 msec
;; Query time: 2 msec
;; Query time: 289 msec
;; Query time: 3 msec
;; Query time: 4 msec
;; Query time: 301 msec
;; Query time: 2 msec
;; Query time: 3 msec
;; Query time: 278 msec
;; Query time: 2 msec
Every third or fourth query spikes to ~290ms. That's a resolver problem, not a backend problem. You can also use
straceon the Traefik process to catch DNS syscalls directly, though that's invasive in production. A lighter approach is to enable Traefik's debug logging temporarily and grep for resolver activity:
$ docker logs traefik 2>&1 | grep -i 'lookup\|resolv\|dial' | tail -20
time="2026-04-19T14:23:39Z" level=debug msg="Dialing" network=tcp address="my-app-service:8080"
time="2026-04-19T14:23:39Z" level=debug msg="Resolved" address="10.0.1.16:8080" duration=287ms
time="2026-04-19T14:23:42Z" level=debug msg="Dialing" network=tcp address="my-app-service:8080"
time="2026-04-19T14:23:42Z" level=debug msg="Resolved" address="10.0.1.16:8080" duration=3ms
How to fix it: The most reliable fix is to use IP addresses instead of hostnames for backends in your Traefik configuration. On stable infrastructure, backends don't change IP often enough to justify DNS overhead on every connection:
# Before — hostname-based (subject to DNS latency)
http:
services:
my-app:
loadBalancer:
servers:
- url: "http://my-app-service:8080"
# After — IP-based (no DNS lookup)
http:
services:
my-app:
loadBalancer:
servers:
- url: "http://10.0.1.16:8080"
If you're running in Kubernetes and can't hardcode IPs, fix the DNS resolver itself. Check CoreDNS pod health and resource limits — CoreDNS getting CPU-throttled is a classic source of intermittent slow lookups:
$ kubectl -n kube-system top pods -l k8s-app=kube-dns
NAME CPU(cores) MEMORY(bytes)
coredns-5d78c9869d-4kbf9 98m 28Mi
coredns-5d78c9869d-7xvnp 97m 27Mi
# If CPU limit is 100m and usage is 98m, you're throttling — increase the limit
$ kubectl -n kube-system edit deployment coredns
Also configure ndots and search domains carefully. A high
ndotsvalue (Kubernetes defaults to 5) means every short hostname triggers multiple failed DNS lookups before the resolver tries the absolute name:
# In your Traefik pod spec, reduce ndots to cut DNS overhead
spec:
dnsConfig:
options:
- name: ndots
value: "2"
Root Cause 6: Middleware Chain Overhead
Every middleware you attach to a router runs synchronously in the request path. Rate limiters, authentication middleware, compression, and header manipulation all add time. A middleware that calls an external service — like an auth provider — is effectively adding a remote API call to every request latency.
How to identify it: Comment out middleware chains one at a time and measure latency with each removed. A faster approach is to create a test router with no middleware and route a subset of requests through it:
# Diagnostic route with zero middleware — compare latency against production route
http:
routers:
my-app-baseline:
rule: "Host(`sw-infrarunbook-01.solvethenetwork.com`) && PathPrefix(`/debug-baseline`)"
service: my-app
# No middlewares attached
my-app-prod:
rule: "Host(`sw-infrarunbook-01.solvethenetwork.com`)"
service: my-app
middlewares:
- rate-limit
- auth-forward
- compress
$ curl -o /dev/null -s -w "%{time_total}\n" https://sw-infrarunbook-01.solvethenetwork.com/debug-baseline/api/data
0.048s
$ curl -o /dev/null -s -w "%{time_total}\n" https://sw-infrarunbook-01.solvethenetwork.com/api/data
0.743s
695ms of middleware overhead. Now remove middlewares one at a time to find the culprit. ForwardAuth calling a slow external validator is the usual suspect.
Root Cause 7: Resource Starvation on the Traefik Host
Traefik is a Go application — it handles concurrency well, but it still needs CPU, memory, and file descriptors. A host under memory pressure will trigger Go's GC more frequently, causing stop-the-world pauses that show up as request latency spikes. CPU saturation causes request queuing. File descriptor exhaustion prevents new connections from opening.
How to identify it:
# Check file descriptor usage
$ cat /proc/$(pgrep traefik)/limits | grep 'open files'
Max open files 65536 65536 files
$ ls /proc/$(pgrep traefik)/fd | wc -l
64891
# At 64891 out of 65536 — nearly exhausted
# Check GC pressure via Traefik metrics
$ curl -s http://sw-infrarunbook-01:8082/metrics | grep go_gc
go_gc_duration_seconds{quantile="0.5"} 0.000312
go_gc_duration_seconds{quantile="0.99"} 0.089341 # 89ms GC pause at p99!
How to fix it: Raise the file descriptor limit in your systemd unit or Docker run arguments, and give Traefik enough headroom:
# /etc/systemd/system/traefik.service
[Service]
LimitNOFILE=1048576
LimitNPROC=65536
For GC pressure, set
GOGChigher (reduces GC frequency at the cost of more memory) or increase the container memory limit so Go's runtime doesn't GC as aggressively.
Prevention
Most of these issues are preventable with the right monitoring and baseline configuration. Set up Prometheus scraping of Traefik's
/metricsendpoint from day one and create alerts on
traefik_service_request_duration_secondsp95 and p99. Alert before users notice, not after.
Standardize on IP-based backend addressing wherever possible to eliminate DNS as a variable. When you first deploy a service, run a five-minute load test and capture a baseline latency profile — that baseline becomes your regression detector when things change later.
For TLS, let Traefik manage certificates via Let's Encrypt with the ACME provider and use HTTP to communicate with internal backends. Double-encryption inside a trusted network is a latency tax with no security dividend on most architectures.
Keep health check intervals sane. I'd set a floor of 15-30 seconds for most services. The additional seconds it takes to detect a failed backend are almost always acceptable compared to the constant probe overhead of aggressive health checking.
Finally, load test your middleware chains before going to production. ForwardAuth in particular deserves its own benchmark — point it at a mock auth service that returns immediately and compare against production auth latency. If your auth provider ever has a bad day, every single request through Traefik will pay that cost.
Document your ServersTransport configuration alongside your Traefik deployment. The default values are conservative and will surprise you during a traffic event if you haven't tuned them. Treat connection pool sizing like you'd treat database connection pool sizing — it matters, and the defaults aren't right for production traffic.
