InfraRunBook
    Back to articles

    Traefik High Latency Troubleshooting

    Traefik
    Published: Apr 19, 2026
    Updated: Apr 19, 2026

    A senior engineer's guide to diagnosing and fixing high latency in Traefik reverse proxy deployments, covering backend slowness, connection pool exhaustion, TLS overhead, DNS delays, and more.

    Traefik High Latency Troubleshooting

    Symptoms

    You've been staring at dashboards for the last hour and something is clearly wrong. Requests that used to complete in 40ms are now sitting at 800ms, 1.2 seconds, sometimes timing out entirely. The backends themselves look fine when you curl them directly. Traefik is in the critical path and it's the prime suspect.

    Common symptoms that bring engineers to this runbook include: response times spiking in Prometheus under the

    traefik_service_request_duration_seconds
    histogram, upstream services reporting normal latency while end users see slowness, connection errors appearing in browser devtools as TTFB (Time to First Byte) balloons, and Traefik logs filling with
    Gateway Timeout
    or
    Service Unavailable
    messages. Sometimes it's intermittent — every fifth request gets hammered. Sometimes it's all requests, all the time. Either way, let's go through this methodically.

    Root Cause 1: Backend Is Actually Slow

    I know this sounds obvious, but you'd be surprised how often engineers spend two hours tuning Traefik when the problem is a database query that regressed after a deploy. Traefik is a transparent proxy — if your backend is slow, Traefik will faithfully proxy that slowness back to the client. The latency Traefik adds itself is typically sub-millisecond under normal conditions.

    Why it happens: A backend service degrades due to a code regression, a slow query, GC pressure, resource contention, or an upstream dependency (like a third-party API or a cache miss storm). Because Traefik sits in front, users blame the proxy.

    How to identify it: Check the Traefik access log. By default it's JSON and includes both the request duration as seen by Traefik and the upstream address it forwarded to.

    $ docker logs traefik 2>&1 | grep '"RouterName"' | jq '{upstream: .RequestAddr, duration: .Duration, status: .DownstreamStatus}' | tail -20
    
    {"upstream": "10.0.1.15:8080", "duration": 1243000000, "status": 200}
    {"upstream": "10.0.1.15:8080", "duration": 1198000000, "status": 200}
    {"upstream": "10.0.1.16:8080", "duration": 45000000,  "status": 200}
    {"upstream": "10.0.1.15:8080", "duration": 1307000000, "status": 200}

    The pattern is damning — 10.0.1.15 is consistently around 1.2 seconds while 10.0.1.16 is healthy. Bypass Traefik and hit the backend directly:

    $ curl -o /dev/null -s -w "Total: %{time_total}s\nTTFB: %{time_starttransfer}s\n" http://10.0.1.15:8080/api/health
    
    Total: 1.243812s
    TTFB: 1.243501s

    How to fix it: Temporarily remove the slow backend from the load balancer rotation while you investigate it. With Traefik's label-based routing in Docker or Kubernetes, you can remove the label or drain the pod. For a static backend defined in a file provider, comment it out and reload:

    # traefik/dynamic/services.yml — remove the slow server
    http:
      services:
        my-app:
          loadBalancer:
            servers:
              # - url: "http://10.0.1.15:8080"  # DRAINED — high latency
              - url: "http://10.0.1.16:8080"

    Traefik picks up file provider changes without restart. Confirm with

    curl http://sw-infrarunbook-01:8080/api/http/services
    against the Traefik API.

    Root Cause 2: Connection Pool Exhausted

    This one bites production environments hard, especially after a traffic surge. Traefik maintains a pool of idle connections to each backend. When that pool is exhausted, new requests have to wait for a connection to become available — or Traefik has to open a fresh TCP connection, which adds round-trip latency before a single byte of your application payload moves.

    Why it happens: The default Traefik transport settings are conservative. By default,

    maxIdleConnsPerHost
    is 200 and
    maxIdleConns
    (global) is also bounded. Under high concurrency — say, a burst of 500 concurrent users — you can exhaust available idle connections, forcing Traefik to either queue requests or establish new connections repeatedly. New TCP connections under TLS are especially expensive (see Root Cause 4).

    How to identify it: Look at Traefik's internal metrics. If you have the Prometheus integration enabled:

    $ curl -s http://sw-infrarunbook-01:8082/metrics | grep -E 'traefik_service_open_connections|go_goroutines'
    
    traefik_service_open_connections{entrypoint="websecure",method="GET",service="my-app@docker"} 198
    go_goroutines 3847

    A connection count sitting at or near the configured maximum is your smoking gun. You can also correlate this with

    netstat
    on the Traefik host:

    $ ss -s
    Total: 8921
    TCP:   4102 (estab 3894, closed 198, orphaned 0, timewait 198)
    
    Transport Total     IP        IPv6
    RAW       0         0         0
    UDP       12        8         4
    TCP       3904      3901      3
    INET      3916      3909      7
    FRAG      0         0         0

    Nearly 4000 established TCP connections from a single Traefik instance points to pool saturation. TIME_WAIT sockets piling up confirm that connections are being torn down and re-established constantly.

    How to fix it: Tune the ServersTransport configuration. This can be done globally or per-service:

    # traefik/dynamic/transports.yml
    http:
      serversTransports:
        high-concurrency:
          maxIdleConnsPerHost: 1000
          forwardingTimeouts:
            dialTimeout: "5s"
            responseHeaderTimeout: "30s"
            idleConnTimeout: "90s"
    
      services:
        my-app:
          loadBalancer:
            serversTransport: high-concurrency
            servers:
              - url: "http://10.0.1.16:8080"

    Also check the kernel's ephemeral port range and socket backlog on the host. If you're burning through ports:

    $ sysctl net.ipv4.ip_local_port_range
    net.ipv4.ip_local_port_range = 32768	60999
    
    # Expand it
    $ sysctl -w net.ipv4.ip_local_port_range="1024 65535"
    
    # Persist in /etc/sysctl.d/99-traefik.conf
    net.ipv4.ip_local_port_range = 1024 65535
    net.ipv4.tcp_tw_reuse = 1

    Root Cause 3: Health Checks Consuming Bandwidth and Slots

    Traefik's health check feature is genuinely useful — it automatically removes unhealthy backends from rotation. But poorly configured health checks can actually cause the latency they're meant to protect against. In my experience, this is one of the most overlooked causes of sporadic high latency in Traefik deployments.

    Why it happens: If health check interval is set very low (say, every 2 seconds) across many backends, Traefik generates a constant stream of HTTP requests to your services. These health check requests compete for connections in the same pool used by real traffic. Worse, if your health check endpoint is heavyweight — hitting a database, running a deep dependency check — those probes themselves add load to the backend, which then slows down real requests.

    A second failure mode: if the health check endpoint is on a large payload path or triggers side effects, it generates unnecessary bandwidth and CPU on the backend side, starving real requests of resources.

    How to identify it: Check your backend access logs and filter for the health check user-agent:

    $ grep 'Traefik' /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head
    
      3600 10.0.0.2  # This is Traefik — 1 probe/sec from a 1s interval
       842 203.0.113.44
       521 198.51.100.7

    3600 requests per hour from Traefik's health checker means a 1-second interval. Compare that against actual backend request rates — if health checks represent more than 5-10% of total requests, you have a problem. Also look at what the health endpoint actually does:

    $ time curl -s http://10.0.1.16:8080/healthz -o /dev/null
    
    real    0m0.312s
    user    0m0.004s
    sys     0m0.004s

    A 312ms health check endpoint is a disaster. It should be under 10ms — ideally a static response that checks nothing but liveness.

    How to fix it: First, slow down the interval and make the health check lightweight:

    # traefik/dynamic/services.yml
    http:
      services:
        my-app:
          loadBalancer:
            healthCheck:
              path: /ping          # Lightweight endpoint, not /health/deep
              interval: "30s"      # Not 1s or 5s
              timeout: "3s"
              headers:
                X-Health-Check: "traefik"
            servers:
              - url: "http://10.0.1.16:8080"

    On the application side, ensure

    /ping
    returns a static 200 OK with no database calls. It should just confirm the process is alive. If you need a deep dependency check, put it on a separate endpoint that ops can call manually — don't let Traefik hammer it every few seconds.

    Root Cause 4: TLS Handshake Overhead

    TLS adds latency. That's the deal you make for encryption. But the overhead should be a one-time cost amortized over many requests via TLS session resumption and connection reuse. When either of those mechanisms breaks down, every request pays the full handshake tax — and that's often 50-200ms per request.

    Why it happens: TLS handshake overhead becomes a problem when connections aren't being reused. This happens when

    keep-alive
    is disabled between Traefik and backends, when backends close connections aggressively, when session tickets aren't configured, or when the backend itself has mismatched TLS settings that force a full handshake each time. It also happens when Traefik terminates TLS from clients and then re-encrypts to backends (passthrough vs. termination confusion).

    How to identify it: Use

    openssl s_client
    and watch for session reuse:

    $ openssl s_client -connect solvethenetwork.com:443 -reconnect 2>&1 | grep -E 'Reused|New|Session-ID'
    
    depth=2 C = US, O = Internet Security Research Group, CN = ISRG Root X1
    ---
    New, TLSv1.3, Cipher is TLS_AES_256_GCM_SHA384
    Session-ID: 3A9F2B...
    ---
    Reused, TLSv1.3, Cipher is TLS_AES_256_GCM_SHA384   # Good — session resumed
    Reused, TLSv1.3, Cipher is TLS_AES_256_GCM_SHA384
    New, TLSv1.3, Cipher is TLS_AES_256_GCM_SHA384       # Bad — forced new handshake

    Intermittent

    New
    handshakes where you'd expect
    Reused
    indicate session resumption is failing. You can also measure handshake time directly:

    $ curl -o /dev/null -s -w "%{time_appconnect} appconnect\n%{time_starttransfer} TTFB\n%{time_total} total\n" https://solvethenetwork.com/api/data
    
    0.187423 appconnect
    0.234891 TTFB
    0.235012 total

    When

    time_appconnect
    is 187ms, nearly all your latency is in the TLS handshake, not the application. Healthy values for a local network should be under 5ms.

    How to fix it: If Traefik is terminating TLS from clients, ensure it's using HTTP (not HTTPS) to communicate with backends on the private network — there's no reason to double-encrypt on a trusted internal segment:

    # traefik/dynamic/services.yml — use plain HTTP to backend on RFC 1918
    http:
      services:
        my-app:
          loadBalancer:
            servers:
              - url: "http://10.0.1.16:8080"   # HTTP, not HTTPS, inside the cluster

    If you must use TLS to backends (compliance requirement), enable session caching and tune cipher suites in Traefik's TLS options:

    # traefik/dynamic/tls.yml
    tls:
      options:
        default:
          minVersion: VersionTLS12
          cipherSuites:
            - TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
            - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
            - TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305
          sniStrict: true

    Also verify your certificate chain is complete. An incomplete chain forces clients to do additional round trips to fetch intermediate certificates, which adds latency that appears in Traefik's observed request times even though it's client-side:

    $ openssl s_client -connect solvethenetwork.com:443 2>&1 | grep -A3 'Certificate chain'
    
    Certificate chain
     0 s:CN = solvethenetwork.com
       i:C = US, O = Let's Encrypt, CN = R11
     1 s:C = US, O = Let's Encrypt, CN = R11
       i:C = US, O = Internet Security Research Group, CN = ISRG Root X1

    Two levels in the chain is correct for Let's Encrypt. If you see only level 0, your chain is incomplete.

    Root Cause 5: DNS Resolution Delay

    Traefik resolves backend hostnames at startup and when services are updated — but depending on your configuration and provider type, it may re-resolve on each request or at intervals. DNS failures or slow responses translate directly into request latency, and this one is notoriously difficult to spot because DNS issues are intermittent by nature.

    Why it happens: In Docker environments, Traefik often uses container names or service names as backend addresses. The Docker DNS resolver (127.0.0.11) handles these. Under heavy load or when containers are cycling, DNS responses can be slow or return stale records. In Kubernetes, CoreDNS becomes the chokepoint. In bare-metal or VM setups, if backends are specified by hostname rather than IP, every connection attempt that misses the OS DNS cache triggers a resolver query.

    How to identify it: Capture DNS query latency on the Traefik host:

    $ dig @127.0.0.11 my-app-service +stats 2>&1 | grep 'Query time'
    
    ;; Query time: 287 msec
    ;; SERVER: 127.0.0.11#53(127.0.0.11)
    ;; WHEN: Sat Apr 19 14:23:41 UTC 2026

    287ms for a DNS query is catastrophic. Even 20ms is bad when you're trying to serve requests in under 100ms total. Run it several times and watch for variance:

    $ for i in $(seq 1 10); do dig @127.0.0.11 my-app-service +short +stats 2>&1 | grep 'Query time'; done
    
    ;; Query time: 3 msec
    ;; Query time: 2 msec
    ;; Query time: 289 msec
    ;; Query time: 3 msec
    ;; Query time: 4 msec
    ;; Query time: 301 msec
    ;; Query time: 2 msec
    ;; Query time: 3 msec
    ;; Query time: 278 msec
    ;; Query time: 2 msec

    Every third or fourth query spikes to ~290ms. That's a resolver problem, not a backend problem. You can also use

    strace
    on the Traefik process to catch DNS syscalls directly, though that's invasive in production. A lighter approach is to enable Traefik's debug logging temporarily and grep for resolver activity:

    $ docker logs traefik 2>&1 | grep -i 'lookup\|resolv\|dial' | tail -20
    
    time="2026-04-19T14:23:39Z" level=debug msg="Dialing" network=tcp address="my-app-service:8080"
    time="2026-04-19T14:23:39Z" level=debug msg="Resolved" address="10.0.1.16:8080" duration=287ms
    time="2026-04-19T14:23:42Z" level=debug msg="Dialing" network=tcp address="my-app-service:8080"
    time="2026-04-19T14:23:42Z" level=debug msg="Resolved" address="10.0.1.16:8080" duration=3ms

    How to fix it: The most reliable fix is to use IP addresses instead of hostnames for backends in your Traefik configuration. On stable infrastructure, backends don't change IP often enough to justify DNS overhead on every connection:

    # Before — hostname-based (subject to DNS latency)
    http:
      services:
        my-app:
          loadBalancer:
            servers:
              - url: "http://my-app-service:8080"
    
    # After — IP-based (no DNS lookup)
    http:
      services:
        my-app:
          loadBalancer:
            servers:
              - url: "http://10.0.1.16:8080"

    If you're running in Kubernetes and can't hardcode IPs, fix the DNS resolver itself. Check CoreDNS pod health and resource limits — CoreDNS getting CPU-throttled is a classic source of intermittent slow lookups:

    $ kubectl -n kube-system top pods -l k8s-app=kube-dns
    
    NAME                       CPU(cores)   MEMORY(bytes)
    coredns-5d78c9869d-4kbf9   98m          28Mi
    coredns-5d78c9869d-7xvnp   97m          27Mi
    
    # If CPU limit is 100m and usage is 98m, you're throttling — increase the limit
    $ kubectl -n kube-system edit deployment coredns

    Also configure ndots and search domains carefully. A high

    ndots
    value (Kubernetes defaults to 5) means every short hostname triggers multiple failed DNS lookups before the resolver tries the absolute name:

    # In your Traefik pod spec, reduce ndots to cut DNS overhead
    spec:
      dnsConfig:
        options:
          - name: ndots
            value: "2"

    Root Cause 6: Middleware Chain Overhead

    Every middleware you attach to a router runs synchronously in the request path. Rate limiters, authentication middleware, compression, and header manipulation all add time. A middleware that calls an external service — like an auth provider — is effectively adding a remote API call to every request latency.

    How to identify it: Comment out middleware chains one at a time and measure latency with each removed. A faster approach is to create a test router with no middleware and route a subset of requests through it:

    # Diagnostic route with zero middleware — compare latency against production route
    http:
      routers:
        my-app-baseline:
          rule: "Host(`sw-infrarunbook-01.solvethenetwork.com`) && PathPrefix(`/debug-baseline`)"
          service: my-app
          # No middlewares attached
    
        my-app-prod:
          rule: "Host(`sw-infrarunbook-01.solvethenetwork.com`)"
          service: my-app
          middlewares:
            - rate-limit
            - auth-forward
            - compress
    $ curl -o /dev/null -s -w "%{time_total}\n" https://sw-infrarunbook-01.solvethenetwork.com/debug-baseline/api/data
    0.048s
    
    $ curl -o /dev/null -s -w "%{time_total}\n" https://sw-infrarunbook-01.solvethenetwork.com/api/data
    0.743s

    695ms of middleware overhead. Now remove middlewares one at a time to find the culprit. ForwardAuth calling a slow external validator is the usual suspect.

    Root Cause 7: Resource Starvation on the Traefik Host

    Traefik is a Go application — it handles concurrency well, but it still needs CPU, memory, and file descriptors. A host under memory pressure will trigger Go's GC more frequently, causing stop-the-world pauses that show up as request latency spikes. CPU saturation causes request queuing. File descriptor exhaustion prevents new connections from opening.

    How to identify it:

    # Check file descriptor usage
    $ cat /proc/$(pgrep traefik)/limits | grep 'open files'
    Max open files            65536                65536                files
    
    $ ls /proc/$(pgrep traefik)/fd | wc -l
    64891
    
    # At 64891 out of 65536 — nearly exhausted
    
    # Check GC pressure via Traefik metrics
    $ curl -s http://sw-infrarunbook-01:8082/metrics | grep go_gc
    go_gc_duration_seconds{quantile="0.5"} 0.000312
    go_gc_duration_seconds{quantile="0.99"} 0.089341   # 89ms GC pause at p99!

    How to fix it: Raise the file descriptor limit in your systemd unit or Docker run arguments, and give Traefik enough headroom:

    # /etc/systemd/system/traefik.service
    [Service]
    LimitNOFILE=1048576
    LimitNPROC=65536

    For GC pressure, set

    GOGC
    higher (reduces GC frequency at the cost of more memory) or increase the container memory limit so Go's runtime doesn't GC as aggressively.

    Prevention

    Most of these issues are preventable with the right monitoring and baseline configuration. Set up Prometheus scraping of Traefik's

    /metrics
    endpoint from day one and create alerts on
    traefik_service_request_duration_seconds
    p95 and p99. Alert before users notice, not after.

    Standardize on IP-based backend addressing wherever possible to eliminate DNS as a variable. When you first deploy a service, run a five-minute load test and capture a baseline latency profile — that baseline becomes your regression detector when things change later.

    For TLS, let Traefik manage certificates via Let's Encrypt with the ACME provider and use HTTP to communicate with internal backends. Double-encryption inside a trusted network is a latency tax with no security dividend on most architectures.

    Keep health check intervals sane. I'd set a floor of 15-30 seconds for most services. The additional seconds it takes to detect a failed backend are almost always acceptable compared to the constant probe overhead of aggressive health checking.

    Finally, load test your middleware chains before going to production. ForwardAuth in particular deserves its own benchmark — point it at a mock auth service that returns immediately and compare against production auth latency. If your auth provider ever has a bad day, every single request through Traefik will pay that cost.

    Document your ServersTransport configuration alongside your Traefik deployment. The default values are conservative and will surprise you during a traffic event if you haven't tuned them. Treat connection pool sizing like you'd treat database connection pool sizing — it matters, and the defaults aren't right for production traffic.

    Frequently Asked Questions

    How do I tell if latency is coming from Traefik itself or from the backend service?

    Check Traefik's access log for the Duration field, then curl the backend directly using its RFC 1918 IP and compare response times. If direct backend calls are fast and Traefik's logged duration is high, the proxy is the issue. If both are slow, the backend is the culprit regardless of what's in front of it.

    What is a safe health check interval for Traefik in production?

    15 to 30 seconds is a reasonable floor for most services. Shorter intervals like 1-5 seconds compete with real traffic for connection pool slots and add backend load. The trade-off is a slightly longer window to detect a failed backend, which is almost always acceptable.

    Should Traefik use HTTPS when proxying to backends inside the same private network?

    Generally no. If Traefik terminates TLS from external clients and your backends are on a trusted RFC 1918 network, using plain HTTP to the backend eliminates TLS handshake overhead on every connection. Double-encrypting on a private network adds latency with no meaningful security benefit in most architectures.

    How do I check if Traefik's connection pool is exhausted?

    Scrape the Prometheus metrics endpoint at /metrics and check traefik_service_open_connections. If it's at or near your configured maxIdleConnsPerHost limit, you're pool-exhausted. You can also run ss -s on the Traefik host to see total TCP connection counts and TIME_WAIT socket accumulation.

    Can middleware like ForwardAuth cause high latency in Traefik?

    Absolutely. ForwardAuth makes a synchronous HTTP call to an external auth service on every request. If that auth service is slow or has any latency spike, every proxied request pays that cost. Benchmark your auth endpoint independently and set a tight timeout in the ForwardAuth middleware configuration so slow auth fails fast rather than queuing indefinitely.

    Related Articles