InfraRunBook
    Back to articles

    Nginx Keepalive Connection Issues

    Nginx
    Published: Apr 20, 2026
    Updated: Apr 20, 2026

    Diagnose and fix Nginx keepalive connection problems including misconfigured keepalive_timeout, missing upstream keepalive pools, HTTP/1.0 backend issues, and connection churn causing 502 errors and TIME_WAIT accumulation.

    Nginx Keepalive Connection Issues

    Symptoms

    You're seeing intermittent 502 Bad Gateway errors under load. The error log is spitting out lines like

    connect() failed (111: Connection refused) while connecting to upstream
    even though the upstream service is clearly running and healthy. Response times are spiking unpredictably.
    ss -s
    shows a growing pile of connections in
    TIME_WAIT
    state. Your monitoring shows TCP connection establishment rates far exceeding what the actual request rate would suggest — you're serving 500 requests per second but creating 490 new TCP connections every second to do it.

    These are the classic fingerprints of keepalive misconfiguration. Nginx is supposed to be reusing connections — both toward clients and toward upstream backends — but something is breaking that reuse. New TCP handshakes are happening when they shouldn't, and under high load that overhead compounds into real latency and errors. In my experience, this category of problem is responsible for a disproportionate amount of "the server is slow but we can't figure out why" tickets, because the app itself is fine and the symptoms are diffuse.

    Let's work through every meaningful cause, with commands you can run right now to confirm which one you're dealing with.


    Root Cause 1: keepalive_requests Too Low

    Why It Happens

    Nginx has a directive called

    keepalive_requests
    that caps how many requests a single keepalive connection can serve before Nginx forcibly closes it. The historical default was 100 — raised to 1000 in Nginx 1.19.10 — and on older deployments you'll still see the old default in effect. Once a connection reaches that limit, Nginx sends a
    Connection: close
    header and tears down the TCP session. The client has to perform a full handshake for the next request. At high traffic volumes, this generates a constant churn of new connections even when everything else looks correct.

    This is one of the most commonly overlooked tunables I've seen. Engineers deploy Nginx, don't touch

    keepalive_requests
    , and then wonder why
    TIME_WAIT
    is in the thousands when they're only serving a few hundred concurrent users.

    How to Identify It

    Check the effective value first:

    nginx -T | grep keepalive_requests

    No output means the compiled default is in effect. Then watch the connection state breakdown:

    ss -s

    You'll see output like this on an affected server:

    Total: 4821
    TCP:   3104 (estab 312, closed 2791, orphaned 14, timewait 2768)
    
    Transport Total     IP        IPv6
    *         0         -         -
    RAW       0         0         0
    UDP       6         4         2
    TCP       313       310       3
    INET      319       314       5
    FRAG      0         0         0

    A large

    timewait
    count relative to established connections is a strong signal. Cross-reference with the stub_status module output:

    curl -s http://10.0.1.10/nginx_status
    Active connections: 318
    server accepts handled requests
     4821823 4821823 24109115
    Reading: 0 Writing: 12 Waiting: 306

    The ratio of requests to handled connections tells the story. Here it's about 5:1, which means on average each connection served five requests before being closed — consistent with an old default of 100 at moderate concurrency. Ideally this ratio should be much higher, in the hundreds or thousands.

    How to Fix It

    Set

    keepalive_requests
    to a significantly higher value in your
    http
    block:

    http {
        keepalive_requests 10000;
        keepalive_timeout  65s;
    }

    Reload Nginx and watch the

    TIME_WAIT
    count drop within minutes. For high-traffic deployments serving millions of requests per hour, values of 100000 are perfectly reasonable. The connection teardown itself is cheap — it's the new three-way handshake on the client side that costs you.


    Root Cause 2: keepalive_timeout Misconfigured

    Why It Happens

    The

    keepalive_timeout
    directive controls how long Nginx holds an idle keepalive connection open waiting for the next request. It accepts two parameters: the server-side close timeout and an optional second value that populates the
    Keep-Alive: timeout=N
    header sent to clients.

    Two distinct failure modes exist. The first: timeout set too low — say, 5 or 10 seconds — when clients are making requests further apart than that interval. A dashboard polling every 30 seconds, a mobile app retrying after a user interaction, a service making infrequent API calls. The keepalive connection expires before the next request arrives, forcing a fresh TCP handshake every time. The second failure mode is a mismatch with something upstream of Nginx — an AWS ALB, HAProxy, or another reverse proxy that has its own idle connection timeout configured shorter than Nginx's. That upstream layer closes the connection while Nginx still believes it's active, causing the upstream to close mid-request.

    How to Identify It

    For the too-low case, capture real client traffic and measure the gap between requests on persistent connections:

    tcpdump -i eth0 -n 'host 10.0.1.10 and port 443' -w /tmp/capture.pcap

    Open the capture in Wireshark and filter by TCP stream. Look at the delta timestamps between requests on the same stream. If those gaps regularly exceed your configured

    keepalive_timeout
    , you're closing connections before they can be reused.

    For the upstream timeout mismatch, this specific error in your Nginx log is the giveaway:

    2026/04/20 09:14:22 [error] 12847#12847: *3841092 upstream prematurely closed connection
    while reading response header from upstream, client: 10.0.1.55,
    server: solvethenetwork.com, request: "GET /api/data HTTP/1.1",
    upstream: "http://10.0.2.20:8080/api/data"

    "Upstream prematurely closed connection" means a keepalive connection was reused but the upstream had already closed its end in the interim.

    How to Fix It

    Match the timeout to your actual traffic pattern. For most applications a value around 65 seconds works well:

    http {
        keepalive_timeout 65s;
        keepalive_requests 10000;
    }

    If you're sitting behind an AWS ALB or similar load balancer with a 60-second idle timeout, set Nginx's

    keepalive_timeout
    slightly lower — 55 seconds — so Nginx closes idle connections before the upstream LB does. This prevents the LB from closing an idle connection at the exact moment Nginx tries to use it for a new request:

    http {
        keepalive_timeout 55s;
    }

    You can also explicitly control what you advertise to clients with the optional second parameter:

    # server closes idle connections after 65s
    # but tells clients to expect 60s
    keepalive_timeout 65s 60s;

    Root Cause 3: Upstream Keepalive Not Enabled

    Why It Happens

    This one catches a lot of engineers off guard. By default, Nginx does not maintain a connection pool to upstream servers. Every proxied request results in a brand new TCP connection to the backend — connection opens, request goes through, connection closes. No reuse. If you're reverse-proxying to an application server on the same host or in the same datacenter segment, you're paying full TCP handshake overhead on every single request for no reason. At high request rates you'll see TIME_WAIT sockets piling up on the upstream side, and you'll eventually hit port exhaustion or upstream connection queue limits.

    The

    keepalive
    directive in the upstream block was specifically designed to fix this, but it has to be explicitly enabled — it doesn't activate automatically just because you define an upstream group.

    How to Identify It

    Inspect your upstream configuration for the presence of the

    keepalive
    directive:

    nginx -T | grep -A 20 'upstream '

    If your output looks like this, you're not using upstream keepalive:

    upstream backend_pool {
        server 10.0.2.20:8080;
        server 10.0.2.21:8080;
    }

    Confirm by watching the connection state on the backend port from the Nginx host in real time:

    watch -n1 "ss -n state time-wait '( dport = :8080 or sport = :8080 )' | wc -l"

    If that number climbs steadily with traffic and never stabilizes, you're not reusing connections to the backend. Every request is leaving a TIME_WAIT socket behind and creating a new one.

    How to Fix It

    Add the

    keepalive
    directive to the upstream block and configure the proxy location correctly. Both changes are required:

    upstream backend_pool {
        server 10.0.2.20:8080;
        server 10.0.2.21:8080;
        keepalive 32;
    }
    
    server {
        listen 443 ssl;
        server_name solvethenetwork.com;
    
        location /api/ {
            proxy_pass http://backend_pool;
            proxy_http_version 1.1;
            proxy_set_header Connection "";
            proxy_set_header Host $host;
        }
    }

    The

    keepalive 32
    value sets the maximum number of idle keepalive connections Nginx will hold open per worker process to this upstream group. Setting
    proxy_http_version 1.1
    is mandatory — HTTP/1.0 doesn't support persistent connections properly, and Nginx defaults to 1.0 when talking to upstreams. Clearing the Connection header with
    proxy_set_header Connection ""
    is equally important: without it, Nginx forwards whatever Connection header it received from the client, which can be
    Connection: close
    , overriding your keepalive configuration entirely.


    Root Cause 4: HTTP/1.0 from Upstream

    Why It Happens

    As noted above, Nginx defaults to HTTP/1.0 when communicating with upstream backends. HTTP/1.0 treats every connection as non-persistent by default — keepalive was a non-standard extension bolted on afterward and isn't reliably supported across all implementations. If you've added the upstream

    keepalive
    pool but forgotten to set
    proxy_http_version 1.1
    , Nginx is sending HTTP/1.0 requests and the backend is responding with
    Connection: close
    , terminating the connection after every request. Your carefully configured connection pool sits empty and unused.

    I've also seen the reverse: backends that speak HTTP/1.0 only, regardless of what Nginx sends. Older Java servlet containers, legacy CGI backends, some embedded HTTP servers, and certain ancient internal microservices fall into this category. They won't maintain keepalive connections no matter what you do on the Nginx side.

    How to Identify It

    Capture the actual wire traffic between Nginx and the backend on sw-infrarunbook-01:

    tcpdump -i lo -n -A 'host 10.0.2.20 and port 8080' 2>/dev/null | grep -E "^(GET|POST|PUT|DELETE|HTTP|Connection)"

    If Nginx is sending HTTP/1.0, you'll see:

    GET /api/users HTTP/1.0
    Host: 10.0.2.20

    If the backend itself only speaks HTTP/1.0, you'll see responses like:

    HTTP/1.0 200 OK
    Connection: close
    Content-Type: application/json

    You can also probe the backend directly to check its protocol support:

    curl -v --http1.1 http://10.0.2.20:8080/api/health 2>&1 | grep -iE "^[<>] (http|connection)"

    A healthy HTTP/1.1 backend responds like this:

    < HTTP/1.1 200 OK
    < Connection: keep-alive

    A broken one responds like this:

    < HTTP/1.0 200 OK
    < Connection: close

    How to Fix It

    On the Nginx side, always be explicit about the protocol version in every proxy location block:

    location / {
        proxy_pass http://backend_pool;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }

    If the backend genuinely only supports HTTP/1.0 and you can't change it — legacy third-party service, vendor appliance, old internal app that nobody wants to touch — keepalive to that specific upstream isn't going to happen. In that case, shift focus to minimizing the cost of new connections. Enable TCP_NODELAY, expand the local port range, and let the OS recycle TIME_WAIT sockets faster:

    sysctl -w net.ipv4.tcp_tw_reuse=1
    sysctl -w net.ipv4.tcp_fin_timeout=15
    sysctl -w net.ipv4.ip_local_port_range="1024 65535"

    Make those persistent in

    /etc/sysctl.d/99-nginx-tuning.conf
    on sw-infrarunbook-01:

    net.ipv4.tcp_tw_reuse = 1
    net.ipv4.tcp_fin_timeout = 15
    net.ipv4.ip_local_port_range = 1024 65535
    sysctl -p /etc/sysctl.d/99-nginx-tuning.conf

    Root Cause 5: Backend Closing the Connection Early

    Why It Happens

    Even with everything correct on the Nginx side, the backend application can unilaterally close connections before Nginx expects it. This happens because the backend has its own keepalive idle timeout shorter than Nginx's upstream connection pool idle time, the backend hits an internal connection limit and starts terminating connections, or the backend application has a bug where it sends

    Connection: close
    on specific response types.

    The dangerous scenario is a race condition: Nginx has a connection sitting idle in its upstream keepalive pool, Nginx thinks it's valid, and the backend has already closed its end. The next request Nginx routes over that connection gets an immediate RST or EOF, which propagates as a 502 to the client. You can't eliminate this race entirely — it's inherent to the keepalive model — but you can make it transparent to users.

    How to Identify It

    This specific error signature in the Nginx error log is the tell:

    2026/04/20 11:32:07 [error] 12847#12847: *4920183 recv() failed (104: Connection reset by peer)
    while reading response header from upstream, client: 10.0.1.88,
    server: solvethenetwork.com, request: "POST /api/submit HTTP/1.1",
    upstream: "http://10.0.2.20:8080/api/submit",
    host: "solvethenetwork.com"

    Error 104 (ECONNRESET) on upstream reads, from a backend that's otherwise healthy when you hit it directly, means Nginx reused a connection the backend had already closed. Check the backend's own keepalive configuration. For a Node.js backend running on 10.0.2.20:

    ssh infrarunbook-admin@10.0.2.20 'node -e "const h = require(\"http\"); const s = h.createServer(); console.log(s.keepAliveTimeout);"'

    The Node.js default

    keepAliveTimeout
    was 5000ms (5 seconds) in older versions — drastically shorter than any reasonable Nginx upstream keepalive idle time, causing this race condition constantly under even modest load.

    How to Fix It

    Two complementary fixes. First, configure Nginx to transparently retry on upstream errors so the race doesn't surface as a client-visible failure:

    location /api/ {
        proxy_pass http://backend_pool;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_next_upstream error timeout invalid_header http_500 http_502;
        proxy_next_upstream_tries 2;
        proxy_next_upstream_timeout 5s;
    }

    Second — and more importantly — align the backend's keepalive timeout so it's always higher than Nginx's upstream idle time. For a Node.js backend, set this explicitly in your server initialization code:

    const server = app.listen(8080);
    server.keepAliveTimeout = 75000;  // 75 seconds in ms
    server.headersTimeout = 80000;    // must exceed keepAliveTimeout

    For a Gunicorn backend, add to its configuration on sw-infrarunbook-01:

    keepalive = 75

    The rule is simple: the backend's keepalive timeout must always be higher than whatever idle time could elapse in Nginx's upstream connection pool. Nginx doesn't have a built-in per-pool idle timeout in older versions (it was added with

    keepalive_timeout
    in the upstream block in Nginx 1.15.3), so making the backend more permissive is the reliable cross-version fix.


    Root Cause 6: Worker Connection Limits and File Descriptor Exhaustion

    Why It Happens

    Keepalive connections hold file descriptors open. Each Nginx worker can handle at most

    worker_connections
    simultaneous connections — both client-facing and upstream. With keepalive enabled on both sides, your per-worker connection count is: concurrent clients plus upstream pool size times the number of upstream servers. On a busy server with multiple worker processes and generous keepalive settings, you can exhaust the allowed file descriptors faster than you'd expect, and Nginx will start refusing to maintain keepalive connections or log errors and drop requests entirely.

    How to Identify It

    Check the current file descriptor limit for the running Nginx master process:

    cat /proc/$(cat /var/run/nginx.pid)/limits | grep "open files"
    Limit                     Soft Limit           Hard Limit           Units
    Max open files            1024                 1024                 files

    A soft limit of 1024 is a classic misconfiguration. Cross-reference with your configured worker_connections:

    nginx -T | grep worker_connections

    And look for this specific alert in the error log:

    2026/04/20 14:22:11 [alert] 12847#12847: *5020012 1024 worker_connections are not enough
    while connecting to upstream

    That's a hard confirmation. Nginx is hitting the wall.

    How to Fix It

    Increase

    worker_connections
    in nginx.conf and set
    worker_processes
    to match your CPU count:

    worker_processes auto;
    
    events {
        worker_connections 4096;
        use epoll;
        multi_accept on;
    }

    Then raise the system file descriptor limit for the Nginx service. Create a systemd override on sw-infrarunbook-01 as infrarunbook-admin:

    mkdir -p /etc/systemd/system/nginx.service.d
    cat > /etc/systemd/system/nginx.service.d/limits.conf <<EOF
    [Service]
    LimitNOFILE=65536
    EOF
    systemctl daemon-reload
    systemctl reload nginx

    Verify the new limit took effect:

    cat /proc/$(cat /var/run/nginx.pid)/limits | grep "open files"
    Limit                     Soft Limit           Hard Limit           Units
    Max open files            65536                65536                files

    Prevention

    Prevention is about getting the configuration right once across all layers and then monitoring the indicators that tell you when something has drifted. Start with a solid baseline that explicitly handles keepalive at every layer rather than relying on defaults. Here's a production-ready template for sw-infrarunbook-01:

    worker_processes auto;
    
    events {
        worker_connections 4096;
        use epoll;
        multi_accept on;
    }
    
    http {
        keepalive_timeout 55s;
        keepalive_requests 10000;
    
        upstream backend_pool {
            server 10.0.2.20:8080;
            server 10.0.2.21:8080;
            keepalive 64;
            keepalive_requests 1000;
            keepalive_timeout 60s;
        }
    
        server {
            listen 443 ssl;
            server_name solvethenetwork.com;
    
            location / {
                proxy_pass http://backend_pool;
                proxy_http_version 1.1;
                proxy_set_header Connection "";
                proxy_set_header Host $host;
                proxy_set_header X-Real-IP $remote_addr;
                proxy_next_upstream error timeout invalid_header;
                proxy_next_upstream_tries 2;
                proxy_next_upstream_timeout 5s;
            }
        }
    }

    For ongoing monitoring, track the TIME_WAIT count continuously. Wire it into your metrics stack or add a cron on sw-infrarunbook-01 as infrarunbook-admin to log it:

    * * * * * infrarunbook-admin ss -s | awk '/timewait/{print strftime("%FT%T"), $4}' >> /var/log/nginx/timewait_count.log

    Watch the stub_status ratio of requests to accepted connections over time. If that ratio approaches 1:1, keepalive is broken somewhere in your stack. A healthy ratio — requests significantly outnumbering accepted connections — means connections are being reused as intended.

    When you upgrade Nginx, re-read the release notes for default changes and run

    nginx -T
    to review the effective configuration, not just the files you edited. The
    keepalive_requests
    default change in 1.19.10 is a good example: if you had explicit configs below the new default, they still override it, and you might assume you got a free improvement when you didn't.

    Finally, make backend keepalive alignment a checklist item for every new service deployment. Before any backend goes behind Nginx in production: confirm it speaks HTTP/1.1, confirm its keepalive timeout is higher than Nginx's upstream pool idle time, and confirm it doesn't send

    Connection: close
    on normal 200 responses. Run the
    curl -v
    probe against it directly. These three checks catch the majority of issues covered in this article before they ever reach production traffic.

    Frequently Asked Questions

    Why does Nginx show '502 Bad Gateway' with 'recv() failed (104: Connection reset by peer)' from upstream?

    Error 104 (ECONNRESET) from an upstream during response header reading typically means Nginx reused a keepalive connection from its pool that the backend had already closed. The backend's keepalive timeout is shorter than Nginx's upstream idle time. Fix it by raising the backend's keepalive timeout above Nginx's and adding proxy_next_upstream error to retry transparently on the next upstream server.

    How do I enable upstream keepalive in Nginx?

    Add the keepalive directive to your upstream block (e.g., keepalive 32;), then in your proxy location add proxy_http_version 1.1; and proxy_set_header Connection ""; — all three are required. Without proxy_http_version 1.1, Nginx uses HTTP/1.0 which doesn't support persistent connections. Without clearing the Connection header, client headers can override your keepalive setting.

    What should I set keepalive_requests to in Nginx?

    The default changed from 100 to 1000 in Nginx 1.19.10. For high-traffic sites, set it to 10000 or higher. A low value forces frequent connection teardowns that generate TIME_WAIT sockets and slow down new connection establishment. Monitor your requests-to-accepted-connections ratio in nginx_status to verify keepalive reuse is working as expected.

    How do I know if Nginx keepalive connections are actually working?

    Check the stub_status output at /nginx_status and compare the 'requests' count to the 'handled' connections count. A healthy keepalive configuration shows requests significantly outnumbering handled connections — a ratio of 10:1 or higher. Also watch 'ss -s' for TIME_WAIT count; a high TIME_WAIT relative to established connections indicates connections are not being reused.

    Why does setting keepalive_timeout too high cause problems?

    A very high or unlimited keepalive_timeout holds file descriptors open indefinitely. Under slow-client conditions this can exhaust your worker_connections limit. More critically, if a load balancer or upstream proxy in front of Nginx has a shorter idle timeout (AWS ALB defaults to 60 seconds), it may close a connection that Nginx still considers valid, causing 'upstream prematurely closed connection' errors. Set Nginx's keepalive_timeout slightly below any upstream proxy's idle timeout.

    Related Articles