Nginx Keepalive Connection Issues

Symptoms

You're seeing intermittent 502 Bad Gateway errors under load. The error log is spitting out lines like

connect() failed (111: Connection refused) while connecting to upstream

even though the upstream service is clearly running and healthy. Response times are spiking unpredictably.

ss -s

shows a growing pile of connections in

TIME_WAIT

state. Your monitoring shows TCP connection establishment rates far exceeding what the actual request rate would suggest — you're serving 500 requests per second but creating 490 new TCP connections every second to do it.

These are the classic fingerprints of keepalive misconfiguration. Nginx is supposed to be reusing connections — both toward clients and toward upstream backends — but something is breaking that reuse. New TCP handshakes are happening when they shouldn't, and under high load that overhead compounds into real latency and errors. In my experience, this category of problem is responsible for a disproportionate amount of "the server is slow but we can't figure out why" tickets, because the app itself is fine and the symptoms are diffuse.

Let's work through every meaningful cause, with commands you can run right now to confirm which one you're dealing with.

Root Cause 1: keepalive_requests Too Low

Why It Happens

Nginx has a directive called

keepalive_requests

that caps how many requests a single keepalive connection can serve before Nginx forcibly closes it. The historical default was 100 — raised to 1000 in Nginx 1.19.10 — and on older deployments you'll still see the old default in effect. Once a connection reaches that limit, Nginx sends a

Connection: close

header and tears down the TCP session. The client has to perform a full handshake for the next request. At high traffic volumes, this generates a constant churn of new connections even when everything else looks correct.

This is one of the most commonly overlooked tunables I've seen. Engineers deploy Nginx, don't touch

keepalive_requests

, and then wonder why

TIME_WAIT

is in the thousands when they're only serving a few hundred concurrent users.

How to Identify It

Check the effective value first:

nginx -T | grep keepalive_requests

No output means the compiled default is in effect. Then watch the connection state breakdown:

ss -s

You'll see output like this on an affected server:

Total: 4821
TCP:   3104 (estab 312, closed 2791, orphaned 14, timewait 2768)

Transport Total     IP        IPv6
*         0         -         -
RAW       0         0         0
UDP       6         4         2
TCP       313       310       3
INET      319       314       5
FRAG      0         0         0

A large

timewait

count relative to established connections is a strong signal. Cross-reference with the stub_status module output:

curl -s http://10.0.1.10/nginx_status

Active connections: 318
server accepts handled requests
 4821823 4821823 24109115
Reading: 0 Writing: 12 Waiting: 306

The ratio of requests to handled connections tells the story. Here it's about 5:1, which means on average each connection served five requests before being closed — consistent with an old default of 100 at moderate concurrency. Ideally this ratio should be much higher, in the hundreds or thousands.

How to Fix It

Set

keepalive_requests

to a significantly higher value in your

http

block:

http {
    keepalive_requests 10000;
    keepalive_timeout  65s;
}

Reload Nginx and watch the

TIME_WAIT

count drop within minutes. For high-traffic deployments serving millions of requests per hour, values of 100000 are perfectly reasonable. The connection teardown itself is cheap — it's the new three-way handshake on the client side that costs you.

Root Cause 2: keepalive_timeout Misconfigured

Why It Happens

The

keepalive_timeout

directive controls how long Nginx holds an idle keepalive connection open waiting for the next request. It accepts two parameters: the server-side close timeout and an optional second value that populates the

Keep-Alive: timeout=N

header sent to clients.

Two distinct failure modes exist. The first: timeout set too low — say, 5 or 10 seconds — when clients are making requests further apart than that interval. A dashboard polling every 30 seconds, a mobile app retrying after a user interaction, a service making infrequent API calls. The keepalive connection expires before the next request arrives, forcing a fresh TCP handshake every time. The second failure mode is a mismatch with something upstream of Nginx — an AWS ALB, HAProxy, or another reverse proxy that has its own idle connection timeout configured shorter than Nginx's. That upstream layer closes the connection while Nginx still believes it's active, causing the upstream to close mid-request.

How to Identify It

For the too-low case, capture real client traffic and measure the gap between requests on persistent connections:

tcpdump -i eth0 -n 'host 10.0.1.10 and port 443' -w /tmp/capture.pcap

Open the capture in Wireshark and filter by TCP stream. Look at the delta timestamps between requests on the same stream. If those gaps regularly exceed your configured

keepalive_timeout

, you're closing connections before they can be reused.

For the upstream timeout mismatch, this specific error in your Nginx log is the giveaway:

2026/04/20 09:14:22 [error] 12847#12847: *3841092 upstream prematurely closed connection
while reading response header from upstream, client: 10.0.1.55,
server: solvethenetwork.com, request: "GET /api/data HTTP/1.1",
upstream: "http://10.0.2.20:8080/api/data"

"Upstream prematurely closed connection" means a keepalive connection was reused but the upstream had already closed its end in the interim.

How to Fix It

Match the timeout to your actual traffic pattern. For most applications a value around 65 seconds works well:

http {
    keepalive_timeout 65s;
    keepalive_requests 10000;
}

If you're sitting behind an AWS ALB or similar load balancer with a 60-second idle timeout, set Nginx's

keepalive_timeout

slightly lower — 55 seconds — so Nginx closes idle connections before the upstream LB does. This prevents the LB from closing an idle connection at the exact moment Nginx tries to use it for a new request:

http {
    keepalive_timeout 55s;
}

You can also explicitly control what you advertise to clients with the optional second parameter:

# server closes idle connections after 65s
# but tells clients to expect 60s
keepalive_timeout 65s 60s;

Root Cause 3: Upstream Keepalive Not Enabled

Why It Happens

This one catches a lot of engineers off guard. By default, Nginx does not maintain a connection pool to upstream servers. Every proxied request results in a brand new TCP connection to the backend — connection opens, request goes through, connection closes. No reuse. If you're reverse-proxying to an application server on the same host or in the same datacenter segment, you're paying full TCP handshake overhead on every single request for no reason. At high request rates you'll see TIME_WAIT sockets piling up on the upstream side, and you'll eventually hit port exhaustion or upstream connection queue limits.

The

keepalive

directive in the upstream block was specifically designed to fix this, but it has to be explicitly enabled — it doesn't activate automatically just because you define an upstream group.

How to Identify It

Inspect your upstream configuration for the presence of the

keepalive

directive:

nginx -T | grep -A 20 'upstream '

If your output looks like this, you're not using upstream keepalive:

upstream backend_pool {
    server 10.0.2.20:8080;
    server 10.0.2.21:8080;
}

Confirm by watching the connection state on the backend port from the Nginx host in real time:

watch -n1 "ss -n state time-wait '( dport = :8080 or sport = :8080 )' | wc -l"

If that number climbs steadily with traffic and never stabilizes, you're not reusing connections to the backend. Every request is leaving a TIME_WAIT socket behind and creating a new one.

How to Fix It

Add the

keepalive

directive to the upstream block and configure the proxy location correctly. Both changes are required:

upstream backend_pool {
    server 10.0.2.20:8080;
    server 10.0.2.21:8080;
    keepalive 32;
}

server {
    listen 443 ssl;
    server_name solvethenetwork.com;

    location /api/ {
        proxy_pass http://backend_pool;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_set_header Host $host;
    }
}

The

keepalive 32

value sets the maximum number of idle keepalive connections Nginx will hold open per worker process to this upstream group. Setting

proxy_http_version 1.1

is mandatory — HTTP/1.0 doesn't support persistent connections properly, and Nginx defaults to 1.0 when talking to upstreams. Clearing the Connection header with

proxy_set_header Connection ""

is equally important: without it, Nginx forwards whatever Connection header it received from the client, which can be

Connection: close

, overriding your keepalive configuration entirely.

Root Cause 4: HTTP/1.0 from Upstream

Why It Happens

As noted above, Nginx defaults to HTTP/1.0 when communicating with upstream backends. HTTP/1.0 treats every connection as non-persistent by default — keepalive was a non-standard extension bolted on afterward and isn't reliably supported across all implementations. If you've added the upstream

keepalive

pool but forgotten to set

proxy_http_version 1.1

, Nginx is sending HTTP/1.0 requests and the backend is responding with

Connection: close

, terminating the connection after every request. Your carefully configured connection pool sits empty and unused.

I've also seen the reverse: backends that speak HTTP/1.0 only, regardless of what Nginx sends. Older Java servlet containers, legacy CGI backends, some embedded HTTP servers, and certain ancient internal microservices fall into this category. They won't maintain keepalive connections no matter what you do on the Nginx side.

How to Identify It

Capture the actual wire traffic between Nginx and the backend on sw-infrarunbook-01:

tcpdump -i lo -n -A 'host 10.0.2.20 and port 8080' 2>/dev/null | grep -E "^(GET|POST|PUT|DELETE|HTTP|Connection)"

If Nginx is sending HTTP/1.0, you'll see:

GET /api/users HTTP/1.0
Host: 10.0.2.20

If the backend itself only speaks HTTP/1.0, you'll see responses like:

HTTP/1.0 200 OK
Connection: close
Content-Type: application/json

You can also probe the backend directly to check its protocol support:

curl -v --http1.1 http://10.0.2.20:8080/api/health 2>&1 | grep -iE "^[<>] (http|connection)"

A healthy HTTP/1.1 backend responds like this:

< HTTP/1.1 200 OK
< Connection: keep-alive

A broken one responds like this:

< HTTP/1.0 200 OK
< Connection: close

How to Fix It

On the Nginx side, always be explicit about the protocol version in every proxy location block:

location / {
    proxy_pass http://backend_pool;
    proxy_http_version 1.1;
    proxy_set_header Connection "";
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
}

If the backend genuinely only supports HTTP/1.0 and you can't change it — legacy third-party service, vendor appliance, old internal app that nobody wants to touch — keepalive to that specific upstream isn't going to happen. In that case, shift focus to minimizing the cost of new connections. Enable TCP_NODELAY, expand the local port range, and let the OS recycle TIME_WAIT sockets faster:

sysctl -w net.ipv4.tcp_tw_reuse=1
sysctl -w net.ipv4.tcp_fin_timeout=15
sysctl -w net.ipv4.ip_local_port_range="1024 65535"

Make those persistent in

/etc/sysctl.d/99-nginx-tuning.conf

on sw-infrarunbook-01:

net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15
net.ipv4.ip_local_port_range = 1024 65535

sysctl -p /etc/sysctl.d/99-nginx-tuning.conf

Root Cause 5: Backend Closing the Connection Early

Why It Happens

Even with everything correct on the Nginx side, the backend application can unilaterally close connections before Nginx expects it. This happens because the backend has its own keepalive idle timeout shorter than Nginx's upstream connection pool idle time, the backend hits an internal connection limit and starts terminating connections, or the backend application has a bug where it sends

Connection: close

on specific response types.

The dangerous scenario is a race condition: Nginx has a connection sitting idle in its upstream keepalive pool, Nginx thinks it's valid, and the backend has already closed its end. The next request Nginx routes over that connection gets an immediate RST or EOF, which propagates as a 502 to the client. You can't eliminate this race entirely — it's inherent to the keepalive model — but you can make it transparent to users.

How to Identify It

This specific error signature in the Nginx error log is the tell:

2026/04/20 11:32:07 [error] 12847#12847: *4920183 recv() failed (104: Connection reset by peer)
while reading response header from upstream, client: 10.0.1.88,
server: solvethenetwork.com, request: "POST /api/submit HTTP/1.1",
upstream: "http://10.0.2.20:8080/api/submit",
host: "solvethenetwork.com"

Error 104 (ECONNRESET) on upstream reads, from a backend that's otherwise healthy when you hit it directly, means Nginx reused a connection the backend had already closed. Check the backend's own keepalive configuration. For a Node.js backend running on 10.0.2.20:

ssh infrarunbook-admin@10.0.2.20 'node -e "const h = require(\"http\"); const s = h.createServer(); console.log(s.keepAliveTimeout);"'

The Node.js default

keepAliveTimeout

was 5000ms (5 seconds) in older versions — drastically shorter than any reasonable Nginx upstream keepalive idle time, causing this race condition constantly under even modest load.

How to Fix It

Two complementary fixes. First, configure Nginx to transparently retry on upstream errors so the race doesn't surface as a client-visible failure:

location /api/ {
    proxy_pass http://backend_pool;
    proxy_http_version 1.1;
    proxy_set_header Connection "";
    proxy_next_upstream error timeout invalid_header http_500 http_502;
    proxy_next_upstream_tries 2;
    proxy_next_upstream_timeout 5s;
}

Second — and more importantly — align the backend's keepalive timeout so it's always higher than Nginx's upstream idle time. For a Node.js backend, set this explicitly in your server initialization code:

const server = app.listen(8080);
server.keepAliveTimeout = 75000;  // 75 seconds in ms
server.headersTimeout = 80000;    // must exceed keepAliveTimeout

For a Gunicorn backend, add to its configuration on sw-infrarunbook-01:

keepalive = 75

The rule is simple: the backend's keepalive timeout must always be higher than whatever idle time could elapse in Nginx's upstream connection pool. Nginx doesn't have a built-in per-pool idle timeout in older versions (it was added with

keepalive_timeout

in the upstream block in Nginx 1.15.3), so making the backend more permissive is the reliable cross-version fix.

Root Cause 6: Worker Connection Limits and File Descriptor Exhaustion

Why It Happens

Keepalive connections hold file descriptors open. Each Nginx worker can handle at most

worker_connections

simultaneous connections — both client-facing and upstream. With keepalive enabled on both sides, your per-worker connection count is: concurrent clients plus upstream pool size times the number of upstream servers. On a busy server with multiple worker processes and generous keepalive settings, you can exhaust the allowed file descriptors faster than you'd expect, and Nginx will start refusing to maintain keepalive connections or log errors and drop requests entirely.

How to Identify It

Check the current file descriptor limit for the running Nginx master process:

cat /proc/$(cat /var/run/nginx.pid)/limits | grep "open files"

Limit                     Soft Limit           Hard Limit           Units
Max open files            1024                 1024                 files

A soft limit of 1024 is a classic misconfiguration. Cross-reference with your configured worker_connections:

nginx -T | grep worker_connections

And look for this specific alert in the error log:

2026/04/20 14:22:11 [alert] 12847#12847: *5020012 1024 worker_connections are not enough
while connecting to upstream

That's a hard confirmation. Nginx is hitting the wall.

How to Fix It

Increase

worker_connections

in nginx.conf and set

worker_processes

to match your CPU count:

worker_processes auto;

events {
    worker_connections 4096;
    use epoll;
    multi_accept on;
}

Then raise the system file descriptor limit for the Nginx service. Create a systemd override on sw-infrarunbook-01 as infrarunbook-admin:

mkdir -p /etc/systemd/system/nginx.service.d
cat > /etc/systemd/system/nginx.service.d/limits.conf <<EOF
[Service]
LimitNOFILE=65536
EOF

systemctl daemon-reload
systemctl reload nginx

Verify the new limit took effect:

cat /proc/$(cat /var/run/nginx.pid)/limits | grep "open files"

Limit                     Soft Limit           Hard Limit           Units
Max open files            65536                65536                files

Prevention

Prevention is about getting the configuration right once across all layers and then monitoring the indicators that tell you when something has drifted. Start with a solid baseline that explicitly handles keepalive at every layer rather than relying on defaults. Here's a production-ready template for sw-infrarunbook-01:

worker_processes auto;

events {
    worker_connections 4096;
    use epoll;
    multi_accept on;
}

http {
    keepalive_timeout 55s;
    keepalive_requests 10000;

    upstream backend_pool {
        server 10.0.2.20:8080;
        server 10.0.2.21:8080;
        keepalive 64;
        keepalive_requests 1000;
        keepalive_timeout 60s;
    }

    server {
        listen 443 ssl;
        server_name solvethenetwork.com;

        location / {
            proxy_pass http://backend_pool;
            proxy_http_version 1.1;
            proxy_set_header Connection "";
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_next_upstream error timeout invalid_header;
            proxy_next_upstream_tries 2;
            proxy_next_upstream_timeout 5s;
        }
    }
}

For ongoing monitoring, track the TIME_WAIT count continuously. Wire it into your metrics stack or add a cron on sw-infrarunbook-01 as infrarunbook-admin to log it:

* * * * * infrarunbook-admin ss -s | awk '/timewait/{print strftime("%FT%T"), $4}' >> /var/log/nginx/timewait_count.log

Watch the stub_status ratio of requests to accepted connections over time. If that ratio approaches 1:1, keepalive is broken somewhere in your stack. A healthy ratio — requests significantly outnumbering accepted connections — means connections are being reused as intended.

When you upgrade Nginx, re-read the release notes for default changes and run

nginx -T

to review the effective configuration, not just the files you edited. The

keepalive_requests

default change in 1.19.10 is a good example: if you had explicit configs below the new default, they still override it, and you might assume you got a free improvement when you didn't.

Finally, make backend keepalive alignment a checklist item for every new service deployment. Before any backend goes behind Nginx in production: confirm it speaks HTTP/1.1, confirm its keepalive timeout is higher than Nginx's upstream pool idle time, and confirm it doesn't send

Connection: close

on normal 200 responses. Run the

curl -v

probe against it directly. These three checks catch the majority of issues covered in this article before they ever reach production traffic.

Nginx Keepalive Connection Issues

Symptoms

Root Cause 1: keepalive_requests Too Low

Why It Happens

How to Identify It

How to Fix It

Root Cause 2: keepalive_timeout Misconfigured

Why It Happens

How to Identify It

How to Fix It

Root Cause 3: Upstream Keepalive Not Enabled

Why It Happens

How to Identify It

How to Fix It

Root Cause 4: HTTP/1.0 from Upstream

Why It Happens

How to Identify It

How to Fix It

Root Cause 5: Backend Closing the Connection Early

Why It Happens

How to Identify It

How to Fix It

Root Cause 6: Worker Connection Limits and File Descriptor Exhaustion

Why It Happens

How to Identify It

How to Fix It

Prevention

Frequently Asked Questions

Why does Nginx show '502 Bad Gateway' with 'recv() failed (104: Connection reset by peer)' from upstream?

How do I enable upstream keepalive in Nginx?

What should I set keepalive_requests to in Nginx?

How do I know if Nginx keepalive connections are actually working?

Why does setting keepalive_timeout too high cause problems?

Related Articles