Symptoms
You're seeing intermittent 502 Bad Gateway errors under load. The error log is spitting out lines like
connect() failed (111: Connection refused) while connecting to upstreameven though the upstream service is clearly running and healthy. Response times are spiking unpredictably.
ss -sshows a growing pile of connections in
TIME_WAITstate. Your monitoring shows TCP connection establishment rates far exceeding what the actual request rate would suggest — you're serving 500 requests per second but creating 490 new TCP connections every second to do it.
These are the classic fingerprints of keepalive misconfiguration. Nginx is supposed to be reusing connections — both toward clients and toward upstream backends — but something is breaking that reuse. New TCP handshakes are happening when they shouldn't, and under high load that overhead compounds into real latency and errors. In my experience, this category of problem is responsible for a disproportionate amount of "the server is slow but we can't figure out why" tickets, because the app itself is fine and the symptoms are diffuse.
Let's work through every meaningful cause, with commands you can run right now to confirm which one you're dealing with.
Root Cause 1: keepalive_requests Too Low
Why It Happens
Nginx has a directive called
keepalive_requeststhat caps how many requests a single keepalive connection can serve before Nginx forcibly closes it. The historical default was 100 — raised to 1000 in Nginx 1.19.10 — and on older deployments you'll still see the old default in effect. Once a connection reaches that limit, Nginx sends a
Connection: closeheader and tears down the TCP session. The client has to perform a full handshake for the next request. At high traffic volumes, this generates a constant churn of new connections even when everything else looks correct.
This is one of the most commonly overlooked tunables I've seen. Engineers deploy Nginx, don't touch
keepalive_requests, and then wonder why
TIME_WAITis in the thousands when they're only serving a few hundred concurrent users.
How to Identify It
Check the effective value first:
nginx -T | grep keepalive_requests
No output means the compiled default is in effect. Then watch the connection state breakdown:
ss -s
You'll see output like this on an affected server:
Total: 4821
TCP: 3104 (estab 312, closed 2791, orphaned 14, timewait 2768)
Transport Total IP IPv6
* 0 - -
RAW 0 0 0
UDP 6 4 2
TCP 313 310 3
INET 319 314 5
FRAG 0 0 0
A large
timewaitcount relative to established connections is a strong signal. Cross-reference with the stub_status module output:
curl -s http://10.0.1.10/nginx_status
Active connections: 318
server accepts handled requests
4821823 4821823 24109115
Reading: 0 Writing: 12 Waiting: 306
The ratio of requests to handled connections tells the story. Here it's about 5:1, which means on average each connection served five requests before being closed — consistent with an old default of 100 at moderate concurrency. Ideally this ratio should be much higher, in the hundreds or thousands.
How to Fix It
Set
keepalive_requeststo a significantly higher value in your
httpblock:
http {
keepalive_requests 10000;
keepalive_timeout 65s;
}
Reload Nginx and watch the
TIME_WAITcount drop within minutes. For high-traffic deployments serving millions of requests per hour, values of 100000 are perfectly reasonable. The connection teardown itself is cheap — it's the new three-way handshake on the client side that costs you.
Root Cause 2: keepalive_timeout Misconfigured
Why It Happens
The
keepalive_timeoutdirective controls how long Nginx holds an idle keepalive connection open waiting for the next request. It accepts two parameters: the server-side close timeout and an optional second value that populates the
Keep-Alive: timeout=Nheader sent to clients.
Two distinct failure modes exist. The first: timeout set too low — say, 5 or 10 seconds — when clients are making requests further apart than that interval. A dashboard polling every 30 seconds, a mobile app retrying after a user interaction, a service making infrequent API calls. The keepalive connection expires before the next request arrives, forcing a fresh TCP handshake every time. The second failure mode is a mismatch with something upstream of Nginx — an AWS ALB, HAProxy, or another reverse proxy that has its own idle connection timeout configured shorter than Nginx's. That upstream layer closes the connection while Nginx still believes it's active, causing the upstream to close mid-request.
How to Identify It
For the too-low case, capture real client traffic and measure the gap between requests on persistent connections:
tcpdump -i eth0 -n 'host 10.0.1.10 and port 443' -w /tmp/capture.pcap
Open the capture in Wireshark and filter by TCP stream. Look at the delta timestamps between requests on the same stream. If those gaps regularly exceed your configured
keepalive_timeout, you're closing connections before they can be reused.
For the upstream timeout mismatch, this specific error in your Nginx log is the giveaway:
2026/04/20 09:14:22 [error] 12847#12847: *3841092 upstream prematurely closed connection
while reading response header from upstream, client: 10.0.1.55,
server: solvethenetwork.com, request: "GET /api/data HTTP/1.1",
upstream: "http://10.0.2.20:8080/api/data"
"Upstream prematurely closed connection" means a keepalive connection was reused but the upstream had already closed its end in the interim.
How to Fix It
Match the timeout to your actual traffic pattern. For most applications a value around 65 seconds works well:
http {
keepalive_timeout 65s;
keepalive_requests 10000;
}
If you're sitting behind an AWS ALB or similar load balancer with a 60-second idle timeout, set Nginx's
keepalive_timeoutslightly lower — 55 seconds — so Nginx closes idle connections before the upstream LB does. This prevents the LB from closing an idle connection at the exact moment Nginx tries to use it for a new request:
http {
keepalive_timeout 55s;
}
You can also explicitly control what you advertise to clients with the optional second parameter:
# server closes idle connections after 65s
# but tells clients to expect 60s
keepalive_timeout 65s 60s;
Root Cause 3: Upstream Keepalive Not Enabled
Why It Happens
This one catches a lot of engineers off guard. By default, Nginx does not maintain a connection pool to upstream servers. Every proxied request results in a brand new TCP connection to the backend — connection opens, request goes through, connection closes. No reuse. If you're reverse-proxying to an application server on the same host or in the same datacenter segment, you're paying full TCP handshake overhead on every single request for no reason. At high request rates you'll see TIME_WAIT sockets piling up on the upstream side, and you'll eventually hit port exhaustion or upstream connection queue limits.
The
keepalivedirective in the upstream block was specifically designed to fix this, but it has to be explicitly enabled — it doesn't activate automatically just because you define an upstream group.
How to Identify It
Inspect your upstream configuration for the presence of the
keepalivedirective:
nginx -T | grep -A 20 'upstream '
If your output looks like this, you're not using upstream keepalive:
upstream backend_pool {
server 10.0.2.20:8080;
server 10.0.2.21:8080;
}
Confirm by watching the connection state on the backend port from the Nginx host in real time:
watch -n1 "ss -n state time-wait '( dport = :8080 or sport = :8080 )' | wc -l"
If that number climbs steadily with traffic and never stabilizes, you're not reusing connections to the backend. Every request is leaving a TIME_WAIT socket behind and creating a new one.
How to Fix It
Add the
keepalivedirective to the upstream block and configure the proxy location correctly. Both changes are required:
upstream backend_pool {
server 10.0.2.20:8080;
server 10.0.2.21:8080;
keepalive 32;
}
server {
listen 443 ssl;
server_name solvethenetwork.com;
location /api/ {
proxy_pass http://backend_pool;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
}
}
The
keepalive 32value sets the maximum number of idle keepalive connections Nginx will hold open per worker process to this upstream group. Setting
proxy_http_version 1.1is mandatory — HTTP/1.0 doesn't support persistent connections properly, and Nginx defaults to 1.0 when talking to upstreams. Clearing the Connection header with
proxy_set_header Connection ""is equally important: without it, Nginx forwards whatever Connection header it received from the client, which can be
Connection: close, overriding your keepalive configuration entirely.
Root Cause 4: HTTP/1.0 from Upstream
Why It Happens
As noted above, Nginx defaults to HTTP/1.0 when communicating with upstream backends. HTTP/1.0 treats every connection as non-persistent by default — keepalive was a non-standard extension bolted on afterward and isn't reliably supported across all implementations. If you've added the upstream
keepalivepool but forgotten to set
proxy_http_version 1.1, Nginx is sending HTTP/1.0 requests and the backend is responding with
Connection: close, terminating the connection after every request. Your carefully configured connection pool sits empty and unused.
I've also seen the reverse: backends that speak HTTP/1.0 only, regardless of what Nginx sends. Older Java servlet containers, legacy CGI backends, some embedded HTTP servers, and certain ancient internal microservices fall into this category. They won't maintain keepalive connections no matter what you do on the Nginx side.
How to Identify It
Capture the actual wire traffic between Nginx and the backend on sw-infrarunbook-01:
tcpdump -i lo -n -A 'host 10.0.2.20 and port 8080' 2>/dev/null | grep -E "^(GET|POST|PUT|DELETE|HTTP|Connection)"
If Nginx is sending HTTP/1.0, you'll see:
GET /api/users HTTP/1.0
Host: 10.0.2.20
If the backend itself only speaks HTTP/1.0, you'll see responses like:
HTTP/1.0 200 OK
Connection: close
Content-Type: application/json
You can also probe the backend directly to check its protocol support:
curl -v --http1.1 http://10.0.2.20:8080/api/health 2>&1 | grep -iE "^[<>] (http|connection)"
A healthy HTTP/1.1 backend responds like this:
< HTTP/1.1 200 OK
< Connection: keep-alive
A broken one responds like this:
< HTTP/1.0 200 OK
< Connection: close
How to Fix It
On the Nginx side, always be explicit about the protocol version in every proxy location block:
location / {
proxy_pass http://backend_pool;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
If the backend genuinely only supports HTTP/1.0 and you can't change it — legacy third-party service, vendor appliance, old internal app that nobody wants to touch — keepalive to that specific upstream isn't going to happen. In that case, shift focus to minimizing the cost of new connections. Enable TCP_NODELAY, expand the local port range, and let the OS recycle TIME_WAIT sockets faster:
sysctl -w net.ipv4.tcp_tw_reuse=1
sysctl -w net.ipv4.tcp_fin_timeout=15
sysctl -w net.ipv4.ip_local_port_range="1024 65535"
Make those persistent in
/etc/sysctl.d/99-nginx-tuning.confon sw-infrarunbook-01:
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15
net.ipv4.ip_local_port_range = 1024 65535
sysctl -p /etc/sysctl.d/99-nginx-tuning.conf
Root Cause 5: Backend Closing the Connection Early
Why It Happens
Even with everything correct on the Nginx side, the backend application can unilaterally close connections before Nginx expects it. This happens because the backend has its own keepalive idle timeout shorter than Nginx's upstream connection pool idle time, the backend hits an internal connection limit and starts terminating connections, or the backend application has a bug where it sends
Connection: closeon specific response types.
The dangerous scenario is a race condition: Nginx has a connection sitting idle in its upstream keepalive pool, Nginx thinks it's valid, and the backend has already closed its end. The next request Nginx routes over that connection gets an immediate RST or EOF, which propagates as a 502 to the client. You can't eliminate this race entirely — it's inherent to the keepalive model — but you can make it transparent to users.
How to Identify It
This specific error signature in the Nginx error log is the tell:
2026/04/20 11:32:07 [error] 12847#12847: *4920183 recv() failed (104: Connection reset by peer)
while reading response header from upstream, client: 10.0.1.88,
server: solvethenetwork.com, request: "POST /api/submit HTTP/1.1",
upstream: "http://10.0.2.20:8080/api/submit",
host: "solvethenetwork.com"
Error 104 (ECONNRESET) on upstream reads, from a backend that's otherwise healthy when you hit it directly, means Nginx reused a connection the backend had already closed. Check the backend's own keepalive configuration. For a Node.js backend running on 10.0.2.20:
ssh infrarunbook-admin@10.0.2.20 'node -e "const h = require(\"http\"); const s = h.createServer(); console.log(s.keepAliveTimeout);"'
The Node.js default
keepAliveTimeoutwas 5000ms (5 seconds) in older versions — drastically shorter than any reasonable Nginx upstream keepalive idle time, causing this race condition constantly under even modest load.
How to Fix It
Two complementary fixes. First, configure Nginx to transparently retry on upstream errors so the race doesn't surface as a client-visible failure:
location /api/ {
proxy_pass http://backend_pool;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_next_upstream error timeout invalid_header http_500 http_502;
proxy_next_upstream_tries 2;
proxy_next_upstream_timeout 5s;
}
Second — and more importantly — align the backend's keepalive timeout so it's always higher than Nginx's upstream idle time. For a Node.js backend, set this explicitly in your server initialization code:
const server = app.listen(8080);
server.keepAliveTimeout = 75000; // 75 seconds in ms
server.headersTimeout = 80000; // must exceed keepAliveTimeout
For a Gunicorn backend, add to its configuration on sw-infrarunbook-01:
keepalive = 75
The rule is simple: the backend's keepalive timeout must always be higher than whatever idle time could elapse in Nginx's upstream connection pool. Nginx doesn't have a built-in per-pool idle timeout in older versions (it was added with
keepalive_timeoutin the upstream block in Nginx 1.15.3), so making the backend more permissive is the reliable cross-version fix.
Root Cause 6: Worker Connection Limits and File Descriptor Exhaustion
Why It Happens
Keepalive connections hold file descriptors open. Each Nginx worker can handle at most
worker_connectionssimultaneous connections — both client-facing and upstream. With keepalive enabled on both sides, your per-worker connection count is: concurrent clients plus upstream pool size times the number of upstream servers. On a busy server with multiple worker processes and generous keepalive settings, you can exhaust the allowed file descriptors faster than you'd expect, and Nginx will start refusing to maintain keepalive connections or log errors and drop requests entirely.
How to Identify It
Check the current file descriptor limit for the running Nginx master process:
cat /proc/$(cat /var/run/nginx.pid)/limits | grep "open files"
Limit Soft Limit Hard Limit Units
Max open files 1024 1024 files
A soft limit of 1024 is a classic misconfiguration. Cross-reference with your configured worker_connections:
nginx -T | grep worker_connections
And look for this specific alert in the error log:
2026/04/20 14:22:11 [alert] 12847#12847: *5020012 1024 worker_connections are not enough
while connecting to upstream
That's a hard confirmation. Nginx is hitting the wall.
How to Fix It
Increase
worker_connectionsin nginx.conf and set
worker_processesto match your CPU count:
worker_processes auto;
events {
worker_connections 4096;
use epoll;
multi_accept on;
}
Then raise the system file descriptor limit for the Nginx service. Create a systemd override on sw-infrarunbook-01 as infrarunbook-admin:
mkdir -p /etc/systemd/system/nginx.service.d
cat > /etc/systemd/system/nginx.service.d/limits.conf <<EOF
[Service]
LimitNOFILE=65536
EOF
systemctl daemon-reload
systemctl reload nginx
Verify the new limit took effect:
cat /proc/$(cat /var/run/nginx.pid)/limits | grep "open files"
Limit Soft Limit Hard Limit Units
Max open files 65536 65536 files
Prevention
Prevention is about getting the configuration right once across all layers and then monitoring the indicators that tell you when something has drifted. Start with a solid baseline that explicitly handles keepalive at every layer rather than relying on defaults. Here's a production-ready template for sw-infrarunbook-01:
worker_processes auto;
events {
worker_connections 4096;
use epoll;
multi_accept on;
}
http {
keepalive_timeout 55s;
keepalive_requests 10000;
upstream backend_pool {
server 10.0.2.20:8080;
server 10.0.2.21:8080;
keepalive 64;
keepalive_requests 1000;
keepalive_timeout 60s;
}
server {
listen 443 ssl;
server_name solvethenetwork.com;
location / {
proxy_pass http://backend_pool;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_next_upstream error timeout invalid_header;
proxy_next_upstream_tries 2;
proxy_next_upstream_timeout 5s;
}
}
}
For ongoing monitoring, track the TIME_WAIT count continuously. Wire it into your metrics stack or add a cron on sw-infrarunbook-01 as infrarunbook-admin to log it:
* * * * * infrarunbook-admin ss -s | awk '/timewait/{print strftime("%FT%T"), $4}' >> /var/log/nginx/timewait_count.log
Watch the stub_status ratio of requests to accepted connections over time. If that ratio approaches 1:1, keepalive is broken somewhere in your stack. A healthy ratio — requests significantly outnumbering accepted connections — means connections are being reused as intended.
When you upgrade Nginx, re-read the release notes for default changes and run
nginx -Tto review the effective configuration, not just the files you edited. The
keepalive_requestsdefault change in 1.19.10 is a good example: if you had explicit configs below the new default, they still override it, and you might assume you got a free improvement when you didn't.
Finally, make backend keepalive alignment a checklist item for every new service deployment. Before any backend goes behind Nginx in production: confirm it speaks HTTP/1.1, confirm its keepalive timeout is higher than Nginx's upstream pool idle time, and confirm it doesn't send
Connection: closeon normal 200 responses. Run the
curl -vprobe against it directly. These three checks catch the majority of issues covered in this article before they ever reach production traffic.
