What a 502 Actually Means
The 502 Bad Gateway error from Nginx is a proxy error, not a server error in the traditional sense. Nginx itself is alive and accepting connections — it got your request, tried to forward it to the upstream backend, and either got back nothing or got back something it couldn't make sense of. That's the whole story. Nginx is telling the client: "I did my job, the thing behind me didn't."
I've watched engineers waste twenty minutes restarting Nginx on a 502. It fixes nothing, because Nginx isn't the problem. The upstream is broken, unreachable, overloaded, or misconfigured. That's where you need to look.
Read the Error Log Before Touching Anything
The very first thing you do is read the Nginx error log. Not the access log — that'll just show you a wall of 502 responses. The error log tells you why.
tail -n 100 /var/log/nginx/error.log
If your server block defines a custom
error_logpath, check there instead. Look for lines flagged
[error]. The upstream field in those lines is the critical piece. You're going to see one of a handful of errno values, and each one points at a different failure mode.
2024/08/14 09:23:41 [error] 3821#3821: *4821 connect() failed (111: Connection refused) while connecting to upstream, client: 203.0.113.55, server: solvethenetwork.com, request: "GET /api/status HTTP/1.1", upstream: "http://127.0.0.1:8080/api/status", host: "solvethenetwork.com"
Error 111 is ECONNREFUSED. Nothing is listening on that port. Full stop. Don't overthink it — go check if your upstream process is running.
2024/08/14 09:31:07 [error] 3821#3821: *5103 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 203.0.113.55, server: solvethenetwork.com, upstream: "http://10.0.1.15:8080/api/slow-query", host: "solvethenetwork.com"
Error 110 is ETIMEDOUT. Something is listening, it accepted the connection, but it never sent a response header before the timeout fired. That's a slow backend problem — think a database query taking forever, GC pauses, or a thread pool that's completely saturated.
2024/08/14 09:45:22 [error] 3821#3821: *5891 recv() failed (104: Connection reset by peer) while reading response header from upstream
Error 104 is ECONNRESET. The upstream accepted the connection and then killed it mid-conversation. This usually signals a crash, a premature socket close, or an upstream in the middle of a graceful shutdown that's still receiving traffic.
Upstream Not Running: The Most Common Cause
In my experience, the single most common 502 cause is simply that the upstream process isn't running. It crashed, someone stopped it, or it never came back up after a deployment. Start here every time.
ss -tlnp | grep 8080
If that returns nothing, your upstream isn't listening. For a Node.js application on port 8080, check its service status directly:
systemctl status node-app.service
Or if it's not managed by systemd:
ps aux | grep node
Once you've confirmed the process is dead, check its own logs before restarting it blindly. There's a reason it crashed, and restarting without understanding why means it'll crash again. Start the service, watch it come up, then tail the Nginx error log to confirm 502s stop.
systemctl start node-app.service
journalctl -u node-app.service -f
PHP-FPM: The Classic 502 Source
If you're running PHP behind Nginx — WordPress, Laravel, any PHP application — then PHP-FPM is almost certainly involved in your 502. FPM either isn't running, ran out of workers, or the socket and port configuration doesn't match what Nginx is trying to connect to.
First, check if PHP-FPM is running:
systemctl status php8.2-fpm.service
Then verify the listener matches what Nginx expects. Your Nginx config might say:
fastcgi_pass 127.0.0.1:9000;
But your PHP-FPM pool config (typically under
/etc/php/8.2/fpm/pool.d/www.conf) might be configured to listen on a Unix socket:
listen = /run/php/php8.2-fpm.sock
That mismatch alone causes a 502. Check both sides and make sure they agree. If FPM is using a socket, Nginx should use:
fastcgi_pass unix:/run/php/php8.2-fpm.sock;
The other PHP-FPM failure mode I see regularly is worker exhaustion. If you have a high-traffic site and FPM's
pm.max_childrenis set too low, all workers get occupied and new requests queue until Nginx times out waiting. Check the FPM log:
tail -n 50 /var/log/php8.2-fpm.log
If you see lines like
server reached pm.max_children setting (5), consider raising it, that's your problem. Increase
pm.max_childrento a value that reflects your server's available memory and reload FPM. A rough rule of thumb: divide available RAM by your average PHP process memory footprint, then take 80% of that number to leave headroom.
; /etc/php/8.2/fpm/pool.d/www.conf
pm = dynamic
pm.max_children = 25
pm.start_servers = 5
pm.min_spare_servers = 3
pm.max_spare_servers = 10
Upstream Timeouts: When the Backend Is Just Too Slow
If your error log shows ETIMEDOUT and the upstream is definitely running, the backend is taking longer to respond than Nginx's timeout allows. The default
proxy_read_timeoutis 60 seconds — which sounds like plenty, but report generation endpoints, heavy aggregation queries, and external API calls can blow past that easily.
You have two options: fix the slow backend (always the correct long-term answer) or increase the timeout for that specific location. Here's how to do it selectively without touching global defaults:
location /api/reports {
proxy_pass http://10.0.1.15:8080;
proxy_connect_timeout 10s;
proxy_send_timeout 120s;
proxy_read_timeout 120s;
}
Don't crank timeouts up globally. That masks problems and lets slow requests pile up, eventually exhausting your worker pool and causing a broader outage. Increase timeouts only for endpoints where the slowness is expected and acceptable.
A quick clarification on what these directives actually control:
proxy_connect_timeoutis how long Nginx waits to establish the TCP connection to the upstream.
proxy_read_timeoutis the gap between successive read operations on the response — not the total response time.
proxy_send_timeoutcovers the send side. In practice,
proxy_read_timeoutis the one you'll tune most often for slow backends.
Buffer Size Problems Causing 502s
This one is subtle and I've been bitten by it more than once. If your upstream sends a very large response header — common with applications that set numerous cookies, put JWT tokens in response headers, or emit verbose debug headers — Nginx may fail to buffer it and return a 502.
The error in the log will say:
upstream sent too big header while reading response header from upstream
The fix is to increase the proxy buffer sizes in your server or location block:
proxy_buffer_size 16k;
proxy_buffers 8 16k;
proxy_busy_buffers_size 32k;
For FastCGI upstreams like PHP-FPM, the equivalent directives are:
fastcgi_buffer_size 16k;
fastcgi_buffers 8 16k;
fastcgi_busy_buffers_size 32k;
The default
proxy_buffer_sizeis one memory page — typically 4k or 8k depending on your platform. A JWT token in a response header, or a Set-Cookie header with multiple long session values, can easily exceed that. Bumping to 16k resolves the majority of cases I've seen in production environments.
Unix Socket Permission Issues
If you're using Unix sockets instead of TCP ports — which is the right call for same-host communication since it avoids the loopback overhead — permissions are a trap that catches a lot of people. Nginx runs as
www-dataor
nginxdepending on your distribution, and it needs read and write access to the socket file.
ls -la /run/php/php8.2-fpm.sock
srw-rw---- 1 www-data www-data 0 Aug 14 09:00 /run/php/php8.2-fpm.sock
If the socket is owned by a different user and doesn't grant write access to the Nginx user, you'll get a 502 with a permission denied error in the log. Fix it in the FPM pool config:
; /etc/php/8.2/fpm/pool.d/www.conf
listen.owner = www-data
listen.group = www-data
listen.mode = 0660
Restart FPM and the socket gets recreated on startup with the permissions you've defined. The same principle applies to any upstream that uses a Unix socket — Gunicorn, uWSGI, whatever. The Nginx worker process user must be able to write to that socket.
SSL Between Nginx and the Upstream
If Nginx is proxying to an HTTPS upstream — common in multi-tier architectures where internal traffic is also encrypted — a certificate validation failure causes a 502. This happens more than it should, usually when a self-signed certificate is used internally and Nginx rejects it because it can't verify the chain.
Your proxy block might look like this:
location / {
proxy_pass https://10.0.1.20:8443;
proxy_ssl_verify on;
proxy_ssl_trusted_certificate /etc/nginx/certs/internal-ca.crt;
}
If
proxy_ssl_verifyis
onand the upstream cert doesn't validate against that CA bundle, Nginx returns a 502 and logs something like:
SSL_do_handshake() failed (SSL: error:14090086:SSL routines:ssl3_get_server_certificate:certificate verify failed) while SSL handshaking to upstream
You either need to provide the correct CA certificate via
proxy_ssl_trusted_certificate, or — in a genuinely trusted internal network where you understand the risk — disable verification with
proxy_ssl_verify off. I don't love that option, but internal PKI is often a mess in practice and you need to balance security with operational reality.
Also double-check that the upstream is actually serving TLS. If you point
proxy_passat an HTTPS address but the backend is serving plain HTTP, you'll get an immediate 502 because the TLS handshake fails against a non-TLS listener. Confirm with a direct curl from the server:
curl -vk https://10.0.1.20:8443/health
Upstream Group Exhaustion
If you're using an
upstreamblock with multiple backend servers, Nginx marks individual backends as unavailable when they fail beyond a threshold. The
max_failsand
fail_timeoutparameters control this passive health checking:
upstream app_backend {
server 10.0.1.21:8080 max_fails=3 fail_timeout=30s;
server 10.0.1.22:8080 max_fails=3 fail_timeout=30s;
server 10.0.1.23:8080 max_fails=3 fail_timeout=30s;
}
If all three backends get marked as failed simultaneously — say, a shared database goes down and every app instance starts returning errors — Nginx has no healthy upstream and returns 502 for every incoming request. The error log will show:
no live upstreams while connecting to upstream
The fix here is resolving the underlying dependency failure — the database, the auth service, whatever caused the cascade. Once
fail_timeoutexpires, Nginx will probe the marked-down servers again automatically. It's not a permanent blacklist. If you need a fallback during full upstream outages, add a backup server that returns a maintenance response:
upstream app_backend {
server 10.0.1.21:8080 max_fails=3 fail_timeout=30s;
server 10.0.1.22:8080 max_fails=3 fail_timeout=30s;
server 10.0.1.23:8080 max_fails=3 fail_timeout=30s;
server 10.0.1.99:8080 backup;
}
The Systematic Debug Sequence on sw-infrarunbook-01
When I'm debugging a 502 in production, this is the exact sequence I run through on a host like sw-infrarunbook-01. It moves from observation to isolation to confirmation without jumping to conclusions.
# 1. What upstream address is Nginx trying to reach?
grep -Ei 'proxy_pass|fastcgi_pass|uwsgi_pass' /etc/nginx/sites-enabled/solvethenetwork.com
# 2. Is anything listening on that address and port?
ss -tlnp | grep 8080
# 3. What does the error log say right now?
tail -f /var/log/nginx/error.log
# 4. Is the upstream service up and healthy?
systemctl status app.service
# 5. Can you reach the upstream directly, bypassing Nginx?
curl -v http://127.0.0.1:8080/health
# 6. What does the upstream service's own log say?
journalctl -u app.service --since "10 minutes ago"
Step 5 is the one engineers most often skip. Curling the upstream directly from the same host Nginx runs on removes Nginx from the equation entirely. If
curl http://127.0.0.1:8080/healthreturns a 200, Nginx should be able to reach it too — and you need to look at Nginx config or permissions. If it hangs, refuses, or errors, that's your upstream problem confirmed without Nginx being involved at all.
Config Validation and Graceful Reload
If your fix involved changing Nginx configuration — adjusting timeouts, buffer sizes, proxy addresses, upstream blocks — always test the config before applying it:
nginx -t
If it comes back with
syntax is okand
test is successful, reload gracefully:
systemctl reload nginx
A reload is graceful — existing connections complete normally while new connections pick up the updated configuration. A full
restartdrops active connections. Use reload unless you specifically need a clean process restart.
After reloading, watch the error log to confirm 502s stop appearing:
tail -f /var/log/nginx/error.log | grep -v " info "
If 502s stop and your upstream health checks pass, you're done. If they continue, you haven't found the actual root cause yet — go back to the error log, look more carefully at the upstream address and the errno, and work through the list again.
Instrumenting for Intermittent 502s
Intermittent 502s — ones that appear occasionally under load rather than constantly — are harder to diagnose. Common causes include upstream worker exhaustion during traffic bursts, memory pressure causing the backend to slow or OOM-crash, and connection pool limits being hit at the database or external API layer.
For these situations, I add upstream timing fields to Nginx's access log format so there's data to correlate against:
log_format upstream_timing '$remote_addr - $remote_user [$time_local] '
'"$request" $status $body_bytes_sent '
'upstream=$upstream_addr '
'upstream_status=$upstream_status '
'upstream_rt=$upstream_response_time '
'request_rt=$request_time';
access_log /var/log/nginx/access.log upstream_timing;
With that in place, you can grep for 502 responses and see exactly which upstream node returned them and how long the request ran before failing:
awk '$9 == 502 {print $0}' /var/log/nginx/access.log | tail -30
Patterns jump out quickly. If all 502s come from
10.0.1.22:8080specifically, you've got a bad node. If they cluster around specific times of day, you've got a traffic-driven exhaustion problem. If the
upstream_rtfield consistently shows values just under your timeout threshold right before the 502, the backend is genuinely too slow and needs optimization, not just a longer timeout.
A 502 Bad Gateway is always solvable. The error log gives you the errno, the errno tells you the failure class, and from there it's methodical elimination. Read before you restart, isolate before you guess, and fix the actual cause rather than the symptom.
