Symptoms
A 504 Gateway Timeout is one of the most disruptive errors an Nginx reverse proxy can return. When it strikes, users see a browser page reading "504 Gateway Time-out" or an API client receives HTTP 504 with a minimal response body. In Nginx's
error.log, the tell-tale signature looks like this:
2026/04/05 14:32:17 [error] 12483#12483: *8421 upstream timed out (110: Connection timed out)
while reading response header from upstream, client: 10.10.10.42,
server: solvethenetwork.com, request: "POST /api/v2/reports HTTP/1.1",
upstream: "http://10.10.20.15:8080/api/v2/reports",
host: "solvethenetwork.com"Common observable symptoms include:
- Browsers display a white page with "504 Gateway Time-out" after 60 seconds of spinning
- API clients receive HTTP 504 with empty or minimal response bodies
- Nginx
access.log
shows status code504
and a request time near or exceeding the configured timeout value - Upstream application logs show requests that never completed or were abandoned mid-flight
- Monitoring dashboards spike on 5xx error rates while upstream CPU may appear normal
- Only certain routes or large payload sizes trigger the timeout, pointing to specific backend operations
Root Cause 1: Upstream Application Too Slow
Why It Happens
Nginx acts as a reverse proxy and waits a finite amount of time for the upstream server to start sending a response. If the backend application — whether Node.js, Python/uWSGI, PHP-FPM, or a Java service — takes longer than the configured timeout to return even the first byte of a response header, Nginx closes the connection and issues a 504. This is the most common root cause and is typically triggered by expensive operations such as large file processing, report generation, or bulk data exports that legitimately require more time than the default 60-second window allows.
How to Identify It
Check Nginx's
$request_timeand
$upstream_response_timein the access log. First, confirm your log format captures these fields:
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" rt=$request_time urt=$upstream_response_time';Then grep for 504s and examine the upstream response time:
grep ' 504 ' /var/log/nginx/access.log | awk '{print $NF, $(NF-1)}' | sort -n | tail -20Sample output showing upstream response time clustering at the 60-second timeout boundary:
urt=60.002 rt=60.005
urt=60.001 rt=60.004
urt=59.998 rt=60.001This pattern — upstream response time consistently equal to the timeout value — confirms the upstream is not responding in time, not that it is erroring out.
How to Fix It
If the upstream genuinely needs more time for legitimate operations, increase
proxy_read_timeoutfor that specific location rather than globally:
location /api/v2/reports {
proxy_pass http://backend_pool;
proxy_read_timeout 300s; # allow up to 5 minutes for report generation
proxy_connect_timeout 10s;
proxy_send_timeout 60s;
}For long-running jobs, the better architectural fix is to move processing to an async pattern — accept the request, return HTTP 202 Accepted with a job ID, and let the client poll a lightweight status endpoint. This decouples user-facing latency from processing time entirely.
Root Cause 2: proxy_read_timeout Set Too Low
Why It Happens
proxy_read_timeoutdefines how long Nginx waits between successive read operations from the upstream — not the total response time for the entire transaction. The Nginx default is 60 seconds. In many installations this default is never revisited, meaning any upstream operation legitimately exceeding one minute will always produce a 504, even when the server is functioning correctly. This is especially problematic after application deployments that introduce heavier processing, or when a database grows to the point where previously fast queries now take longer.
How to Identify It
Verify the currently active timeout values using
nginx -Tto dump the full compiled configuration including all included files:
nginx -T 2>/dev/null | grep -E 'proxy_read_timeout|proxy_connect_timeout|proxy_send_timeout'Expected output showing defaults that may be too conservative for your workload:
proxy_read_timeout 60s;
proxy_connect_timeout 60s;
proxy_send_timeout 60s;Cross-reference with the upstream application's own timeout configuration. For a uWSGI backend on sw-infrarunbook-01:
grep -E 'harakiri|socket-timeout' /etc/uwsgi/apps-enabled/solvethenetwork.iniharakiri = 120
socket-timeout = 30Here the application allows 120 seconds of processing before self-terminating, but Nginx cuts the connection after 60 seconds — a clear mismatch that will produce 504s for all requests taking 60–120 seconds.
How to Fix It
Align Nginx timeouts with the upstream application's timeout chain, ensuring Nginx always waits slightly longer than the backend's own maximum processing time so the application can return a meaningful 500 error rather than being abandoned mid-processing:
# /etc/nginx/conf.d/proxy_timeouts.conf
proxy_connect_timeout 10s;
proxy_send_timeout 120s;
proxy_read_timeout 130s; # backend harakiri=120, plus 10s bufferAfter editing, validate and reload without downtime:
nginx -t && systemctl reload nginxnginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successfulApply granular per-location timeouts to avoid globally relaxing timeouts for fast endpoints that should fail quickly:
location /api/v2/export {
proxy_pass http://backend_pool;
proxy_read_timeout 300s;
}
location /api/ {
proxy_pass http://backend_pool;
proxy_read_timeout 30s; # tight timeout for standard API calls
}Root Cause 3: Backend Processing Bottleneck
Why It Happens
The upstream application itself becomes the bottleneck when it exhausts its worker pool, runs out of file descriptors, hits memory limits, or suffers from application-level contention such as thread locks or queue saturation. Under these conditions the backend accepts the TCP connection from Nginx — so
proxy_connect_timeoutpasses without issue — but never processes the request within the read timeout window. This is distinct from the upstream being slow on a single request. It means the backend is overwhelmed and cannot service new requests promptly regardless of their individual complexity.
How to Identify It
On sw-infrarunbook-01, check the upstream application process state. For a PHP-FPM pool, query the status endpoint:
curl -s 'http://10.10.20.15:9000/status?full' | head -40pool: www
process manager: dynamic
start time: 05/Apr/2026:12:00:01 +0000
accepted conn: 142381
listen queue: 47
max listen queue: 128
listen queue len: 128
idle processes: 0
active processes: 20
total processes: 20
max active processes: 20
max children reached: 1The listen queue: 47 combined with idle processes: 0 and max children reached: 1 confirm the pool is fully saturated. All workers are busy and new requests queue up behind them. When the queue overflows its backlog limit, Nginx gets no response in time and 504s are the result.
For a Node.js or Java service, check active established connections to the upstream port:
ss -tnp state ESTABLISHED '( dport = :8080 or sport = :8080 )' | wc -l487How to Fix It
Increase the PHP-FPM worker pool size. Each worker uses roughly 30–60 MB of RAM, so tune to available memory:
# /etc/php/8.2/fpm/pool.d/www.conf
pm = dynamic
pm.max_children = 50
pm.start_servers = 10
pm.min_spare_servers = 5
pm.max_spare_servers = 20
pm.max_requests = 500For Node.js, leverage the cluster module via PM2 to spawn workers matching CPU core count:
pm2 start /srv/solvethenetwork/api/server.js -i max --name api-workersConfigure Nginx upstream keepalive connections to reduce TCP handshake overhead between Nginx and the backend pool, increasing effective throughput:
upstream backend_pool {
server 10.10.20.15:8080;
server 10.10.20.16:8080;
keepalive 64;
}
location /api/ {
proxy_pass http://backend_pool;
proxy_http_version 1.1;
proxy_set_header Connection "";
}Root Cause 4: Slow Database Queries
Why It Happens
The majority of production 504 incidents trace back to the database layer. A query that ran in milliseconds when a table held 10,000 rows becomes catastrophically slow at 50 million rows without proper indexing. Missing indexes, N+1 query patterns, table-level locks, autovacuum running during peak traffic in PostgreSQL, or full table scans on unindexed filter columns all translate to the backend thread blocking on a database response. The application itself is not CPU-bound — it is I/O-bound, waiting for the database, which in turn causes the backend to miss Nginx's
proxy_read_timeout.
How to Identify It
Enable the PostgreSQL slow query log on sw-infrarunbook-01 to capture queries exceeding one second:
# /etc/postgresql/15/main/postgresql.conf
log_min_duration_statement = 1000 # log queries taking > 1 second
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h 'systemctl reload postgresqlWatch the log during a 504 incident in real time:
tail -f /var/log/postgresql/postgresql-15-main.log | grep -E 'duration|ERROR'2026-04-05 14:32:01 UTC [8821]: [3-1] user=infrarunbook-admin,db=solvethenetwork_prod,
app=uwsgi,client=10.10.20.15 LOG: duration: 58432.741 ms
statement: SELECT * FROM audit_events ae
JOIN users u ON u.id = ae.user_id
WHERE ae.created_at > '2026-01-01'
ORDER BY ae.created_at DESC;At 58 seconds for a single query, this is the direct cause of the 504. Use
EXPLAIN ANALYZEto expose the execution plan:
psql -U infrarunbook-admin -d solvethenetwork_prod -c "
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM audit_events ae
JOIN users u ON u.id = ae.user_id
WHERE ae.created_at > '2026-01-01'
ORDER BY ae.created_at DESC
LIMIT 1000;"Seq Scan on audit_events ae (cost=0.00..982341.22 rows=4812341 width=312)
(actual time=0.042..57891.340 rows=4812341 loops=1)
Filter: (created_at > '2026-01-01 00:00:00+00'::timestamptz)
Rows Removed by Filter: 123456
Buffers: shared hit=12 read=481234
Planning Time: 2.341 ms
Execution Time: 58021.482 msA sequential scan across 4.8 million rows with 481,234 disk block reads is the bottleneck. No index exists on
created_at.
How to Fix It
Add the missing index concurrently to avoid a full table lock during index creation on a live system:
psql -U infrarunbook-admin -d solvethenetwork_prod -c "
CREATE INDEX CONCURRENTLY idx_audit_events_created_at
ON audit_events (created_at DESC);"Re-run
EXPLAIN ANALYZEto confirm the plan switched from a sequential scan to an index scan:
Index Scan using idx_audit_events_created_at on audit_events ae
(cost=0.56..15234.22 rows=1000 width=312)
(actual time=0.182..14.841 rows=1000 loops=1)
Execution Time: 15.291 msQuery time dropped from 58 seconds to 15 milliseconds. For MySQL/MariaDB environments, enable and analyze the slow query log similarly:
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 1;
SET GLOBAL slow_query_log_file = '/var/log/mysql/slow.log';mysqldumpslow -s t -t 10 /var/log/mysql/slow.logRoot Cause 5: Network Latency Between Nginx and Upstream
Why It Happens
Even within an internal RFC 1918 network, network latency can trigger 504 errors. Causes include misconfigured MTU settings leading to packet fragmentation and retransmission, saturated switch uplinks, spanning-tree topology recalculations, NIC duplex mismatches causing half-duplex collisions, or firewall stateful connection table exhaustion dropping mid-session packets. When packets are dropped or significantly delayed at the network layer between Nginx on sw-infrarunbook-01 and the upstream at 10.10.20.15, the TCP stream stalls and Nginx's read timeout fires before the backend's response arrives — even when the backend itself processed the request quickly.
How to Identify It
Begin with latency and packet loss measurement between the Nginx host and the upstream during a degraded period:
ping -c 100 -i 0.2 10.10.20.15--- 10.10.20.15 ping statistics ---
100 packets transmitted, 100 received, 0% packet loss, time 19837ms
rtt min/avg/max/mdev = 0.182/4.832/48.291/9.341 msAn average of 4.8 ms on a LAN (expected: sub-1 ms) and a max spike to 48 ms indicate serious intermittent network issues. Use
mtrfor per-hop path analysis to isolate the faulty segment:
mtr --report --report-cycles 60 10.10.20.15Host Loss% Snt Last Avg Best Wrst StDev
1. 10.10.20.1 0.0% 60 0.3 0.4 0.2 1.1 0.2
2. 10.10.10.254 3.3% 60 1.2 22.1 0.8 48.3 14.2
3. 10.10.20.15 3.3% 60 1.4 22.4 0.9 48.8 14.3The 3.3% packet loss appearing at hop 2 (core switch 10.10.10.254) is the smoking gun — a misbehaving or saturated switch in the path. Check interface statistics on sw-infrarunbook-01 for hardware-level symptoms:
ip -s link show eth02: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
RX: bytes packets errors dropped overrun mcast
18432984123 14234123 0 2341 0 0
TX: bytes packets errors dropped carrier collisions
9823412312 8234123 0 0 0 0The RX dropped counter of 2,341 indicates receive buffer overflow — packets arriving faster than the kernel can process them. Also test for MTU fragmentation issues:
ping -M do -s 1472 10.10.20.15ping: local error: message too long, mtu=1450The path MTU is 1,450 bytes but the interface is configured for 1,500 — causing fragmentation or drops for large response payloads.
How to Fix It
Correct the MTU mismatch by aligning the interface to the path MTU:
ip link set eth0 mtu 1450
# Persist by adding MTU=1450 to /etc/systemd/network/10-eth0.networkIncrease the NIC receive ring buffer to handle traffic bursts without dropping packets:
ethtool -G eth0 rx 4096
# Persist via /etc/udev/rules.d/60-net-buffers.rulesFor the core switch packet loss, escalate to the network team with the full
mtrreport. As an Nginx-level mitigation during the network repair, configure upstream retry logic so transient packet drops trigger a retry to a second upstream node rather than immediately returning 504 to the client:
upstream backend_pool {
server 10.10.20.15:8080;
server 10.10.20.16:8080;
}
location /api/ {
proxy_pass http://backend_pool;
proxy_next_upstream error timeout http_502 http_503;
proxy_next_upstream_tries 2;
proxy_next_upstream_timeout 10s;
}Root Cause 6: Upstream Worker Thread Exhaustion
Why It Happens
Every upstream application framework has a hard limit on concurrent workers or threads. When all workers are occupied — processing slow requests, waiting on database queries, or blocked on downstream service calls — new incoming requests from Nginx are placed in a listen queue. If Nginx's
proxy_read_timeoutexpires before a worker becomes free to service the queued request, a 504 results. This is distinct from individual request slowness: each request would be fast if a worker were available, but none are.
How to Identify It
For a Spring Boot application exposing the Actuator endpoint, check the active thread count against the pool maximum:
curl -s http://10.10.20.15:8081/actuator/metrics/executor.active | python3 -m json.tool{
"name": "executor.active",
"measurements": [
{
"statistic": "VALUE",
"value": 200.0
}
]
}200 active threads against a pool maximum of 200 means the pool is at 100% utilization. Correlate with Nginx
stub_statusto confirm connection queuing on the Nginx side:
curl -s http://127.0.0.1/nginx_statusActive connections: 847
server accepts handled requests
234123 234123 891234
Reading: 0 Writing: 483 Waiting: 364High Writing count (483) combined with a saturated thread pool upstream confirms requests are piling up waiting for workers.
How to Fix It
Increase the Tomcat thread pool in Spring Boot's
application.properties:
server.tomcat.threads.max=400
server.tomcat.threads.min-spare=20
server.tomcat.accept-count=200Add horizontal scaling by deploying a second backend node and using least-connection load balancing so Nginx distributes to the least-busy upstream:
upstream backend_pool {
least_conn;
server 10.10.20.15:8080 weight=1;
server 10.10.20.16:8080 weight=1;
keepalive 32;
}Root Cause 7: File Descriptor Exhaustion on the Nginx Process
Why It Happens
Linux imposes per-process and system-wide limits on open file descriptors. Each active connection — both from clients to Nginx and from Nginx to each upstream — consumes one file descriptor. When a Nginx worker process hits its
nofilelimit, it cannot open new sockets to upstream servers. New requests fail to establish upstream connections and time out with 504. This typically appears suddenly during traffic spikes on systems where the default limit of 1,024 was never raised.
How to Identify It
cat /proc/$(pgrep -f 'nginx: worker' | head -1)/limits | grep 'open files'Max open files 1024 1024 filesls /proc/$(pgrep -f 'nginx: worker' | head -1)/fd | wc -l1021At 1,021 of 1,024, the process is three file descriptors from exhaustion. Nginx's error log confirms the problem:
2026/04/05 14:33:01 [alert] 12483#12483: *8500 socket() failed (24: Too many open files)
while connecting to upstream, client: 10.10.10.42, server: solvethenetwork.comHow to Fix It
Raise the file descriptor limit in the Nginx configuration and system limits simultaneously:
# /etc/nginx/nginx.conf
worker_rlimit_nofile 65535;
events {
worker_connections 16384;
use epoll;
multi_accept on;
}# /etc/security/limits.d/nginx.conf
nginx soft nofile 65535
nginx hard nofile 65535nginx -t && systemctl reload nginxPrevention
Reactive troubleshooting is costly in engineer time and user trust. The following practices eliminate most 504 incidents before they surface.
- Set layered timeouts consistently. Every layer — load balancer, Nginx, application server, database connection pool — should have its timeout set so that inner layers always timeout before outer layers. Nginx should wait slightly longer than the application's own timeout so the backend can return a meaningful 500 error rather than being abandoned silently with a 504.
- Monitor
$upstream_response_time
as a latency metric. Ship this field from Nginx access logs to your observability stack. Alert when the 95th percentile exceeds 50% of your configuredproxy_read_timeout
— this gives a runway to investigate before 504s begin occurring at scale. - Enable slow query logs permanently. Set
log_min_duration_statement = 500
in PostgreSQL andlong_query_time = 0.5
in MySQL. The performance overhead is negligible and the visibility is invaluable. Review weekly usingpgBadger
orpt-query-digest
. - Capacity-plan worker pools proactively. Calculate maximum workers as
floor(available_RAM / per_worker_RAM)
. Monitor the ratio of idle to total workers. Alert and scale before saturation exceeds 80% to preserve headroom for traffic bursts. - Configure upstream health checks. Mark backends as unavailable before Nginx wastes requests on a degraded host. Using
nginx_upstream_check_module
:
upstream backend_pool {
server 10.10.20.15:8080;
server 10.10.20.16:8080;
check interval=3000 rise=2 fall=3 timeout=2000 type=http;
check_http_send "HEAD /health HTTP/1.0\r\n\r\n";
check_http_expect_alive http_2xx;
}- Move long-running operations to async job queues. Any operation that could exceed 10 seconds should be processed by a background worker (Celery, Sidekiq, BullMQ) with a polling or webhook completion mechanism. This decouples user-facing HTTP latency from backend processing time entirely.
- Tune Linux networking parameters. Ensure
net.core.somaxconn
andnet.ipv4.tcp_max_syn_backlog
are sized for peak traffic to prevent connection queue drops at the kernel level:
sysctl -w net.core.somaxconn=65535
sysctl -w net.ipv4.tcp_max_syn_backlog=65535
sysctl -w net.ipv4.tcp_fin_timeout=15
# Persist in /etc/sysctl.d/99-nginx-tuning.conf- Implement circuit breakers in the application layer. Libraries like Resilience4j (Java) or pybreaker (Python) prevent cascade failures where one slow downstream dependency causes all threads to pile up and time out simultaneously, multiplying the blast radius of a single slow service.
Frequently Asked Questions
Q: What is the difference between a 502 and a 504 in Nginx?
A: A 502 Bad Gateway means Nginx received an invalid or unexpected response from the upstream — the upstream responded, but what it sent was not a valid HTTP response. A 504 Gateway Timeout means Nginx received no response at all within the configured timeout window. In practice: 502 typically points to an application crash, misconfiguration, or upstream returning garbage data; 504 points to the upstream being too slow or unreachable.
Q: How do I tell if the 504 is Nginx's timeout or the upstream's own timeout?
A: Check the exact elapsed time in the Nginx access log field
$upstream_response_time. If it equals your configured
proxy_read_timeoutvalue to within a fraction of a second (e.g., exactly 60.00 seconds), Nginx fired its timeout. If the upstream responds with its own error (a 500 or 503 with an error body), that appears as a different status code. A 504 in Nginx's access log always means Nginx was waiting and gave up — the upstream never responded.
Q: Does increasing proxy_read_timeout actually fix the root problem?
A: No. Increasing the timeout suppresses the 504 symptom and prevents users from seeing the error, but it does not address why the upstream is slow. It is a valid short-term measure while you investigate, but the real fix must address the root cause — missing database indexes, insufficient worker capacity, or unoptimized application code. A timeout increase without a root cause fix will eventually fail again when load increases further.
Q: My 504s only happen under high traffic load. What should I check first?
A: Load-dependent 504s almost always indicate worker pool exhaustion or database lock contention. Check the upstream's active worker count at peak time, the database's
pg_stat_activity(PostgreSQL) or
SHOW PROCESSLIST(MySQL) for blocked or long-running queries, and Nginx's
stub_statusfor high Writing counts. Also verify the upstream listen backlog is not overflowing using the PHP-FPM status endpoint or equivalent.
Q: How do I check Nginx's effective timeout configuration without restarting it?
A: Run
nginx -T 2>/dev/null | grep timeoutto dump the compiled configuration including all included files, showing exactly what values are active right now. This is safe and non-disruptive. You can also run
nginx -tto validate syntax without a full dump.
Q: Can a 504 be caused by DNS resolution failure for the upstream?
A: Yes. If Nginx uses a hostname in the
proxy_passdirective with a
resolverdirective configured, a slow or failing DNS resolution consumes time before the TCP connection even begins. If DNS resolution is slow enough, the combined time exceeds
proxy_read_timeoutand produces a 504. Always use stable RFC 1918 IP addresses for upstream servers in production, or configure a local caching resolver to minimize DNS latency.
Q: Why does my upstream show only 20% CPU during a 504 incident — shouldn't it be maxed out?
A: Low CPU during 504s is the classic signature of I/O-bound blocking — typically database wait or network I/O to a downstream service. Application threads are idle, blocked waiting for a query result or an external API response. They consume no CPU but they occupy worker slots, preventing new requests from being processed. Check
pg_stat_activityfor long-running queries and
iostat -x 1for disk saturation rather than looking at CPU utilization.
Q: How do I safely test timeout changes on a production Nginx without dropping connections?
A: Use
nginx -tto validate the configuration first, then
systemctl reload nginx(or
nginx -s reload). This performs a graceful reload — existing connections continue on old worker processes while new connections pick up the updated configuration immediately. No connections are dropped and no downtime occurs. Always validate on a staging server first and keep a known-good configuration backup ready for rollback.
Q: Should I set proxy_read_timeout globally or per location block?
A: Per location block is strongly preferred. A globally relaxed timeout means even simple health-check or quick-lookup endpoints will wait for the full timeout value before failing, degrading user experience during partial outages and masking problems. Reserve long timeouts only for locations with legitimate long-running operations such as report exports or file processing, and keep tight timeouts on all standard API and page routes.
Q: How can I alert on 504 errors proactively before users report them?
A: Configure your log shipper (Filebeat, Promtail, or Fluentd) to parse the Nginx access log and export a
nginx_http_504_totalcounter metric. Set a rate alert at more than five 504s per minute above your normal baseline. More powerfully, set a leading-indicator alert on 95th percentile
upstream_response_timeexceeding 70–80% of your
proxy_read_timeoutvalue — this fires before 504s begin and gives you time to act.
Q: Can Nginx serve cached responses during a 504 condition to protect users?
A: Yes, for cacheable GET responses. Configure
proxy_cachewith the
proxy_cache_use_stale error timeout updatingdirective to serve stale cached content when the upstream times out or errors. This is a resilience pattern that protects the user experience during brief upstream degradation while the root cause is addressed. It only works for previously cached, cacheable responses and is not a substitute for fixing the underlying performance problem.
Q: What is the difference between proxy_connect_timeout and proxy_read_timeout?
A:
proxy_connect_timeoutcontrols how long Nginx waits to establish the TCP connection to the upstream — the three-way handshake. This should be short (5–15 seconds) because TCP failing to connect means the upstream host is unreachable or the port is not listening, not that it is slow.
proxy_read_timeoutcontrols how long Nginx waits for response data after the connection is established. Nearly all 504 errors come from
proxy_read_timeoutexpiring. A 504 with very short elapsed time (under 15 seconds) may point to a
proxy_connect_timeoutfiring due to a firewall or routing issue.
