Nginx 504 Gateway Timeout Troubleshooting

Symptoms

A 504 Gateway Timeout is one of the most disruptive errors an Nginx reverse proxy can return. When it strikes, users see a browser page reading "504 Gateway Time-out" or an API client receives HTTP 504 with a minimal response body. In Nginx's

error.log

, the tell-tale signature looks like this:

2026/04/05 14:32:17 [error] 12483#12483: *8421 upstream timed out (110: Connection timed out)
while reading response header from upstream, client: 10.10.10.42,
server: solvethenetwork.com, request: "POST /api/v2/reports HTTP/1.1",
upstream: "http://10.10.20.15:8080/api/v2/reports",
host: "solvethenetwork.com"

Common observable symptoms include:

Browsers display a white page with "504 Gateway Time-out" after 60 seconds of spinning
API clients receive HTTP 504 with empty or minimal response bodies
Nginx
access.log
shows status code
504
and a request time near or exceeding the configured timeout value
Upstream application logs show requests that never completed or were abandoned mid-flight
Monitoring dashboards spike on 5xx error rates while upstream CPU may appear normal
Only certain routes or large payload sizes trigger the timeout, pointing to specific backend operations

Root Cause 1: Upstream Application Too Slow

Why It Happens

Nginx acts as a reverse proxy and waits a finite amount of time for the upstream server to start sending a response. If the backend application — whether Node.js, Python/uWSGI, PHP-FPM, or a Java service — takes longer than the configured timeout to return even the first byte of a response header, Nginx closes the connection and issues a 504. This is the most common root cause and is typically triggered by expensive operations such as large file processing, report generation, or bulk data exports that legitimately require more time than the default 60-second window allows.

How to Identify It

Check Nginx's

$request_time

and

$upstream_response_time

in the access log. First, confirm your log format captures these fields:

log_format main '$remote_addr - $remote_user [$time_local] "$request" '
                '$status $body_bytes_sent "$http_referer" '
                '"$http_user_agent" rt=$request_time urt=$upstream_response_time';

Then grep for 504s and examine the upstream response time:

grep ' 504 ' /var/log/nginx/access.log | awk '{print $NF, $(NF-1)}' | sort -n | tail -20

Sample output showing upstream response time clustering at the 60-second timeout boundary:

urt=60.002 rt=60.005
urt=60.001 rt=60.004
urt=59.998 rt=60.001

This pattern — upstream response time consistently equal to the timeout value — confirms the upstream is not responding in time, not that it is erroring out.

How to Fix It

If the upstream genuinely needs more time for legitimate operations, increase

proxy_read_timeout

for that specific location rather than globally:

location /api/v2/reports {
    proxy_pass         http://backend_pool;
    proxy_read_timeout 300s;   # allow up to 5 minutes for report generation
    proxy_connect_timeout 10s;
    proxy_send_timeout 60s;
}

For long-running jobs, the better architectural fix is to move processing to an async pattern — accept the request, return HTTP 202 Accepted with a job ID, and let the client poll a lightweight status endpoint. This decouples user-facing latency from processing time entirely.

Root Cause 2: proxy_read_timeout Set Too Low

Why It Happens

proxy_read_timeout

defines how long Nginx waits between successive read operations from the upstream — not the total response time for the entire transaction. The Nginx default is 60 seconds. In many installations this default is never revisited, meaning any upstream operation legitimately exceeding one minute will always produce a 504, even when the server is functioning correctly. This is especially problematic after application deployments that introduce heavier processing, or when a database grows to the point where previously fast queries now take longer.

How to Identify It

Verify the currently active timeout values using

nginx -T

to dump the full compiled configuration including all included files:

nginx -T 2>/dev/null | grep -E 'proxy_read_timeout|proxy_connect_timeout|proxy_send_timeout'

Expected output showing defaults that may be too conservative for your workload:

proxy_read_timeout 60s;
proxy_connect_timeout 60s;
proxy_send_timeout 60s;

Cross-reference with the upstream application's own timeout configuration. For a uWSGI backend on sw-infrarunbook-01:

grep -E 'harakiri|socket-timeout' /etc/uwsgi/apps-enabled/solvethenetwork.ini

harakiri = 120
socket-timeout = 30

Here the application allows 120 seconds of processing before self-terminating, but Nginx cuts the connection after 60 seconds — a clear mismatch that will produce 504s for all requests taking 60–120 seconds.

How to Fix It

Align Nginx timeouts with the upstream application's timeout chain, ensuring Nginx always waits slightly longer than the backend's own maximum processing time so the application can return a meaningful 500 error rather than being abandoned mid-processing:

# /etc/nginx/conf.d/proxy_timeouts.conf
proxy_connect_timeout  10s;
proxy_send_timeout     120s;
proxy_read_timeout     130s;   # backend harakiri=120, plus 10s buffer

After editing, validate and reload without downtime:

nginx -t && systemctl reload nginx

nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful

Apply granular per-location timeouts to avoid globally relaxing timeouts for fast endpoints that should fail quickly:

location /api/v2/export {
    proxy_pass         http://backend_pool;
    proxy_read_timeout 300s;
}

location /api/ {
    proxy_pass         http://backend_pool;
    proxy_read_timeout 30s;   # tight timeout for standard API calls
}

Root Cause 3: Backend Processing Bottleneck

Why It Happens

The upstream application itself becomes the bottleneck when it exhausts its worker pool, runs out of file descriptors, hits memory limits, or suffers from application-level contention such as thread locks or queue saturation. Under these conditions the backend accepts the TCP connection from Nginx — so

proxy_connect_timeout

passes without issue — but never processes the request within the read timeout window. This is distinct from the upstream being slow on a single request. It means the backend is overwhelmed and cannot service new requests promptly regardless of their individual complexity.

How to Identify It

On sw-infrarunbook-01, check the upstream application process state. For a PHP-FPM pool, query the status endpoint:

curl -s 'http://10.10.20.15:9000/status?full' | head -40

pool:                 www
process manager:      dynamic
start time:           05/Apr/2026:12:00:01 +0000
accepted conn:        142381
listen queue:         47
max listen queue:     128
listen queue len:     128
idle processes:       0
active processes:     20
total processes:      20
max active processes: 20
max children reached: 1

The listen queue: 47 combined with idle processes: 0 and max children reached: 1 confirm the pool is fully saturated. All workers are busy and new requests queue up behind them. When the queue overflows its backlog limit, Nginx gets no response in time and 504s are the result.

For a Node.js or Java service, check active established connections to the upstream port:

ss -tnp state ESTABLISHED '( dport = :8080 or sport = :8080 )' | wc -l

How to Fix It

Increase the PHP-FPM worker pool size. Each worker uses roughly 30–60 MB of RAM, so tune to available memory:

# /etc/php/8.2/fpm/pool.d/www.conf
pm = dynamic
pm.max_children      = 50
pm.start_servers     = 10
pm.min_spare_servers = 5
pm.max_spare_servers = 20
pm.max_requests      = 500

For Node.js, leverage the cluster module via PM2 to spawn workers matching CPU core count:

pm2 start /srv/solvethenetwork/api/server.js -i max --name api-workers

Configure Nginx upstream keepalive connections to reduce TCP handshake overhead between Nginx and the backend pool, increasing effective throughput:

upstream backend_pool {
    server 10.10.20.15:8080;
    server 10.10.20.16:8080;
    keepalive 64;
}

location /api/ {
    proxy_pass              http://backend_pool;
    proxy_http_version      1.1;
    proxy_set_header        Connection "";
}

Root Cause 4: Slow Database Queries

Why It Happens

The majority of production 504 incidents trace back to the database layer. A query that ran in milliseconds when a table held 10,000 rows becomes catastrophically slow at 50 million rows without proper indexing. Missing indexes, N+1 query patterns, table-level locks, autovacuum running during peak traffic in PostgreSQL, or full table scans on unindexed filter columns all translate to the backend thread blocking on a database response. The application itself is not CPU-bound — it is I/O-bound, waiting for the database, which in turn causes the backend to miss Nginx's

proxy_read_timeout

How to Identify It

Enable the PostgreSQL slow query log on sw-infrarunbook-01 to capture queries exceeding one second:

# /etc/postgresql/15/main/postgresql.conf
log_min_duration_statement = 1000   # log queries taking > 1 second
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h '

systemctl reload postgresql

Watch the log during a 504 incident in real time:

tail -f /var/log/postgresql/postgresql-15-main.log | grep -E 'duration|ERROR'

2026-04-05 14:32:01 UTC [8821]: [3-1] user=infrarunbook-admin,db=solvethenetwork_prod,
app=uwsgi,client=10.10.20.15 LOG:  duration: 58432.741 ms
statement: SELECT * FROM audit_events ae
           JOIN users u ON u.id = ae.user_id
           WHERE ae.created_at > '2026-01-01'
           ORDER BY ae.created_at DESC;

At 58 seconds for a single query, this is the direct cause of the 504. Use

EXPLAIN ANALYZE

to expose the execution plan:

psql -U infrarunbook-admin -d solvethenetwork_prod -c "
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM audit_events ae
JOIN users u ON u.id = ae.user_id
WHERE ae.created_at > '2026-01-01'
ORDER BY ae.created_at DESC
LIMIT 1000;"

Seq Scan on audit_events ae  (cost=0.00..982341.22 rows=4812341 width=312)
                              (actual time=0.042..57891.340 rows=4812341 loops=1)
  Filter: (created_at > '2026-01-01 00:00:00+00'::timestamptz)
  Rows Removed by Filter: 123456
  Buffers: shared hit=12 read=481234
Planning Time: 2.341 ms
Execution Time: 58021.482 ms

A sequential scan across 4.8 million rows with 481,234 disk block reads is the bottleneck. No index exists on

created_at

How to Fix It

Add the missing index concurrently to avoid a full table lock during index creation on a live system:

psql -U infrarunbook-admin -d solvethenetwork_prod -c "
CREATE INDEX CONCURRENTLY idx_audit_events_created_at
ON audit_events (created_at DESC);"

Re-run

EXPLAIN ANALYZE

to confirm the plan switched from a sequential scan to an index scan:

Index Scan using idx_audit_events_created_at on audit_events ae
  (cost=0.56..15234.22 rows=1000 width=312)
  (actual time=0.182..14.841 rows=1000 loops=1)
Execution Time: 15.291 ms

Query time dropped from 58 seconds to 15 milliseconds. For MySQL/MariaDB environments, enable and analyze the slow query log similarly:

SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 1;
SET GLOBAL slow_query_log_file = '/var/log/mysql/slow.log';

mysqldumpslow -s t -t 10 /var/log/mysql/slow.log

Root Cause 5: Network Latency Between Nginx and Upstream

Why It Happens

Even within an internal RFC 1918 network, network latency can trigger 504 errors. Causes include misconfigured MTU settings leading to packet fragmentation and retransmission, saturated switch uplinks, spanning-tree topology recalculations, NIC duplex mismatches causing half-duplex collisions, or firewall stateful connection table exhaustion dropping mid-session packets. When packets are dropped or significantly delayed at the network layer between Nginx on sw-infrarunbook-01 and the upstream at 10.10.20.15, the TCP stream stalls and Nginx's read timeout fires before the backend's response arrives — even when the backend itself processed the request quickly.

How to Identify It

Begin with latency and packet loss measurement between the Nginx host and the upstream during a degraded period:

ping -c 100 -i 0.2 10.10.20.15

--- 10.10.20.15 ping statistics ---
100 packets transmitted, 100 received, 0% packet loss, time 19837ms
rtt min/avg/max/mdev = 0.182/4.832/48.291/9.341 ms

An average of 4.8 ms on a LAN (expected: sub-1 ms) and a max spike to 48 ms indicate serious intermittent network issues. Use

mtr

for per-hop path analysis to isolate the faulty segment:

mtr --report --report-cycles 60 10.10.20.15

Host                     Loss%   Snt   Last   Avg  Best  Wrst StDev
1. 10.10.20.1             0.0%    60    0.3   0.4   0.2   1.1   0.2
2. 10.10.10.254           3.3%    60    1.2  22.1   0.8  48.3  14.2
3. 10.10.20.15            3.3%    60    1.4  22.4   0.9  48.8  14.3

The 3.3% packet loss appearing at hop 2 (core switch 10.10.10.254) is the smoking gun — a misbehaving or saturated switch in the path. Check interface statistics on sw-infrarunbook-01 for hardware-level symptoms:

ip -s link show eth0

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
    RX: bytes      packets    errors  dropped  overrun  mcast
    18432984123    14234123   0       2341     0        0
    TX: bytes      packets    errors  dropped  carrier  collisions
    9823412312     8234123    0       0        0        0

The RX dropped counter of 2,341 indicates receive buffer overflow — packets arriving faster than the kernel can process them. Also test for MTU fragmentation issues:

ping -M do -s 1472 10.10.20.15

ping: local error: message too long, mtu=1450

The path MTU is 1,450 bytes but the interface is configured for 1,500 — causing fragmentation or drops for large response payloads.

How to Fix It

Correct the MTU mismatch by aligning the interface to the path MTU:

ip link set eth0 mtu 1450
# Persist by adding MTU=1450 to /etc/systemd/network/10-eth0.network

Increase the NIC receive ring buffer to handle traffic bursts without dropping packets:

ethtool -G eth0 rx 4096
# Persist via /etc/udev/rules.d/60-net-buffers.rules

For the core switch packet loss, escalate to the network team with the full

mtr

report. As an Nginx-level mitigation during the network repair, configure upstream retry logic so transient packet drops trigger a retry to a second upstream node rather than immediately returning 504 to the client:

upstream backend_pool {
    server 10.10.20.15:8080;
    server 10.10.20.16:8080;
}

location /api/ {
    proxy_pass                  http://backend_pool;
    proxy_next_upstream         error timeout http_502 http_503;
    proxy_next_upstream_tries   2;
    proxy_next_upstream_timeout 10s;
}

Root Cause 6: Upstream Worker Thread Exhaustion

Why It Happens

Every upstream application framework has a hard limit on concurrent workers or threads. When all workers are occupied — processing slow requests, waiting on database queries, or blocked on downstream service calls — new incoming requests from Nginx are placed in a listen queue. If Nginx's

proxy_read_timeout

expires before a worker becomes free to service the queued request, a 504 results. This is distinct from individual request slowness: each request would be fast if a worker were available, but none are.

How to Identify It

For a Spring Boot application exposing the Actuator endpoint, check the active thread count against the pool maximum:

curl -s http://10.10.20.15:8081/actuator/metrics/executor.active | python3 -m json.tool

{
  "name": "executor.active",
  "measurements": [
    {
      "statistic": "VALUE",
      "value": 200.0
    }
  ]
}

200 active threads against a pool maximum of 200 means the pool is at 100% utilization. Correlate with Nginx

stub_status

to confirm connection queuing on the Nginx side:

curl -s http://127.0.0.1/nginx_status

Active connections: 847
server accepts handled requests
 234123 234123 891234
Reading: 0 Writing: 483 Waiting: 364

High Writing count (483) combined with a saturated thread pool upstream confirms requests are piling up waiting for workers.

How to Fix It

Increase the Tomcat thread pool in Spring Boot's

application.properties

server.tomcat.threads.max=400
server.tomcat.threads.min-spare=20
server.tomcat.accept-count=200

Add horizontal scaling by deploying a second backend node and using least-connection load balancing so Nginx distributes to the least-busy upstream:

upstream backend_pool {
    least_conn;
    server 10.10.20.15:8080 weight=1;
    server 10.10.20.16:8080 weight=1;
    keepalive 32;
}

Root Cause 7: File Descriptor Exhaustion on the Nginx Process

Why It Happens

Linux imposes per-process and system-wide limits on open file descriptors. Each active connection — both from clients to Nginx and from Nginx to each upstream — consumes one file descriptor. When a Nginx worker process hits its

nofile

limit, it cannot open new sockets to upstream servers. New requests fail to establish upstream connections and time out with 504. This typically appears suddenly during traffic spikes on systems where the default limit of 1,024 was never raised.

How to Identify It

cat /proc/$(pgrep -f 'nginx: worker' | head -1)/limits | grep 'open files'

Max open files            1024                 1024                 files

ls /proc/$(pgrep -f 'nginx: worker' | head -1)/fd | wc -l

At 1,021 of 1,024, the process is three file descriptors from exhaustion. Nginx's error log confirms the problem:

2026/04/05 14:33:01 [alert] 12483#12483: *8500 socket() failed (24: Too many open files)
while connecting to upstream, client: 10.10.10.42, server: solvethenetwork.com

How to Fix It

Raise the file descriptor limit in the Nginx configuration and system limits simultaneously:

# /etc/nginx/nginx.conf
worker_rlimit_nofile 65535;
events {
    worker_connections 16384;
    use epoll;
    multi_accept on;
}

# /etc/security/limits.d/nginx.conf
nginx soft nofile 65535
nginx hard nofile 65535

nginx -t && systemctl reload nginx

Prevention

Reactive troubleshooting is costly in engineer time and user trust. The following practices eliminate most 504 incidents before they surface.

Set layered timeouts consistently. Every layer — load balancer, Nginx, application server, database connection pool — should have its timeout set so that inner layers always timeout before outer layers. Nginx should wait slightly longer than the application's own timeout so the backend can return a meaningful 500 error rather than being abandoned silently with a 504.
Monitor
$upstream_response_time
as a latency metric. Ship this field from Nginx access logs to your observability stack. Alert when the 95th percentile exceeds 50% of your configured
proxy_read_timeout
— this gives a runway to investigate before 504s begin occurring at scale.
Enable slow query logs permanently. Set
log_min_duration_statement = 500
in PostgreSQL and
long_query_time = 0.5
in MySQL. The performance overhead is negligible and the visibility is invaluable. Review weekly using
pgBadger
or
pt-query-digest
.
Capacity-plan worker pools proactively. Calculate maximum workers as
floor(available_RAM / per_worker_RAM)
. Monitor the ratio of idle to total workers. Alert and scale before saturation exceeds 80% to preserve headroom for traffic bursts.
Configure upstream health checks. Mark backends as unavailable before Nginx wastes requests on a degraded host. Using
nginx_upstream_check_module
:

upstream backend_pool {
    server 10.10.20.15:8080;
    server 10.10.20.16:8080;
    check interval=3000 rise=2 fall=3 timeout=2000 type=http;
    check_http_send "HEAD /health HTTP/1.0\r\n\r\n";
    check_http_expect_alive http_2xx;
}

Move long-running operations to async job queues. Any operation that could exceed 10 seconds should be processed by a background worker (Celery, Sidekiq, BullMQ) with a polling or webhook completion mechanism. This decouples user-facing HTTP latency from backend processing time entirely.
Tune Linux networking parameters. Ensure
net.core.somaxconn
and
net.ipv4.tcp_max_syn_backlog
are sized for peak traffic to prevent connection queue drops at the kernel level:

sysctl -w net.core.somaxconn=65535
sysctl -w net.ipv4.tcp_max_syn_backlog=65535
sysctl -w net.ipv4.tcp_fin_timeout=15
# Persist in /etc/sysctl.d/99-nginx-tuning.conf

Implement circuit breakers in the application layer. Libraries like Resilience4j (Java) or pybreaker (Python) prevent cascade failures where one slow downstream dependency causes all threads to pile up and time out simultaneously, multiplying the blast radius of a single slow service.

Symptoms

Root Cause 1: Upstream Application Too Slow

Why It Happens

How to Identify It

How to Fix It

Root Cause 2: proxy_read_timeout Set Too Low

Why It Happens

How to Identify It

How to Fix It

Root Cause 3: Backend Processing Bottleneck

Why It Happens

How to Identify It

How to Fix It

Root Cause 4: Slow Database Queries

Why It Happens

How to Identify It

How to Fix It

Root Cause 5: Network Latency Between Nginx and Upstream

Why It Happens

How to Identify It

How to Fix It

Root Cause 6: Upstream Worker Thread Exhaustion

Why It Happens

How to Identify It

How to Fix It

Root Cause 7: File Descriptor Exhaustion on the Nginx Process

Why It Happens

How to Identify It

How to Fix It

Prevention

Related Articles

Frequently Asked Questions

What is the difference between a 502 and a 504 in Nginx?

How do I tell if the 504 is Nginx's timeout or the upstream's own timeout?

Does increasing proxy_read_timeout actually fix the root problem?

My 504s only happen under high traffic load. What should I check first?

How do I check Nginx's effective timeout configuration without restarting it?

Can a 504 be caused by DNS resolution failure for the upstream?

Why does my upstream show only 20% CPU during a 504 incident?

How do I safely apply timeout changes on a production Nginx without dropping connections?

Should I set proxy_read_timeout globally or per location block?

How can I alert on 504 errors proactively before users report them?

Related Articles