InfraRunBook
    Back to articles

    Nginx 504 Gateway Timeout Troubleshooting

    Nginx
    Published: Apr 4, 2026
    Updated: Apr 4, 2026

    A complete guide to diagnosing and fixing Nginx 504 Gateway Timeout errors, covering upstream slowness, misconfigured timeouts, backend bottlenecks, slow database queries, and network latency with real CLI commands and config examples.

    Nginx 504 Gateway Timeout Troubleshooting

    Symptoms

    A 504 Gateway Timeout is one of the most disruptive errors an Nginx reverse proxy can return. When it strikes, users see a browser page reading "504 Gateway Time-out" or an API client receives HTTP 504 with a minimal response body. In Nginx's

    error.log
    , the tell-tale signature looks like this:

    2026/04/05 14:32:17 [error] 12483#12483: *8421 upstream timed out (110: Connection timed out)
    while reading response header from upstream, client: 10.10.10.42,
    server: solvethenetwork.com, request: "POST /api/v2/reports HTTP/1.1",
    upstream: "http://10.10.20.15:8080/api/v2/reports",
    host: "solvethenetwork.com"

    Common observable symptoms include:

    • Browsers display a white page with "504 Gateway Time-out" after 60 seconds of spinning
    • API clients receive HTTP 504 with empty or minimal response bodies
    • Nginx
      access.log
      shows status code
      504
      and a request time near or exceeding the configured timeout value
    • Upstream application logs show requests that never completed or were abandoned mid-flight
    • Monitoring dashboards spike on 5xx error rates while upstream CPU may appear normal
    • Only certain routes or large payload sizes trigger the timeout, pointing to specific backend operations

    Root Cause 1: Upstream Application Too Slow

    Why It Happens

    Nginx acts as a reverse proxy and waits a finite amount of time for the upstream server to start sending a response. If the backend application — whether Node.js, Python/uWSGI, PHP-FPM, or a Java service — takes longer than the configured timeout to return even the first byte of a response header, Nginx closes the connection and issues a 504. This is the most common root cause and is typically triggered by expensive operations such as large file processing, report generation, or bulk data exports that legitimately require more time than the default 60-second window allows.

    How to Identify It

    Check Nginx's

    $request_time
    and
    $upstream_response_time
    in the access log. First, confirm your log format captures these fields:

    log_format main '$remote_addr - $remote_user [$time_local] "$request" '
                    '$status $body_bytes_sent "$http_referer" '
                    '"$http_user_agent" rt=$request_time urt=$upstream_response_time';

    Then grep for 504s and examine the upstream response time:

    grep ' 504 ' /var/log/nginx/access.log | awk '{print $NF, $(NF-1)}' | sort -n | tail -20

    Sample output showing upstream response time clustering at the 60-second timeout boundary:

    urt=60.002 rt=60.005
    urt=60.001 rt=60.004
    urt=59.998 rt=60.001

    This pattern — upstream response time consistently equal to the timeout value — confirms the upstream is not responding in time, not that it is erroring out.

    How to Fix It

    If the upstream genuinely needs more time for legitimate operations, increase

    proxy_read_timeout
    for that specific location rather than globally:

    location /api/v2/reports {
        proxy_pass         http://backend_pool;
        proxy_read_timeout 300s;   # allow up to 5 minutes for report generation
        proxy_connect_timeout 10s;
        proxy_send_timeout 60s;
    }

    For long-running jobs, the better architectural fix is to move processing to an async pattern — accept the request, return HTTP 202 Accepted with a job ID, and let the client poll a lightweight status endpoint. This decouples user-facing latency from processing time entirely.


    Root Cause 2: proxy_read_timeout Set Too Low

    Why It Happens

    proxy_read_timeout
    defines how long Nginx waits between successive read operations from the upstream — not the total response time for the entire transaction. The Nginx default is 60 seconds. In many installations this default is never revisited, meaning any upstream operation legitimately exceeding one minute will always produce a 504, even when the server is functioning correctly. This is especially problematic after application deployments that introduce heavier processing, or when a database grows to the point where previously fast queries now take longer.

    How to Identify It

    Verify the currently active timeout values using

    nginx -T
    to dump the full compiled configuration including all included files:

    nginx -T 2>/dev/null | grep -E 'proxy_read_timeout|proxy_connect_timeout|proxy_send_timeout'

    Expected output showing defaults that may be too conservative for your workload:

    proxy_read_timeout 60s;
    proxy_connect_timeout 60s;
    proxy_send_timeout 60s;

    Cross-reference with the upstream application's own timeout configuration. For a uWSGI backend on sw-infrarunbook-01:

    grep -E 'harakiri|socket-timeout' /etc/uwsgi/apps-enabled/solvethenetwork.ini
    harakiri = 120
    socket-timeout = 30

    Here the application allows 120 seconds of processing before self-terminating, but Nginx cuts the connection after 60 seconds — a clear mismatch that will produce 504s for all requests taking 60–120 seconds.

    How to Fix It

    Align Nginx timeouts with the upstream application's timeout chain, ensuring Nginx always waits slightly longer than the backend's own maximum processing time so the application can return a meaningful 500 error rather than being abandoned mid-processing:

    # /etc/nginx/conf.d/proxy_timeouts.conf
    proxy_connect_timeout  10s;
    proxy_send_timeout     120s;
    proxy_read_timeout     130s;   # backend harakiri=120, plus 10s buffer

    After editing, validate and reload without downtime:

    nginx -t && systemctl reload nginx
    nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
    nginx: configuration file /etc/nginx/nginx.conf test is successful

    Apply granular per-location timeouts to avoid globally relaxing timeouts for fast endpoints that should fail quickly:

    location /api/v2/export {
        proxy_pass         http://backend_pool;
        proxy_read_timeout 300s;
    }
    
    location /api/ {
        proxy_pass         http://backend_pool;
        proxy_read_timeout 30s;   # tight timeout for standard API calls
    }

    Root Cause 3: Backend Processing Bottleneck

    Why It Happens

    The upstream application itself becomes the bottleneck when it exhausts its worker pool, runs out of file descriptors, hits memory limits, or suffers from application-level contention such as thread locks or queue saturation. Under these conditions the backend accepts the TCP connection from Nginx — so

    proxy_connect_timeout
    passes without issue — but never processes the request within the read timeout window. This is distinct from the upstream being slow on a single request. It means the backend is overwhelmed and cannot service new requests promptly regardless of their individual complexity.

    How to Identify It

    On sw-infrarunbook-01, check the upstream application process state. For a PHP-FPM pool, query the status endpoint:

    curl -s 'http://10.10.20.15:9000/status?full' | head -40
    pool:                 www
    process manager:      dynamic
    start time:           05/Apr/2026:12:00:01 +0000
    accepted conn:        142381
    listen queue:         47
    max listen queue:     128
    listen queue len:     128
    idle processes:       0
    active processes:     20
    total processes:      20
    max active processes: 20
    max children reached: 1

    The listen queue: 47 combined with idle processes: 0 and max children reached: 1 confirm the pool is fully saturated. All workers are busy and new requests queue up behind them. When the queue overflows its backlog limit, Nginx gets no response in time and 504s are the result.

    For a Node.js or Java service, check active established connections to the upstream port:

    ss -tnp state ESTABLISHED '( dport = :8080 or sport = :8080 )' | wc -l
    487

    How to Fix It

    Increase the PHP-FPM worker pool size. Each worker uses roughly 30–60 MB of RAM, so tune to available memory:

    # /etc/php/8.2/fpm/pool.d/www.conf
    pm = dynamic
    pm.max_children      = 50
    pm.start_servers     = 10
    pm.min_spare_servers = 5
    pm.max_spare_servers = 20
    pm.max_requests      = 500

    For Node.js, leverage the cluster module via PM2 to spawn workers matching CPU core count:

    pm2 start /srv/solvethenetwork/api/server.js -i max --name api-workers

    Configure Nginx upstream keepalive connections to reduce TCP handshake overhead between Nginx and the backend pool, increasing effective throughput:

    upstream backend_pool {
        server 10.10.20.15:8080;
        server 10.10.20.16:8080;
        keepalive 64;
    }
    
    location /api/ {
        proxy_pass              http://backend_pool;
        proxy_http_version      1.1;
        proxy_set_header        Connection "";
    }

    Root Cause 4: Slow Database Queries

    Why It Happens

    The majority of production 504 incidents trace back to the database layer. A query that ran in milliseconds when a table held 10,000 rows becomes catastrophically slow at 50 million rows without proper indexing. Missing indexes, N+1 query patterns, table-level locks, autovacuum running during peak traffic in PostgreSQL, or full table scans on unindexed filter columns all translate to the backend thread blocking on a database response. The application itself is not CPU-bound — it is I/O-bound, waiting for the database, which in turn causes the backend to miss Nginx's

    proxy_read_timeout
    .

    How to Identify It

    Enable the PostgreSQL slow query log on sw-infrarunbook-01 to capture queries exceeding one second:

    # /etc/postgresql/15/main/postgresql.conf
    log_min_duration_statement = 1000   # log queries taking > 1 second
    log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h '
    systemctl reload postgresql

    Watch the log during a 504 incident in real time:

    tail -f /var/log/postgresql/postgresql-15-main.log | grep -E 'duration|ERROR'
    2026-04-05 14:32:01 UTC [8821]: [3-1] user=infrarunbook-admin,db=solvethenetwork_prod,
    app=uwsgi,client=10.10.20.15 LOG:  duration: 58432.741 ms
    statement: SELECT * FROM audit_events ae
               JOIN users u ON u.id = ae.user_id
               WHERE ae.created_at > '2026-01-01'
               ORDER BY ae.created_at DESC;

    At 58 seconds for a single query, this is the direct cause of the 504. Use

    EXPLAIN ANALYZE
    to expose the execution plan:

    psql -U infrarunbook-admin -d solvethenetwork_prod -c "
    EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
    SELECT * FROM audit_events ae
    JOIN users u ON u.id = ae.user_id
    WHERE ae.created_at > '2026-01-01'
    ORDER BY ae.created_at DESC
    LIMIT 1000;"
    Seq Scan on audit_events ae  (cost=0.00..982341.22 rows=4812341 width=312)
                                  (actual time=0.042..57891.340 rows=4812341 loops=1)
      Filter: (created_at > '2026-01-01 00:00:00+00'::timestamptz)
      Rows Removed by Filter: 123456
      Buffers: shared hit=12 read=481234
    Planning Time: 2.341 ms
    Execution Time: 58021.482 ms

    A sequential scan across 4.8 million rows with 481,234 disk block reads is the bottleneck. No index exists on

    created_at
    .

    How to Fix It

    Add the missing index concurrently to avoid a full table lock during index creation on a live system:

    psql -U infrarunbook-admin -d solvethenetwork_prod -c "
    CREATE INDEX CONCURRENTLY idx_audit_events_created_at
    ON audit_events (created_at DESC);"

    Re-run

    EXPLAIN ANALYZE
    to confirm the plan switched from a sequential scan to an index scan:

    Index Scan using idx_audit_events_created_at on audit_events ae
      (cost=0.56..15234.22 rows=1000 width=312)
      (actual time=0.182..14.841 rows=1000 loops=1)
    Execution Time: 15.291 ms

    Query time dropped from 58 seconds to 15 milliseconds. For MySQL/MariaDB environments, enable and analyze the slow query log similarly:

    SET GLOBAL slow_query_log = 'ON';
    SET GLOBAL long_query_time = 1;
    SET GLOBAL slow_query_log_file = '/var/log/mysql/slow.log';
    mysqldumpslow -s t -t 10 /var/log/mysql/slow.log

    Root Cause 5: Network Latency Between Nginx and Upstream

    Why It Happens

    Even within an internal RFC 1918 network, network latency can trigger 504 errors. Causes include misconfigured MTU settings leading to packet fragmentation and retransmission, saturated switch uplinks, spanning-tree topology recalculations, NIC duplex mismatches causing half-duplex collisions, or firewall stateful connection table exhaustion dropping mid-session packets. When packets are dropped or significantly delayed at the network layer between Nginx on sw-infrarunbook-01 and the upstream at 10.10.20.15, the TCP stream stalls and Nginx's read timeout fires before the backend's response arrives — even when the backend itself processed the request quickly.

    How to Identify It

    Begin with latency and packet loss measurement between the Nginx host and the upstream during a degraded period:

    ping -c 100 -i 0.2 10.10.20.15
    --- 10.10.20.15 ping statistics ---
    100 packets transmitted, 100 received, 0% packet loss, time 19837ms
    rtt min/avg/max/mdev = 0.182/4.832/48.291/9.341 ms

    An average of 4.8 ms on a LAN (expected: sub-1 ms) and a max spike to 48 ms indicate serious intermittent network issues. Use

    mtr
    for per-hop path analysis to isolate the faulty segment:

    mtr --report --report-cycles 60 10.10.20.15
    Host                     Loss%   Snt   Last   Avg  Best  Wrst StDev
    1. 10.10.20.1             0.0%    60    0.3   0.4   0.2   1.1   0.2
    2. 10.10.10.254           3.3%    60    1.2  22.1   0.8  48.3  14.2
    3. 10.10.20.15            3.3%    60    1.4  22.4   0.9  48.8  14.3

    The 3.3% packet loss appearing at hop 2 (core switch 10.10.10.254) is the smoking gun — a misbehaving or saturated switch in the path. Check interface statistics on sw-infrarunbook-01 for hardware-level symptoms:

    ip -s link show eth0
    2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
        RX: bytes      packets    errors  dropped  overrun  mcast
        18432984123    14234123   0       2341     0        0
        TX: bytes      packets    errors  dropped  carrier  collisions
        9823412312     8234123    0       0        0        0

    The RX dropped counter of 2,341 indicates receive buffer overflow — packets arriving faster than the kernel can process them. Also test for MTU fragmentation issues:

    ping -M do -s 1472 10.10.20.15
    ping: local error: message too long, mtu=1450

    The path MTU is 1,450 bytes but the interface is configured for 1,500 — causing fragmentation or drops for large response payloads.

    How to Fix It

    Correct the MTU mismatch by aligning the interface to the path MTU:

    ip link set eth0 mtu 1450
    # Persist by adding MTU=1450 to /etc/systemd/network/10-eth0.network

    Increase the NIC receive ring buffer to handle traffic bursts without dropping packets:

    ethtool -G eth0 rx 4096
    # Persist via /etc/udev/rules.d/60-net-buffers.rules

    For the core switch packet loss, escalate to the network team with the full

    mtr
    report. As an Nginx-level mitigation during the network repair, configure upstream retry logic so transient packet drops trigger a retry to a second upstream node rather than immediately returning 504 to the client:

    upstream backend_pool {
        server 10.10.20.15:8080;
        server 10.10.20.16:8080;
    }
    
    location /api/ {
        proxy_pass                  http://backend_pool;
        proxy_next_upstream         error timeout http_502 http_503;
        proxy_next_upstream_tries   2;
        proxy_next_upstream_timeout 10s;
    }

    Root Cause 6: Upstream Worker Thread Exhaustion

    Why It Happens

    Every upstream application framework has a hard limit on concurrent workers or threads. When all workers are occupied — processing slow requests, waiting on database queries, or blocked on downstream service calls — new incoming requests from Nginx are placed in a listen queue. If Nginx's

    proxy_read_timeout
    expires before a worker becomes free to service the queued request, a 504 results. This is distinct from individual request slowness: each request would be fast if a worker were available, but none are.

    How to Identify It

    For a Spring Boot application exposing the Actuator endpoint, check the active thread count against the pool maximum:

    curl -s http://10.10.20.15:8081/actuator/metrics/executor.active | python3 -m json.tool
    {
      "name": "executor.active",
      "measurements": [
        {
          "statistic": "VALUE",
          "value": 200.0
        }
      ]
    }

    200 active threads against a pool maximum of 200 means the pool is at 100% utilization. Correlate with Nginx

    stub_status
    to confirm connection queuing on the Nginx side:

    curl -s http://127.0.0.1/nginx_status
    Active connections: 847
    server accepts handled requests
     234123 234123 891234
    Reading: 0 Writing: 483 Waiting: 364

    High Writing count (483) combined with a saturated thread pool upstream confirms requests are piling up waiting for workers.

    How to Fix It

    Increase the Tomcat thread pool in Spring Boot's

    application.properties
    :

    server.tomcat.threads.max=400
    server.tomcat.threads.min-spare=20
    server.tomcat.accept-count=200

    Add horizontal scaling by deploying a second backend node and using least-connection load balancing so Nginx distributes to the least-busy upstream:

    upstream backend_pool {
        least_conn;
        server 10.10.20.15:8080 weight=1;
        server 10.10.20.16:8080 weight=1;
        keepalive 32;
    }

    Root Cause 7: File Descriptor Exhaustion on the Nginx Process

    Why It Happens

    Linux imposes per-process and system-wide limits on open file descriptors. Each active connection — both from clients to Nginx and from Nginx to each upstream — consumes one file descriptor. When a Nginx worker process hits its

    nofile
    limit, it cannot open new sockets to upstream servers. New requests fail to establish upstream connections and time out with 504. This typically appears suddenly during traffic spikes on systems where the default limit of 1,024 was never raised.

    How to Identify It

    cat /proc/$(pgrep -f 'nginx: worker' | head -1)/limits | grep 'open files'
    Max open files            1024                 1024                 files
    ls /proc/$(pgrep -f 'nginx: worker' | head -1)/fd | wc -l
    1021

    At 1,021 of 1,024, the process is three file descriptors from exhaustion. Nginx's error log confirms the problem:

    2026/04/05 14:33:01 [alert] 12483#12483: *8500 socket() failed (24: Too many open files)
    while connecting to upstream, client: 10.10.10.42, server: solvethenetwork.com

    How to Fix It

    Raise the file descriptor limit in the Nginx configuration and system limits simultaneously:

    # /etc/nginx/nginx.conf
    worker_rlimit_nofile 65535;
    events {
        worker_connections 16384;
        use epoll;
        multi_accept on;
    }
    # /etc/security/limits.d/nginx.conf
    nginx soft nofile 65535
    nginx hard nofile 65535
    nginx -t && systemctl reload nginx

    Prevention

    Reactive troubleshooting is costly in engineer time and user trust. The following practices eliminate most 504 incidents before they surface.

    • Set layered timeouts consistently. Every layer — load balancer, Nginx, application server, database connection pool — should have its timeout set so that inner layers always timeout before outer layers. Nginx should wait slightly longer than the application's own timeout so the backend can return a meaningful 500 error rather than being abandoned silently with a 504.
    • Monitor
      $upstream_response_time
      as a latency metric.
      Ship this field from Nginx access logs to your observability stack. Alert when the 95th percentile exceeds 50% of your configured
      proxy_read_timeout
      — this gives a runway to investigate before 504s begin occurring at scale.
    • Enable slow query logs permanently. Set
      log_min_duration_statement = 500
      in PostgreSQL and
      long_query_time = 0.5
      in MySQL. The performance overhead is negligible and the visibility is invaluable. Review weekly using
      pgBadger
      or
      pt-query-digest
      .
    • Capacity-plan worker pools proactively. Calculate maximum workers as
      floor(available_RAM / per_worker_RAM)
      . Monitor the ratio of idle to total workers. Alert and scale before saturation exceeds 80% to preserve headroom for traffic bursts.
    • Configure upstream health checks. Mark backends as unavailable before Nginx wastes requests on a degraded host. Using
      nginx_upstream_check_module
      :
    upstream backend_pool {
        server 10.10.20.15:8080;
        server 10.10.20.16:8080;
        check interval=3000 rise=2 fall=3 timeout=2000 type=http;
        check_http_send "HEAD /health HTTP/1.0\r\n\r\n";
        check_http_expect_alive http_2xx;
    }
    • Move long-running operations to async job queues. Any operation that could exceed 10 seconds should be processed by a background worker (Celery, Sidekiq, BullMQ) with a polling or webhook completion mechanism. This decouples user-facing HTTP latency from backend processing time entirely.
    • Tune Linux networking parameters. Ensure
      net.core.somaxconn
      and
      net.ipv4.tcp_max_syn_backlog
      are sized for peak traffic to prevent connection queue drops at the kernel level:
    sysctl -w net.core.somaxconn=65535
    sysctl -w net.ipv4.tcp_max_syn_backlog=65535
    sysctl -w net.ipv4.tcp_fin_timeout=15
    # Persist in /etc/sysctl.d/99-nginx-tuning.conf
    • Implement circuit breakers in the application layer. Libraries like Resilience4j (Java) or pybreaker (Python) prevent cascade failures where one slow downstream dependency causes all threads to pile up and time out simultaneously, multiplying the blast radius of a single slow service.

    Frequently Asked Questions

    Q: What is the difference between a 502 and a 504 in Nginx?

    A: A 502 Bad Gateway means Nginx received an invalid or unexpected response from the upstream — the upstream responded, but what it sent was not a valid HTTP response. A 504 Gateway Timeout means Nginx received no response at all within the configured timeout window. In practice: 502 typically points to an application crash, misconfiguration, or upstream returning garbage data; 504 points to the upstream being too slow or unreachable.

    Q: How do I tell if the 504 is Nginx's timeout or the upstream's own timeout?

    A: Check the exact elapsed time in the Nginx access log field

    $upstream_response_time
    . If it equals your configured
    proxy_read_timeout
    value to within a fraction of a second (e.g., exactly 60.00 seconds), Nginx fired its timeout. If the upstream responds with its own error (a 500 or 503 with an error body), that appears as a different status code. A 504 in Nginx's access log always means Nginx was waiting and gave up — the upstream never responded.

    Q: Does increasing proxy_read_timeout actually fix the root problem?

    A: No. Increasing the timeout suppresses the 504 symptom and prevents users from seeing the error, but it does not address why the upstream is slow. It is a valid short-term measure while you investigate, but the real fix must address the root cause — missing database indexes, insufficient worker capacity, or unoptimized application code. A timeout increase without a root cause fix will eventually fail again when load increases further.

    Q: My 504s only happen under high traffic load. What should I check first?

    A: Load-dependent 504s almost always indicate worker pool exhaustion or database lock contention. Check the upstream's active worker count at peak time, the database's

    pg_stat_activity
    (PostgreSQL) or
    SHOW PROCESSLIST
    (MySQL) for blocked or long-running queries, and Nginx's
    stub_status
    for high Writing counts. Also verify the upstream listen backlog is not overflowing using the PHP-FPM status endpoint or equivalent.

    Q: How do I check Nginx's effective timeout configuration without restarting it?

    A: Run

    nginx -T 2>/dev/null | grep timeout
    to dump the compiled configuration including all included files, showing exactly what values are active right now. This is safe and non-disruptive. You can also run
    nginx -t
    to validate syntax without a full dump.

    Q: Can a 504 be caused by DNS resolution failure for the upstream?

    A: Yes. If Nginx uses a hostname in the

    proxy_pass
    directive with a
    resolver
    directive configured, a slow or failing DNS resolution consumes time before the TCP connection even begins. If DNS resolution is slow enough, the combined time exceeds
    proxy_read_timeout
    and produces a 504. Always use stable RFC 1918 IP addresses for upstream servers in production, or configure a local caching resolver to minimize DNS latency.

    Q: Why does my upstream show only 20% CPU during a 504 incident — shouldn't it be maxed out?

    A: Low CPU during 504s is the classic signature of I/O-bound blocking — typically database wait or network I/O to a downstream service. Application threads are idle, blocked waiting for a query result or an external API response. They consume no CPU but they occupy worker slots, preventing new requests from being processed. Check

    pg_stat_activity
    for long-running queries and
    iostat -x 1
    for disk saturation rather than looking at CPU utilization.

    Q: How do I safely test timeout changes on a production Nginx without dropping connections?

    A: Use

    nginx -t
    to validate the configuration first, then
    systemctl reload nginx
    (or
    nginx -s reload
    ). This performs a graceful reload — existing connections continue on old worker processes while new connections pick up the updated configuration immediately. No connections are dropped and no downtime occurs. Always validate on a staging server first and keep a known-good configuration backup ready for rollback.

    Q: Should I set proxy_read_timeout globally or per location block?

    A: Per location block is strongly preferred. A globally relaxed timeout means even simple health-check or quick-lookup endpoints will wait for the full timeout value before failing, degrading user experience during partial outages and masking problems. Reserve long timeouts only for locations with legitimate long-running operations such as report exports or file processing, and keep tight timeouts on all standard API and page routes.

    Q: How can I alert on 504 errors proactively before users report them?

    A: Configure your log shipper (Filebeat, Promtail, or Fluentd) to parse the Nginx access log and export a

    nginx_http_504_total
    counter metric. Set a rate alert at more than five 504s per minute above your normal baseline. More powerfully, set a leading-indicator alert on 95th percentile
    upstream_response_time
    exceeding 70–80% of your
    proxy_read_timeout
    value — this fires before 504s begin and gives you time to act.

    Q: Can Nginx serve cached responses during a 504 condition to protect users?

    A: Yes, for cacheable GET responses. Configure

    proxy_cache
    with the
    proxy_cache_use_stale error timeout updating
    directive to serve stale cached content when the upstream times out or errors. This is a resilience pattern that protects the user experience during brief upstream degradation while the root cause is addressed. It only works for previously cached, cacheable responses and is not a substitute for fixing the underlying performance problem.

    Q: What is the difference between proxy_connect_timeout and proxy_read_timeout?

    A:

    proxy_connect_timeout
    controls how long Nginx waits to establish the TCP connection to the upstream — the three-way handshake. This should be short (5–15 seconds) because TCP failing to connect means the upstream host is unreachable or the port is not listening, not that it is slow.
    proxy_read_timeout
    controls how long Nginx waits for response data after the connection is established. Nearly all 504 errors come from
    proxy_read_timeout
    expiring. A 504 with very short elapsed time (under 15 seconds) may point to a
    proxy_connect_timeout
    firing due to a firewall or routing issue.

    Frequently Asked Questions

    What is the difference between a 502 and a 504 in Nginx?

    A 502 Bad Gateway means Nginx received an invalid or unexpected response from the upstream — the upstream responded, but with something that was not a valid HTTP response. A 504 Gateway Timeout means Nginx received no response at all within the configured timeout window. In practice: 502 points to an application crash or misconfiguration; 504 points to the upstream being too slow or unreachable.

    How do I tell if the 504 is Nginx's timeout or the upstream's own timeout?

    Check the exact elapsed time in the Nginx access log field $upstream_response_time. If it equals your configured proxy_read_timeout value (e.g., exactly 60.00 seconds), Nginx fired its timeout. A 504 logged by Nginx always means Nginx was waiting and gave up — the upstream never responded within the window.

    Does increasing proxy_read_timeout actually fix the root problem?

    No. Increasing the timeout suppresses the 504 symptom but does not address why the upstream is slow. It is a valid short-term measure while you investigate, but the real fix must address the root cause — missing database indexes, insufficient worker capacity, or unoptimized code. A timeout increase without a root cause fix will eventually fail again under higher load.

    My 504s only happen under high traffic load. What should I check first?

    Load-dependent 504s almost always indicate worker pool exhaustion or database lock contention. Check the upstream's active worker count at peak time, the database's pg_stat_activity or SHOW PROCESSLIST for blocked queries, and Nginx's stub_status for high Writing counts. Also verify the upstream listen backlog is not overflowing.

    How do I check Nginx's effective timeout configuration without restarting it?

    Run 'nginx -T 2>/dev/null | grep timeout' to dump the full compiled configuration including all included files, showing exactly what values are currently active. This is safe and non-disruptive to running connections.

    Can a 504 be caused by DNS resolution failure for the upstream?

    Yes. If Nginx uses a hostname in the proxy_pass directive with a resolver directive configured, a slow DNS resolution consumes time before the TCP connection even begins. Always use stable RFC 1918 IP addresses for upstream servers in production, or configure a local caching resolver to minimize DNS latency.

    Why does my upstream show only 20% CPU during a 504 incident?

    Low CPU during 504s is the classic signature of I/O-bound blocking — typically database wait or network I/O. Application threads are idle, blocked waiting for a query result, consuming no CPU but occupying all worker slots. Check pg_stat_activity for long-running queries and iostat -x 1 for disk saturation rather than relying on CPU metrics.

    How do I safely apply timeout changes on a production Nginx without dropping connections?

    Use 'nginx -t' to validate the configuration first, then 'systemctl reload nginx'. This performs a graceful reload — existing connections continue on old worker processes while new connections immediately use the updated configuration. No connections are dropped and there is no downtime.

    Should I set proxy_read_timeout globally or per location block?

    Per location block is strongly preferred. A globally relaxed timeout means simple endpoints wait the full timeout before failing, masking problems and degrading user experience during partial outages. Reserve long timeouts only for locations with legitimate long-running operations, and keep tight timeouts on all standard API routes.

    How can I alert on 504 errors proactively before users report them?

    Configure your log shipper to parse Nginx access logs and export an nginx_http_504_total metric. Alert at a rate above your normal baseline. More powerfully, set a leading-indicator alert on 95th percentile upstream_response_time exceeding 70-80% of your proxy_read_timeout value — this fires before 504s begin and gives time to act.

    Related Articles