InfraRunBook
    Back to articles

    Nginx Worker Process Crashing

    Nginx
    Published: Apr 15, 2026
    Updated: Apr 15, 2026

    Nginx worker processes can crash for a range of reasons from kernel-level segfaults to Lua panics and OOM kills. This guide walks through every major root cause with real commands, log signatures, and fixes.

    Nginx Worker Process Crashing

    Symptoms

    You log into sw-infrarunbook-01 expecting a routine morning and instead you find Nginx half-dead. Some workers are gone, requests are intermittently 502-ing, and the master process keeps spinning up replacements that promptly die again. The access log trickles. The error log is screaming. systemd has already restarted the service twice this hour.

    The specific signals that something is wrong with a worker — not just a slow backend — tend to look like this:

    • 502 Bad Gateway errors appearing in bursts, often aligned with traffic spikes or specific request patterns
    • connect() failed (111: Connection refused)
      messages in
      /var/log/nginx/error.log
    • The master process still running but with fewer worker processes than expected —
      ps aux | grep nginx
      shows two workers where you normally see eight
    • Kernel logs in
      dmesg
      or
      /var/log/syslog
      showing
      segfault
      entries attributed to nginx
    • Monitoring alerts showing a spike in HTTP 5xx errors followed by a partial recovery — that recovery is the master restarting workers

    Before you start digging, pull these three things immediately:

    sudo tail -n 200 /var/log/nginx/error.log
    sudo dmesg | grep -i nginx | tail -40
    sudo journalctl -u nginx --since "1 hour ago" --no-pager

    What you find in those three outputs will usually point you at the right root cause within two minutes. Let's walk through each one.


    Root Cause 1: Segfault in an Nginx Module

    This is the crash mode that tends to cause the most panic because the error looks catastrophic. A worker process is killed by signal 11 (SIGSEGV) — a segmentation fault — meaning it tried to access memory it wasn't supposed to touch. In my experience, the most common trigger is a third-party module compiled against a different version of Nginx than what's actually running, or a module with a known memory corruption bug that only surfaces under specific request patterns.

    The kernel is usually the first to tell you. Check dmesg:

    sudo dmesg | grep nginx
    [1234567.890123] nginx[28341]: segfault at 7f3b2c000010 ip 00007f3b2c000010 sp 00007ffde9b3f8f0 error 14 in nginx[400000+89000]

    The

    error 14
    here is a page fault flagged as a user-mode access violation. The
    ip
    (instruction pointer) value falling inside the nginx binary range is a strong hint that a module — not the Nginx core itself — is misbehaving. The error log confirms the worker death:

    2026/04/15 08:42:17 [alert] 28340#28340: worker process 28341 exited on signal 11
    2026/04/15 08:42:17 [notice] 28340#28340: start worker process 28350

    To identify which module is responsible, enable a core dump. Add to your systemd service override (

    sudo systemctl edit nginx
    ):

    [Service]
    LimitCORE=infinity
    WorkingDirectory=/tmp

    Then set the core pattern and restart:

    echo "/tmp/core.%e.%p" | sudo tee /proc/sys/kernel/core_pattern
    sudo systemctl daemon-reload && sudo systemctl restart nginx

    Once you catch a core dump, run it through gdb:

    sudo gdb /usr/sbin/nginx /tmp/core.nginx.28341
    (gdb) bt full

    The backtrace will show exactly which function call chain led to the crash. If you see a frame from

    ngx_http_lua_module
    ,
    ngx_http_modsecurity
    , or another third-party module, that's your culprit. Cross-reference your loaded modules with
    nginx -V 2>&1 | tr ' ' ' ' | grep module
    .

    The fix depends on the module. For a version mismatch, recompile the module against the exact Nginx version on sw-infrarunbook-01. For known bugs, check the module's issue tracker — there's usually a patch or a newer release. If you can't fix it immediately, disable the module directive in your config and reload:

    sudo nginx -t && sudo nginx -s reload

    Root Cause 2: File Descriptor Limit Exhaustion

    Nginx opens a file descriptor for every active connection, every upstream connection it proxies, and every file it serves directly. When you hit the OS limit, workers can't accept new connections. Depending on the code path that hits EMFILE, the worker may log the error and keep running, or it may die — either way, your service is broken.

    The error log signature is distinctive:

    2026/04/15 09:15:44 [crit] 28350#28350: *88921 open() "/var/www/solvethenetwork.com/assets/logo.png" failed (24: Too many open files), client: 10.10.1.45, server: solvethenetwork.com

    Error 24 is EMFILE. You can confirm the limit at the process level before things become critical:

    NGINX_PID=$(cat /run/nginx.pid)
    cat /proc/$NGINX_PID/limits | grep "open files"
    # Max open files            1024                 1024                 files
    
    ls /proc/$NGINX_PID/fd | wc -l
    # 1019

    If the current FD count is close to the limit, you're about to hit the wall. The fix is a two-part operation. First, raise the limit in

    /etc/nginx/nginx.conf
    :

    worker_processes auto;
    worker_rlimit_nofile 65535;
    
    events {
        worker_connections 10240;
        use epoll;
        multi_accept on;
    }

    The

    worker_rlimit_nofile
    directive tells the master process to set the file descriptor limit when it forks each worker. But this only works if the master itself has permission to raise to that value. You also need to set system-wide limits. For systemd-managed Nginx:

    sudo systemctl edit nginx
    [Service]
    LimitNOFILE=65535

    For non-systemd systems, add to

    /etc/security/limits.conf
    :

    www-data    soft    nofile    65535
    www-data    hard    nofile    65535

    Reload and verify:

    sudo systemctl daemon-reload && sudo systemctl restart nginx
    cat /proc/$(cat /run/nginx.pid)/limits | grep "open files"
    # Max open files            65535                65535                files

    Root Cause 3: Out of Memory (OOM Killer)

    The Linux OOM killer is a blunt instrument. When the system runs out of memory, the kernel scores every process by memory usage and kills the one with the highest score. Nginx workers, which can be handling thousands of connections each carrying proxy buffers, often accumulate enough RSS to look like excellent candidates. The workers don't crash in the traditional sense — they're killed from the outside by the kernel.

    This is the easiest root cause to confirm. The OOM killer always leaves a trace in the kernel ring buffer:

    sudo dmesg | grep -i "oom\|killed process" | tail -20
    [2345678.901234] Out of memory: Kill process 28350 (nginx) score 312 or sacrifice child
    [2345678.901235] Killed process 28350 (nginx) total-vm:512048kB, anon-rss:487312kB, file-rss:0kB, shmem-rss:0kB

    It also shows up in journald:

    sudo journalctl -k --since "2 hours ago" | grep -i "oom\|nginx"

    Once confirmed, figure out whether Nginx is the source of the memory pressure or the victim of something else on the box. Common Nginx-side causes include oversized proxy buffer directives that allocate per-request:

    # Before — over-allocated
    proxy_buffer_size          128k;
    proxy_buffers              64 128k;
    
    # After — conservative defaults that still handle most workloads
    proxy_buffer_size          4k;
    proxy_buffers              8 4k;
    proxy_busy_buffers_size    8k;
    proxy_temp_file_write_size 8k;

    If the memory pressure is system-wide and Nginx is a victim, you have two levers. First, protect the master process from the OOM killer by lowering its oom_score_adj. Add to your systemd service override:

    [Service]
    OOMScoreAdjust=-500

    A value of -500 makes the kernel significantly less likely to kill the Nginx master. This doesn't solve the underlying memory problem, but it keeps the master alive to restart workers while you investigate. In my experience, the actual culprit in an OOM scenario is usually a PHP-FPM pool that caught a traffic spike and spawned uncapped child processes, or a Java application that never released its heap. Use

    ps aux --sort=-%mem | head -20
    right after an OOM event to see who was consuming the most memory.


    Root Cause 4: Lua Module Error

    If you're running

    ngx_http_lua_module
    — which ships with OpenResty and can be added to standalone Nginx — you have a powerful scripting layer that can also silently destroy workers under the right conditions. Lua errors at certain lifecycle hooks, particularly in
    init_worker_by_lua_block
    or within cosocket operations that violate phase constraints, can cause a worker to terminate abnormally via SIGABRT rather than simply returning a 500 to the client.

    The error log pattern for a Lua panic is hard to miss:

    2026/04/15 10:03:22 [error] 28360#28360: *102341 lua entry thread aborted: runtime error: /etc/nginx/lua/auth.lua:47: attempt to index a nil value (global 'redis')
    stack traceback:
            [C]: in ?
            /etc/nginx/lua/auth.lua:47: in function 'verify_token'
            /etc/nginx/lua/auth.lua:89: in function <auth.lua:85>
    2026/04/15 10:03:22 [alert] 28340#28340: worker process 28360 exited on signal 6

    Signal 6 is SIGABRT — the process called

    abort()
    . That's what the Lua module does when a Lua panic unwinds past the C boundary with no recovery point.

    The most common scenario I've hit is a Lua script trying to use a cosocket or shared dictionary during

    init_worker_by_lua_block
    before the event loop is fully ready. Something like this will crash intermittently:

    init_worker_by_lua_block {
        local redis = require "resty.redis"
        local r = redis:new()
        -- Cosockets are not available in all init_worker contexts
        local ok, err = r:connect("10.10.1.10", 6379)
        if not ok then
            ngx.log(ngx.ERR, "redis connect failed: ", err)
        end
    }

    The fix is to defer initialization using a timer, which fires after the event loop is running:

    init_worker_by_lua_block {
        local handler
        handler = function(premature)
            if premature then return end
            local redis = require "resty.redis"
            local r = redis:new()
            local ok, err = r:connect("10.10.1.10", 6379)
            if not ok then
                ngx.log(ngx.ERR, "redis connect failed: ", err)
                return
            end
            -- store handle in ngx.shared dict or upvalue
        end
        ngx.timer.at(0, handler)
    }

    Another Lua crash pattern is unhandled errors in

    content_by_lua_block
    propagating past the C boundary. Always wrap top-level Lua handlers in
    pcall
    :

    content_by_lua_block {
        local ok, err = pcall(function()
            -- your actual logic here
        end)
        if not ok then
            ngx.log(ngx.ERR, "handler error: ", err)
            ngx.exit(500)
        end
    }

    If you're debugging a Lua crash and the stack trace is truncated, temporarily enable this in a dev reload — and only in dev:

    lua_code_cache off;

    Never leave

    lua_code_cache off
    in production. It disables bytecode caching and can itself destabilize workers under load. Use it to iterate on a fix, then re-enable it before any production deployment.


    Root Cause 5: Config Error Not Caught at Start

    This one is subtle and genuinely frustrating.

    nginx -t
    passes. The service starts cleanly. Workers run for minutes or hours. And then one dies because it tried to evaluate a configuration directive that's syntactically valid but semantically broken at runtime when it encounters specific real-world input.

    The classic example is a

    map
    block with a PCRE regex that compiles fine but hits catastrophic backtracking when matched against a crafted URI. Another common variant is a
    geo
    or
    geoip2
    directive pointing at a database file that exists at startup but gets atomically replaced by an automated update script mid-operation, leaving the worker holding a stale descriptor to a deleted inode.

    The error log for this type of crash is often sparse. You get the signal but minimal context:

    2026/04/15 11:22:45 [alert] 28340#28340: worker process 28370 exited on signal 11
    2026/04/15 11:22:45 [alert] 28340#28340: worker process 28371 exited on signal 11

    The crash correlates with requests matching a specific pattern — long URLs, certain user-agent strings, requests that route through a particular location block. Check your access log timestamps against the worker death timestamps to find the pattern.

    For regex-related crashes, enable PCRE JIT compilation in

    nginx.conf
    :

    pcre_jit on;

    JIT-compiled PCRE is faster and significantly more resistant to catastrophic backtracking. Also audit any complex regex in map or location blocks. This is a ReDoS trap:

    map $uri $some_var {
        ~^/(a+)+$    "matched";
        default      "none";
    }

    Simplify the regex or rewrite the matching logic entirely. Use

    pcre2test
    or an online ReDoS checker to validate any regex before it goes into a map block on a production host.

    For the stale GeoIP or include-file scenario, the fix is an atomic update pattern: write to a temp file, rename it over the production path, then send Nginx a reload signal to re-open all files with fresh descriptors:

    cp /tmp/GeoLite2-City-new.mmdb /tmp/GeoLite2-City-new.mmdb.tmp
    mv /tmp/GeoLite2-City-new.mmdb.tmp /etc/nginx/geoip/GeoLite2-City.mmdb
    nginx -s reload

    An SSL certificate rotation that doesn't trigger a reload is another variant of this bug. The worker doesn't crash immediately — it crashes when it tries to re-initialize the SSL context for a new connection after the old cert file has been replaced. Check your certificate renewal hooks:

    ls -la /etc/letsencrypt/renewal-hooks/deploy/
    cat /etc/letsencrypt/renewal-hooks/deploy/nginx-reload.sh

    That file should exist and contain:

    #!/bin/sh
    nginx -s reload

    If it doesn't, create it and make it executable with

    chmod +x
    .


    Root Cause 6: Upstream Timeout Storm

    This isn't a crash in the traditional sense, but it produces identical symptoms — workers disappearing and 502s flooding in. When upstreams become slow or start timing out, Nginx workers pile up waiting for backend responses. The worker connection pool saturates. Under extreme backpressure combined with very large

    proxy_read_timeout
    values, memory grows as connections accumulate, and you can trigger the OOM condition described above as a secondary effect.

    The error log makes the cause obvious:

    2026/04/15 12:05:33 [error] 28372#28372: *115432 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.10.1.88, server: solvethenetwork.com, request: "GET /api/users HTTP/1.1", upstream: "http://10.10.2.10:8080/api/users"

    The fix is conservative timeout values combined with fast failure and failover:

    upstream api_backend {
        server 10.10.2.10:8080;
        server 10.10.2.11:8080 backup;
        keepalive 32;
    }
    
    location /api/ {
        proxy_pass             http://api_backend;
        proxy_connect_timeout  5s;
        proxy_send_timeout     10s;
        proxy_read_timeout     30s;
        proxy_next_upstream    error timeout http_502 http_503;
        proxy_next_upstream_tries 2;
    }

    Short timeouts mean workers fail fast and return to the pool rather than sitting blocked for 60 or 120 seconds per request. With

    proxy_next_upstream
    configured, Nginx will also automatically retry failed requests against the backup upstream before returning a 502 to the client.


    Prevention

    Once you've fought through a worker crash incident, you don't want to do it again. Here's what I keep in place on production Nginx hosts to stay ahead of these failures.

    Enable stub_status and monitor it actively. The built-in stub_status module exposes active connections, accepted connections, handled requests, and reading/writing/waiting counts. If active connections are climbing toward your

    worker_connections
    limit, you'll know before workers start dying. Wire it to Prometheus with nginx-prometheus-exporter or poll it with a simple curl from your monitoring system.

    location /nginx_status {
        stub_status;
        allow 127.0.0.1;
        allow 10.10.0.0/16;
        deny all;
    }

    Alert on specific error log strings. Push

    /var/log/nginx/error.log
    to your log aggregator and create alerts on
    exited on signal
    ,
    Too many open files
    ,
    lua entry thread aborted
    , and
    Out of memory
    . These fire minutes before your users start complaining and before the 502s trigger your HTTP error rate alert.

    Run

    nginx -T
    in CI on every config change. The
    -T
    flag dumps the full resolved configuration including all included files. Running it as a CI step catches syntax errors, undefined variables, and duplicate directives before they ever reach production. It won't catch every runtime issue, but it eliminates the obvious ones that slip through during a rushed manual deploy.

    sudo nginx -T 2>&1 | grep -i "warn\|error"

    Pin module versions in your build scripts. If you compile Nginx from source with third-party modules, lock module source versions to your Nginx release in the build script and store that script in version control. A module-to-core version mismatch is the single most common cause of unexpected segfaults I've seen in production Nginx deployments.

    Set resource limits from the start, not when you hit a wall. Set

    worker_rlimit_nofile
    and the systemd
    LimitNOFILE
    to at least 65535 on any server expecting more than light traffic. Add
    OOMScoreAdjust=-500
    to your service unit. These are one-line config changes with zero performance downside and they prevent entire categories of incidents.

    Wrap all Lua handlers in pcall. If you're writing Lua for Nginx, treat every top-level handler as if it will receive malformed, adversarial input. Wrap everything in pcall. Log errors with enough context to debug them. Never let an unhandled Lua exception reach the C boundary — the Lua module will SIGABRT the worker and you'll be back reading this article.

    Test your file rotation procedures explicitly. If you rotate any file Nginx references at runtime — GeoIP databases, TLS certificates, include files — test the rotation procedure in staging with a loaded Nginx instance. The safe pattern on Linux is always: write to a temp path, atomically rename over the production path, send

    nginx -s reload
    . Anything else is a race condition waiting to produce an incident at 2am.

    Worker crashes are rarely random. There's always a root cause, always a log entry somewhere, and almost always a systematic fix that prevents recurrence. The key is knowing where to look first and what each error pattern means at the kernel and application level. Keep your error logs accessible, your resource limits sane, and your Lua handlers wrapped — and you'll spend a lot less time reading dmesg at 2am.

    Frequently Asked Questions

    How do I tell if an Nginx worker was killed by the OOM killer versus crashing on its own?

    Check dmesg with 'sudo dmesg | grep -i "oom\|killed process"'. The OOM killer always leaves an explicit log entry like 'Killed process 28350 (nginx)' with memory stats. A self-inflicted crash via segfault or SIGABRT will show 'exited on signal 11' or 'exited on signal 6' in the Nginx error log without a corresponding OOM entry in dmesg.

    Why does nginx -t pass but a worker still crashes at runtime?

    nginx -t validates configuration syntax and basic semantic correctness at parse time, but it can't simulate runtime behavior. A PCRE regex that triggers catastrophic backtracking on a specific URI, a GeoIP database file replaced mid-operation, or an SSL certificate file deleted after startup will all pass nginx -t but cause worker crashes or errors in production. Always combine nginx -t with monitoring and log alerting to catch runtime configuration issues.

    What is the safest way to debug a Lua module crash in Nginx without taking down production?

    Add pcall wrappers around your Lua handler logic immediately to prevent unhandled errors from aborting workers. Then use a staging or canary instance to reproduce the crash with 'lua_code_cache off' enabled, which disables bytecode caching and makes Lua errors more verbose. Capture a core dump on the staging host using LimitCORE=infinity in the systemd unit and analyze the backtrace with gdb. Never run lua_code_cache off in production.

    How many file descriptors should I set for an Nginx server handling high traffic?

    A common starting point is worker_rlimit_nofile set to 65535 in nginx.conf with a matching LimitNOFILE=65535 in the systemd unit. For very high-traffic servers, values of 131072 or higher are reasonable. The key formula is: each active connection needs roughly one FD, each upstream connection needs one FD, and each open file being served needs one FD. Monitor /proc/<nginx-pid>/fd count versus your limit using your metrics stack and raise the limit before you hit 80% utilization.

    Can the Nginx master process crash, or is it only workers that crash?

    Under normal conditions, the Nginx master process is very stable — it doesn't handle requests directly and has minimal code paths that could trigger a crash. However, the OOM killer can kill the master if OOMScoreAdjust is not set and system memory is completely exhausted. A bug in a module's init phase executed by the master during startup can also crash it. In practice, almost every production Nginx crash involves workers rather than the master.

    Related Articles