Nginx Worker Process Crashing

Symptoms

You log into sw-infrarunbook-01 expecting a routine morning and instead you find Nginx half-dead. Some workers are gone, requests are intermittently 502-ing, and the master process keeps spinning up replacements that promptly die again. The access log trickles. The error log is screaming. systemd has already restarted the service twice this hour.

The specific signals that something is wrong with a worker — not just a slow backend — tend to look like this:

502 Bad Gateway errors appearing in bursts, often aligned with traffic spikes or specific request patterns
connect() failed (111: Connection refused)
messages in
/var/log/nginx/error.log
The master process still running but with fewer worker processes than expected —
ps aux | grep nginx
shows two workers where you normally see eight
Kernel logs in
dmesg
or
/var/log/syslog
showing
segfault
entries attributed to nginx
Monitoring alerts showing a spike in HTTP 5xx errors followed by a partial recovery — that recovery is the master restarting workers

Before you start digging, pull these three things immediately:

sudo tail -n 200 /var/log/nginx/error.log
sudo dmesg | grep -i nginx | tail -40
sudo journalctl -u nginx --since "1 hour ago" --no-pager

What you find in those three outputs will usually point you at the right root cause within two minutes. Let's walk through each one.

Root Cause 1: Segfault in an Nginx Module

This is the crash mode that tends to cause the most panic because the error looks catastrophic. A worker process is killed by signal 11 (SIGSEGV) — a segmentation fault — meaning it tried to access memory it wasn't supposed to touch. In my experience, the most common trigger is a third-party module compiled against a different version of Nginx than what's actually running, or a module with a known memory corruption bug that only surfaces under specific request patterns.

The kernel is usually the first to tell you. Check dmesg:

sudo dmesg | grep nginx
[1234567.890123] nginx[28341]: segfault at 7f3b2c000010 ip 00007f3b2c000010 sp 00007ffde9b3f8f0 error 14 in nginx[400000+89000]

The

error 14

here is a page fault flagged as a user-mode access violation. The

ip

(instruction pointer) value falling inside the nginx binary range is a strong hint that a module — not the Nginx core itself — is misbehaving. The error log confirms the worker death:

2026/04/15 08:42:17 [alert] 28340#28340: worker process 28341 exited on signal 11
2026/04/15 08:42:17 [notice] 28340#28340: start worker process 28350

To identify which module is responsible, enable a core dump. Add to your systemd service override (

sudo systemctl edit nginx

[Service]
LimitCORE=infinity
WorkingDirectory=/tmp

Then set the core pattern and restart:

echo "/tmp/core.%e.%p" | sudo tee /proc/sys/kernel/core_pattern
sudo systemctl daemon-reload && sudo systemctl restart nginx

Once you catch a core dump, run it through gdb:

sudo gdb /usr/sbin/nginx /tmp/core.nginx.28341
(gdb) bt full

The backtrace will show exactly which function call chain led to the crash. If you see a frame from

ngx_http_lua_module

ngx_http_modsecurity

, or another third-party module, that's your culprit. Cross-reference your loaded modules with

nginx -V 2>&1 | tr ' ' ' ' | grep module

The fix depends on the module. For a version mismatch, recompile the module against the exact Nginx version on sw-infrarunbook-01. For known bugs, check the module's issue tracker — there's usually a patch or a newer release. If you can't fix it immediately, disable the module directive in your config and reload:

sudo nginx -t && sudo nginx -s reload

Root Cause 2: File Descriptor Limit Exhaustion

Nginx opens a file descriptor for every active connection, every upstream connection it proxies, and every file it serves directly. When you hit the OS limit, workers can't accept new connections. Depending on the code path that hits EMFILE, the worker may log the error and keep running, or it may die — either way, your service is broken.

The error log signature is distinctive:

2026/04/15 09:15:44 [crit] 28350#28350: *88921 open() "/var/www/solvethenetwork.com/assets/logo.png" failed (24: Too many open files), client: 10.10.1.45, server: solvethenetwork.com

Error 24 is EMFILE. You can confirm the limit at the process level before things become critical:

NGINX_PID=$(cat /run/nginx.pid)
cat /proc/$NGINX_PID/limits | grep "open files"
# Max open files            1024                 1024                 files

ls /proc/$NGINX_PID/fd | wc -l
# 1019

If the current FD count is close to the limit, you're about to hit the wall. The fix is a two-part operation. First, raise the limit in

/etc/nginx/nginx.conf

worker_processes auto;
worker_rlimit_nofile 65535;

events {
    worker_connections 10240;
    use epoll;
    multi_accept on;
}

The

worker_rlimit_nofile

directive tells the master process to set the file descriptor limit when it forks each worker. But this only works if the master itself has permission to raise to that value. You also need to set system-wide limits. For systemd-managed Nginx:

sudo systemctl edit nginx

[Service]
LimitNOFILE=65535

For non-systemd systems, add to

/etc/security/limits.conf

www-data    soft    nofile    65535
www-data    hard    nofile    65535

Reload and verify:

sudo systemctl daemon-reload && sudo systemctl restart nginx
cat /proc/$(cat /run/nginx.pid)/limits | grep "open files"
# Max open files            65535                65535                files

Root Cause 3: Out of Memory (OOM Killer)

The Linux OOM killer is a blunt instrument. When the system runs out of memory, the kernel scores every process by memory usage and kills the one with the highest score. Nginx workers, which can be handling thousands of connections each carrying proxy buffers, often accumulate enough RSS to look like excellent candidates. The workers don't crash in the traditional sense — they're killed from the outside by the kernel.

This is the easiest root cause to confirm. The OOM killer always leaves a trace in the kernel ring buffer:

sudo dmesg | grep -i "oom\|killed process" | tail -20

[2345678.901234] Out of memory: Kill process 28350 (nginx) score 312 or sacrifice child
[2345678.901235] Killed process 28350 (nginx) total-vm:512048kB, anon-rss:487312kB, file-rss:0kB, shmem-rss:0kB

It also shows up in journald:

sudo journalctl -k --since "2 hours ago" | grep -i "oom\|nginx"

Once confirmed, figure out whether Nginx is the source of the memory pressure or the victim of something else on the box. Common Nginx-side causes include oversized proxy buffer directives that allocate per-request:

# Before — over-allocated
proxy_buffer_size          128k;
proxy_buffers              64 128k;

# After — conservative defaults that still handle most workloads
proxy_buffer_size          4k;
proxy_buffers              8 4k;
proxy_busy_buffers_size    8k;
proxy_temp_file_write_size 8k;

If the memory pressure is system-wide and Nginx is a victim, you have two levers. First, protect the master process from the OOM killer by lowering its oom_score_adj. Add to your systemd service override:

[Service]
OOMScoreAdjust=-500

A value of -500 makes the kernel significantly less likely to kill the Nginx master. This doesn't solve the underlying memory problem, but it keeps the master alive to restart workers while you investigate. In my experience, the actual culprit in an OOM scenario is usually a PHP-FPM pool that caught a traffic spike and spawned uncapped child processes, or a Java application that never released its heap. Use

ps aux --sort=-%mem | head -20

right after an OOM event to see who was consuming the most memory.

Root Cause 4: Lua Module Error

If you're running

ngx_http_lua_module

— which ships with OpenResty and can be added to standalone Nginx — you have a powerful scripting layer that can also silently destroy workers under the right conditions. Lua errors at certain lifecycle hooks, particularly in

init_worker_by_lua_block

or within cosocket operations that violate phase constraints, can cause a worker to terminate abnormally via SIGABRT rather than simply returning a 500 to the client.

The error log pattern for a Lua panic is hard to miss:

2026/04/15 10:03:22 [error] 28360#28360: *102341 lua entry thread aborted: runtime error: /etc/nginx/lua/auth.lua:47: attempt to index a nil value (global 'redis')
stack traceback:
        [C]: in ?
        /etc/nginx/lua/auth.lua:47: in function 'verify_token'
        /etc/nginx/lua/auth.lua:89: in function <auth.lua:85>
2026/04/15 10:03:22 [alert] 28340#28340: worker process 28360 exited on signal 6

Signal 6 is SIGABRT — the process called

abort()

. That's what the Lua module does when a Lua panic unwinds past the C boundary with no recovery point.

The most common scenario I've hit is a Lua script trying to use a cosocket or shared dictionary during

init_worker_by_lua_block

before the event loop is fully ready. Something like this will crash intermittently:

init_worker_by_lua_block {
    local redis = require "resty.redis"
    local r = redis:new()
    -- Cosockets are not available in all init_worker contexts
    local ok, err = r:connect("10.10.1.10", 6379)
    if not ok then
        ngx.log(ngx.ERR, "redis connect failed: ", err)
    end
}

The fix is to defer initialization using a timer, which fires after the event loop is running:

init_worker_by_lua_block {
    local handler
    handler = function(premature)
        if premature then return end
        local redis = require "resty.redis"
        local r = redis:new()
        local ok, err = r:connect("10.10.1.10", 6379)
        if not ok then
            ngx.log(ngx.ERR, "redis connect failed: ", err)
            return
        end
        -- store handle in ngx.shared dict or upvalue
    end
    ngx.timer.at(0, handler)
}

Another Lua crash pattern is unhandled errors in

content_by_lua_block

propagating past the C boundary. Always wrap top-level Lua handlers in

pcall

content_by_lua_block {
    local ok, err = pcall(function()
        -- your actual logic here
    end)
    if not ok then
        ngx.log(ngx.ERR, "handler error: ", err)
        ngx.exit(500)
    end
}

If you're debugging a Lua crash and the stack trace is truncated, temporarily enable this in a dev reload — and only in dev:

lua_code_cache off;

Never leave

lua_code_cache off

in production. It disables bytecode caching and can itself destabilize workers under load. Use it to iterate on a fix, then re-enable it before any production deployment.

Root Cause 5: Config Error Not Caught at Start

This one is subtle and genuinely frustrating.

nginx -t

passes. The service starts cleanly. Workers run for minutes or hours. And then one dies because it tried to evaluate a configuration directive that's syntactically valid but semantically broken at runtime when it encounters specific real-world input.

The classic example is a

map

block with a PCRE regex that compiles fine but hits catastrophic backtracking when matched against a crafted URI. Another common variant is a

geo

geoip2

directive pointing at a database file that exists at startup but gets atomically replaced by an automated update script mid-operation, leaving the worker holding a stale descriptor to a deleted inode.

The error log for this type of crash is often sparse. You get the signal but minimal context:

2026/04/15 11:22:45 [alert] 28340#28340: worker process 28370 exited on signal 11
2026/04/15 11:22:45 [alert] 28340#28340: worker process 28371 exited on signal 11

The crash correlates with requests matching a specific pattern — long URLs, certain user-agent strings, requests that route through a particular location block. Check your access log timestamps against the worker death timestamps to find the pattern.

For regex-related crashes, enable PCRE JIT compilation in

nginx.conf

pcre_jit on;

JIT-compiled PCRE is faster and significantly more resistant to catastrophic backtracking. Also audit any complex regex in map or location blocks. This is a ReDoS trap:

map $uri $some_var {
    ~^/(a+)+$    "matched";
    default      "none";
}

Simplify the regex or rewrite the matching logic entirely. Use

pcre2test

or an online ReDoS checker to validate any regex before it goes into a map block on a production host.

For the stale GeoIP or include-file scenario, the fix is an atomic update pattern: write to a temp file, rename it over the production path, then send Nginx a reload signal to re-open all files with fresh descriptors:

cp /tmp/GeoLite2-City-new.mmdb /tmp/GeoLite2-City-new.mmdb.tmp
mv /tmp/GeoLite2-City-new.mmdb.tmp /etc/nginx/geoip/GeoLite2-City.mmdb
nginx -s reload

An SSL certificate rotation that doesn't trigger a reload is another variant of this bug. The worker doesn't crash immediately — it crashes when it tries to re-initialize the SSL context for a new connection after the old cert file has been replaced. Check your certificate renewal hooks:

ls -la /etc/letsencrypt/renewal-hooks/deploy/
cat /etc/letsencrypt/renewal-hooks/deploy/nginx-reload.sh

That file should exist and contain:

#!/bin/sh
nginx -s reload

If it doesn't, create it and make it executable with

chmod +x

Root Cause 6: Upstream Timeout Storm

This isn't a crash in the traditional sense, but it produces identical symptoms — workers disappearing and 502s flooding in. When upstreams become slow or start timing out, Nginx workers pile up waiting for backend responses. The worker connection pool saturates. Under extreme backpressure combined with very large

proxy_read_timeout

values, memory grows as connections accumulate, and you can trigger the OOM condition described above as a secondary effect.

The error log makes the cause obvious:

2026/04/15 12:05:33 [error] 28372#28372: *115432 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.10.1.88, server: solvethenetwork.com, request: "GET /api/users HTTP/1.1", upstream: "http://10.10.2.10:8080/api/users"

The fix is conservative timeout values combined with fast failure and failover:

upstream api_backend {
    server 10.10.2.10:8080;
    server 10.10.2.11:8080 backup;
    keepalive 32;
}

location /api/ {
    proxy_pass             http://api_backend;
    proxy_connect_timeout  5s;
    proxy_send_timeout     10s;
    proxy_read_timeout     30s;
    proxy_next_upstream    error timeout http_502 http_503;
    proxy_next_upstream_tries 2;
}

Short timeouts mean workers fail fast and return to the pool rather than sitting blocked for 60 or 120 seconds per request. With

proxy_next_upstream

configured, Nginx will also automatically retry failed requests against the backup upstream before returning a 502 to the client.

Prevention

Once you've fought through a worker crash incident, you don't want to do it again. Here's what I keep in place on production Nginx hosts to stay ahead of these failures.

Enable stub_status and monitor it actively. The built-in stub_status module exposes active connections, accepted connections, handled requests, and reading/writing/waiting counts. If active connections are climbing toward your

worker_connections

limit, you'll know before workers start dying. Wire it to Prometheus with nginx-prometheus-exporter or poll it with a simple curl from your monitoring system.

location /nginx_status {
    stub_status;
    allow 127.0.0.1;
    allow 10.10.0.0/16;
    deny all;
}

Alert on specific error log strings. Push

/var/log/nginx/error.log

to your log aggregator and create alerts on

exited on signal

Too many open files

lua entry thread aborted

, and

Out of memory

. These fire minutes before your users start complaining and before the 502s trigger your HTTP error rate alert.

Run

nginx -T

in CI on every config change. The

-T

flag dumps the full resolved configuration including all included files. Running it as a CI step catches syntax errors, undefined variables, and duplicate directives before they ever reach production. It won't catch every runtime issue, but it eliminates the obvious ones that slip through during a rushed manual deploy.

sudo nginx -T 2>&1 | grep -i "warn\|error"

Pin module versions in your build scripts. If you compile Nginx from source with third-party modules, lock module source versions to your Nginx release in the build script and store that script in version control. A module-to-core version mismatch is the single most common cause of unexpected segfaults I've seen in production Nginx deployments.

Set resource limits from the start, not when you hit a wall. Set

worker_rlimit_nofile

and the systemd

LimitNOFILE

to at least 65535 on any server expecting more than light traffic. Add

OOMScoreAdjust=-500

to your service unit. These are one-line config changes with zero performance downside and they prevent entire categories of incidents.

Wrap all Lua handlers in pcall. If you're writing Lua for Nginx, treat every top-level handler as if it will receive malformed, adversarial input. Wrap everything in pcall. Log errors with enough context to debug them. Never let an unhandled Lua exception reach the C boundary — the Lua module will SIGABRT the worker and you'll be back reading this article.

Test your file rotation procedures explicitly. If you rotate any file Nginx references at runtime — GeoIP databases, TLS certificates, include files — test the rotation procedure in staging with a loaded Nginx instance. The safe pattern on Linux is always: write to a temp path, atomically rename over the production path, send

nginx -s reload

. Anything else is a race condition waiting to produce an incident at 2am.

Worker crashes are rarely random. There's always a root cause, always a log entry somewhere, and almost always a systematic fix that prevents recurrence. The key is knowing where to look first and what each error pattern means at the kernel and application level. Keep your error logs accessible, your resource limits sane, and your Lua handlers wrapped — and you'll spend a lot less time reading dmesg at 2am.

Nginx Worker Process Crashing

Symptoms

Root Cause 1: Segfault in an Nginx Module

Root Cause 2: File Descriptor Limit Exhaustion

Root Cause 3: Out of Memory (OOM Killer)

Root Cause 4: Lua Module Error

Root Cause 5: Config Error Not Caught at Start

Root Cause 6: Upstream Timeout Storm

Prevention

Frequently Asked Questions

How do I tell if an Nginx worker was killed by the OOM killer versus crashing on its own?

Why does nginx -t pass but a worker still crashes at runtime?

What is the safest way to debug a Lua module crash in Nginx without taking down production?

How many file descriptors should I set for an Nginx server handling high traffic?

Can the Nginx master process crash, or is it only workers that crash?

Related Articles