HAProxy Reload Causing Connection Drops

Symptoms

HAProxy's reload mechanism exists for a reason: pick up new configuration without dropping a single active connection. But in practice, I've seen this go wrong in more ways than I can count. Teams who are confident that a reload is "safe" find themselves staring at a spike of 502s in their application logs, frantic messages from users, and a monitoring dashboard that looks like someone pulled a cable.

The symptoms that typically surface after a HAProxy reload gone wrong include:

A sharp spike in 502 or 503 errors in application logs immediately after the reload window
TCP RST packets captured in tcpdump during the reload event
Active WebSocket connections or long-running HTTP streams abruptly terminated
HAProxy failing to come back up at all, leaving the old process orphaned and the new config never loaded
Upstream monitoring health checks failing for 5–30 seconds post-reload
Client-side errors like Connection reset by peer or zero-byte HTTP responses

What makes this particularly frustrating is that some of these issues are intermittent — they only show up under load, or only when a long-lived connection happens to be active at the exact reload moment. This guide covers every root cause I've encountered, how to diagnose each one definitively, and how to fix it for good.

Root Cause 1: Reload Not Graceful

The most common cause, and the one that surprises people most, is that the reload wasn't actually graceful to begin with. HAProxy supports a hitless reload via the -sf (softly finish) flag, which tells the old process to stop accepting new connections and drain existing ones while the new process takes over. If you're not using this mechanism correctly, you don't get a graceful reload — you get a hard stop-start with a full gap in listener availability in the middle.

Why it happens: Homegrown deployment scripts often just invoke

haproxy -f /etc/haproxy/haproxy.cfg

without passing the old PID at all. Others use

kill -HUP

directly, which triggers a config reload in some builds but not a proper graceful worker handoff. A very common mistake is running

systemctl restart haproxy

instead of

systemctl reload haproxy

. Restart is a full kill-and-start. Reload invokes whatever is in the systemd ExecReload directive — and if that directive is wrong or missing, you still don't get a graceful reload.

How to identify it: Check the systemd unit definition to see exactly what ExecReload does:

systemctl cat haproxy | grep -A5 ExecReload

A bad ExecReload looks like this:

ExecReload=/bin/kill -HUP $MAINPID

A plain HUP to the main PID doesn't guarantee a graceful handoff. You can also confirm the behavior directly by watching the process table during a reload:

watch -n0.5 'ps aux | grep haproxy'

During a proper graceful reload, you'll briefly see two haproxy processes — the old one draining and the new one accepting. If you only ever see one at a time, it's a hard stop-start and you are dropping connections in the gap between them.

How to fix it: Update ExecReload to pass the old PID to the new invocation. For classic process mode:

ExecReload=/usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf $MAINPID

For HAProxy 1.8 and newer running in master-worker mode, the correct signal is USR2 sent to the master process:

ExecReload=/bin/kill -USR2 $MAINPID

In master-worker mode, the master receives USR2, spawns a new worker with the updated config, and gracefully drains the old worker. The master process itself never exits. This is the architecture you want for any production deployment.

Root Cause 2: Old Process Not Dying

Even when the reload is genuinely graceful, the old process can linger far longer than expected. This isn't always a problem on its own — lingering means the old process is draining active connections, which is the intended behavior. But it becomes a serious problem when the drain takes hours, when file descriptor limits are hit across two coexisting processes, or when the old process gets completely stuck and never exits.

Why it happens: The old HAProxy process stays alive until every one of its connections closes. If you're proxying WebSocket connections, long HTTP uploads, or keep-alive connections with very permissive timeouts, you'll have an old HAProxy process sitting around long after the reload. Both the old and new processes share the system's file descriptor pool. If

ulimit -n

is tight and you have many concurrent connections, the new process will eventually start refusing connections because it can't open new file descriptors.

In my experience, the most common trigger is an omitted or overly generous

timeout tunnel

in the config. Some HAProxy versions default this to no timeout, meaning a WebSocket connection will be held open until the client disconnects — potentially hours or days later.

How to identify it:

ps aux | grep haproxy

If multiple haproxy processes appear with meaningfully different start times — hours after your last reload — the old one hasn't drained. To see what it's holding:

ss -tnp | grep haproxy
lsof -p <old_pid> | grep -c ESTABLISHED

If the old process is still answering its stats socket, you can query it directly:

echo "show info" | socat stdio /run/haproxy/admin.sock

Look at the CurrConns field. If it's non-zero and hasn't moved in an hour, you have stuck sessions that will never drain on their own.

How to fix it: Set appropriate drain timeouts in your defaults section. For WebSocket-heavy proxies, be deliberate about

timeout tunnel

. For standard HTTPS frontends, 30–60 seconds on

timeout client

is almost always sufficient. Add

hard-stop-after

in the global section to enforce an absolute deadline:

global
    hard-stop-after 30s

With this directive, the old process will hard-close any remaining connections after 30 seconds regardless of their state. You get predictable, bounded reload behavior that fits into deployment pipelines cleanly.

Root Cause 3: Sock File Conflict

HAProxy's stats socket is the communication channel that the reload mechanism depends on. If that socket file is stale, missing, or owned by the wrong user, the new process can't bind it — and the reload fails hard, sometimes without a clear error visible to the operator.

Why it happens: When HAProxy crashes or is killed with SIGKILL, it doesn't get a chance to clean up the socket file. The next startup attempt tries to bind the same path and gets EADDRINUSE even though nothing is listening on it. This is also common after a server reboot when the socket lives in a tmpfs directory like

/run/haproxy/

that wasn't recreated, or when a package upgrade recreates the directory with wrong ownership. In non-master-worker setups, this is a persistent footgun that I've seen bite teams repeatedly.

How to identify it: Run a config check, reload, then examine the journal:

haproxy -f /etc/haproxy/haproxy.cfg -c
journalctl -u haproxy -n 50 --no-pager

A sock file conflict produces output like this:

[ALERT] 106/143021 (12345) : Starting proxy stats: cannot bind socket [/run/haproxy/admin.sock]
[ALERT] 106/143021 (12345) : [/usr/sbin/haproxy.main()] Some protocols failed to start their listeners! Exiting.

Check the socket file directly:

ls -la /run/haproxy/admin.sock
file /run/haproxy/admin.sock

If it shows as a regular file rather than a Unix domain socket, or if

socat

gets a connection refused when trying to query it, it's stale.

How to fix it: For an immediate fix, remove the stale socket and restart:

rm -f /run/haproxy/admin.sock
systemctl start haproxy

For a durable fix, configure systemd's

RuntimeDirectory

to manage the socket directory lifecycle automatically. Add this to the

[Service]

section of the unit file:

[Service]
RuntimeDirectory=haproxy
RuntimeDirectoryMode=0755

Systemd will create

/run/haproxy/

fresh on each service start and clean it up on stop, eliminating stale socket files entirely. Reference the socket in haproxy.cfg with explicit permissions:

global
    stats socket /run/haproxy/admin.sock mode 660 level admin user haproxy group haproxy
    stats timeout 30s

Root Cause 4: Binding Issue During Reload

The new HAProxy process that starts during a reload needs to bind its frontend ports fresh. If anything is in the way — a port conflict with the still-running old process, a VIP address not yet assigned to the interface, or a permission error — the new process fails to start. Depending on how the reload is invoked, you either fall back to the old config silently or end up with no proxy running at all.

Why it happens: In classic mode without SO_REUSEPORT, the new HAProxy process tries to bind the same ports that the old process is still holding open for its drain period. Without SO_REUSEPORT, this is a hard conflict — two processes can't bind the same port. The new process loses and the reload fails. On Linux kernels older than 3.9 or on HAProxy builds compiled without

USE_REUSEPORT=1

, you will hit this every time.

A second scenario I've seen frequently in Keepalived HA setups: a new frontend is added that binds to a VIP address — say

192.168.10.25

— that Keepalived manages. If the current node is in BACKUP state and that VIP isn't assigned to the interface yet, the bind call fails. The timing can make this look sporadic, appearing only on the standby node during certain failover states.

How to identify it: The config validation will pass but the reload itself will fail:

haproxy -f /etc/haproxy/haproxy.cfg -c && systemctl reload haproxy
journalctl -u haproxy --since "1 minute ago" --no-pager

Look for lines like:

[ALERT] haproxy: Starting frontend https_front: cannot bind socket (192.168.10.25:443)
[ALERT] haproxy: [/usr/sbin/haproxy.main()] Some protocols failed to start their listeners! Exiting.

Check what holds the port and whether the IP is assigned:

ss -tlnp | grep ':443'
ip addr show | grep 192.168.10.25

Check whether your build supports SO_REUSEPORT:

haproxy -vv | grep REUSEPORT

How to fix it: For port conflict issues, switch to master-worker mode with

expose-fd listeners

. The master process holds the listening socket file descriptors permanently and passes them to new workers at reload time — bypassing the bind conflict entirely because the new worker never needs to call bind() on its own:

global
    master-worker
    expose-fd listeners

For Keepalived VIP binding issues, use the

freebind

parameter on the relevant frontend bind line. This sets

IP_FREEBIND

on the socket, allowing HAProxy to bind to addresses not yet assigned to the interface:

frontend https_front
    bind 192.168.10.25:443 ssl crt /etc/ssl/solvethenetwork.com.pem freebind
    default_backend web_servers

With

freebind

, the reload succeeds on both the MASTER and BACKUP node regardless of which one currently holds the VIP. This is critical for symmetric configurations where both nodes run HAProxy.

Root Cause 5: Client Connections Reset

This is the most insidious root cause because HAProxy can do everything correctly — graceful reload flag passed, clean socket handoff, proper drain configured — and clients still see connection resets. The issue isn't a misconfiguration in the traditional sense. It's a fundamental timing window inherent to how the reload process hands off the listener.

Why it happens: Even in a graceful reload, there's a brief moment where the old listener is stopped and the new listener is starting. In classic mode without file descriptor inheritance, this window is real — typically a few milliseconds but measurable. Any TCP SYN packet that arrives during this window gets no response or receives an RST from the kernel because no process has the port open. For clients making new connections at exactly the wrong millisecond, the handshake fails outright.

HTTP/1.1 keep-alive connections compound the problem significantly. A client reusing an existing keep-alive connection from the old process might send a new request just as that process begins draining. HAProxy closes the connection, the client gets an RST mid-request, and unless the client implements transparent retry on a fresh connection (which not all do), you see an application-level error. Even clients that retry will typically log the error first.

How to identify it: Capture traffic on the frontend interface during a controlled reload:

tcpdump -i eth0 -w /tmp/haproxy-reload-$(date +%s).pcap port 443 &
TCPDUMP_PID=$!
systemctl reload haproxy
sleep 5
kill $TCPDUMP_PID

Then analyze the capture for RST packets:

tcpdump -r /tmp/haproxy-reload-*.pcap 'tcp[tcpflags] & tcp-rst != 0' | head -30

A tight burst of RST packets correlated with the reload timestamp is definitive. Cross-reference with application error logs:

grep -i "connection reset\|errno 104\|broken pipe" /var/log/app/error.log | grep "$(date +%H:%M)"

How to fix it: The definitive fix is master-worker mode with

expose-fd listeners

. In this architecture, the master process holds the listening file descriptors for the entire lifetime of the haproxy service. Workers are spawned and drained, but the master never drops the port. From the kernel and the network's perspective, the listening socket is continuous with zero gap:

global
    master-worker
    expose-fd listeners
    hard-stop-after 30s

Trigger reloads with USR2 to the master PID:

kill -USR2 $(cat /run/haproxy.pid)

You can verify that fd inheritance is working correctly by comparing the socket inodes between the master and worker processes:

ls -la /proc/$(pgrep -f 'haproxy.*master')/fd | grep socket
ls -la /proc/$(pgrep -f 'haproxy.*worker')/fd | grep socket

The same socket inode numbers should appear in both process fd tables, confirming the worker inherited the listener rather than creating a new one.

Root Cause 6: Stats Socket Timing Race in Automation

Deployment automation that queries the stats socket immediately after a reload — a health check loop running

show servers state

to wait for backends to come up, or a script checking

show info

to confirm the new process is ready — can fail in ways that look like HAProxy itself is broken when really it's a race between socket recreation and the script's poll timing.

Why it happens: The new HAProxy process creates a fresh stats socket after startup. If your script polls that socket path within milliseconds of reload completing, it may connect to the old socket just as the old process closes it, or find the new socket file not yet created. The socat command gets a Connection refused, which causes scripts to fail or report HAProxy as down when it's actually healthy. This also surfaces as a permissions issue when the socket is recreated with different ownership than the script expects.

How to identify it:

echo "show info" | socat stdio /run/haproxy/admin.sock

Immediately after a reload, this might return:

socat[5678] E connect(3, AF=1 "/run/haproxy/admin.sock", 27): Connection refused

Check socket ownership and mode after a reload to spot permission drift:

stat /run/haproxy/admin.sock

How to fix it: Add a readiness loop in your deployment automation rather than assuming the socket is immediately available:

for i in $(seq 1 20); do
    echo "show info" | socat stdio /run/haproxy/admin.sock 2>/dev/null | grep -q "^Name:" && break
    sleep 0.5
done

Set explicit socket permissions in haproxy.cfg so they're consistent across every start and reload:

global
    stats socket /run/haproxy/admin.sock mode 660 level admin user haproxy group haproxy
    stats timeout 30s

Prevention

After working through all of these causes, a clear pattern emerges. The vast majority of HAProxy reload connection drops trace back to not running in master-worker mode with proper file descriptor inheritance. If you're on HAProxy 1.8 or newer — and you should be — there's no good reason not to use it.

Here's the global section I use as a baseline for production deployments on sw-infrarunbook-01:

global
    master-worker
    expose-fd listeners
    hard-stop-after 30s
    log /dev/log local2
    stats socket /run/haproxy/admin.sock mode 660 level admin user haproxy group haproxy
    stats timeout 30s
    maxconn 50000
    ulimit-n 102400

Pair this with a systemd unit that sets

RuntimeDirectory=haproxy

and uses

kill -USR2 $MAINPID

for ExecReload. That combination eliminates sock file conflicts, binding races, and client RST windows all at once.

Beyond configuration, build a reload regression test into your deployment pipeline. Send a continuous stream of requests through the frontend during the reload window and assert that the error rate stays at zero. Tools like

hey

work well for this:

hey -z 30s -c 10 -q 50 https://solvethenetwork.com/health &
HEY_PID=$!
sleep 5
systemctl reload haproxy
wait $HEY_PID

The output shows you exactly how many non-2xx responses occurred during the test. If it's zero, your reload is genuinely hitless. Anything above zero means one of the root causes in this guide is still present.

Ship HAProxy logs to syslog and retain them. The

log /dev/log local2

directive plus a local rsyslog config pointing local2 to a dedicated file is the minimum viable setup. When a reload drop happens at 3 AM, those timestamped log lines showing exactly which process exited and what connections were active are the difference between a 5-minute postmortem and a 3-hour guessing session. Add a structured

log-format

if you're feeding into a log aggregation platform — correlating HAProxy reload events with application-level errors becomes dramatically easier when both sides share a common timestamp and request ID.

HAProxy Reload Causing Connection Drops

Symptoms

Root Cause 1: Reload Not Graceful

Root Cause 2: Old Process Not Dying

Root Cause 3: Sock File Conflict

Root Cause 4: Binding Issue During Reload

Root Cause 5: Client Connections Reset

Root Cause 6: Stats Socket Timing Race in Automation

Prevention

Frequently Asked Questions

What is the difference between HAProxy reload and restart?

How do I enable HAProxy master-worker mode?

What does hard-stop-after do in HAProxy?

Why does HAProxy fail to bind a socket after reload?

How do I fix a stale HAProxy stats socket file?

How can I verify a HAProxy reload was truly hitless?

Related Articles