Symptoms
HAProxy's reload mechanism exists for a reason: pick up new configuration without dropping a single active connection. But in practice, I've seen this go wrong in more ways than I can count. Teams who are confident that a reload is "safe" find themselves staring at a spike of 502s in their application logs, frantic messages from users, and a monitoring dashboard that looks like someone pulled a cable.
The symptoms that typically surface after a HAProxy reload gone wrong include:
- A sharp spike in 502 or 503 errors in application logs immediately after the reload window
- TCP RST packets captured in tcpdump during the reload event
- Active WebSocket connections or long-running HTTP streams abruptly terminated
- HAProxy failing to come back up at all, leaving the old process orphaned and the new config never loaded
- Upstream monitoring health checks failing for 5–30 seconds post-reload
- Client-side errors like Connection reset by peer or zero-byte HTTP responses
What makes this particularly frustrating is that some of these issues are intermittent — they only show up under load, or only when a long-lived connection happens to be active at the exact reload moment. This guide covers every root cause I've encountered, how to diagnose each one definitively, and how to fix it for good.
Root Cause 1: Reload Not Graceful
The most common cause, and the one that surprises people most, is that the reload wasn't actually graceful to begin with. HAProxy supports a hitless reload via the -sf (softly finish) flag, which tells the old process to stop accepting new connections and drain existing ones while the new process takes over. If you're not using this mechanism correctly, you don't get a graceful reload — you get a hard stop-start with a full gap in listener availability in the middle.
Why it happens: Homegrown deployment scripts often just invoke
haproxy -f /etc/haproxy/haproxy.cfgwithout passing the old PID at all. Others use
kill -HUPdirectly, which triggers a config reload in some builds but not a proper graceful worker handoff. A very common mistake is running
systemctl restart haproxyinstead of
systemctl reload haproxy. Restart is a full kill-and-start. Reload invokes whatever is in the systemd ExecReload directive — and if that directive is wrong or missing, you still don't get a graceful reload.
How to identify it: Check the systemd unit definition to see exactly what ExecReload does:
systemctl cat haproxy | grep -A5 ExecReload
A bad ExecReload looks like this:
ExecReload=/bin/kill -HUP $MAINPID
A plain HUP to the main PID doesn't guarantee a graceful handoff. You can also confirm the behavior directly by watching the process table during a reload:
watch -n0.5 'ps aux | grep haproxy'
During a proper graceful reload, you'll briefly see two haproxy processes — the old one draining and the new one accepting. If you only ever see one at a time, it's a hard stop-start and you are dropping connections in the gap between them.
How to fix it: Update ExecReload to pass the old PID to the new invocation. For classic process mode:
ExecReload=/usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf $MAINPID
For HAProxy 1.8 and newer running in master-worker mode, the correct signal is USR2 sent to the master process:
ExecReload=/bin/kill -USR2 $MAINPID
In master-worker mode, the master receives USR2, spawns a new worker with the updated config, and gracefully drains the old worker. The master process itself never exits. This is the architecture you want for any production deployment.
Root Cause 2: Old Process Not Dying
Even when the reload is genuinely graceful, the old process can linger far longer than expected. This isn't always a problem on its own — lingering means the old process is draining active connections, which is the intended behavior. But it becomes a serious problem when the drain takes hours, when file descriptor limits are hit across two coexisting processes, or when the old process gets completely stuck and never exits.
Why it happens: The old HAProxy process stays alive until every one of its connections closes. If you're proxying WebSocket connections, long HTTP uploads, or keep-alive connections with very permissive timeouts, you'll have an old HAProxy process sitting around long after the reload. Both the old and new processes share the system's file descriptor pool. If
ulimit -nis tight and you have many concurrent connections, the new process will eventually start refusing connections because it can't open new file descriptors.
In my experience, the most common trigger is an omitted or overly generous
timeout tunnelin the config. Some HAProxy versions default this to no timeout, meaning a WebSocket connection will be held open until the client disconnects — potentially hours or days later.
How to identify it:
ps aux | grep haproxy
If multiple haproxy processes appear with meaningfully different start times — hours after your last reload — the old one hasn't drained. To see what it's holding:
ss -tnp | grep haproxy
lsof -p <old_pid> | grep -c ESTABLISHED
If the old process is still answering its stats socket, you can query it directly:
echo "show info" | socat stdio /run/haproxy/admin.sock
Look at the CurrConns field. If it's non-zero and hasn't moved in an hour, you have stuck sessions that will never drain on their own.
How to fix it: Set appropriate drain timeouts in your defaults section. For WebSocket-heavy proxies, be deliberate about
timeout tunnel. For standard HTTPS frontends, 30–60 seconds on
timeout clientis almost always sufficient. Add
hard-stop-afterin the global section to enforce an absolute deadline:
global
hard-stop-after 30s
With this directive, the old process will hard-close any remaining connections after 30 seconds regardless of their state. You get predictable, bounded reload behavior that fits into deployment pipelines cleanly.
Root Cause 3: Sock File Conflict
HAProxy's stats socket is the communication channel that the reload mechanism depends on. If that socket file is stale, missing, or owned by the wrong user, the new process can't bind it — and the reload fails hard, sometimes without a clear error visible to the operator.
Why it happens: When HAProxy crashes or is killed with SIGKILL, it doesn't get a chance to clean up the socket file. The next startup attempt tries to bind the same path and gets EADDRINUSE even though nothing is listening on it. This is also common after a server reboot when the socket lives in a tmpfs directory like
/run/haproxy/that wasn't recreated, or when a package upgrade recreates the directory with wrong ownership. In non-master-worker setups, this is a persistent footgun that I've seen bite teams repeatedly.
How to identify it: Run a config check, reload, then examine the journal:
haproxy -f /etc/haproxy/haproxy.cfg -c
journalctl -u haproxy -n 50 --no-pager
A sock file conflict produces output like this:
[ALERT] 106/143021 (12345) : Starting proxy stats: cannot bind socket [/run/haproxy/admin.sock]
[ALERT] 106/143021 (12345) : [/usr/sbin/haproxy.main()] Some protocols failed to start their listeners! Exiting.
Check the socket file directly:
ls -la /run/haproxy/admin.sock
file /run/haproxy/admin.sock
If it shows as a regular file rather than a Unix domain socket, or if
socatgets a connection refused when trying to query it, it's stale.
How to fix it: For an immediate fix, remove the stale socket and restart:
rm -f /run/haproxy/admin.sock
systemctl start haproxy
For a durable fix, configure systemd's
RuntimeDirectoryto manage the socket directory lifecycle automatically. Add this to the
[Service]section of the unit file:
[Service]
RuntimeDirectory=haproxy
RuntimeDirectoryMode=0755
Systemd will create
/run/haproxy/fresh on each service start and clean it up on stop, eliminating stale socket files entirely. Reference the socket in haproxy.cfg with explicit permissions:
global
stats socket /run/haproxy/admin.sock mode 660 level admin user haproxy group haproxy
stats timeout 30s
Root Cause 4: Binding Issue During Reload
The new HAProxy process that starts during a reload needs to bind its frontend ports fresh. If anything is in the way — a port conflict with the still-running old process, a VIP address not yet assigned to the interface, or a permission error — the new process fails to start. Depending on how the reload is invoked, you either fall back to the old config silently or end up with no proxy running at all.
Why it happens: In classic mode without SO_REUSEPORT, the new HAProxy process tries to bind the same ports that the old process is still holding open for its drain period. Without SO_REUSEPORT, this is a hard conflict — two processes can't bind the same port. The new process loses and the reload fails. On Linux kernels older than 3.9 or on HAProxy builds compiled without
USE_REUSEPORT=1, you will hit this every time.
A second scenario I've seen frequently in Keepalived HA setups: a new frontend is added that binds to a VIP address — say
192.168.10.25— that Keepalived manages. If the current node is in BACKUP state and that VIP isn't assigned to the interface yet, the bind call fails. The timing can make this look sporadic, appearing only on the standby node during certain failover states.
How to identify it: The config validation will pass but the reload itself will fail:
haproxy -f /etc/haproxy/haproxy.cfg -c && systemctl reload haproxy
journalctl -u haproxy --since "1 minute ago" --no-pager
Look for lines like:
[ALERT] haproxy: Starting frontend https_front: cannot bind socket (192.168.10.25:443)
[ALERT] haproxy: [/usr/sbin/haproxy.main()] Some protocols failed to start their listeners! Exiting.
Check what holds the port and whether the IP is assigned:
ss -tlnp | grep ':443'
ip addr show | grep 192.168.10.25
Check whether your build supports SO_REUSEPORT:
haproxy -vv | grep REUSEPORT
How to fix it: For port conflict issues, switch to master-worker mode with
expose-fd listeners. The master process holds the listening socket file descriptors permanently and passes them to new workers at reload time — bypassing the bind conflict entirely because the new worker never needs to call bind() on its own:
global
master-worker
expose-fd listeners
For Keepalived VIP binding issues, use the
freebindparameter on the relevant frontend bind line. This sets
IP_FREEBINDon the socket, allowing HAProxy to bind to addresses not yet assigned to the interface:
frontend https_front
bind 192.168.10.25:443 ssl crt /etc/ssl/solvethenetwork.com.pem freebind
default_backend web_servers
With
freebind, the reload succeeds on both the MASTER and BACKUP node regardless of which one currently holds the VIP. This is critical for symmetric configurations where both nodes run HAProxy.
Root Cause 5: Client Connections Reset
This is the most insidious root cause because HAProxy can do everything correctly — graceful reload flag passed, clean socket handoff, proper drain configured — and clients still see connection resets. The issue isn't a misconfiguration in the traditional sense. It's a fundamental timing window inherent to how the reload process hands off the listener.
Why it happens: Even in a graceful reload, there's a brief moment where the old listener is stopped and the new listener is starting. In classic mode without file descriptor inheritance, this window is real — typically a few milliseconds but measurable. Any TCP SYN packet that arrives during this window gets no response or receives an RST from the kernel because no process has the port open. For clients making new connections at exactly the wrong millisecond, the handshake fails outright.
HTTP/1.1 keep-alive connections compound the problem significantly. A client reusing an existing keep-alive connection from the old process might send a new request just as that process begins draining. HAProxy closes the connection, the client gets an RST mid-request, and unless the client implements transparent retry on a fresh connection (which not all do), you see an application-level error. Even clients that retry will typically log the error first.
How to identify it: Capture traffic on the frontend interface during a controlled reload:
tcpdump -i eth0 -w /tmp/haproxy-reload-$(date +%s).pcap port 443 &
TCPDUMP_PID=$!
systemctl reload haproxy
sleep 5
kill $TCPDUMP_PID
Then analyze the capture for RST packets:
tcpdump -r /tmp/haproxy-reload-*.pcap 'tcp[tcpflags] & tcp-rst != 0' | head -30
A tight burst of RST packets correlated with the reload timestamp is definitive. Cross-reference with application error logs:
grep -i "connection reset\|errno 104\|broken pipe" /var/log/app/error.log | grep "$(date +%H:%M)"
How to fix it: The definitive fix is master-worker mode with
expose-fd listeners. In this architecture, the master process holds the listening file descriptors for the entire lifetime of the haproxy service. Workers are spawned and drained, but the master never drops the port. From the kernel and the network's perspective, the listening socket is continuous with zero gap:
global
master-worker
expose-fd listeners
hard-stop-after 30s
Trigger reloads with USR2 to the master PID:
kill -USR2 $(cat /run/haproxy.pid)
You can verify that fd inheritance is working correctly by comparing the socket inodes between the master and worker processes:
ls -la /proc/$(pgrep -f 'haproxy.*master')/fd | grep socket
ls -la /proc/$(pgrep -f 'haproxy.*worker')/fd | grep socket
The same socket inode numbers should appear in both process fd tables, confirming the worker inherited the listener rather than creating a new one.
Root Cause 6: Stats Socket Timing Race in Automation
Deployment automation that queries the stats socket immediately after a reload — a health check loop running
show servers stateto wait for backends to come up, or a script checking
show infoto confirm the new process is ready — can fail in ways that look like HAProxy itself is broken when really it's a race between socket recreation and the script's poll timing.
Why it happens: The new HAProxy process creates a fresh stats socket after startup. If your script polls that socket path within milliseconds of reload completing, it may connect to the old socket just as the old process closes it, or find the new socket file not yet created. The socat command gets a Connection refused, which causes scripts to fail or report HAProxy as down when it's actually healthy. This also surfaces as a permissions issue when the socket is recreated with different ownership than the script expects.
How to identify it:
echo "show info" | socat stdio /run/haproxy/admin.sock
Immediately after a reload, this might return:
socat[5678] E connect(3, AF=1 "/run/haproxy/admin.sock", 27): Connection refused
Check socket ownership and mode after a reload to spot permission drift:
stat /run/haproxy/admin.sock
How to fix it: Add a readiness loop in your deployment automation rather than assuming the socket is immediately available:
for i in $(seq 1 20); do
echo "show info" | socat stdio /run/haproxy/admin.sock 2>/dev/null | grep -q "^Name:" && break
sleep 0.5
done
Set explicit socket permissions in haproxy.cfg so they're consistent across every start and reload:
global
stats socket /run/haproxy/admin.sock mode 660 level admin user haproxy group haproxy
stats timeout 30s
Prevention
After working through all of these causes, a clear pattern emerges. The vast majority of HAProxy reload connection drops trace back to not running in master-worker mode with proper file descriptor inheritance. If you're on HAProxy 1.8 or newer — and you should be — there's no good reason not to use it.
Here's the global section I use as a baseline for production deployments on sw-infrarunbook-01:
global
master-worker
expose-fd listeners
hard-stop-after 30s
log /dev/log local2
stats socket /run/haproxy/admin.sock mode 660 level admin user haproxy group haproxy
stats timeout 30s
maxconn 50000
ulimit-n 102400
Pair this with a systemd unit that sets
RuntimeDirectory=haproxyand uses
kill -USR2 $MAINPIDfor ExecReload. That combination eliminates sock file conflicts, binding races, and client RST windows all at once.
Beyond configuration, build a reload regression test into your deployment pipeline. Send a continuous stream of requests through the frontend during the reload window and assert that the error rate stays at zero. Tools like
heywork well for this:
hey -z 30s -c 10 -q 50 https://solvethenetwork.com/health &
HEY_PID=$!
sleep 5
systemctl reload haproxy
wait $HEY_PID
The output shows you exactly how many non-2xx responses occurred during the test. If it's zero, your reload is genuinely hitless. Anything above zero means one of the root causes in this guide is still present.
Ship HAProxy logs to syslog and retain them. The
log /dev/log local2directive plus a local rsyslog config pointing local2 to a dedicated file is the minimum viable setup. When a reload drop happens at 3 AM, those timestamped log lines showing exactly which process exited and what connections were active are the difference between a 5-minute postmortem and a 3-hour guessing session. Add a structured
log-formatif you're feeding into a log aggregation platform — correlating HAProxy reload events with application-level errors becomes dramatically easier when both sides share a common timestamp and request ID.
