InfraRunBook
    Back to articles

    HAProxy Reload Causing Connection Drops

    HAProxy
    Published: Apr 15, 2026
    Updated: Apr 15, 2026

    HAProxy reloads should be seamless, but misconfigured graceful reload, socket file conflicts, binding errors, and TCP resets can all cause dropped connections. This guide covers every root cause with real diagnosis commands and fixes.

    HAProxy Reload Causing Connection Drops

    Symptoms

    HAProxy's reload mechanism exists for a reason: pick up new configuration without dropping a single active connection. But in practice, I've seen this go wrong in more ways than I can count. Teams who are confident that a reload is "safe" find themselves staring at a spike of 502s in their application logs, frantic messages from users, and a monitoring dashboard that looks like someone pulled a cable.

    The symptoms that typically surface after a HAProxy reload gone wrong include:

    • A sharp spike in 502 or 503 errors in application logs immediately after the reload window
    • TCP RST packets captured in tcpdump during the reload event
    • Active WebSocket connections or long-running HTTP streams abruptly terminated
    • HAProxy failing to come back up at all, leaving the old process orphaned and the new config never loaded
    • Upstream monitoring health checks failing for 5–30 seconds post-reload
    • Client-side errors like Connection reset by peer or zero-byte HTTP responses

    What makes this particularly frustrating is that some of these issues are intermittent — they only show up under load, or only when a long-lived connection happens to be active at the exact reload moment. This guide covers every root cause I've encountered, how to diagnose each one definitively, and how to fix it for good.


    Root Cause 1: Reload Not Graceful

    The most common cause, and the one that surprises people most, is that the reload wasn't actually graceful to begin with. HAProxy supports a hitless reload via the -sf (softly finish) flag, which tells the old process to stop accepting new connections and drain existing ones while the new process takes over. If you're not using this mechanism correctly, you don't get a graceful reload — you get a hard stop-start with a full gap in listener availability in the middle.

    Why it happens: Homegrown deployment scripts often just invoke

    haproxy -f /etc/haproxy/haproxy.cfg
    without passing the old PID at all. Others use
    kill -HUP
    directly, which triggers a config reload in some builds but not a proper graceful worker handoff. A very common mistake is running
    systemctl restart haproxy
    instead of
    systemctl reload haproxy
    . Restart is a full kill-and-start. Reload invokes whatever is in the systemd ExecReload directive — and if that directive is wrong or missing, you still don't get a graceful reload.

    How to identify it: Check the systemd unit definition to see exactly what ExecReload does:

    systemctl cat haproxy | grep -A5 ExecReload

    A bad ExecReload looks like this:

    ExecReload=/bin/kill -HUP $MAINPID

    A plain HUP to the main PID doesn't guarantee a graceful handoff. You can also confirm the behavior directly by watching the process table during a reload:

    watch -n0.5 'ps aux | grep haproxy'

    During a proper graceful reload, you'll briefly see two haproxy processes — the old one draining and the new one accepting. If you only ever see one at a time, it's a hard stop-start and you are dropping connections in the gap between them.

    How to fix it: Update ExecReload to pass the old PID to the new invocation. For classic process mode:

    ExecReload=/usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf $MAINPID

    For HAProxy 1.8 and newer running in master-worker mode, the correct signal is USR2 sent to the master process:

    ExecReload=/bin/kill -USR2 $MAINPID

    In master-worker mode, the master receives USR2, spawns a new worker with the updated config, and gracefully drains the old worker. The master process itself never exits. This is the architecture you want for any production deployment.


    Root Cause 2: Old Process Not Dying

    Even when the reload is genuinely graceful, the old process can linger far longer than expected. This isn't always a problem on its own — lingering means the old process is draining active connections, which is the intended behavior. But it becomes a serious problem when the drain takes hours, when file descriptor limits are hit across two coexisting processes, or when the old process gets completely stuck and never exits.

    Why it happens: The old HAProxy process stays alive until every one of its connections closes. If you're proxying WebSocket connections, long HTTP uploads, or keep-alive connections with very permissive timeouts, you'll have an old HAProxy process sitting around long after the reload. Both the old and new processes share the system's file descriptor pool. If

    ulimit -n
    is tight and you have many concurrent connections, the new process will eventually start refusing connections because it can't open new file descriptors.

    In my experience, the most common trigger is an omitted or overly generous

    timeout tunnel
    in the config. Some HAProxy versions default this to no timeout, meaning a WebSocket connection will be held open until the client disconnects — potentially hours or days later.

    How to identify it:

    ps aux | grep haproxy

    If multiple haproxy processes appear with meaningfully different start times — hours after your last reload — the old one hasn't drained. To see what it's holding:

    ss -tnp | grep haproxy
    lsof -p <old_pid> | grep -c ESTABLISHED

    If the old process is still answering its stats socket, you can query it directly:

    echo "show info" | socat stdio /run/haproxy/admin.sock

    Look at the CurrConns field. If it's non-zero and hasn't moved in an hour, you have stuck sessions that will never drain on their own.

    How to fix it: Set appropriate drain timeouts in your defaults section. For WebSocket-heavy proxies, be deliberate about

    timeout tunnel
    . For standard HTTPS frontends, 30–60 seconds on
    timeout client
    is almost always sufficient. Add
    hard-stop-after
    in the global section to enforce an absolute deadline:

    global
        hard-stop-after 30s

    With this directive, the old process will hard-close any remaining connections after 30 seconds regardless of their state. You get predictable, bounded reload behavior that fits into deployment pipelines cleanly.


    Root Cause 3: Sock File Conflict

    HAProxy's stats socket is the communication channel that the reload mechanism depends on. If that socket file is stale, missing, or owned by the wrong user, the new process can't bind it — and the reload fails hard, sometimes without a clear error visible to the operator.

    Why it happens: When HAProxy crashes or is killed with SIGKILL, it doesn't get a chance to clean up the socket file. The next startup attempt tries to bind the same path and gets EADDRINUSE even though nothing is listening on it. This is also common after a server reboot when the socket lives in a tmpfs directory like

    /run/haproxy/
    that wasn't recreated, or when a package upgrade recreates the directory with wrong ownership. In non-master-worker setups, this is a persistent footgun that I've seen bite teams repeatedly.

    How to identify it: Run a config check, reload, then examine the journal:

    haproxy -f /etc/haproxy/haproxy.cfg -c
    journalctl -u haproxy -n 50 --no-pager

    A sock file conflict produces output like this:

    [ALERT] 106/143021 (12345) : Starting proxy stats: cannot bind socket [/run/haproxy/admin.sock]
    [ALERT] 106/143021 (12345) : [/usr/sbin/haproxy.main()] Some protocols failed to start their listeners! Exiting.

    Check the socket file directly:

    ls -la /run/haproxy/admin.sock
    file /run/haproxy/admin.sock

    If it shows as a regular file rather than a Unix domain socket, or if

    socat
    gets a connection refused when trying to query it, it's stale.

    How to fix it: For an immediate fix, remove the stale socket and restart:

    rm -f /run/haproxy/admin.sock
    systemctl start haproxy

    For a durable fix, configure systemd's

    RuntimeDirectory
    to manage the socket directory lifecycle automatically. Add this to the
    [Service]
    section of the unit file:

    [Service]
    RuntimeDirectory=haproxy
    RuntimeDirectoryMode=0755

    Systemd will create

    /run/haproxy/
    fresh on each service start and clean it up on stop, eliminating stale socket files entirely. Reference the socket in haproxy.cfg with explicit permissions:

    global
        stats socket /run/haproxy/admin.sock mode 660 level admin user haproxy group haproxy
        stats timeout 30s

    Root Cause 4: Binding Issue During Reload

    The new HAProxy process that starts during a reload needs to bind its frontend ports fresh. If anything is in the way — a port conflict with the still-running old process, a VIP address not yet assigned to the interface, or a permission error — the new process fails to start. Depending on how the reload is invoked, you either fall back to the old config silently or end up with no proxy running at all.

    Why it happens: In classic mode without SO_REUSEPORT, the new HAProxy process tries to bind the same ports that the old process is still holding open for its drain period. Without SO_REUSEPORT, this is a hard conflict — two processes can't bind the same port. The new process loses and the reload fails. On Linux kernels older than 3.9 or on HAProxy builds compiled without

    USE_REUSEPORT=1
    , you will hit this every time.

    A second scenario I've seen frequently in Keepalived HA setups: a new frontend is added that binds to a VIP address — say

    192.168.10.25
    — that Keepalived manages. If the current node is in BACKUP state and that VIP isn't assigned to the interface yet, the bind call fails. The timing can make this look sporadic, appearing only on the standby node during certain failover states.

    How to identify it: The config validation will pass but the reload itself will fail:

    haproxy -f /etc/haproxy/haproxy.cfg -c && systemctl reload haproxy
    journalctl -u haproxy --since "1 minute ago" --no-pager

    Look for lines like:

    [ALERT] haproxy: Starting frontend https_front: cannot bind socket (192.168.10.25:443)
    [ALERT] haproxy: [/usr/sbin/haproxy.main()] Some protocols failed to start their listeners! Exiting.

    Check what holds the port and whether the IP is assigned:

    ss -tlnp | grep ':443'
    ip addr show | grep 192.168.10.25

    Check whether your build supports SO_REUSEPORT:

    haproxy -vv | grep REUSEPORT

    How to fix it: For port conflict issues, switch to master-worker mode with

    expose-fd listeners
    . The master process holds the listening socket file descriptors permanently and passes them to new workers at reload time — bypassing the bind conflict entirely because the new worker never needs to call bind() on its own:

    global
        master-worker
        expose-fd listeners

    For Keepalived VIP binding issues, use the

    freebind
    parameter on the relevant frontend bind line. This sets
    IP_FREEBIND
    on the socket, allowing HAProxy to bind to addresses not yet assigned to the interface:

    frontend https_front
        bind 192.168.10.25:443 ssl crt /etc/ssl/solvethenetwork.com.pem freebind
        default_backend web_servers

    With

    freebind
    , the reload succeeds on both the MASTER and BACKUP node regardless of which one currently holds the VIP. This is critical for symmetric configurations where both nodes run HAProxy.


    Root Cause 5: Client Connections Reset

    This is the most insidious root cause because HAProxy can do everything correctly — graceful reload flag passed, clean socket handoff, proper drain configured — and clients still see connection resets. The issue isn't a misconfiguration in the traditional sense. It's a fundamental timing window inherent to how the reload process hands off the listener.

    Why it happens: Even in a graceful reload, there's a brief moment where the old listener is stopped and the new listener is starting. In classic mode without file descriptor inheritance, this window is real — typically a few milliseconds but measurable. Any TCP SYN packet that arrives during this window gets no response or receives an RST from the kernel because no process has the port open. For clients making new connections at exactly the wrong millisecond, the handshake fails outright.

    HTTP/1.1 keep-alive connections compound the problem significantly. A client reusing an existing keep-alive connection from the old process might send a new request just as that process begins draining. HAProxy closes the connection, the client gets an RST mid-request, and unless the client implements transparent retry on a fresh connection (which not all do), you see an application-level error. Even clients that retry will typically log the error first.

    How to identify it: Capture traffic on the frontend interface during a controlled reload:

    tcpdump -i eth0 -w /tmp/haproxy-reload-$(date +%s).pcap port 443 &
    TCPDUMP_PID=$!
    systemctl reload haproxy
    sleep 5
    kill $TCPDUMP_PID

    Then analyze the capture for RST packets:

    tcpdump -r /tmp/haproxy-reload-*.pcap 'tcp[tcpflags] & tcp-rst != 0' | head -30

    A tight burst of RST packets correlated with the reload timestamp is definitive. Cross-reference with application error logs:

    grep -i "connection reset\|errno 104\|broken pipe" /var/log/app/error.log | grep "$(date +%H:%M)"

    How to fix it: The definitive fix is master-worker mode with

    expose-fd listeners
    . In this architecture, the master process holds the listening file descriptors for the entire lifetime of the haproxy service. Workers are spawned and drained, but the master never drops the port. From the kernel and the network's perspective, the listening socket is continuous with zero gap:

    global
        master-worker
        expose-fd listeners
        hard-stop-after 30s

    Trigger reloads with USR2 to the master PID:

    kill -USR2 $(cat /run/haproxy.pid)

    You can verify that fd inheritance is working correctly by comparing the socket inodes between the master and worker processes:

    ls -la /proc/$(pgrep -f 'haproxy.*master')/fd | grep socket
    ls -la /proc/$(pgrep -f 'haproxy.*worker')/fd | grep socket

    The same socket inode numbers should appear in both process fd tables, confirming the worker inherited the listener rather than creating a new one.


    Root Cause 6: Stats Socket Timing Race in Automation

    Deployment automation that queries the stats socket immediately after a reload — a health check loop running

    show servers state
    to wait for backends to come up, or a script checking
    show info
    to confirm the new process is ready — can fail in ways that look like HAProxy itself is broken when really it's a race between socket recreation and the script's poll timing.

    Why it happens: The new HAProxy process creates a fresh stats socket after startup. If your script polls that socket path within milliseconds of reload completing, it may connect to the old socket just as the old process closes it, or find the new socket file not yet created. The socat command gets a Connection refused, which causes scripts to fail or report HAProxy as down when it's actually healthy. This also surfaces as a permissions issue when the socket is recreated with different ownership than the script expects.

    How to identify it:

    echo "show info" | socat stdio /run/haproxy/admin.sock

    Immediately after a reload, this might return:

    socat[5678] E connect(3, AF=1 "/run/haproxy/admin.sock", 27): Connection refused

    Check socket ownership and mode after a reload to spot permission drift:

    stat /run/haproxy/admin.sock

    How to fix it: Add a readiness loop in your deployment automation rather than assuming the socket is immediately available:

    for i in $(seq 1 20); do
        echo "show info" | socat stdio /run/haproxy/admin.sock 2>/dev/null | grep -q "^Name:" && break
        sleep 0.5
    done

    Set explicit socket permissions in haproxy.cfg so they're consistent across every start and reload:

    global
        stats socket /run/haproxy/admin.sock mode 660 level admin user haproxy group haproxy
        stats timeout 30s

    Prevention

    After working through all of these causes, a clear pattern emerges. The vast majority of HAProxy reload connection drops trace back to not running in master-worker mode with proper file descriptor inheritance. If you're on HAProxy 1.8 or newer — and you should be — there's no good reason not to use it.

    Here's the global section I use as a baseline for production deployments on sw-infrarunbook-01:

    global
        master-worker
        expose-fd listeners
        hard-stop-after 30s
        log /dev/log local2
        stats socket /run/haproxy/admin.sock mode 660 level admin user haproxy group haproxy
        stats timeout 30s
        maxconn 50000
        ulimit-n 102400

    Pair this with a systemd unit that sets

    RuntimeDirectory=haproxy
    and uses
    kill -USR2 $MAINPID
    for ExecReload. That combination eliminates sock file conflicts, binding races, and client RST windows all at once.

    Beyond configuration, build a reload regression test into your deployment pipeline. Send a continuous stream of requests through the frontend during the reload window and assert that the error rate stays at zero. Tools like

    hey
    work well for this:

    hey -z 30s -c 10 -q 50 https://solvethenetwork.com/health &
    HEY_PID=$!
    sleep 5
    systemctl reload haproxy
    wait $HEY_PID

    The output shows you exactly how many non-2xx responses occurred during the test. If it's zero, your reload is genuinely hitless. Anything above zero means one of the root causes in this guide is still present.

    Ship HAProxy logs to syslog and retain them. The

    log /dev/log local2
    directive plus a local rsyslog config pointing local2 to a dedicated file is the minimum viable setup. When a reload drop happens at 3 AM, those timestamped log lines showing exactly which process exited and what connections were active are the difference between a 5-minute postmortem and a 3-hour guessing session. Add a structured
    log-format
    if you're feeding into a log aggregation platform — correlating HAProxy reload events with application-level errors becomes dramatically easier when both sides share a common timestamp and request ID.

    Frequently Asked Questions

    What is the difference between HAProxy reload and restart?

    A restart kills the HAProxy process and starts a fresh one, creating a full gap where no connections are accepted. A reload (using -sf or USR2 in master-worker mode) starts a new process while the old one drains active connections, allowing zero-downtime configuration changes when configured correctly.

    How do I enable HAProxy master-worker mode?

    Add 'master-worker' and 'expose-fd listeners' to the global section of haproxy.cfg, then update your systemd ExecReload to use 'kill -USR2 $MAINPID'. The master process will hold listening sockets permanently and pass them to new workers on each reload.

    What does hard-stop-after do in HAProxy?

    hard-stop-after sets an absolute deadline for the old process to drain after a graceful reload. For example, 'hard-stop-after 30s' means the old process will force-close any remaining connections after 30 seconds, giving you predictable bounded reload behavior instead of an indefinitely lingering process.

    Why does HAProxy fail to bind a socket after reload?

    In classic mode without expose-fd listeners, the new process tries to bind ports that the old process still holds. Without SO_REUSEPORT support, this causes EADDRINUSE. Switching to master-worker mode with expose-fd listeners eliminates this entirely because the master holds the port and passes the fd to the new worker.

    How do I fix a stale HAProxy stats socket file?

    Remove the stale socket with 'rm -f /run/haproxy/admin.sock' and restart HAProxy. For a permanent fix, configure RuntimeDirectory=haproxy in the systemd unit's [Service] section — systemd will manage the directory lifecycle and clean up stale sockets automatically on each service start.

    How can I verify a HAProxy reload was truly hitless?

    Run a continuous load test with a tool like hey against your frontend, trigger a reload in the middle of the test, then check the output for non-2xx responses. You can also use tcpdump to capture traffic during the reload and look for TCP RST packets correlated with the reload timestamp.

    Related Articles