Symptoms
Backend flapping in HAProxy is one of those problems that announces itself loudly. Your monitoring fires an alert — server down — and then, before you've even opened a terminal, it fires again saying the server is back up. This pattern repeats every few seconds or minutes, and the noise is relentless. The HAProxy stats page shows a server cycling between green and red. Your application error rate spikes in short bursts. Users report intermittent failures where a page reload fixes it briefly but then it breaks again moments later.
In the HAProxy log, the signature looks like this:
[WARNING] 106/143021 (1) : Health check for server app_backend/app01 failed, reason: Layer4 timeout, check duration: 2001ms, status: 2/3 DOWN.
[WARNING] 106/143025 (1) : Health check for server app_backend/app01 succeeded, reason: Layer4 check passed, check duration: 4ms, status: 3/3 UP.
[WARNING] 106/143029 (1) : Health check for server app_backend/app01 failed, reason: Layer4 timeout, check duration: 2001ms, status: 2/3 DOWN.
[WARNING] 106/143033 (1) : Health check for server app_backend/app01 succeeded, reason: Layer4 check passed, check duration: 3ms, status: 3/3 UP.
That alternating pattern, repeating like clockwork, is the tell. A genuinely dead backend stays DOWN. A one-time blip produces a single warning and then silence. Flapping is cyclical and persistent, and it's almost always exposing a real underlying condition — an overloaded server, a misconfigured check, or a network path with intermittent problems. Let's work through the causes systematically.
Root Cause 1: Health Check Too Aggressive
This is the first thing I reach for when I see flapping on a server that appears otherwise healthy. If your health check interval is very short — say, every 500ms or 1 second — and your
fallthreshold is set to 1 or 2, HAProxy will mark a backend DOWN after just one or two missed checks. Any transient delay, even a brief CPU spike or garbage collection pause on the backend, is enough to trip the threshold.
The relevant config parameters are
inter,
rise, and
fall. A common misconfiguration looks like this:
backend app_backend
option httpchk GET /health
server app01 10.10.1.11:8080 check inter 500ms fall 1 rise 1
With
fall 1, a single failed check marks the server DOWN. With
inter 500ms, checks fire twice per second. With
rise 1, a single passing check immediately brings it back UP. The result is a server that flaps on any transient delay — even a 600ms GC pause will cause a full down/up cycle. I have seen this misconfiguration copied from a development environment into production where the backend workload is completely different.
To identify this, look at the timestamps in your HAProxy logs. If the down/up cycles track almost exactly with your check interval, the check cadence itself is the trigger. You can also inspect the live server state via the admin socket:
infrarunbook-admin@sw-infrarunbook-01:~$ echo "show servers state app_backend" | socat stdio /var/run/haproxy/admin.sock
# be_id be_name srv_id srv_name srv_addr srv_op_state srv_admin_state srv_uweight srv_iweight srv_time_since_last_change srv_check_status srv_check_result srv_check_health srv_check_state srv_agent_state
2 app_backend 1 app01 10.10.1.11 2 0 100 100 3 6 3 2 6 0
2 app_backend 2 app02 10.10.1.12 2 0 100 100 41 6 3 3 6 0
The fix is to tune the parameters to tolerate brief blips without declaring failure. A more production-appropriate configuration:
backend app_backend
option httpchk GET /health
server app01 10.10.1.11:8080 check inter 3000ms fastinter 1000ms downinter 2000ms fall 3 rise 2
Now the server needs to fail three consecutive checks before being marked DOWN, and only two consecutive successes to come back UP. The
fastinterdirective speeds up checks when a server is already in a transitioning state, and
downintercontrols the interval while it remains DOWN. This is a much more resilient baseline that absorbs transient delays without causing false state changes.
Root Cause 2: Health Check Timeout Too Low
Closely related to aggressive check frequency, but distinct. Even with a sane
intervalue, if your
timeout checkis set lower than the time the backend occasionally takes to respond, checks will time out and register as failures — even when the backend is functionally healthy and processing real traffic without issue.
In my experience, this most commonly appears after a configuration change where someone tightened global timeouts for performance reasons without thinking about the health check path. The default
timeout checkin HAProxy is inherited from
timeout connectif not explicitly set, and connect timeouts are often very aggressive — 500ms or less — because you want fast failure detection for real user connections.
The giveaway in the logs is that the failure reason says "Layer4 timeout" or "Layer7 timeout" consistently, and the check duration matches your timeout value exactly:
[WARNING] : Health check for server app_backend/app01 failed, reason: Layer4 timeout, check duration: 500ms, status: 2/3 DOWN.
[WARNING] : Health check for server app_backend/app01 succeeded, reason: Layer4 check passed, check duration: 11ms, status: 3/3 UP.
[WARNING] : Health check for server app_backend/app01 failed, reason: Layer4 timeout, check duration: 500ms, status: 2/3 DOWN.
The check duration being exactly 500ms — not 501, not 499, exactly 500 — is the giveaway that you're hitting a hard timeout ceiling. When the backend is not under pressure, it responds in 11ms. Occasionally, when it's briefly busy, it takes longer, hits the 500ms ceiling, and HAProxy registers a failure. The fix is to explicitly set a generous check timeout separate from your connect timeout:
defaults
timeout connect 500ms
timeout client 30s
timeout server 30s
timeout check 3s
Setting
timeout checkto 3 seconds gives your backends plenty of breathing room during occasional slow responses without affecting the latency characteristics of normal traffic. These two timeouts serve different purposes —
timeout connectis about how long to wait to establish a connection to a backend under normal operation, while
timeout checkis about how long a health probe can take before being declared a failure. Tune them independently.
Root Cause 3: Backend CPU High Causing Slow Responses
This one is a genuine backend problem, not a configuration problem — but it manifests as flapping because an overloaded server takes too long to respond to health checks during periods of CPU saturation. The server is alive and eventually responding, but it's so busy that health check probes time out intermittently while the backend struggles to service real requests too.
The pattern here is that flapping correlates with load events. You'll see it during peak traffic windows, during batch job execution, or after a deployment that introduced a CPU-intensive regression. If you SSH to the backend while it's flapping, you'll typically see something like:
infrarunbook-admin@sw-infrarunbook-01:~$ ssh 10.10.1.11 "top -bn1 | head -5"
top - 14:32:17 up 12 days, 3:21, 1 user, load average: 14.82, 12.41, 9.73
Tasks: 312 total, 4 running, 308 sleeping, 0 stopped, 0 zombie
%Cpu(s): 96.3 us, 2.1 sy, 0.0 ni, 1.4 id, 0.0 wa, 0.0 hi, 0.2 si, 0.0 st
MiB Mem : 15872.0 total, 312.4 free, 14891.2 used, 668.4 buff/cache
MiB Swap: 4096.0 total, 3841.2 free, 254.8 used. 692.4 avail Mem
A load average of 14.82 on a host that probably has 4 or 8 cores — that's your problem. The application is consuming nearly all available CPU, so when HAProxy sends a health check request the TCP stack accepts the connection, but no application worker is free to generate a response within the check timeout window.
Cross-reference with HAProxy logs to confirm the timing correlation:
infrarunbook-admin@sw-infrarunbook-01:~$ grep "app01" /var/log/haproxy.log | grep -E "(DOWN|UP)" | awk '{print $1, $2, $NF}' | tail -20
If the DOWN events align with periods of high load as shown in your monitoring history, you've confirmed the cause. The fix is two-pronged. Short term: raise
timeout checkso that slow responses during CPU pressure don't register as immediate failures. Longer term: fix the CPU issue — profile the application, look for runaway threads, check for missing database indexes causing expensive queries, or add capacity. HAProxy configuration can mask the symptom temporarily, but a server running at 96% CPU is not healthy and will eventually fail in other ways that no amount of health check tuning will hide.
While you're investigating, you can reduce traffic to the struggling backend without removing it entirely:
infrarunbook-admin@sw-infrarunbook-01:~$ echo "set server app_backend/app01 weight 10" | socat stdio /var/run/haproxy/admin.sock
This drops app01's relative weight from the default of 100 to 10, sending it roughly 9% of traffic while the other backends absorb the majority of the load. This buys you time to diagnose without taking the backend fully out of rotation.
Root Cause 4: Network Jitter
Network jitter — irregular latency spikes on the path between HAProxy and the backend — is a particularly sneaky cause of flapping because the backends themselves are completely healthy. The problem lives in the network fabric: a congested switch, a flapping uplink, an oversubscribed trunk, or a faulty cable or SFP causing intermittent packet loss or retransmissions.
The tell that separates network jitter from a backend problem is that multiple backends flap simultaneously, or flapping events correlate with overall network congestion rather than anything backend-specific. If app01 and app02 both transition to DOWN at the same timestamp and recover together moments later, the cause is almost certainly the network path they share, not individual backend problems.
Start by checking raw round-trip latency from the HAProxy host to the backend:
infrarunbook-admin@sw-infrarunbook-01:~$ ping -i 0.2 -c 100 10.10.1.11
PING 10.10.1.11 (10.10.1.11) 56(84) bytes of data.
64 bytes from 10.10.1.11: icmp_seq=1 ttl=64 time=0.412 ms
64 bytes from 10.10.1.11: icmp_seq=2 ttl=64 time=0.389 ms
64 bytes from 10.10.1.11: icmp_seq=47 ttl=64 time=84.312 ms
64 bytes from 10.10.1.11: icmp_seq=48 ttl=64 time=0.401 ms
--- 10.10.1.11 ping statistics ---
100 packets transmitted, 99 received, 1% packet loss, time 19839ms
rtt min/avg/max/mdev = 0.389/1.241/84.312/8.331 ms
An rtt mdev of 8.331ms with a max of 84ms is dramatic jitter. On a healthy local LAN you'd expect sub-millisecond mdev. That single 84ms spike in a 20-second window is enough to cause a health check timeout failure. Get path-level detail with mtr:
infrarunbook-admin@sw-infrarunbook-01:~$ mtr --report --report-cycles 60 10.10.1.11
Start: 2026-04-17T14:35:00+0000
HOST: sw-infrarunbook-01 Loss% Snt Last Avg Best Wrst StDev
1.|-- 10.10.0.1 0.0% 60 0.4 0.5 0.3 1.1 0.2
2.|-- 10.10.1.1 3.3% 60 0.9 12.4 0.7 84.1 18.7
3.|-- 10.10.1.11 3.3% 60 1.1 12.8 0.8 85.2 18.9
The loss and jitter originate at hop 2 — the device at 10.10.1.1. That's the switch or router connecting the backend subnet. That's where you need to look: check switch port error counters, examine interface utilization graphs, inspect SFP transceiver diagnostics if applicable, and look for duplex mismatches on the affected ports.
From the HAProxy side, while you're chasing the network issue, increase your check thresholds to tolerate jitter bursts without triggering state changes:
backend app_backend
option httpchk GET /healthz
timeout check 3s
server app01 10.10.1.11:8080 check inter 5000ms fall 4 rise 3
server app02 10.10.1.12:8080 check inter 5000ms fall 4 rise 3
With
fall 4and
inter 5000ms, the backend needs to fail four consecutive 5-second checks — 20 seconds of sustained failure — before HAProxy marks it DOWN. That's long enough to absorb most jitter bursts without causing false flapping, while still catching a genuinely dead backend within a reasonable window.
Root Cause 5: Intermittent TCP RST from Backend
Sometimes the backend isn't timing out — it's actively resetting the connection. HAProxy logs a Layer4 connection problem rather than a timeout, and the failure is near-instant rather than taking as long as your timeout value.
[WARNING] : Health check for server app_backend/app01 failed, reason: Layer4 connection problem, info: "Connection reset by peer", check duration: 1ms, status: 2/3 DOWN.
A 1ms check duration with a connection reset is not a timeout — it's an immediate rejection. The TCP handshake is being aborted before it completes. This happens for several reasons: a stateful firewall or iptables rule issuing RSTs to connections it considers invalid or rate-limited; a backend application that's bound to a port but not accepting new connections during a restart cycle; a TCP connection-tracking issue causing the backend's networking stack to RST connections it thinks are stale; or a half-open listening socket that exists but isn't being serviced.
Capture it directly from the HAProxy host to confirm:
infrarunbook-admin@sw-infrarunbook-01:~$ tcpdump -i eth0 -nn host 10.10.1.11 and port 8080 -c 50
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
14:41:02.112334 IP 10.10.0.5.54312 > 10.10.1.11.8080: Flags [S], seq 1234567890, win 64240, options [mss 1460,sackOK,TS val 1234567 ecr 0,nop,wscale 7], length 0
14:41:02.112801 IP 10.10.1.11.8080 > 10.10.0.5.54312: Flags [R.], seq 0, ack 1234567891, win 0, length 0
That
Flags [R.]in the response is the RST-ACK. HAProxy sends a SYN, the backend immediately responds with a reset. The connection never completes. The health check fails in under a millisecond.
If the RST is coming from a firewall or iptables rule, audit the rules on the backend host:
infrarunbook-admin@sw-infrarunbook-01:~$ ssh 10.10.1.11 "iptables -L INPUT -n -v --line-numbers"
Chain INPUT (policy ACCEPT)
num pkts bytes target prot opt in out source destination
1 8421 992K ACCEPT all -- * * 0.0.0.0/0 0.0.0.0/0 state RELATED,ESTABLISHED
2 143 8580 DROP tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:8080 -m limit --limit 10/min --limit-burst 5
There it is — rule 2 is rate-limiting new connections to port 8080 to 10 per minute with a burst of 5. HAProxy's health checks are new TCP connections every few seconds, easily exceeding that limit. The HAProxy host's IP needs to be in an allow-list that bypasses rate-limiting entirely.
If the RST is coming from an application restart cycle — the service briefly drops its listening socket during a rolling restart — the fix is using a proper readiness endpoint that only becomes active once the application is fully initialized. HAProxy's
option httpchkwith a readiness-specific path is far more reliable than a raw TCP check for application-level health:
backend app_backend
option httpchk GET /ready
http-check expect status 200
Your application's
/readyendpoint should return 200 only after database connections are established, caches are warm, and the service is genuinely ready to handle requests. During a restart, it returns 503 or simply doesn't accept connections yet, which HAProxy handles gracefully without the RST-induced confusion.
Root Cause 6: Misconfigured Rise/Fall Asymmetry
Even with a sane check interval and a reasonable timeout, an asymmetric
rise/
fallconfiguration can cause subtle but persistent flapping. The most problematic pattern is a high
fallwith a very low
rise— specifically
rise 1. With this configuration, the server needs multiple consecutive failures to go DOWN but only a single success to come back UP — even if that success is immediately followed by more failures.
The sequence plays out like this: the server fails three checks in a row, gets marked DOWN, then succeeds once and immediately returns to UP — even though the underlying issue hasn't resolved. It fails again, goes DOWN again, succeeds once, bounces back UP. Traffic is being sent to this backend during every brief UP window, causing errors for users each time. The server is barely functional but never stays DOWN long enough for anyone to properly investigate.
Require at least two or three consecutive successes before restoring a backend to active rotation:
server app01 10.10.1.11:8080 check inter 3000ms fall 3 rise 3
A symmetric threshold is more forgiving of transient issues in both directions and ensures that a backend demonstrates sustained health before receiving production traffic again. Don't be afraid to set
risehigher than
fallfor critical backends where the cost of sending traffic to a recovering-but-not-ready server is high.
Root Cause 7: Application-Level Failures During Deploys
Flapping that occurs in predictable short windows — right after a deployment completes — is usually caused by health check endpoints returning errors during application startup. The process is running and accepting TCP connections (which would pass a Layer4 check), but the application isn't ready: database connection pools aren't established, internal caches aren't populated, required background threads haven't started. HAProxy's HTTP health check sees the 500 or the connection drop and marks the backend DOWN.
[WARNING] : Health check for server app_backend/app01 failed, reason: Layer7 wrong status, code: 500, info: "Internal Server Error", check duration: 23ms, status: 2/3 DOWN.
The backend comes UP after initialization completes, then DOWN again if another instance is cycling through the same startup sequence. In a rolling deploy across several backends, this can look exactly like random flapping.
The cleanest solution is explicit drain/ready control via the admin socket as part of your deploy process. Put the backend into maintenance mode before the deploy touches it, and restore it only after the new process passes readiness checks:
infrarunbook-admin@sw-infrarunbook-01:~$ echo "set server app_backend/app01 state maint" | socat stdio /var/run/haproxy/admin.sock
# run deployment on app01...
# wait for application readiness check to pass...
infraunbook-admin@sw-infrarunbook-01:~$ curl -sf http://10.10.1.11:8080/ready && echo "app01 ready"
app01 ready
infraunbook-admin@sw-infrarunbook-01:~$ echo "set server app_backend/app01 state ready" | socat stdio /var/run/haproxy/admin.sock
This explicitly removes the server from rotation during the deploy window without triggering any health check logic, then restores it cleanly. No flapping, no user-visible errors, no confusing alerts. I consider this pattern mandatory for any zero-downtime deployment strategy with HAProxy.
Prevention
Flapping is a symptom. Every instance of it is telling you something real about your infrastructure — either your configuration doesn't match your backend's actual behavior, or there's a genuine problem in the backend or network that needs addressing. The goal of prevention isn't to hide flapping; it's to ensure your health check configuration is calibrated well enough that when you do see flapping, it means something.
Start with a production-grade baseline configuration that you apply consistently across all backends:
defaults
timeout connect 1s
timeout client 30s
timeout server 30s
timeout check 3s
backend app_backend
option httpchk GET /ready
http-check expect status 200
server app01 10.10.1.11:8080 check inter 5000ms fastinter 1500ms downinter 3000ms fall 3 rise 3
server app02 10.10.1.12:8080 check inter 5000ms fastinter 1500ms downinter 3000ms fall 3 rise 3
Enable the stats socket so you can always inspect live state without log-diving:
global
stats socket /var/run/haproxy/admin.sock mode 660 level admin expose-fd listeners
stats timeout 30s
frontend stats_http
bind 127.0.0.1:8404
stats enable
stats uri /haproxy-stats
stats refresh 5s
stats auth infrarunbook-admin:use-a-real-password-here
stats show-legends
stats show-node
Set up monitoring that alerts on the rate of DOWN/UP transitions, not just current state. A backend that's technically UP at any given polling moment but has transitioned DOWN fifteen times in the past hour is in trouble — state-based alerting misses this completely. Forward HAProxy logs to a log aggregation stack and write a rule that counts DOWN transitions per server per 10-minute window and fires when that count exceeds a threshold.
For catching network-layer issues before they cause flapping, run continuous latency monitoring between your HAProxy host and each backend. A simple cron entry that logs ping statistics gives you historical data to correlate against future incidents:
# /etc/cron.d/haproxy-latency-monitor
* * * * * infrarunbook-admin ping -c 10 -q 10.10.1.11 >> /var/log/haproxy-latency-app01.log 2>&1
* * * * * infrarunbook-admin ping -c 10 -q 10.10.1.12 >> /var/log/haproxy-latency-app02.log 2>&1
For production environments, a tool like Smokeping gives you a historical latency graph with jitter and loss visualization that makes network path problems immediately obvious.
Finally, document your rise/fall thresholds and their rationale in your runbook. When someone is woken up at 3am by flapping alerts, they need to know whether "server DOWN for 15 seconds then UP" is an expected behavior during the check hysteresis window or a genuine emergency requiring escalation. That context, written down, prevents unnecessary panic and allows operators to make informed decisions quickly. Get the health check tuning right once, and you'll actually trust your alerts when they fire — because they'll be telling you something real.
