Symptoms
When HAProxy marks one or more backend servers as DOWN, the effects are immediate and often visible to end users. Understanding what to look for is the first step toward a fast resolution. Common symptoms include:
- Clients receive HTTP 503 Service Unavailable responses, particularly when all backend pool members are DOWN
- The HAProxy statistics page shows affected servers highlighted in red with a DOWN status label
- Monitoring and alerting systems fire notifications for backend availability or upstream connection failures
- HAProxy logs in
/var/log/haproxy.log
contain entries referencing Layer4 or Layer7 check failures - Traffic concentrates onto a reduced set of healthy servers, potentially overloading them and triggering cascading failures
- Application logs on upstream services report connection errors, ECONNREFUSED, or upstream timeout messages
A typical log entry on sw-infrarunbook-01 when a backend is marked DOWN looks like this:
Apr 4 09:14:22 sw-infrarunbook-01 haproxy[12345]: Server app_backend/web01 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 1ms, 0 active and 1 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
On the stats page accessible at
http://192.168.10.100:8080/haproxy?stats, the affected backend row appears as:
web01 192.168.10.20:80 DOWN 0/3 L4CON - - -
The
0/3column shows zero successful checks out of the last three attempts, and
L4CONidentifies the failure at the TCP connection layer. Different failure codes point to different root causes, which the following sections address in full.
Root Cause 1: Health Check Misconfigured
Why It Happens
HAProxy uses health checks to determine whether a backend server is capable of handling traffic. When the health check configuration does not match what the backend actually serves — wrong URI path, unexpected HTTP response code, or wrong check type — HAProxy will repeatedly fail checks and eventually mark the server as DOWN, even when the backend is perfectly healthy and serving real traffic without issue.
Common misconfiguration scenarios include:
- Using a plain TCP check (
check
) when the application expects an HTTP probe with a specificHost
header - Sending an HTTP health check to the wrong URI path (e.g.,
/health
returns 404 instead of 200) - Expecting a specific HTTP status code such as 200 but the server legitimately returns 204 or 302
- Health check rise and fall thresholds are too aggressive, causing a server to be marked DOWN on the first transient slowdown
How to Identify It
Inspect the backend stanza in your HAProxy configuration:
grep -A 20 'backend app_backend' /etc/haproxy/haproxy.cfg
backend app_backend
balance roundrobin
option httpchk GET /status
http-check expect status 200
server web01 192.168.10.20:80 check inter 2s fall 3 rise 2
Now verify what the backend actually returns at that path:
curl -v http://192.168.10.20:80/status
* Connected to 192.168.10.20 (192.168.10.20) port 80
> GET /status HTTP/1.1
> Host: 192.168.10.20
< HTTP/1.1 204 No Content
< Connection: keep-alive
The server returns 204 No Content but HAProxy is configured to expect 200 OK. Every health check fails, and after three consecutive failures HAProxy marks the server DOWN.
How to Fix It
Update the
http-check expectdirective to match the real response code. Using a regex range is more resilient to minor application changes:
backend app_backend
balance roundrobin
option httpchk GET /status
http-check expect rstatus (2|3)[0-9][0-9]
server web01 192.168.10.20:80 check inter 2s fall 3 rise 2
If the application has no dedicated health endpoint, use a HEAD request to the root path with a wide status code range:
backend app_backend
balance roundrobin
option httpchk HEAD / HTTP/1.1\r\nHost:\ 192.168.10.20
http-check expect status 200-404
server web01 192.168.10.20:80 check inter 2s fall 3 rise 2
Validate the configuration and reload HAProxy without dropping live connections:
haproxy -c -f /etc/haproxy/haproxy.cfg && systemctl reload haproxy
Tail the log to confirm the server recovers:
tail -f /var/log/haproxy.log | grep 'web01'
Apr 4 09:22:10 sw-infrarunbook-01 haproxy[12345]: Server app_backend/web01 is UP, reason: Layer7 check passed, code: 204, check duration: 3ms, 1 active and 0 backup servers online.
Root Cause 2: Port Mismatch
Why It Happens
A port mismatch occurs when HAProxy is configured to connect to a backend server on a port that is not actively listening. This is common after application deployments where a service migrates from one port to another (for example, from port 80 to port 8080 after containerization), or when a new team member deploys a service on a non-standard port without updating the HAProxy configuration. HAProxy receives a TCP RST from the kernel or a Connection refused error and immediately fails the health check.
How to Identify It
The log entry is unmistakable — the check duration is near zero because the TCP connection is rejected instantly:
grep 'web01' /var/log/haproxy.log | tail -5
Apr 4 10:05:33 sw-infrarunbook-01 haproxy[12345]: Server app_backend/web01 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms
Confirm what the backend server is actually listening on:
ssh infrarunbook-admin@192.168.10.20 "ss -tlnp | grep -E 'State|80|443|8080|8443'"
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 511 0.0.0.0:8080 0.0.0.0:* users:(("node",pid=4421,fd=22))
The application is now listening on port 8080 but HAProxy is configured for port 80. A quick test from sw-infrarunbook-01 confirms the mismatch:
nc -zv 192.168.10.20 80
nc -zv 192.168.10.20 8080
nc: connect to 192.168.10.20 port 80 (tcp) failed: Connection refused
Connection to 192.168.10.20 8080 port [tcp/*] succeeded!
How to Fix It
Update the server line in the HAProxy backend stanza to use the correct port:
# Before
server web01 192.168.10.20:80 check inter 2s fall 3 rise 2
# After
server web01 192.168.10.20:8080 check inter 2s fall 3 rise 2
Validate and reload:
haproxy -c -f /etc/haproxy/haproxy.cfg
Configuration file is valid
systemctl reload haproxy
Root Cause 3: SSL Mismatch
Why It Happens
SSL/TLS mismatches between HAProxy and a backend are a subtle but frequent cause of servers being marked DOWN. Unlike a port mismatch that fails instantly at Layer 4, SSL failures manifest at Layer 6 during the TLS handshake. This can occur when:
- HAProxy is configured with
ssl
on the server line but the backend does not accept TLS at all - The backend uses a self-signed or internally-signed certificate and HAProxy is set to
verify required
against the system CA bundle - There is a protocol version incompatibility (e.g., the backend enforces TLSv1.3 but HAProxy negotiates TLSv1.1)
- The SNI hostname HAProxy sends does not match the certificate CN or Subject Alternative Name on the backend
- HAProxy omits
ssl
on the server line but the backend only accepts HTTPS connections
How to Identify It
SSL failures appear in logs as Layer6 errors:
Apr 4 11:30:12 sw-infrarunbook-01 haproxy[12345]: Server app_backend/web01 is DOWN, reason: Layer6 invalid response, info: "SSL handshake failure", check duration: 12ms
For certificate verification failures specifically:
Apr 4 11:30:14 sw-infrarunbook-01 haproxy[12345]: Server app_backend/web01 is DOWN, reason: Layer6 invalid response, info: "SSL certificate verification failure", check duration: 15ms
Test the TLS connection manually from sw-infrarunbook-01 using the same parameters HAProxy would use:
openssl s_client -connect 192.168.10.20:443 -servername api.solvethenetwork.com
CONNECTED(00000003)
depth=0 CN = api.solvethenetwork.com
verify error:num=18:self signed certificate
verify return:1
---
Certificate chain
0 s:CN = api.solvethenetwork.com
i:CN = api.solvethenetwork.com
---
SSL handshake has read 1338 bytes and written 444 bytes
Now review the HAProxy server line:
grep 'server web01' /etc/haproxy/haproxy.cfg
server web01 192.168.10.20:443 check ssl verify required ca-file /etc/ssl/certs/ca-certificates.crt inter 2s fall 3 rise 2
The backend presents a self-signed certificate. Because
verify requiredchecks against the system CA bundle (which does not include the self-signed cert), every check fails at the SSL layer.
How to Fix It
For internal backends using self-signed certificates on a trusted private network, disable certificate verification. For production setups, provide the specific internal CA certificate instead:
# Option 1: Disable verification (acceptable for internal private networks only)
server web01 192.168.10.20:443 check ssl verify none inter 2s fall 3 rise 2
# Option 2: Provide internal CA cert
server web01 192.168.10.20:443 check ssl verify required ca-file /etc/haproxy/certs/internal-ca.crt inter 2s fall 3 rise 2
If the backend does not accept SSL at all and the server line incorrectly includes
ssl, simply remove it:
# Before
server web01 192.168.10.20:80 check ssl verify none inter 2s fall 3 rise 2
# After
server web01 192.168.10.20:80 check inter 2s fall 3 rise 2
Validate and reload:
haproxy -c -f /etc/haproxy/haproxy.cfg && systemctl reload haproxy
Root Cause 4: Timeout Too Low
Why It Happens
Every HAProxy health check is governed by a timeout. If the backend does not respond within that window, HAProxy treats the check as a failure and increments the fall counter. This is especially problematic for:
- Backends under heavy load that respond slowly to health check probes
- Application health endpoints that perform database connectivity queries, cache warm-up, or dependency checks as part of the response
- Virtualized or containerized backends that experience CPU steal, I/O contention, or garbage collection pauses
- Cold-start scenarios where an application has just been (re)started and its connection pools are not yet warmed up
How to Identify It
The key indicator in the log is that the check duration equals the configured timeout value exactly:
Apr 4 12:15:44 sw-infrarunbook-01 haproxy[12345]: Server app_backend/web01 is DOWN, reason: Layer4 timeout, check duration: 2000ms
check duration: 2000ms matching your timeout configuration confirms the check hit the ceiling rather than receiving a genuine failure response. Inspect the current timeout settings:
grep -E 'timeout|inter|fall|rise' /etc/haproxy/haproxy.cfg
timeout connect 2s
timeout client 30s
timeout server 30s
timeout check 2s
Now measure the actual backend health endpoint response time under realistic load:
time curl -s -o /dev/null -w "%{http_code}" http://192.168.10.20:80/health
200
real 0m3.421s
The backend legitimately returns 200 but takes 3.4 seconds. The 2-second check timeout causes every probe to time out before the response arrives.
How to Fix It
Increase
timeout checkto accommodate the backend's real response time. Set it to at least 2x the 99th percentile response time of your health endpoint under load:
defaults
timeout connect 5s
timeout client 30s
timeout server 30s
timeout check 8s
To override only for a specific backend without changing global defaults:
backend app_backend
timeout check 8s
option httpchk GET /health
http-check expect status 200
server web01 192.168.10.20:80 check inter 10s fall 3 rise 2
Simultaneously, work to reduce the health endpoint's response time. If it queries a database for connectivity verification, ensure the query uses a connection pool and a short statement timeout. Ideally a health endpoint should respond in under 100ms. Reload after changes:
systemctl reload haproxy
Root Cause 5: Backend Crash
Why It Happens
Sometimes HAProxy marks a server DOWN because it genuinely is down. The application process may have exited due to an unhandled exception, a failed dependency check at startup, an out-of-memory (OOM) kill, a deployment rollout failure, or a kernel panic. In these cases, HAProxy is performing exactly as designed — protecting users from routing traffic to a dead server.
How to Identify It
SSH to the backend and check the service status immediately:
ssh infrarunbook-admin@192.168.10.20 "systemctl status app-service"
● app-service.service - Application Service
Loaded: loaded (/etc/systemd/system/app-service.service; enabled)
Active: failed (Result: exit-code) since Fri 2026-04-04 12:30:00 UTC; 5min ago
Process: 9812 ExecStart=/usr/bin/node /opt/app/server.js (code=exited, status=1/FAILURE)
Main PID: 9812 (code=exited, status=1/FAILURE)
Review the application logs for the crash root cause:
ssh infrarunbook-admin@192.168.10.20 "journalctl -u app-service --since '30 minutes ago' | tail -30"
Apr 04 12:29:58 192.168.10.20 node[9812]: Error: connect ECONNREFUSED 192.168.10.50:5432
Apr 04 12:29:58 192.168.10.20 node[9812]: Unhandled promise rejection -- exiting
Apr 04 12:30:00 192.168.10.20 systemd[1]: app-service.service: Main process exited, code=exited, status=1/FAILURE
Check for OOM kills, which often leave no application-level trace:
ssh infrarunbook-admin@192.168.10.20 "dmesg | grep -i 'oom\|killed process' | tail -10"
[185432.123456] Out of memory: Killed process 9812 (node) total-vm:1524432kB, anon-rss:1020212kB, file-rss:4kB, shmem-rss:0kB
How to Fix It
The resolution depends entirely on the crash cause. For a database connection failure (ECONNREFUSED to 192.168.10.50:5432), restore connectivity to the PostgreSQL instance and restart:
ssh infrarunbook-admin@192.168.10.20 "systemctl restart app-service"
● app-service.service - Application Service
Active: active (running) since Fri 2026-04-04 12:45:00 UTC; 3s ago
Main PID: 10234 (node)
For OOM kills, increase the host memory, reduce the application heap limit, or add a memory limit with an automatic restart policy in the service unit:
# /etc/systemd/system/app-service.service
[Service]
MemoryMax=1G
Restart=on-failure
RestartSec=5s
Once the application is back up, HAProxy detects recovery on the next successful health check cycle and automatically marks the server UP:
Apr 4 12:45:10 sw-infrarunbook-01 haproxy[12345]: Server app_backend/web01 is UP, reason: Layer7 check passed, code: 200, check duration: 4ms, 1 active and 0 backup servers online.
Root Cause 6: Firewall Blocking Health Check Traffic
Why It Happens
HAProxy sends health check packets originating from the load balancer's IP address to the backend's IP and port. If a host-based firewall on the backend, an intermediate security group, or a network ACL drops packets from the HAProxy source IP, the health check fails even though the application is running normally and may even be serving real user traffic via a different path. This produces a situation where the application is healthy but HAProxy cannot determine that.
How to Identify It
The log entry looks identical to a genuine connection refusal. The distinguishing test is that the application works from the backend itself but not from the HAProxy host. Test from sw-infrarunbook-01:
nc -zv 192.168.10.20 80
nc: connect to 192.168.10.20 port 80 (tcp) failed: No route to host
Compare with a test from the backend host itself:
ssh infrarunbook-admin@192.168.10.20 "curl -s -o /dev/null -w '%{http_code}' http://127.0.0.1:80/health"
200
The discrepancy confirms a network or firewall issue rather than an application problem. Inspect iptables rules on the backend:
ssh infrarunbook-admin@192.168.10.20 "iptables -L INPUT -n -v --line-numbers | grep -E '80|DROP|REJECT'"
5 0 0 DROP tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:80
A blanket DROP rule on port 80 is blocking health check probes from sw-infrarunbook-01 (192.168.10.100).
How to Fix It
Insert an ACCEPT rule for health check traffic from the HAProxy source IP before the DROP rule:
ssh infrarunbook-admin@192.168.10.20 "iptables -I INPUT 1 -s 192.168.10.100 -p tcp --dport 80 -j ACCEPT"
Persist the rule across reboots:
ssh infrarunbook-admin@192.168.10.20 "iptables-save > /etc/iptables/rules.v4"
Root Cause 7: Backend Resource Exhaustion
Why It Happens
A backend server that is alive but saturated will fail to accept new TCP connections even though the process is running. This can occur when the application has exhausted its open file descriptor limit (since every socket consumes a file descriptor), the OS listen backlog queue is full, or the application's thread pool or connection pool is completely occupied with long-running requests. HAProxy sees the connection refusal or timeout and marks the server DOWN. This is self-reinforcing: marking one server DOWN pushes more load onto the remaining servers, potentially triggering further failures.
How to Identify It
Check the application process's file descriptor usage on the backend:
ssh infrarunbook-admin@192.168.10.20 "cat /proc/\$(pgrep -f server.js)/limits | grep 'open files'"
Max open files 1024 1024 files
ssh infrarunbook-admin@192.168.10.20 "ls /proc/\$(pgrep -f server.js)/fd | wc -l"
1021
The process is at 1021 out of a maximum of 1024 open file descriptors — three away from saturation. Any new connection attempt (including health check probes) will be refused. Check system-wide socket stats too:
ssh infrarunbook-admin@192.168.10.20 "ss -s"
Total: 4096 (kernel 4100)
TCP: 4090 (estab 3980, closed 110, orphaned 0, synrecv 0, timewait 98/0), ports 0
Transport Total IP IPv6
* 4100 - -
RAW 0 0 0
UDP 4 4 0
TCP 3980 3980 0
INET 3984 3984 0
FRAG 0 0 0
How to Fix It
Increase the open file descriptor limit for the application service unit:
# /etc/systemd/system/app-service.service
[Service]
LimitNOFILE=65535
ssh infrarunbook-admin@192.168.10.20 "systemctl daemon-reload && systemctl restart app-service"
On the HAProxy side, use
maxconnper server to cap the connection count before the backend reaches saturation, giving you a controlled queue rather than an abrupt refusal:
server web01 192.168.10.20:80 check inter 2s fall 3 rise 2 maxconn 200
Prevention
Preventing unexpected backend server downtime in HAProxy requires proactive configuration discipline, robust monitoring, and codified operational procedures.
- Use application-level health checks: Always prefer
option httpchk
with a dedicated health endpoint over raw TCP checks. The endpoint should verify the application, its database connections, and any critical dependencies simultaneously. It must respond in under 200ms under normal load. - Set appropriate timeouts based on measurement: Baseline your backend response times under realistic load and set
timeout check
to at least 2x the 99th percentile response time of your health endpoint. Do not rely on defaults. - Tune rise and fall thresholds carefully: Use
fall 3
to require three consecutive failures before marking a server DOWN, avoiding false positives from transient network glitches. Userise 2
to require two consecutive successes before marking a server back UP, preventing flapping. - Enable the admin socket: Configure the UNIX socket interface so you can enable, disable, and drain servers dynamically without a configuration reload:
stats socket /var/run/haproxy/admin.sock mode 660 level admin expose-fd listeners
- Ship logs to a centralized system and alert: Forward
/var/log/haproxy.log
to your log aggregation platform. Alert immediately on anyis DOWN
event so engineers investigate before all backends fail. - Version-control haproxy.cfg: Store
/etc/haproxy/haproxy.cfg
in a git repository. Runhaproxy -c -f /etc/haproxy/haproxy.cfg
in your CI/CD pipeline on every proposed change to catch misconfigurations before they reach production. - Document port and TLS configuration per service: Maintain a service registry that records each backend's expected port, protocol, and certificate authority. Update the HAProxy configuration as an explicit step in every backend deployment runbook.
- Cap per-server connections with maxconn: Use the
maxconn
directive per server to apply back-pressure before resource exhaustion causes health checks to fail. A queued request is better than a refused connection. - Test health checks independently and regularly: Schedule periodic tests from sw-infrarunbook-01 using
curl
andopenssl s_client
to each backend, verifying the health check path is reachable and returns the expected response independent of real user traffic paths. - Use slowstart for freshly recovered servers: Add
slowstart 30s
to server definitions so that a server returning to UP state receives gradually increasing traffic rather than being immediately flooded, reducing the risk of it being immediately overwhelmed and marked DOWN again.
Frequently Asked Questions
Q: What does "L4CON" mean in the HAProxy stats page?
A: L4CON stands for Layer 4 Connection problem. It means HAProxy attempted a TCP connection to the backend server and received either a connection refused (RST) or a no-route-to-host error. The server's port is either not listening or is blocked by a firewall. This is distinct from L4TOUT (timeout at Layer 4) and L7STS (Layer 7 HTTP status failure).
Q: How do I manually re-enable a server that HAProxy has marked DOWN?
A: Use the HAProxy admin socket. With the socket enabled, run:
echo "set server app_backend/web01 state ready" | socat stdio /var/run/haproxy/admin.sock. This immediately marks the server as UP and resumes health checks. Note that if the underlying issue is not fixed, HAProxy will mark it DOWN again after the next failed check cycle.
Q: What is the difference between fall and rise thresholds in HAProxy?
A: The
fallthreshold is the number of consecutive failed health checks required before HAProxy marks a server as DOWN. The
risethreshold is the number of consecutive successful checks required before a DOWN server is marked back UP. Setting
fall 3 rise 2means the server must fail three times in a row to go DOWN, and must succeed twice in a row to come back UP. This prevents flapping caused by transient network issues.
Q: Can I check which backend servers are currently DOWN from the command line without using the stats web page?
A: Yes. Query the admin socket directly:
echo "show servers state" | socat stdio /var/run/haproxy/admin.sock. This returns a table with each server's current state, check result, and timing data. You can also parse the HAProxy stats CSV via:
echo "show stat" | socat stdio /var/run/haproxy/admin.sock | cut -d',' -f1,2,18,19to extract backend/server names and their status fields.
Q: What is the default health check behavior if I do not explicitly configure one?
A: If you add
checkto a server line without any
option httpchkor similar directive, HAProxy performs a basic TCP connect check — it opens a TCP connection to the server's IP and port, and if the connection succeeds (TCP handshake completes), the check passes. It does not send any data or validate the HTTP response. This means the check passes even if the application inside the TCP listener has crashed, as long as something is accepting connections on that port.
Q: Why does HAProxy mark a server DOWN even though I can ping it successfully?
A: ICMP ping (Layer 3) and HAProxy health checks (TCP Layer 4 or HTTP Layer 7) are completely independent. A server can respond to pings while its application process is crashed, its listening port is firewalled, or its file descriptors are exhausted. HAProxy does not use ICMP at any point. A successful ping tells you only that the host's network interface is alive and routing is working — not that the application is healthy.
Q: How do I configure HAProxy to use a backup server when the primary backend is DOWN?
A: Add the
backupkeyword to the server line you want to act as a standby:
server web01-backup 192.168.10.21:80 check inter 2s fall 3 rise 2 backup. HAProxy will only send traffic to backup servers when all non-backup servers in the pool are marked DOWN. The backup server participates in normal health checks but receives no traffic while primary servers are UP.
Q: Can HAProxy drain active sessions from a server before marking it DOWN during a planned maintenance?
A: Yes. Use the admin socket to put the server in DRAIN state:
echo "set server app_backend/web01 state drain" | socat stdio /var/run/haproxy/admin.sock. In DRAIN mode, the server stops receiving new sessions but continues serving existing ones until they close naturally. Once traffic drops to zero, you can safely take the server offline. This is preferable to an abrupt DOWN state during maintenance windows.
Q: How do I debug health checks in real time to see exactly what HAProxy is sending and receiving?
A: Increase the HAProxy log verbosity by setting
option log-health-checksin the backend stanza. This logs every individual health check result — both successes and failures — to syslog. You can also use
tcpdumpon sw-infrarunbook-01 to capture the raw health check traffic:
tcpdump -i any -nn host 192.168.10.20 and port 80 -A. This shows exactly what HAProxy sends and what the backend responds with at the packet level.
Q: What is the slowstart option and when should I use it?
A: The
slowstartoption causes HAProxy to gradually increase the weight (and therefore the traffic share) of a server transitioning from DOWN to UP over a specified time period. For example,
server web01 192.168.10.20:80 check slowstart 60sramps the server from 0% to 100% of its configured weight over 60 seconds. Use it for backends that require a warm-up period — such as JVM applications building JIT-compiled code caches, or services pre-loading data into memory — to avoid immediately overwhelming them and triggering a rapid DOWN/UP cycle.
Q: How do I ensure HAProxy health check configuration changes do not cause a service interruption when reloading?
A: Use
systemctl reload haproxyrather than
restart. A reload performs a zero-downtime configuration reload by spawning a new HAProxy master process, migrating existing connections, and gracefully terminating the old process. Existing sessions are preserved. Always validate the configuration with
haproxy -c -f /etc/haproxy/haproxy.cfgbefore reloading to ensure no syntax errors are introduced.
Q: Can I configure different health check parameters for individual servers within the same backend pool?
A: Yes. While backend-level directives like
option httpchkand
timeout checkapply to all servers in the pool, per-server attributes such as
inter(check interval),
fall,
rise,
maxconn, and
weightcan be set individually on each server line. This allows you to, for example, check a high-capacity server more frequently or apply a stricter fall threshold to a server known to be less stable.
