Symptoms
When AWS Elastic Load Balancer health checks begin failing, the impact is immediate and visible across multiple layers of your AWS environment. Recognizing the exact combination of signals helps you skip the noise and move directly to root cause identification.
- Target group instances show status unhealthy in the AWS Console under EC2 > Target Groups > Targets
- CloudWatch metric UnHealthyHostCount is non-zero while HealthyHostCount trends toward zero
- End users receive HTTP 502 Bad Gateway or 503 Service Unavailable responses from the ALB DNS endpoint
- ELB access logs stored in S3 show target_status_code as
-
(no response) or unexpected 4xx/5xx values - Autoscaling groups repeatedly launch and terminate instances because newly registered targets never pass the health check and are never considered healthy
- CloudWatch Alarms enter ALARM state on HTTPCode_ELB_5XX_Count or TargetConnectionErrorCount
- Deployments stall in CodeDeploy or ECS because the replacement task or instance fails to transition to healthy within the wait timeout
The moment you observe any of these signals, run the following command to retrieve the raw health state and embedded reason codes from AWS:
aws elbv2 describe-target-health \
--target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
--region us-east-1Example output showing an unhealthy target:
{
"TargetHealthDescriptions": [
{
"Target": {
"Id": "i-0a1b2c3d4e5f67890",
"Port": 8080
},
"HealthCheckPort": "8080",
"TargetHealth": {
"State": "unhealthy",
"Reason": "Target.FailedHealthChecks",
"Description": "Health checks failed with these codes: [404]"
}
}
]
}The Reason and Description fields are your primary pivot points. Each section below maps a failure pattern to its root cause, a definitive identification method, and a concrete fix.
Root Cause 1: Wrong Health Check Path
Why It Happens
The default health check path for an Application Load Balancer target group is /. Many applications do not return a valid 200 on the root path. They may redirect to a login screen (returning 301), serve a large HTML page that the matcher rejects, or simply return 404 because the application is mounted under a sub-path such as
/api/v1or
/app. When the ELB receives any status code outside the configured success range, it immediately marks the target unhealthy. This is the single most common cause of health check failures after a new deployment or when migrating an existing application to a load balancer for the first time.
How to Identify
Check the current health check configuration for the target group:
aws elbv2 describe-target-groups \
--target-group-arns arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
--query 'TargetGroups[*].{Path:HealthCheckPath,Code:Matcher.HttpCode,Port:HealthCheckPort}' \
--output tableSample output:
--------------------------------------
| DescribeTargetGroups |
+-------+-----------+----------------+
| Code | Path | Port |
+-------+-----------+----------------+
| 200 | / | traffic-port |
+-------+-----------+----------------+Now verify what the application actually returns on that path. From a bastion or another EC2 instance in the same VPC subnet, curl the instance's private IP directly:
curl -v http://10.0.1.45:8080/
# Output:
< HTTP/1.1 301 Moved Permanently
< Location: https://10.0.1.45:8080/dashboardA 301 is not in the success matcher, so every health check fails. Identify the correct endpoint by asking the application team or inspecting the app routes:
curl -sv http://10.0.1.45:8080/health
< HTTP/1.1 200 OK
{"status":"UP"}How to Fix
Update the health check path to the endpoint that reliably returns 200:
aws elbv2 modify-target-group \
--target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
--health-check-path /health \
--matcher HttpCode=200After the update, the ELB resumes health checks immediately. With default settings (30-second interval, 3 healthy threshold), targets transition to healthy within roughly 90 seconds if the endpoint is responding correctly.
Root Cause 2: Security Group Blocking Health Check Port
Why It Happens
Application Load Balancers send health check probes from the ELB nodes, which reside inside your VPC in the subnets where the load balancer is deployed. The source IP of these probes is the private IP of the ALB node — not a public address. If the security group attached to your EC2 instances does not have an inbound rule permitting traffic from the ALB's security group on the health check port, the TCP connection is silently dropped before the HTTP layer is reached. The ELB records this as a connection timeout (
Target.Timeout) or refused connection (
Target.FailedHealthCheckswith no HTTP code). This is a common pitfall when security groups are tightened mid-incident or when IaC templates are applied that overwrite prior manual rules.
How to Identify
Retrieve the security group IDs for both the load balancer and the target instances:
# Get ALB security groups
aws elbv2 describe-load-balancers \
--names app-alb-prod \
--query 'LoadBalancers[*].SecurityGroups' \
--output text
# Output: sg-0123456789abcdef0
# Get instance security groups
aws ec2 describe-instances \
--instance-ids i-0a1b2c3d4e5f67890 \
--query 'Reservations[*].Instances[*].SecurityGroups' \
--output tableNow inspect the inbound rules of the instance's security group:
aws ec2 describe-security-groups \
--group-ids sg-0fedcba987654321 \
--query 'SecurityGroups[*].IpPermissions' \
--output jsonIf no rule permits TCP port 8080 from
sg-0123456789abcdef0(the ALB security group), health check traffic is being dropped. Confirm via VPC Flow Logs with a REJECT filter:
aws logs filter-log-events \
--log-group-name /vpc/flowlogs/prod \
--filter-pattern '[version, account, eni, src, dst="10.0.1.45", srcport, dstport="8080", protocol="6", packets, bytes, start, end, action="REJECT", status]' \
--start-time 1700000000000How to Fix
Add an inbound rule permitting the health check port from the ALB's security group:
aws ec2 authorize-security-group-ingress \
--group-id sg-0fedcba987654321 \
--protocol tcp \
--port 8080 \
--source-group sg-0123456789abcdef0Security group changes take effect immediately — no instance reboot required. The next scheduled health check will succeed if the application is otherwise healthy.
Root Cause 3: App Not Responding on Health Endpoint
Why It Happens
Even when the path is correct and the security group is open, the application process itself may be unable to respond. Common sub-causes include: the process crashed and was not restarted; the application is mid-startup and the health handler is not yet registered; the service is bound to
127.0.0.1instead of
0.0.0.0, making it unreachable from the ELB node; or the health endpoint performs a live dependency check (database query, cache ping) that blocks indefinitely when a downstream service is unavailable. In ECS environments, container port mappings may be misconfigured so the container port is not exposed on the host network interface that the ALB targets.
How to Identify
SSH to the target instance (or exec into the container) and verify the process is running and bound to the correct network interface:
ss -tlnp | grep 8080
# Expected — listening on all interfaces:
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
LISTEN 0 128 0.0.0.0:8080 0.0.0.0:* users:(("java",pid=1234,fd=42))
# Problematic — loopback only:
LISTEN 0 128 127.0.0.1:8080 0.0.0.0:* users:(("java",pid=1234,fd=42))Confirm the endpoint responds locally and measure latency:
curl -sv http://127.0.0.1:8080/health
# If this hangs or times out, inspect app logs:
journalctl -u app-service -n 200 --no-pager | grep -i 'error\|warn\|exception'
tail -n 200 /var/log/app/application.logLook for messages like
Connection refused to 10.0.2.10:5432or
OutOfMemoryErrorthat indicate the health handler is blocking on a dependency.
How to Fix
- Process not running: Restart the service and investigate root cause —
systemctl restart app-service
, thendmesg | grep -i oom
for OOM kills - Bound to loopback: Update the application bind address to
0.0.0.0
in the configuration file and redeploy - Health endpoint blocks on dependencies: Refactor the health endpoint to return 200 immediately, confirming only that the process is alive. Move dependency validation to a separate deep-health endpoint that is never targeted by the ELB
- ECS port mapping error: Verify the task definition
portMappings
block maps containerPort 8080 to hostPort 8080 (or 0 for dynamic mapping), then redeploy the service
The ELB health check endpoint should be shallow — confirm the process is alive and able to accept connections, nothing more.
Root Cause 4: SSL Certificate Mismatch
Why It Happens
When the target group health check protocol is set to HTTPS, the ALB establishes a TLS handshake with the backend instance before sending the HTTP health check request. If the certificate served by the instance has expired, is self-signed with an untrusted CA, or does not match the expected hostname, the TLS negotiation phase fails and the health check is recorded as a connection error with no HTTP status code. This scenario commonly appears when: end-to-end encryption is enabled using self-signed certificates on backend instances; a certificate renewal was applied to some instances in a fleet but not all; or the health check protocol was recently changed from HTTP to HTTPS without updating the instance certificate configuration.
How to Identify
Check the current health check protocol:
aws elbv2 describe-target-groups \
--target-group-arns arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
--query 'TargetGroups[*].{Protocol:HealthCheckProtocol,Path:HealthCheckPath}' \
--output table
# Output: HTTPS / /healthAttempt a TLS connection directly to the instance from within the VPC:
openssl s_client -connect 10.0.1.45:8443 -servername sw-infrarunbook-01.solvethenetwork.com
# Failure output:
verify error:num=18:self signed certificate
# or:
verify error:num=10:certificate has expired
subject=CN = sw-infrarunbook-01.internal
issuer=CN = sw-infrarunbook-01.internalCheck certificate validity dates explicitly:
echo | openssl s_client -connect 10.0.1.45:8443 2>/dev/null | openssl x509 -noout -dates
# notBefore=Jan 1 00:00:00 2023 GMT
# notAfter=Jan 1 00:00:00 2024 GMT <-- expired!How to Fix
Option A — Confirm ALB certificate verification behavior. By default, Application Load Balancers do NOT validate the certificate chain of backend targets for HTTPS health checks. They establish TLS but accept self-signed and expired certificates. If health checks are still failing with HTTPS, the likely issue is that the target is not completing the TLS handshake at all (process not listening, wrong port). Confirm the protocol is actually HTTPS on the target:
# Verify the app is actually serving TLS on the health check port
nmap -p 8443 --script ssl-cert 10.0.1.45
# If port shows 'closed' or no ssl-cert output, the app is not TLS-enabled on that portOption B — Renew or replace the certificate on the instance. If the TLS handshake itself fails (connection reset, certificate parse error), deploy a valid certificate:
# Renew via certbot on sw-infrarunbook-01
certbot renew --nginx --non-interactive
# Verify the renewed certificate
openssl x509 \
-in /etc/letsencrypt/live/sw-infrarunbook-01.solvethenetwork.com/fullchain.pem \
-noout -dates -subject
# subject=CN = sw-infrarunbook-01.solvethenetwork.com
# notAfter=Jul 5 00:00:00 2026 GMTRoot Cause 5: Health Check Timeout Too Low
Why It Happens
The health check timeout defines how many seconds the ELB waits for a complete HTTP response before declaring the check failed. The default is 5 seconds for ALB and 10 seconds for NLB. If the application experiences JVM garbage collection pauses, performs a database query inside the health handler, has a slow initialization path, or is under high load that delays response times, the health endpoint may consistently exceed the configured timeout. The ELB interprets this as a failure, marks the target unhealthy, and stops sending traffic to it — even though the application would otherwise serve production requests successfully. This creates a situation where the load balancer is removing healthy instances from rotation, compounding load on the remaining targets and potentially triggering a cascading failure.
How to Identify
Retrieve the current timeout and interval configuration:
aws elbv2 describe-target-groups \
--target-group-arns arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
--query 'TargetGroups[*].{Timeout:HealthCheckTimeoutSeconds,Interval:HealthCheckIntervalSeconds,Threshold:UnhealthyThresholdCount}' \
--output table
# Output:
# +----------+-----------+---------+
# | Interval | Threshold | Timeout |
# +----------+-----------+---------+
# | 30 | 3 | 5 |
# +----------+-----------+---------+Measure actual health endpoint response times across multiple samples from within the VPC:
for i in {1..20}; do
curl -o /dev/null -s -w "%{time_total}\n" http://10.0.1.45:8080/health
done
# Output:
0.312
0.289
6.102 <-- exceeds 5s timeout!
0.301
7.891 <-- GC pause causing delay
0.295
0.310If the p95 or p99 response time consistently exceeds the configured timeout, this is your root cause. Correlate the timing of health check failures in CloudWatch with GC logs or slow-query logs on the instance.
How to Fix
Increase the timeout to provide headroom above the observed worst-case response time. The timeout must always be less than the interval:
aws elbv2 modify-target-group \
--target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
--health-check-timeout-seconds 15 \
--health-check-interval-seconds 30This resolves the immediate health check failure, but increasing the timeout is a mitigation, not a fix. The correct long-term solution is to ensure the health endpoint returns immediately without performing expensive operations. A well-designed health endpoint should look like this:
GET /health HTTP/1.1
Host: sw-infrarunbook-01.solvethenetwork.com
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 30
{"status":"UP","version":"1.4.2"}Root Cause 6: HTTP Response Code Mismatch
Why It Happens
The ALB health check success condition is defined by the matcher — a configurable set of HTTP status codes. The default is
200only. If your application's health endpoint legitimately returns
204 No Contentor
202 Accepted, or if a reverse proxy in front of the app returns a
301redirect that was not accounted for, the ELB marks the check as failed. This frequently surfaces after a deployment where the health endpoint behavior was changed, or when a new framework version alters the default response code for empty-body responses.
How to Identify
# Check what the endpoint actually returns
curl -o /dev/null -s -w "%{http_code}\n" http://10.0.1.45:8080/health
# Output: 204
# Check the current matcher
aws elbv2 describe-target-groups \
--target-group-arns arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
--query 'TargetGroups[*].Matcher'
# Output: {"HttpCode": "200"}The application returns 204 but the matcher only accepts 200 — the health check fails on every probe.
How to Fix
aws elbv2 modify-target-group \
--target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
--matcher HttpCode=200-299Alternatively, keep the matcher strict at 200 and update the application to return 200 from the health endpoint. Standardizing on 200 across all services makes the ELB configuration consistent and easier to audit.
Root Cause 7: Target Registered on Wrong Port
Why It Happens
When targets are registered manually via the console, CLI, or through IaC templates with a hardcoded port value, it is easy to register the instance on the wrong port — for example, port 80 while the application listens on 8080. The ELB will attempt the health check on the registered port, receive a connection-refused error, and mark the target unhealthy. This sub-cause is particularly common in environments where the application port is managed as a variable in a deployment pipeline and the ELB target group registration was not updated to match.
How to Identify
aws elbv2 describe-target-health \
--target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
--query 'TargetHealthDescriptions[*].{ID:Target.Id,Port:Target.Port,State:TargetHealth.State,Reason:TargetHealth.Reason}'
# Output:
# [{"ID":"i-0a1b2c3d4e5f67890","Port":80,"State":"unhealthy","Reason":"Target.FailedHealthChecks"}]
# Verify what the instance is actually listening on
ss -tlnp | grep LISTEN
# Shows: 0.0.0.0:8080 -- app is on 8080, not 80How to Fix
# Deregister the target on the wrong port
aws elbv2 deregister-targets \
--target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
--targets Id=i-0a1b2c3d4e5f67890,Port=80
# Register on the correct port
aws elbv2 register-targets \
--target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
--targets Id=i-0a1b2c3d4e5f67890,Port=8080Prevention
Health check failures are almost always preventable with disciplined configuration standards and deployment safeguards. Applying the following controls reduces the probability of recurring outages and accelerates diagnosis when they do occur.
- Implement a dedicated /health endpoint in every application. The endpoint should return 200 within milliseconds without querying databases, caches, or external services. It should only confirm the process is alive and ready to accept traffic. Move dependency checks to a separate /health/deep endpoint that is never targeted by the ELB.
- Codify all health check configuration in IaC. Use Terraform, CloudFormation, or CDK to define health check path, protocol, healthy threshold, unhealthy threshold, timeout, interval, and matcher as explicit, version-controlled values. Never configure health checks manually in the console where they can drift silently.
- Lock security group rules in IaC and enforce with AWS Config. Create a custom AWS Config rule that verifies the load balancer's security group ID appears as an inbound source on the target security group for the health check port. Alert on any deviation.
- Set CloudWatch alarms on UnHealthyHostCount > 0 with a 1-minute evaluation period and notify the on-call engineer immediately. Do not wait for end users to report 502 errors before you know targets are unhealthy.
- Validate health checks during every deployment. In CodeDeploy blue/green deployments, register one replacement instance and verify it reaches healthy status before continuing the rollout. In ECS rolling deployments, set the minimum healthy percent and health check grace period to allow adequate initialization time.
- Monitor SSL certificate expiry proactively. Use ACM managed renewal for public certificates. For private or imported certificates, create an EventBridge rule that fires on ACM expiry notifications 30 days in advance and routes alerts to the ops team.
- Document the health check path in the service runbook. Every service should have a one-line entry that states the health check URL, expected response code, and approximate response time. This lets any on-call engineer verify the endpoint in seconds during an incident.
- Test the health endpoint independently of the ELB in your CI/CD pipeline. A curl check against the /health path as part of the integration test suite catches path changes before they reach production.
Frequently Asked Questions
Q: How long does it take for an ELB target to become healthy after the root cause is resolved?
A: Recovery time depends on your health check interval and healthy threshold. With the default settings (30-second interval, 3 healthy threshold), it takes a minimum of 90 seconds from the first successful check for a target to reach the healthy state. During an active incident, temporarily reduce the interval to 10 seconds (the ALB minimum) and the healthy threshold to 2, cutting recovery time to 20 seconds. Restore default values once the incident is closed.
Q: Can I trigger an ELB health check manually without waiting for the interval?
A: No. AWS does not expose an API to trigger on-demand health checks. You must wait for the next scheduled probe. If faster feedback is needed during troubleshooting, reduce the health check interval to 10 seconds temporarily, observe the next few check results in the console or via
describe-target-health, then restore the original interval.
Q: Why does my instance show healthy in the EC2 console but unhealthy in the target group?
A: EC2 instance health checks and ELB target group health checks are entirely separate mechanisms. EC2 status checks verify hypervisor-level and OS-level availability (system reachability and instance reachability). ELB target group health checks verify that the application process is responding correctly on a specific port and path with an acceptable HTTP status code. An instance can pass both EC2 status checks while the application process inside it is crashed, bound to the wrong address, or returning errors.
Q: What is the difference between HealthCheckProtocol HTTP and HTTPS in a target group?
A: With HTTP, the ALB sends a plain TCP-based HTTP request to the target. With HTTPS, the ALB first negotiates a TLS session and then sends the HTTP request over the encrypted channel. By default, ALB does not validate the target's certificate chain (it accepts self-signed and expired certificates), but the TLS handshake itself must complete successfully. If the target process is not configured to serve TLS on the health check port, the HTTPS health check will fail at the connection level before any HTTP exchange occurs.
Q: How do I diagnose health check failures without SSH access to the instance?
A: Enable VPC Flow Logs on the target subnet and filter for traffic on the health check port with REJECT actions to identify security group blocks. Check the ELB access logs in S3 — the
target_status_codefield reveals exactly what HTTP code the target returned. Use the
describe-target-healthCLI command; the Description field often contains the specific HTTP status code or connection error. For ECS workloads, use CloudWatch Container Insights or AWS Systems Manager Session Manager to access container logs without opening SSH.
Q: Can an NLB health check fail for different reasons than an ALB health check?
A: Yes. NLB TCP health checks only verify that the target accepts a TCP connection — they do not evaluate an HTTP response. A service that accepts and immediately closes a TCP connection passes a TCP health check. HTTP health checks on NLBs behave similarly to ALB checks. Critically, NLBs do not have security groups — they pass client source IPs directly to the targets. Your instance security group must therefore allow inbound traffic from the full client IP range (or the NLB subnet CIDRs for health checks specifically), not just from a load balancer security group ID.
Q: My ECS task health check passes but the ALB target health check still fails. Why?
A: ECS container health checks (defined in the task definition's
healthCheckblock) run inside the container network namespace and are evaluated by the ECS agent independently of the ALB. The ALB health check probes the host-level ENI. Confirm that the task definition's
portMappingsblock correctly maps the container port to a host port, and verify the ECS service's security group allows inbound from the ALB security group on that mapped port. Also check that the health check grace period in the ECS service definition is sufficient for the application to initialize before the first ALB probe.
Q: How do I configure ALB health checks correctly in Terraform to avoid drift?
A: Use the
health_checkblock inside
aws_lb_target_groupand explicitly set every field rather than relying on AWS defaults.
resource "aws_lb_target_group" "app" {
name = "app-tg"
port = 8080
protocol = "HTTP"
vpc_id = aws_vpc.main.id
health_check {
path = "/health"
protocol = "HTTP"
healthy_threshold = 2
unhealthy_threshold = 3
timeout = 10
interval = 30
matcher = "200"
}
}Pinning every attribute prevents unexpected state changes when AWS updates default values and eliminates the drift that occurs when engineers modify settings in the console.
Q: Which CloudWatch metrics should I monitor to detect ELB health check problems before they cause an outage?
A: Monitor UnHealthyHostCount (alert when greater than 0), HealthyHostCount (alert when it drops below the minimum required capacity for your service), HTTPCode_Target_5XX_Count, and TargetConnectionErrorCount. For NLBs, also watch TCP_ELB_Reset_Count. Set evaluation periods to 1–2 minutes so you catch transient failures before they cause sustained degradation and before autoscaling terminates all unhealthy instances.
Q: How do I temporarily exclude an instance from health checks during maintenance without removing it from the target group?
A: Deregister the target before performing maintenance. In-flight connections drain within the configured deregistration delay (default 300 seconds, configurable down to 0 for emergency maintenance). After maintenance is complete, re-register the instance and monitor the health check status until it transitions to healthy before considering the maintenance window closed.
# Deregister before maintenance
aws elbv2 deregister-targets \
--target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
--targets Id=i-0a1b2c3d4e5f67890,Port=8080
# Re-register after maintenance
aws elbv2 register-targets \
--target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
--targets Id=i-0a1b2c3d4e5f67890,Port=8080Q: Can weighted routing help reduce blast radius while I debug a health check failure on a subset of targets?
A: ALB does not support per-target weight within a single target group. However, you can configure weighted target groups in an ALB listener rule to direct a percentage of traffic to a secondary target group. For example, route 90% of traffic to a known-good target group while routing 10% to the group under investigation. This limits user impact while you diagnose the failure in a live traffic environment. Use the AWS console or CLI to set the weight attribute on each target group action in the listener rule.
