AWS ELB Health Check Failing

Symptoms

When AWS Elastic Load Balancer health checks begin failing, the impact is immediate and visible across multiple layers of your AWS environment. Recognizing the exact combination of signals helps you skip the noise and move directly to root cause identification.

Target group instances show status unhealthy in the AWS Console under EC2 > Target Groups > Targets
CloudWatch metric UnHealthyHostCount is non-zero while HealthyHostCount trends toward zero
End users receive HTTP 502 Bad Gateway or 503 Service Unavailable responses from the ALB DNS endpoint
ELB access logs stored in S3 show target_status_code as
-
(no response) or unexpected 4xx/5xx values
Autoscaling groups repeatedly launch and terminate instances because newly registered targets never pass the health check and are never considered healthy
CloudWatch Alarms enter ALARM state on HTTPCode_ELB_5XX_Count or TargetConnectionErrorCount
Deployments stall in CodeDeploy or ECS because the replacement task or instance fails to transition to healthy within the wait timeout

The moment you observe any of these signals, run the following command to retrieve the raw health state and embedded reason codes from AWS:

aws elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
  --region us-east-1

Example output showing an unhealthy target:

{
    "TargetHealthDescriptions": [
        {
            "Target": {
                "Id": "i-0a1b2c3d4e5f67890",
                "Port": 8080
            },
            "HealthCheckPort": "8080",
            "TargetHealth": {
                "State": "unhealthy",
                "Reason": "Target.FailedHealthChecks",
                "Description": "Health checks failed with these codes: [404]"
            }
        }
    ]
}

The Reason and Description fields are your primary pivot points. Each section below maps a failure pattern to its root cause, a definitive identification method, and a concrete fix.

Root Cause 1: Wrong Health Check Path

Why It Happens

The default health check path for an Application Load Balancer target group is /. Many applications do not return a valid 200 on the root path. They may redirect to a login screen (returning 301), serve a large HTML page that the matcher rejects, or simply return 404 because the application is mounted under a sub-path such as

/api/v1

/app

. When the ELB receives any status code outside the configured success range, it immediately marks the target unhealthy. This is the single most common cause of health check failures after a new deployment or when migrating an existing application to a load balancer for the first time.

How to Identify

Check the current health check configuration for the target group:

aws elbv2 describe-target-groups \
  --target-group-arns arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
  --query 'TargetGroups[*].{Path:HealthCheckPath,Code:Matcher.HttpCode,Port:HealthCheckPort}' \
  --output table

Sample output:

--------------------------------------
|        DescribeTargetGroups        |
+-------+-----------+----------------+
| Code  |   Path    |      Port      |
+-------+-----------+----------------+
|  200  |     /     | traffic-port   |
+-------+-----------+----------------+

Now verify what the application actually returns on that path. From a bastion or another EC2 instance in the same VPC subnet, curl the instance's private IP directly:

curl -v http://10.0.1.45:8080/
# Output:
< HTTP/1.1 301 Moved Permanently
< Location: https://10.0.1.45:8080/dashboard

A 301 is not in the success matcher, so every health check fails. Identify the correct endpoint by asking the application team or inspecting the app routes:

curl -sv http://10.0.1.45:8080/health
< HTTP/1.1 200 OK
{"status":"UP"}

How to Fix

Update the health check path to the endpoint that reliably returns 200:

aws elbv2 modify-target-group \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
  --health-check-path /health \
  --matcher HttpCode=200

After the update, the ELB resumes health checks immediately. With default settings (30-second interval, 3 healthy threshold), targets transition to healthy within roughly 90 seconds if the endpoint is responding correctly.

Root Cause 2: Security Group Blocking Health Check Port

Why It Happens

Application Load Balancers send health check probes from the ELB nodes, which reside inside your VPC in the subnets where the load balancer is deployed. The source IP of these probes is the private IP of the ALB node — not a public address. If the security group attached to your EC2 instances does not have an inbound rule permitting traffic from the ALB's security group on the health check port, the TCP connection is silently dropped before the HTTP layer is reached. The ELB records this as a connection timeout (

Target.Timeout

) or refused connection (

Target.FailedHealthChecks

with no HTTP code). This is a common pitfall when security groups are tightened mid-incident or when IaC templates are applied that overwrite prior manual rules.

How to Identify

Retrieve the security group IDs for both the load balancer and the target instances:

# Get ALB security groups
aws elbv2 describe-load-balancers \
  --names app-alb-prod \
  --query 'LoadBalancers[*].SecurityGroups' \
  --output text
# Output: sg-0123456789abcdef0

# Get instance security groups
aws ec2 describe-instances \
  --instance-ids i-0a1b2c3d4e5f67890 \
  --query 'Reservations[*].Instances[*].SecurityGroups' \
  --output table

Now inspect the inbound rules of the instance's security group:

aws ec2 describe-security-groups \
  --group-ids sg-0fedcba987654321 \
  --query 'SecurityGroups[*].IpPermissions' \
  --output json

If no rule permits TCP port 8080 from

sg-0123456789abcdef0

(the ALB security group), health check traffic is being dropped. Confirm via VPC Flow Logs with a REJECT filter:

aws logs filter-log-events \
  --log-group-name /vpc/flowlogs/prod \
  --filter-pattern '[version, account, eni, src, dst="10.0.1.45", srcport, dstport="8080", protocol="6", packets, bytes, start, end, action="REJECT", status]' \
  --start-time 1700000000000

How to Fix

Add an inbound rule permitting the health check port from the ALB's security group:

aws ec2 authorize-security-group-ingress \
  --group-id sg-0fedcba987654321 \
  --protocol tcp \
  --port 8080 \
  --source-group sg-0123456789abcdef0

Security group changes take effect immediately — no instance reboot required. The next scheduled health check will succeed if the application is otherwise healthy.

Root Cause 3: App Not Responding on Health Endpoint

Why It Happens

Even when the path is correct and the security group is open, the application process itself may be unable to respond. Common sub-causes include: the process crashed and was not restarted; the application is mid-startup and the health handler is not yet registered; the service is bound to

127.0.0.1

instead of

0.0.0.0

, making it unreachable from the ELB node; or the health endpoint performs a live dependency check (database query, cache ping) that blocks indefinitely when a downstream service is unavailable. In ECS environments, container port mappings may be misconfigured so the container port is not exposed on the host network interface that the ALB targets.

How to Identify

SSH to the target instance (or exec into the container) and verify the process is running and bound to the correct network interface:

ss -tlnp | grep 8080
# Expected — listening on all interfaces:
State   Recv-Q  Send-Q  Local Address:Port  Peer Address:Port  Process
LISTEN  0       128     0.0.0.0:8080        0.0.0.0:*          users:(("java",pid=1234,fd=42))

# Problematic — loopback only:
LISTEN  0       128     127.0.0.1:8080      0.0.0.0:*          users:(("java",pid=1234,fd=42))

Confirm the endpoint responds locally and measure latency:

curl -sv http://127.0.0.1:8080/health
# If this hangs or times out, inspect app logs:
journalctl -u app-service -n 200 --no-pager | grep -i 'error\|warn\|exception'
tail -n 200 /var/log/app/application.log

Look for messages like

Connection refused to 10.0.2.10:5432

OutOfMemoryError

that indicate the health handler is blocking on a dependency.

How to Fix

Process not running: Restart the service and investigate root cause —
systemctl restart app-service
, then
dmesg | grep -i oom
for OOM kills
Bound to loopback: Update the application bind address to
0.0.0.0
in the configuration file and redeploy
Health endpoint blocks on dependencies: Refactor the health endpoint to return 200 immediately, confirming only that the process is alive. Move dependency validation to a separate deep-health endpoint that is never targeted by the ELB
ECS port mapping error: Verify the task definition
portMappings
block maps containerPort 8080 to hostPort 8080 (or 0 for dynamic mapping), then redeploy the service

The ELB health check endpoint should be shallow — confirm the process is alive and able to accept connections, nothing more.

Root Cause 4: SSL Certificate Mismatch

Why It Happens

When the target group health check protocol is set to HTTPS, the ALB establishes a TLS handshake with the backend instance before sending the HTTP health check request. If the certificate served by the instance has expired, is self-signed with an untrusted CA, or does not match the expected hostname, the TLS negotiation phase fails and the health check is recorded as a connection error with no HTTP status code. This scenario commonly appears when: end-to-end encryption is enabled using self-signed certificates on backend instances; a certificate renewal was applied to some instances in a fleet but not all; or the health check protocol was recently changed from HTTP to HTTPS without updating the instance certificate configuration.

How to Identify

Check the current health check protocol:

aws elbv2 describe-target-groups \
  --target-group-arns arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
  --query 'TargetGroups[*].{Protocol:HealthCheckProtocol,Path:HealthCheckPath}' \
  --output table
# Output: HTTPS / /health

Attempt a TLS connection directly to the instance from within the VPC:

openssl s_client -connect 10.0.1.45:8443 -servername sw-infrarunbook-01.solvethenetwork.com
# Failure output:
verify error:num=18:self signed certificate
# or:
verify error:num=10:certificate has expired

subject=CN = sw-infrarunbook-01.internal
issuer=CN = sw-infrarunbook-01.internal

Check certificate validity dates explicitly:

echo | openssl s_client -connect 10.0.1.45:8443 2>/dev/null | openssl x509 -noout -dates
# notBefore=Jan  1 00:00:00 2023 GMT
# notAfter=Jan  1 00:00:00 2024 GMT  <-- expired!

How to Fix

Option A — Confirm ALB certificate verification behavior. By default, Application Load Balancers do NOT validate the certificate chain of backend targets for HTTPS health checks. They establish TLS but accept self-signed and expired certificates. If health checks are still failing with HTTPS, the likely issue is that the target is not completing the TLS handshake at all (process not listening, wrong port). Confirm the protocol is actually HTTPS on the target:

# Verify the app is actually serving TLS on the health check port
nmap -p 8443 --script ssl-cert 10.0.1.45
# If port shows 'closed' or no ssl-cert output, the app is not TLS-enabled on that port

Option B — Renew or replace the certificate on the instance. If the TLS handshake itself fails (connection reset, certificate parse error), deploy a valid certificate:

# Renew via certbot on sw-infrarunbook-01
certbot renew --nginx --non-interactive

# Verify the renewed certificate
openssl x509 \
  -in /etc/letsencrypt/live/sw-infrarunbook-01.solvethenetwork.com/fullchain.pem \
  -noout -dates -subject
# subject=CN = sw-infrarunbook-01.solvethenetwork.com
# notAfter=Jul  5 00:00:00 2026 GMT

Root Cause 5: Health Check Timeout Too Low

Why It Happens

The health check timeout defines how many seconds the ELB waits for a complete HTTP response before declaring the check failed. The default is 5 seconds for ALB and 10 seconds for NLB. If the application experiences JVM garbage collection pauses, performs a database query inside the health handler, has a slow initialization path, or is under high load that delays response times, the health endpoint may consistently exceed the configured timeout. The ELB interprets this as a failure, marks the target unhealthy, and stops sending traffic to it — even though the application would otherwise serve production requests successfully. This creates a situation where the load balancer is removing healthy instances from rotation, compounding load on the remaining targets and potentially triggering a cascading failure.

How to Identify

Retrieve the current timeout and interval configuration:

aws elbv2 describe-target-groups \
  --target-group-arns arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
  --query 'TargetGroups[*].{Timeout:HealthCheckTimeoutSeconds,Interval:HealthCheckIntervalSeconds,Threshold:UnhealthyThresholdCount}' \
  --output table
# Output:
# +----------+-----------+---------+
# | Interval | Threshold | Timeout |
# +----------+-----------+---------+
# |    30    |     3     |    5    |
# +----------+-----------+---------+

Measure actual health endpoint response times across multiple samples from within the VPC:

for i in {1..20}; do
  curl -o /dev/null -s -w "%{time_total}\n" http://10.0.1.45:8080/health
done
# Output:
0.312
0.289
6.102   <-- exceeds 5s timeout!
0.301
7.891   <-- GC pause causing delay
0.295
0.310

If the p95 or p99 response time consistently exceeds the configured timeout, this is your root cause. Correlate the timing of health check failures in CloudWatch with GC logs or slow-query logs on the instance.

How to Fix

Increase the timeout to provide headroom above the observed worst-case response time. The timeout must always be less than the interval:

aws elbv2 modify-target-group \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
  --health-check-timeout-seconds 15 \
  --health-check-interval-seconds 30

This resolves the immediate health check failure, but increasing the timeout is a mitigation, not a fix. The correct long-term solution is to ensure the health endpoint returns immediately without performing expensive operations. A well-designed health endpoint should look like this:

GET /health HTTP/1.1
Host: sw-infrarunbook-01.solvethenetwork.com

HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 30

{"status":"UP","version":"1.4.2"}

Root Cause 6: HTTP Response Code Mismatch

Why It Happens

The ALB health check success condition is defined by the matcher — a configurable set of HTTP status codes. The default is

200

only. If your application's health endpoint legitimately returns

204 No Content

202 Accepted

, or if a reverse proxy in front of the app returns a

301

redirect that was not accounted for, the ELB marks the check as failed. This frequently surfaces after a deployment where the health endpoint behavior was changed, or when a new framework version alters the default response code for empty-body responses.

How to Identify

# Check what the endpoint actually returns
curl -o /dev/null -s -w "%{http_code}\n" http://10.0.1.45:8080/health
# Output: 204

# Check the current matcher
aws elbv2 describe-target-groups \
  --target-group-arns arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
  --query 'TargetGroups[*].Matcher'
# Output: {"HttpCode": "200"}

The application returns 204 but the matcher only accepts 200 — the health check fails on every probe.

How to Fix

aws elbv2 modify-target-group \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
  --matcher HttpCode=200-299

Alternatively, keep the matcher strict at 200 and update the application to return 200 from the health endpoint. Standardizing on 200 across all services makes the ELB configuration consistent and easier to audit.

Root Cause 7: Target Registered on Wrong Port

Why It Happens

When targets are registered manually via the console, CLI, or through IaC templates with a hardcoded port value, it is easy to register the instance on the wrong port — for example, port 80 while the application listens on 8080. The ELB will attempt the health check on the registered port, receive a connection-refused error, and mark the target unhealthy. This sub-cause is particularly common in environments where the application port is managed as a variable in a deployment pipeline and the ELB target group registration was not updated to match.

How to Identify

aws elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
  --query 'TargetHealthDescriptions[*].{ID:Target.Id,Port:Target.Port,State:TargetHealth.State,Reason:TargetHealth.Reason}'
# Output:
# [{"ID":"i-0a1b2c3d4e5f67890","Port":80,"State":"unhealthy","Reason":"Target.FailedHealthChecks"}]

# Verify what the instance is actually listening on
ss -tlnp | grep LISTEN
# Shows: 0.0.0.0:8080 -- app is on 8080, not 80

How to Fix

# Deregister the target on the wrong port
aws elbv2 deregister-targets \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
  --targets Id=i-0a1b2c3d4e5f67890,Port=80

# Register on the correct port
aws elbv2 register-targets \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
  --targets Id=i-0a1b2c3d4e5f67890,Port=8080

Prevention

Health check failures are almost always preventable with disciplined configuration standards and deployment safeguards. Applying the following controls reduces the probability of recurring outages and accelerates diagnosis when they do occur.

Implement a dedicated /health endpoint in every application. The endpoint should return 200 within milliseconds without querying databases, caches, or external services. It should only confirm the process is alive and ready to accept traffic. Move dependency checks to a separate /health/deep endpoint that is never targeted by the ELB.
Codify all health check configuration in IaC. Use Terraform, CloudFormation, or CDK to define health check path, protocol, healthy threshold, unhealthy threshold, timeout, interval, and matcher as explicit, version-controlled values. Never configure health checks manually in the console where they can drift silently.
Lock security group rules in IaC and enforce with AWS Config. Create a custom AWS Config rule that verifies the load balancer's security group ID appears as an inbound source on the target security group for the health check port. Alert on any deviation.
Set CloudWatch alarms on UnHealthyHostCount > 0 with a 1-minute evaluation period and notify the on-call engineer immediately. Do not wait for end users to report 502 errors before you know targets are unhealthy.
Validate health checks during every deployment. In CodeDeploy blue/green deployments, register one replacement instance and verify it reaches healthy status before continuing the rollout. In ECS rolling deployments, set the minimum healthy percent and health check grace period to allow adequate initialization time.
Monitor SSL certificate expiry proactively. Use ACM managed renewal for public certificates. For private or imported certificates, create an EventBridge rule that fires on ACM expiry notifications 30 days in advance and routes alerts to the ops team.
Document the health check path in the service runbook. Every service should have a one-line entry that states the health check URL, expected response code, and approximate response time. This lets any on-call engineer verify the endpoint in seconds during an incident.
Test the health endpoint independently of the ELB in your CI/CD pipeline. A curl check against the /health path as part of the integration test suite catches path changes before they reach production.

Symptoms

Root Cause 1: Wrong Health Check Path

Why It Happens

How to Identify

How to Fix

Root Cause 2: Security Group Blocking Health Check Port

Why It Happens

How to Identify

How to Fix

Root Cause 3: App Not Responding on Health Endpoint

Why It Happens

How to Identify

How to Fix

Root Cause 4: SSL Certificate Mismatch

Why It Happens

How to Identify

How to Fix

Root Cause 5: Health Check Timeout Too Low

Why It Happens

How to Identify

How to Fix

Root Cause 6: HTTP Response Code Mismatch

Why It Happens

How to Identify

How to Fix

Root Cause 7: Target Registered on Wrong Port

Why It Happens

How to Identify

How to Fix

Prevention

Related Articles

Frequently Asked Questions

How long does it take for an ELB target to become healthy after the root cause is resolved?

Can I trigger an ELB health check manually without waiting for the interval?

Why does my instance show healthy in the EC2 console but unhealthy in the target group?

What is the difference between HealthCheckProtocol HTTP and HTTPS in a target group?

How do I diagnose health check failures without SSH access to the instance?

Can an NLB health check fail for different reasons than an ALB health check?

My ECS task health check passes but the ALB target health check still fails. Why?

How do I configure ALB health checks correctly in Terraform to avoid drift?

Which CloudWatch metrics should I monitor to detect ELB health check problems proactively?

How do I temporarily exclude an instance from health checks during maintenance?

Related Articles