InfraRunBook
    Back to articles

    AWS ELB Health Check Failing

    Cloud
    Published: Apr 5, 2026
    Updated: Apr 5, 2026

    Diagnose and fix AWS ELB health check failures with step-by-step root cause analysis covering wrong paths, security groups, SSL mismatches, and timeout issues. Includes real CLI commands and before/after outputs.

    AWS ELB Health Check Failing

    Symptoms

    When AWS Elastic Load Balancer health checks begin failing, the impact is immediate and visible across multiple layers of your AWS environment. Recognizing the exact combination of signals helps you skip the noise and move directly to root cause identification.

    • Target group instances show status unhealthy in the AWS Console under EC2 > Target Groups > Targets
    • CloudWatch metric UnHealthyHostCount is non-zero while HealthyHostCount trends toward zero
    • End users receive HTTP 502 Bad Gateway or 503 Service Unavailable responses from the ALB DNS endpoint
    • ELB access logs stored in S3 show target_status_code as
      -
      (no response) or unexpected 4xx/5xx values
    • Autoscaling groups repeatedly launch and terminate instances because newly registered targets never pass the health check and are never considered healthy
    • CloudWatch Alarms enter ALARM state on HTTPCode_ELB_5XX_Count or TargetConnectionErrorCount
    • Deployments stall in CodeDeploy or ECS because the replacement task or instance fails to transition to healthy within the wait timeout

    The moment you observe any of these signals, run the following command to retrieve the raw health state and embedded reason codes from AWS:

    aws elbv2 describe-target-health \
      --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
      --region us-east-1

    Example output showing an unhealthy target:

    {
        "TargetHealthDescriptions": [
            {
                "Target": {
                    "Id": "i-0a1b2c3d4e5f67890",
                    "Port": 8080
                },
                "HealthCheckPort": "8080",
                "TargetHealth": {
                    "State": "unhealthy",
                    "Reason": "Target.FailedHealthChecks",
                    "Description": "Health checks failed with these codes: [404]"
                }
            }
        ]
    }

    The Reason and Description fields are your primary pivot points. Each section below maps a failure pattern to its root cause, a definitive identification method, and a concrete fix.


    Root Cause 1: Wrong Health Check Path

    Why It Happens

    The default health check path for an Application Load Balancer target group is /. Many applications do not return a valid 200 on the root path. They may redirect to a login screen (returning 301), serve a large HTML page that the matcher rejects, or simply return 404 because the application is mounted under a sub-path such as

    /api/v1
    or
    /app
    . When the ELB receives any status code outside the configured success range, it immediately marks the target unhealthy. This is the single most common cause of health check failures after a new deployment or when migrating an existing application to a load balancer for the first time.

    How to Identify

    Check the current health check configuration for the target group:

    aws elbv2 describe-target-groups \
      --target-group-arns arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
      --query 'TargetGroups[*].{Path:HealthCheckPath,Code:Matcher.HttpCode,Port:HealthCheckPort}' \
      --output table

    Sample output:

    --------------------------------------
    |        DescribeTargetGroups        |
    +-------+-----------+----------------+
    | Code  |   Path    |      Port      |
    +-------+-----------+----------------+
    |  200  |     /     | traffic-port   |
    +-------+-----------+----------------+

    Now verify what the application actually returns on that path. From a bastion or another EC2 instance in the same VPC subnet, curl the instance's private IP directly:

    curl -v http://10.0.1.45:8080/
    # Output:
    < HTTP/1.1 301 Moved Permanently
    < Location: https://10.0.1.45:8080/dashboard

    A 301 is not in the success matcher, so every health check fails. Identify the correct endpoint by asking the application team or inspecting the app routes:

    curl -sv http://10.0.1.45:8080/health
    < HTTP/1.1 200 OK
    {"status":"UP"}

    How to Fix

    Update the health check path to the endpoint that reliably returns 200:

    aws elbv2 modify-target-group \
      --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
      --health-check-path /health \
      --matcher HttpCode=200

    After the update, the ELB resumes health checks immediately. With default settings (30-second interval, 3 healthy threshold), targets transition to healthy within roughly 90 seconds if the endpoint is responding correctly.


    Root Cause 2: Security Group Blocking Health Check Port

    Why It Happens

    Application Load Balancers send health check probes from the ELB nodes, which reside inside your VPC in the subnets where the load balancer is deployed. The source IP of these probes is the private IP of the ALB node — not a public address. If the security group attached to your EC2 instances does not have an inbound rule permitting traffic from the ALB's security group on the health check port, the TCP connection is silently dropped before the HTTP layer is reached. The ELB records this as a connection timeout (

    Target.Timeout
    ) or refused connection (
    Target.FailedHealthChecks
    with no HTTP code). This is a common pitfall when security groups are tightened mid-incident or when IaC templates are applied that overwrite prior manual rules.

    How to Identify

    Retrieve the security group IDs for both the load balancer and the target instances:

    # Get ALB security groups
    aws elbv2 describe-load-balancers \
      --names app-alb-prod \
      --query 'LoadBalancers[*].SecurityGroups' \
      --output text
    # Output: sg-0123456789abcdef0
    
    # Get instance security groups
    aws ec2 describe-instances \
      --instance-ids i-0a1b2c3d4e5f67890 \
      --query 'Reservations[*].Instances[*].SecurityGroups' \
      --output table

    Now inspect the inbound rules of the instance's security group:

    aws ec2 describe-security-groups \
      --group-ids sg-0fedcba987654321 \
      --query 'SecurityGroups[*].IpPermissions' \
      --output json

    If no rule permits TCP port 8080 from

    sg-0123456789abcdef0
    (the ALB security group), health check traffic is being dropped. Confirm via VPC Flow Logs with a REJECT filter:

    aws logs filter-log-events \
      --log-group-name /vpc/flowlogs/prod \
      --filter-pattern '[version, account, eni, src, dst="10.0.1.45", srcport, dstport="8080", protocol="6", packets, bytes, start, end, action="REJECT", status]' \
      --start-time 1700000000000

    How to Fix

    Add an inbound rule permitting the health check port from the ALB's security group:

    aws ec2 authorize-security-group-ingress \
      --group-id sg-0fedcba987654321 \
      --protocol tcp \
      --port 8080 \
      --source-group sg-0123456789abcdef0

    Security group changes take effect immediately — no instance reboot required. The next scheduled health check will succeed if the application is otherwise healthy.


    Root Cause 3: App Not Responding on Health Endpoint

    Why It Happens

    Even when the path is correct and the security group is open, the application process itself may be unable to respond. Common sub-causes include: the process crashed and was not restarted; the application is mid-startup and the health handler is not yet registered; the service is bound to

    127.0.0.1
    instead of
    0.0.0.0
    , making it unreachable from the ELB node; or the health endpoint performs a live dependency check (database query, cache ping) that blocks indefinitely when a downstream service is unavailable. In ECS environments, container port mappings may be misconfigured so the container port is not exposed on the host network interface that the ALB targets.

    How to Identify

    SSH to the target instance (or exec into the container) and verify the process is running and bound to the correct network interface:

    ss -tlnp | grep 8080
    # Expected — listening on all interfaces:
    State   Recv-Q  Send-Q  Local Address:Port  Peer Address:Port  Process
    LISTEN  0       128     0.0.0.0:8080        0.0.0.0:*          users:(("java",pid=1234,fd=42))
    
    # Problematic — loopback only:
    LISTEN  0       128     127.0.0.1:8080      0.0.0.0:*          users:(("java",pid=1234,fd=42))

    Confirm the endpoint responds locally and measure latency:

    curl -sv http://127.0.0.1:8080/health
    # If this hangs or times out, inspect app logs:
    journalctl -u app-service -n 200 --no-pager | grep -i 'error\|warn\|exception'
    tail -n 200 /var/log/app/application.log

    Look for messages like

    Connection refused to 10.0.2.10:5432
    or
    OutOfMemoryError
    that indicate the health handler is blocking on a dependency.

    How to Fix

    • Process not running: Restart the service and investigate root cause —
      systemctl restart app-service
      , then
      dmesg | grep -i oom
      for OOM kills
    • Bound to loopback: Update the application bind address to
      0.0.0.0
      in the configuration file and redeploy
    • Health endpoint blocks on dependencies: Refactor the health endpoint to return 200 immediately, confirming only that the process is alive. Move dependency validation to a separate deep-health endpoint that is never targeted by the ELB
    • ECS port mapping error: Verify the task definition
      portMappings
      block maps containerPort 8080 to hostPort 8080 (or 0 for dynamic mapping), then redeploy the service

    The ELB health check endpoint should be shallow — confirm the process is alive and able to accept connections, nothing more.


    Root Cause 4: SSL Certificate Mismatch

    Why It Happens

    When the target group health check protocol is set to HTTPS, the ALB establishes a TLS handshake with the backend instance before sending the HTTP health check request. If the certificate served by the instance has expired, is self-signed with an untrusted CA, or does not match the expected hostname, the TLS negotiation phase fails and the health check is recorded as a connection error with no HTTP status code. This scenario commonly appears when: end-to-end encryption is enabled using self-signed certificates on backend instances; a certificate renewal was applied to some instances in a fleet but not all; or the health check protocol was recently changed from HTTP to HTTPS without updating the instance certificate configuration.

    How to Identify

    Check the current health check protocol:

    aws elbv2 describe-target-groups \
      --target-group-arns arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
      --query 'TargetGroups[*].{Protocol:HealthCheckProtocol,Path:HealthCheckPath}' \
      --output table
    # Output: HTTPS / /health

    Attempt a TLS connection directly to the instance from within the VPC:

    openssl s_client -connect 10.0.1.45:8443 -servername sw-infrarunbook-01.solvethenetwork.com
    # Failure output:
    verify error:num=18:self signed certificate
    # or:
    verify error:num=10:certificate has expired
    
    subject=CN = sw-infrarunbook-01.internal
    issuer=CN = sw-infrarunbook-01.internal

    Check certificate validity dates explicitly:

    echo | openssl s_client -connect 10.0.1.45:8443 2>/dev/null | openssl x509 -noout -dates
    # notBefore=Jan  1 00:00:00 2023 GMT
    # notAfter=Jan  1 00:00:00 2024 GMT  <-- expired!

    How to Fix

    Option A — Confirm ALB certificate verification behavior. By default, Application Load Balancers do NOT validate the certificate chain of backend targets for HTTPS health checks. They establish TLS but accept self-signed and expired certificates. If health checks are still failing with HTTPS, the likely issue is that the target is not completing the TLS handshake at all (process not listening, wrong port). Confirm the protocol is actually HTTPS on the target:

    # Verify the app is actually serving TLS on the health check port
    nmap -p 8443 --script ssl-cert 10.0.1.45
    # If port shows 'closed' or no ssl-cert output, the app is not TLS-enabled on that port

    Option B — Renew or replace the certificate on the instance. If the TLS handshake itself fails (connection reset, certificate parse error), deploy a valid certificate:

    # Renew via certbot on sw-infrarunbook-01
    certbot renew --nginx --non-interactive
    
    # Verify the renewed certificate
    openssl x509 \
      -in /etc/letsencrypt/live/sw-infrarunbook-01.solvethenetwork.com/fullchain.pem \
      -noout -dates -subject
    # subject=CN = sw-infrarunbook-01.solvethenetwork.com
    # notAfter=Jul  5 00:00:00 2026 GMT

    Root Cause 5: Health Check Timeout Too Low

    Why It Happens

    The health check timeout defines how many seconds the ELB waits for a complete HTTP response before declaring the check failed. The default is 5 seconds for ALB and 10 seconds for NLB. If the application experiences JVM garbage collection pauses, performs a database query inside the health handler, has a slow initialization path, or is under high load that delays response times, the health endpoint may consistently exceed the configured timeout. The ELB interprets this as a failure, marks the target unhealthy, and stops sending traffic to it — even though the application would otherwise serve production requests successfully. This creates a situation where the load balancer is removing healthy instances from rotation, compounding load on the remaining targets and potentially triggering a cascading failure.

    How to Identify

    Retrieve the current timeout and interval configuration:

    aws elbv2 describe-target-groups \
      --target-group-arns arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
      --query 'TargetGroups[*].{Timeout:HealthCheckTimeoutSeconds,Interval:HealthCheckIntervalSeconds,Threshold:UnhealthyThresholdCount}' \
      --output table
    # Output:
    # +----------+-----------+---------+
    # | Interval | Threshold | Timeout |
    # +----------+-----------+---------+
    # |    30    |     3     |    5    |
    # +----------+-----------+---------+

    Measure actual health endpoint response times across multiple samples from within the VPC:

    for i in {1..20}; do
      curl -o /dev/null -s -w "%{time_total}\n" http://10.0.1.45:8080/health
    done
    # Output:
    0.312
    0.289
    6.102   <-- exceeds 5s timeout!
    0.301
    7.891   <-- GC pause causing delay
    0.295
    0.310

    If the p95 or p99 response time consistently exceeds the configured timeout, this is your root cause. Correlate the timing of health check failures in CloudWatch with GC logs or slow-query logs on the instance.

    How to Fix

    Increase the timeout to provide headroom above the observed worst-case response time. The timeout must always be less than the interval:

    aws elbv2 modify-target-group \
      --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
      --health-check-timeout-seconds 15 \
      --health-check-interval-seconds 30

    This resolves the immediate health check failure, but increasing the timeout is a mitigation, not a fix. The correct long-term solution is to ensure the health endpoint returns immediately without performing expensive operations. A well-designed health endpoint should look like this:

    GET /health HTTP/1.1
    Host: sw-infrarunbook-01.solvethenetwork.com
    
    HTTP/1.1 200 OK
    Content-Type: application/json
    Content-Length: 30
    
    {"status":"UP","version":"1.4.2"}

    Root Cause 6: HTTP Response Code Mismatch

    Why It Happens

    The ALB health check success condition is defined by the matcher — a configurable set of HTTP status codes. The default is

    200
    only. If your application's health endpoint legitimately returns
    204 No Content
    or
    202 Accepted
    , or if a reverse proxy in front of the app returns a
    301
    redirect that was not accounted for, the ELB marks the check as failed. This frequently surfaces after a deployment where the health endpoint behavior was changed, or when a new framework version alters the default response code for empty-body responses.

    How to Identify

    # Check what the endpoint actually returns
    curl -o /dev/null -s -w "%{http_code}\n" http://10.0.1.45:8080/health
    # Output: 204
    
    # Check the current matcher
    aws elbv2 describe-target-groups \
      --target-group-arns arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
      --query 'TargetGroups[*].Matcher'
    # Output: {"HttpCode": "200"}

    The application returns 204 but the matcher only accepts 200 — the health check fails on every probe.

    How to Fix

    aws elbv2 modify-target-group \
      --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
      --matcher HttpCode=200-299

    Alternatively, keep the matcher strict at 200 and update the application to return 200 from the health endpoint. Standardizing on 200 across all services makes the ELB configuration consistent and easier to audit.


    Root Cause 7: Target Registered on Wrong Port

    Why It Happens

    When targets are registered manually via the console, CLI, or through IaC templates with a hardcoded port value, it is easy to register the instance on the wrong port — for example, port 80 while the application listens on 8080. The ELB will attempt the health check on the registered port, receive a connection-refused error, and mark the target unhealthy. This sub-cause is particularly common in environments where the application port is managed as a variable in a deployment pipeline and the ELB target group registration was not updated to match.

    How to Identify

    aws elbv2 describe-target-health \
      --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
      --query 'TargetHealthDescriptions[*].{ID:Target.Id,Port:Target.Port,State:TargetHealth.State,Reason:TargetHealth.Reason}'
    # Output:
    # [{"ID":"i-0a1b2c3d4e5f67890","Port":80,"State":"unhealthy","Reason":"Target.FailedHealthChecks"}]
    
    # Verify what the instance is actually listening on
    ss -tlnp | grep LISTEN
    # Shows: 0.0.0.0:8080 -- app is on 8080, not 80

    How to Fix

    # Deregister the target on the wrong port
    aws elbv2 deregister-targets \
      --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
      --targets Id=i-0a1b2c3d4e5f67890,Port=80
    
    # Register on the correct port
    aws elbv2 register-targets \
      --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
      --targets Id=i-0a1b2c3d4e5f67890,Port=8080

    Prevention

    Health check failures are almost always preventable with disciplined configuration standards and deployment safeguards. Applying the following controls reduces the probability of recurring outages and accelerates diagnosis when they do occur.

    • Implement a dedicated /health endpoint in every application. The endpoint should return 200 within milliseconds without querying databases, caches, or external services. It should only confirm the process is alive and ready to accept traffic. Move dependency checks to a separate /health/deep endpoint that is never targeted by the ELB.
    • Codify all health check configuration in IaC. Use Terraform, CloudFormation, or CDK to define health check path, protocol, healthy threshold, unhealthy threshold, timeout, interval, and matcher as explicit, version-controlled values. Never configure health checks manually in the console where they can drift silently.
    • Lock security group rules in IaC and enforce with AWS Config. Create a custom AWS Config rule that verifies the load balancer's security group ID appears as an inbound source on the target security group for the health check port. Alert on any deviation.
    • Set CloudWatch alarms on UnHealthyHostCount > 0 with a 1-minute evaluation period and notify the on-call engineer immediately. Do not wait for end users to report 502 errors before you know targets are unhealthy.
    • Validate health checks during every deployment. In CodeDeploy blue/green deployments, register one replacement instance and verify it reaches healthy status before continuing the rollout. In ECS rolling deployments, set the minimum healthy percent and health check grace period to allow adequate initialization time.
    • Monitor SSL certificate expiry proactively. Use ACM managed renewal for public certificates. For private or imported certificates, create an EventBridge rule that fires on ACM expiry notifications 30 days in advance and routes alerts to the ops team.
    • Document the health check path in the service runbook. Every service should have a one-line entry that states the health check URL, expected response code, and approximate response time. This lets any on-call engineer verify the endpoint in seconds during an incident.
    • Test the health endpoint independently of the ELB in your CI/CD pipeline. A curl check against the /health path as part of the integration test suite catches path changes before they reach production.

    Frequently Asked Questions

    Q: How long does it take for an ELB target to become healthy after the root cause is resolved?

    A: Recovery time depends on your health check interval and healthy threshold. With the default settings (30-second interval, 3 healthy threshold), it takes a minimum of 90 seconds from the first successful check for a target to reach the healthy state. During an active incident, temporarily reduce the interval to 10 seconds (the ALB minimum) and the healthy threshold to 2, cutting recovery time to 20 seconds. Restore default values once the incident is closed.

    Q: Can I trigger an ELB health check manually without waiting for the interval?

    A: No. AWS does not expose an API to trigger on-demand health checks. You must wait for the next scheduled probe. If faster feedback is needed during troubleshooting, reduce the health check interval to 10 seconds temporarily, observe the next few check results in the console or via

    describe-target-health
    , then restore the original interval.

    Q: Why does my instance show healthy in the EC2 console but unhealthy in the target group?

    A: EC2 instance health checks and ELB target group health checks are entirely separate mechanisms. EC2 status checks verify hypervisor-level and OS-level availability (system reachability and instance reachability). ELB target group health checks verify that the application process is responding correctly on a specific port and path with an acceptable HTTP status code. An instance can pass both EC2 status checks while the application process inside it is crashed, bound to the wrong address, or returning errors.

    Q: What is the difference between HealthCheckProtocol HTTP and HTTPS in a target group?

    A: With HTTP, the ALB sends a plain TCP-based HTTP request to the target. With HTTPS, the ALB first negotiates a TLS session and then sends the HTTP request over the encrypted channel. By default, ALB does not validate the target's certificate chain (it accepts self-signed and expired certificates), but the TLS handshake itself must complete successfully. If the target process is not configured to serve TLS on the health check port, the HTTPS health check will fail at the connection level before any HTTP exchange occurs.

    Q: How do I diagnose health check failures without SSH access to the instance?

    A: Enable VPC Flow Logs on the target subnet and filter for traffic on the health check port with REJECT actions to identify security group blocks. Check the ELB access logs in S3 — the

    target_status_code
    field reveals exactly what HTTP code the target returned. Use the
    describe-target-health
    CLI command; the Description field often contains the specific HTTP status code or connection error. For ECS workloads, use CloudWatch Container Insights or AWS Systems Manager Session Manager to access container logs without opening SSH.

    Q: Can an NLB health check fail for different reasons than an ALB health check?

    A: Yes. NLB TCP health checks only verify that the target accepts a TCP connection — they do not evaluate an HTTP response. A service that accepts and immediately closes a TCP connection passes a TCP health check. HTTP health checks on NLBs behave similarly to ALB checks. Critically, NLBs do not have security groups — they pass client source IPs directly to the targets. Your instance security group must therefore allow inbound traffic from the full client IP range (or the NLB subnet CIDRs for health checks specifically), not just from a load balancer security group ID.

    Q: My ECS task health check passes but the ALB target health check still fails. Why?

    A: ECS container health checks (defined in the task definition's

    healthCheck
    block) run inside the container network namespace and are evaluated by the ECS agent independently of the ALB. The ALB health check probes the host-level ENI. Confirm that the task definition's
    portMappings
    block correctly maps the container port to a host port, and verify the ECS service's security group allows inbound from the ALB security group on that mapped port. Also check that the health check grace period in the ECS service definition is sufficient for the application to initialize before the first ALB probe.

    Q: How do I configure ALB health checks correctly in Terraform to avoid drift?

    A: Use the

    health_check
    block inside
    aws_lb_target_group
    and explicitly set every field rather than relying on AWS defaults.

    resource "aws_lb_target_group" "app" {
      name     = "app-tg"
      port     = 8080
      protocol = "HTTP"
      vpc_id   = aws_vpc.main.id
    
      health_check {
        path                = "/health"
        protocol            = "HTTP"
        healthy_threshold   = 2
        unhealthy_threshold = 3
        timeout             = 10
        interval            = 30
        matcher             = "200"
      }
    }

    Pinning every attribute prevents unexpected state changes when AWS updates default values and eliminates the drift that occurs when engineers modify settings in the console.

    Q: Which CloudWatch metrics should I monitor to detect ELB health check problems before they cause an outage?

    A: Monitor UnHealthyHostCount (alert when greater than 0), HealthyHostCount (alert when it drops below the minimum required capacity for your service), HTTPCode_Target_5XX_Count, and TargetConnectionErrorCount. For NLBs, also watch TCP_ELB_Reset_Count. Set evaluation periods to 1–2 minutes so you catch transient failures before they cause sustained degradation and before autoscaling terminates all unhealthy instances.

    Q: How do I temporarily exclude an instance from health checks during maintenance without removing it from the target group?

    A: Deregister the target before performing maintenance. In-flight connections drain within the configured deregistration delay (default 300 seconds, configurable down to 0 for emergency maintenance). After maintenance is complete, re-register the instance and monitor the health check status until it transitions to healthy before considering the maintenance window closed.

    # Deregister before maintenance
    aws elbv2 deregister-targets \
      --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
      --targets Id=i-0a1b2c3d4e5f67890,Port=8080
    
    # Re-register after maintenance
    aws elbv2 register-targets \
      --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/app-tg/abc123def456 \
      --targets Id=i-0a1b2c3d4e5f67890,Port=8080

    Q: Can weighted routing help reduce blast radius while I debug a health check failure on a subset of targets?

    A: ALB does not support per-target weight within a single target group. However, you can configure weighted target groups in an ALB listener rule to direct a percentage of traffic to a secondary target group. For example, route 90% of traffic to a known-good target group while routing 10% to the group under investigation. This limits user impact while you diagnose the failure in a live traffic environment. Use the AWS console or CLI to set the weight attribute on each target group action in the listener rule.

    Frequently Asked Questions

    How long does it take for an ELB target to become healthy after the root cause is resolved?

    Recovery time depends on your health check interval and healthy threshold. With the default settings (30-second interval, 3 healthy threshold), it takes a minimum of 90 seconds from the first successful check. During an incident, temporarily reduce the interval to 10 seconds and the healthy threshold to 2 to cut recovery time to 20 seconds.

    Can I trigger an ELB health check manually without waiting for the interval?

    No. AWS does not expose an API to trigger on-demand health checks. You must wait for the next scheduled probe. Temporarily reducing the health check interval to 10 seconds (the ALB minimum) gives faster feedback during troubleshooting.

    Why does my instance show healthy in the EC2 console but unhealthy in the target group?

    EC2 status checks verify hypervisor and OS-level availability. ELB target group health checks verify that the application is responding on a specific port and path with an acceptable HTTP status code. These are independent mechanisms — an instance can pass EC2 checks while the application process inside it is crashed or misconfigured.

    What is the difference between HealthCheckProtocol HTTP and HTTPS in a target group?

    HTTP sends a plain TCP-based request. HTTPS negotiates TLS first, then sends the HTTP request. By default, ALB does not validate the target certificate chain, but the TLS handshake must complete. If the target is not configured for TLS on that port, the HTTPS health check fails at the connection level.

    How do I diagnose health check failures without SSH access to the instance?

    Enable VPC Flow Logs and filter for REJECT actions on the health check port. Check ELB access logs in S3 for target_status_code values. Use the describe-target-health CLI command — the Description field often contains the specific HTTP code or connection error that caused the failure.

    Can an NLB health check fail for different reasons than an ALB health check?

    Yes. NLB TCP health checks only verify that the target accepts a TCP connection. NLBs also lack security groups and pass client IPs directly to targets, so instance security groups must permit the full client IP range or NLB subnet CIDRs rather than a load balancer security group ID.

    My ECS task health check passes but the ALB target health check still fails. Why?

    ECS container health checks run inside the container network namespace and are evaluated by the ECS agent independently. The ALB probes the host-level ENI. Verify the task definition portMappings correctly expose the container port, and confirm the ECS service security group allows inbound from the ALB security group on the mapped port.

    How do I configure ALB health checks correctly in Terraform to avoid drift?

    Use the health_check block inside aws_lb_target_group and explicitly set path, protocol, healthy_threshold, unhealthy_threshold, timeout, interval, and matcher. Pinning every attribute prevents unexpected state changes when AWS updates defaults and eliminates drift caused by console modifications.

    Which CloudWatch metrics should I monitor to detect ELB health check problems proactively?

    Monitor UnHealthyHostCount (alert when greater than 0), HealthyHostCount (alert when below minimum required capacity), HTTPCode_Target_5XX_Count, and TargetConnectionErrorCount. For NLBs, also watch TCP_ELB_Reset_Count. Set evaluation periods to 1–2 minutes to catch failures before they cause sustained degradation.

    How do I temporarily exclude an instance from health checks during maintenance?

    Deregister the target before maintenance. In-flight connections drain within the configured deregistration delay (default 300 seconds). After maintenance, re-register the instance and monitor the health check status until it transitions to healthy before closing the maintenance window.

    Related Articles