InfraRunBook
    Back to articles

    AWS Auto Scaling Not Triggering

    Cloud
    Published: Apr 16, 2026
    Updated: Apr 16, 2026

    Diagnose and fix AWS Auto Scaling groups that refuse to scale out, covering CloudWatch alarm failures, policy misconfigurations, capacity limits, launch template errors, and health check issues with real CLI commands.

    AWS Auto Scaling Not Triggering

    Symptoms

    You're watching your application crawl under load. CPU is pinned at 90%, response times are climbing, and users are complaining — but your Auto Scaling Group is sitting there doing absolutely nothing. No new instances. No scale-out events in the activity history. Just silence while your application burns.

    The signs are usually unmistakable once you know what to look for:

    • The ASG desired capacity hasn't changed despite sustained high load
    • CloudWatch metrics show CPU or request count well above your alarm threshold
    • The EC2 Auto Scaling activity history shows no recent scaling events
    • Your load balancer is returning elevated error rates or climbing response latency
    • No scaling notifications have fired on your SNS topic
    • A manual increase of desired capacity works fine — but the automatic trigger never fires

    The tricky part is that Auto Scaling failures are almost always silent. The service doesn't page you. It doesn't log a user-visible error. It just doesn't act, and you find out when a human notices the application struggling. This runbook walks through every realistic cause, how to diagnose it with real CLI output, and how to fix it under pressure.


    Root Cause 1: CloudWatch Alarm Not Firing

    The entire Auto Scaling chain starts with a CloudWatch alarm. If that alarm never transitions to the ALARM state, nothing downstream will trigger — full stop. This single point of failure catches a lot of engineers off-guard because the alarm and the scaling policy can look completely correct in the console, but the alarm is stuck in

    INSUFFICIENT_DATA
    or stubbornly sitting at
    OK
    when it shouldn't be.

    The most common reason this happens is that the metric simply isn't being published. Basic EC2 monitoring only emits metrics every 5 minutes. A short CPU spike can start and end without ever producing enough datapoints to cross a threshold configured for 1-minute evaluation periods. A misconfigured metric namespace, wrong dimension keys, or a stale custom metric publisher will also produce

    INSUFFICIENT_DATA
    indefinitely — the alarm has no way to tell you it's looking at the wrong thing.

    Start by checking the alarm state directly:

    aws cloudwatch describe-alarms \
      --alarm-names "my-asg-high-cpu" \
      --query 'MetricAlarms[*].{Name:AlarmName,State:StateValue,Reason:StateReason}'

    If the alarm is broken, you'll see something like this:

    [
        {
            "Name": "my-asg-high-cpu",
            "State": "INSUFFICIENT_DATA",
            "Reason": "Unchecked: Initial alarm creation"
        }
    ]

    Next, verify the metric actually has data. For a standard ASG CPU alarm:

    aws cloudwatch get-metric-statistics \
      --namespace AWS/EC2 \
      --metric-name CPUUtilization \
      --dimensions Name=AutoScalingGroupName,Value=my-asg \
      --start-time 2024-01-15T12:00:00Z \
      --end-time 2024-01-15T13:00:00Z \
      --period 60 \
      --statistics Average

    If the

    Datapoints
    array comes back empty, EC2 isn't publishing that metric for your ASG. Enable detailed monitoring on your instances to get 1-minute resolution:

    aws ec2 monitor-instances --instance-ids i-0abc123def456789a

    For custom application metrics, check that the namespace and dimension names in your alarm exactly match what your application is publishing. A single mismatched dimension key — even a capitalization difference between

    AutoScalingGroupName
    and
    autoscalinggroupname
    — will result in
    INSUFFICIENT_DATA
    forever. Also verify your alarm's
    DatapointsToAlarm
    setting isn't requiring more consecutive breaching datapoints than your evaluation period produces under realistic load.


    Root Cause 2: Scaling Policy Misconfiguration

    Even if your alarm fires perfectly, a broken scaling policy will silently absorb the signal and do nothing. Scaling policy issues are particularly sneaky because the ASG activity log often won't record a failure — it simply won't record anything at all.

    In my experience, the most common policy mistake involves step scaling: the step adjustments don't cover the actual metric range. You configure a step for 0–10 above the threshold and another for 10–20 above, but when your metric shoots 40 points above the threshold, it falls into a gap where no adjustment is defined. Zero action. The other failure mode I see frequently is using

    PercentChangeInCapacity
    with a small percentage on a small ASG. If you have 2 instances and set a 10% adjustment, 10% of 2 is 0.2 instances — which rounds down to zero. Nothing launches.

    Inspect your policy configuration in full:

    aws autoscaling describe-policies \
      --auto-scaling-group-name my-asg \
      --query 'ScalingPolicies[*].{Name:PolicyName,Type:PolicyType,AdjType:AdjustmentType,Steps:StepAdjustments}'

    For a correctly configured step scaling policy, the output should show no gaps and no upper bound on the final step:

    [
        {
            "Name": "scale-out-cpu",
            "Type": "StepScaling",
            "AdjType": "ChangeInCapacity",
            "Steps": [
                {"MetricIntervalLowerBound": 0.0, "MetricIntervalUpperBound": 10.0, "ScalingAdjustment": 1},
                {"MetricIntervalLowerBound": 10.0, "MetricIntervalUpperBound": 25.0, "ScalingAdjustment": 2},
                {"MetricIntervalLowerBound": 25.0, "ScalingAdjustment": 4}
            ]
        }
    ]

    That last step with no

    MetricIntervalUpperBound
    is the catch-all. If your last step has an upper bound, any metric value above it triggers no action at all. To fix a misconfigured step scaling policy, recreate it:

    aws autoscaling put-scaling-policy \
      --auto-scaling-group-name my-asg \
      --policy-name scale-out-cpu \
      --policy-type StepScaling \
      --adjustment-type ChangeInCapacity \
      --metric-aggregation-type Average \
      --estimated-instance-warmup 90 \
      --step-adjustments \
        MetricIntervalLowerBound=0,MetricIntervalUpperBound=10,ScalingAdjustment=1 \
        MetricIntervalLowerBound=10,MetricIntervalUpperBound=25,ScalingAdjustment=2 \
        MetricIntervalLowerBound=25,ScalingAdjustment=4

    For target tracking policies, the failure mode is a target value that's simply set too high. If your target CPU is 80% and your fleet regularly operates at 75%, the policy won't scale until you're well past comfortable territory. Set targets based on actual load testing data, not intuition.


    Root Cause 3: Min/Max Capacity Limits Reached

    This is the single most common root cause I encounter in production environments. The ASG has already hit its

    MaxSize
    , so even when the alarm fires and the policy executes correctly, the group literally cannot add more instances. The system is working exactly as configured — it's just hitting a ceiling you set and forgot about.

    There's a less obvious variant: AWS service quotas. Every AWS account has a per-region limit on running EC2 instances. If you've hit that quota, new instance launches fail silently from the ASG perspective. Similarly, if your VPC subnet has exhausted its available private IPs in the relevant availability zones, launches will fail without an obvious connection to capacity limits.

    Check the current capacity situation first:

    aws autoscaling describe-auto-scaling-groups \
      --auto-scaling-group-names my-asg \
      --query 'AutoScalingGroups[*].{Min:MinSize,Max:MaxSize,Desired:DesiredCapacity,InService:Instances[?LifecycleState==`InService`] | length(@)}'
    [
        {
            "Min": 2,
            "Max": 10,
            "Desired": 10,
            "InService": 10
        }
    ]
    

    When

    Desired
    equals
    Max
    , you've hit the ceiling. Confirm by looking at the activity log — you'll often see a scale-out attempt logged as successfully updating capacity from N to N, which is how AWS records a no-op against the max boundary. Check your EC2 instance quota as well:

    aws service-quotas get-service-quota \
      --service-code ec2 \
      --quota-code L-1216C47A \
      --query 'Quota.Value'

    The fix for hitting the MaxSize ceiling is to increase it to whatever your architecture and cost controls can support:

    aws autoscaling update-auto-scaling-group \
      --auto-scaling-group-name my-asg \
      --max-size 25

    If you need a quota increase, submit through Service Quotas. Standard instance types typically process within a few hours:

    aws service-quotas request-service-quota-increase \
      --service-code ec2 \
      --quota-code L-1216C47A \
      --desired-value 500

    Don't set

    MaxSize
    arbitrarily high just to avoid the problem. Set it to a number that reflects your actual capacity planning and cost tolerance. An ASG with MaxSize set to 1000 and no spend alerting is a future incident waiting for a trigger. Revisit your max values whenever your traffic patterns change significantly — a limit set during early growth stages will become a surprise constraint during a traffic event if you never update it.


    Root Cause 4: Launch Template or Configuration Error

    When Auto Scaling tries to launch a new instance and that launch fails, the ASG records a failed activity, backs off, and tries again later. Meanwhile, your desired capacity stays the same, your alarm stays in

    ALARM
    , and nothing is actually helping your load problem. From the outside it looks like scaling isn't triggering. In reality it's triggering and failing immediately — a critical distinction because the fix is completely different.

    This almost always happens right after an infrastructure change. Someone updated the launch template with an AMI ID that doesn't exist in this region. A security group referenced in the template got deleted. An IAM instance profile was renamed without updating the launch template. A subnet was removed and the ASG's availability zone list wasn't updated to match. These changes break instance launches instantly and completely, with no automated rollback.

    The activity log is your first stop:

    aws autoscaling describe-scaling-activities \
      --auto-scaling-group-name my-asg \
      --max-items 10 \
      --query 'Activities[*].{Status:StatusCode,Time:StartTime,Message:StatusMessage}'

    A failed launch looks like this:

    [
        {
            "Status": "Failed",
            "Time": "2024-01-15T14:23:11Z",
            "Message": "The image id '[ami-0deadbeef12345678]' does not exist (Service: AmazonEC2; Status Code: 400; Error Code: InvalidAMIID.NotFound)"
        }
    ]

    That error message tells you exactly what's broken. Now inspect the current launch template to confirm what it's referencing:

    aws ec2 describe-launch-template-versions \
      --launch-template-name my-app-lt \
      --versions '$Latest' \
      --query 'LaunchTemplateVersions[*].LaunchTemplateData.{AMI:ImageId,Type:InstanceType,SGs:SecurityGroupIds,Profile:IamInstanceProfile}'

    Validate the AMI exists in this specific region:

    aws ec2 describe-images \
      --image-ids ami-0deadbeef12345678 \
      --query 'Images[*].{State:State,Name:Name}'

    An empty array means the AMI is either gone or was never available in this region — a common issue when teams copy launch templates across regions without updating the AMI ID. Create a new template version with a valid AMI and point your ASG at it:

    aws ec2 create-launch-template-version \
      --launch-template-name my-app-lt \
      --source-version '$Latest' \
      --launch-template-data '{"ImageId":"ami-0newvalidimage12345"}'
    aws autoscaling update-auto-scaling-group \
      --auto-scaling-group-name my-asg \
      --launch-template LaunchTemplateName=my-app-lt,Version='$Latest'

    After fixing the template, immediately test by manually bumping desired capacity by one and watching the activity log for a successful launch before you're back under pressure.


    Root Cause 5: Health Check Grace Period Too Long

    The health check grace period is a buffer window after a new instance launches, during which Auto Scaling won't terminate that instance for failing health checks. The intent is reasonable — your application needs time to start up before health checks mean anything. But this setting has side effects that make scaling appear broken when it's actually working incorrectly in a different way.

    The first effect: with ELB health checks enabled, instances won't receive production traffic until they pass the load balancer's health check. If your grace period is much longer than your actual startup time, the ELB can clear the instance long before Auto Scaling's grace period expires. The instance is healthy and serving traffic, but the grace period is still counting down. This is mostly benign but delays accurate metric reporting.

    The more disruptive effect — and the one that looks exactly like scaling not working — is when the grace period is shorter than your actual application startup time. Auto Scaling evaluates health before the app is ready, marks the instance unhealthy, terminates it, and launches a replacement. You end up in a thrash loop where new instances constantly launch and die without ever serving traffic. Your CloudWatch metrics stay elevated, the alarm stays in

    ALARM
    , and nothing improves. You might even see your instance count briefly tick up before falling back down, which looks like partial scaling that immediately reverses.

    Check the current grace period and health check type:

    aws autoscaling describe-auto-scaling-groups \
      --auto-scaling-group-names my-asg \
      --query 'AutoScalingGroups[*].{GracePeriod:HealthCheckGracePeriod,HealthType:HealthCheckType}'

    To confirm a thrash loop, look for rapid launch-then-terminate sequences in the activity log:

    aws autoscaling describe-scaling-activities \
      --auto-scaling-group-name my-asg \
      --max-items 30 \
      --query 'Activities[*].{Status:StatusCode,Description:Description,Time:StartTime}'

    Repeated

    Launching
    entries followed shortly by
    Terminating
    entries with a health check failure reason confirms the loop. Set the grace period to match your actual measured startup time — not a guess, not a worst-case from years ago:

    aws autoscaling update-auto-scaling-group \
      --auto-scaling-group-name my-asg \
      --health-check-grace-period 120

    For applications with variable startup times, lifecycle hooks give you precise control. The instance holds in

    Pending:Wait
    until your bootstrap script signals completion:

    aws autoscaling put-lifecycle-hook \
      --auto-scaling-group-name my-asg \
      --lifecycle-hook-name app-ready \
      --lifecycle-transition autoscaling:EC2_INSTANCE_LAUNCHING \
      --heartbeat-timeout 180 \
      --default-result CONTINUE

    Your instance bootstrap script signals when the application is actually ready to serve traffic:

    aws autoscaling complete-lifecycle-action \
      --auto-scaling-group-name my-asg \
      --lifecycle-hook-name app-ready \
      --instance-id $(curl -s http://169.254.169.254/latest/meta-data/instance-id) \
      --lifecycle-action-result CONTINUE

    Root Cause 6: Cooldown Period Blocking Subsequent Scale-Outs

    Auto Scaling's cooldown mechanism exists to prevent runaway scaling. After any scale-out or scale-in event, the ASG enters a cooldown period during which additional scaling actions won't fire. The default is 300 seconds — five full minutes. In a fast-moving incident where load doubles every two minutes, this cooldown means you might execute one scale-out event while your capacity need grows far beyond what that single event addresses.

    This is most painful with simple scaling policies, which respect the ASG default cooldown strictly. Step scaling and target tracking policies have their own instance warm-up concept that's a bit more nuanced, but you can still run into effective blocking delays if these values aren't tuned.

    Look at your activity log timestamps to identify the pattern:

    aws autoscaling describe-scaling-activities \
      --auto-scaling-group-name my-asg \
      --max-items 10 \
      --query 'Activities[*].{Status:StatusCode,Start:StartTime,Cause:Cause}'

    If you see a successful scale-out at 14:00:00 and the next event only at 14:05:15, you hit the default 300-second cooldown. Check what the ASG default is set to:

    aws autoscaling describe-auto-scaling-groups \
      --auto-scaling-group-names my-asg \
      --query 'AutoScalingGroups[*].DefaultCooldown'

    For most modern workloads, drop the effective cooldown to something closer to your instance warm-up time and attach it directly to the policy rather than relying on the ASG default:

    aws autoscaling put-scaling-policy \
      --auto-scaling-group-name my-asg \
      --policy-name scale-out-cpu \
      --policy-type StepScaling \
      --adjustment-type ChangeInCapacity \
      --estimated-instance-warmup 90 \
      --step-adjustments \
        MetricIntervalLowerBound=0,MetricIntervalUpperBound=15,ScalingAdjustment=2 \
        MetricIntervalLowerBound=15,ScalingAdjustment=4

    In an active incident where you need capacity right now, bypass the cooldown entirely with a manual desired capacity override:

    aws autoscaling set-desired-capacity \
      --auto-scaling-group-name my-asg \
      --desired-capacity 20 \
      --honor-cooldown false

    This is a valid emergency lever. Use it when you need to get ahead of the problem manually and let the automatic policies catch up once things stabilize.


    Root Cause 7: Instance Warm-Up Distorting Target Tracking Metrics

    Target tracking scaling policies exclude instances that are still in their configured warm-up period from aggregate metric calculations. The reasoning is sound — a half-started instance shouldn't drag down your average CPU reading and fool the policy into thinking you have more headroom than you do. But when the

    EstimatedInstanceWarmup
    value is set far higher than your actual startup time, healthy and fully operational instances get excluded from metric calculations for many minutes after they're already serving traffic.

    The result is a distorted view: your CPU average stays artificially high because contributing instances are excluded, the policy believes it needs more capacity, and it continues scaling when it doesn't need to — or conversely, keeps scaling when what you really need is for the existing new instances to be counted so the policy can see that load is being absorbed. Either way, the behavior looks wrong and the actual scaling logic is working on bad input data.

    Check what your target tracking policy has configured:

    aws autoscaling describe-policies \
      --auto-scaling-group-name my-asg \
      --policy-types TargetTrackingScaling \
      --query 'ScalingPolicies[*].{Name:PolicyName,Target:TargetTrackingConfiguration.TargetValue,WarmUp:TargetTrackingConfiguration.EstimatedInstanceWarmup}'

    If

    EstimatedInstanceWarmup
    is 600 and your application starts in 90 seconds, you're excluding healthy, in-service instances from metric calculations for an extra 510 seconds per scale-out event. Update the policy with a value grounded in measured startup time:

    aws autoscaling put-scaling-policy \
      --auto-scaling-group-name my-asg \
      --policy-name target-cpu-60 \
      --policy-type TargetTrackingScaling \
      --target-tracking-configuration '{
        "PredefinedMetricSpecification": {
          "PredefinedMetricType": "ASGAverageCPUUtilization"
        },
        "TargetValue": 60.0,
        "EstimatedInstanceWarmup": 120
      }'

    Prevention

    The most important thing you can do is test your scaling before you need it. Set up a load test — k6, wrk, or even

    stress-ng
    on an instance — and verify that your ASG actually scales out in response to real load. Do this in staging. Do it again after every significant infrastructure change. A scaling policy that hasn't been tested is a policy you can't trust during an incident.

    Enable scaling activity notifications so you find out about failures the moment they happen, not when an engineer notices the application struggling:

    aws autoscaling put-notification-configuration \
      --auto-scaling-group-name my-asg \
      --topic-arn arn:aws:sns:us-east-1:123456789012:asg-alerts \
      --notification-types \
        autoscaling:EC2_INSTANCE_LAUNCH_ERROR \
        autoscaling:EC2_INSTANCE_TERMINATE_ERROR \
        autoscaling:EC2_INSTANCE_LAUNCH \
        autoscaling:EC2_INSTANCE_TERMINATE

    Create a CloudWatch alarm that watches for divergence between desired and in-service capacity. When these two values differ for more than a few minutes, something is actively wrong — either launches are failing or the policy isn't executing. A persistent gap is always worth an alert:

    aws cloudwatch put-metric-alarm \
      --alarm-name "asg-desired-vs-inservice-gap" \
      --alarm-description "Desired capacity exceeds in-service count for too long" \
      --namespace AWS/AutoScaling \
      --metric-name GroupInServiceInstances \
      --dimensions Name=AutoScalingGroupName,Value=my-asg \
      --period 300 \
      --evaluation-periods 3 \
      --threshold 1 \
      --comparison-operator LessThanThreshold \
      --alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts

    Define all your ASG configurations, launch templates, scaling policies, and CloudWatch alarms in infrastructure-as-code. Drift between what's in Terraform or CloudFormation and what's actually deployed in the account is how a working scaling configuration silently becomes a broken one. When a manual change breaks scaling at 2am, the person investigating needs to know what the intended configuration was — and that information should be in version control, not in someone's memory.

    Review your

    MaxSize
    values as part of your regular capacity planning cycle. An ASG ceiling set during early growth stages will eventually become the thing that limits you during a traffic event, and you won't find out until you're already in the incident. Set it deliberately, attach spend alerts so an unexpected scale-out doesn't produce a surprise bill, and raise it proactively rather than reactively.

    Finally, keep your launch templates validated. Any automated pipeline that builds and publishes AMIs should also update the launch template and verify that a test instance can launch successfully. Catching a broken AMI reference in CI is dramatically better than catching it during a scale-out event under production load.

    Frequently Asked Questions

    Why is my Auto Scaling Group alarm in INSUFFICIENT_DATA state?

    INSUFFICIENT_DATA means the alarm has no metric datapoints to evaluate. This usually happens because EC2 basic monitoring only publishes every 5 minutes, your metric namespace or dimensions don't match what EC2 is publishing, or a custom metric publisher has stopped sending data. Enable detailed monitoring on your instances and verify the namespace and dimensions match exactly using aws cloudwatch get-metric-statistics.

    My CloudWatch alarm is in ALARM state but no new instances are launching — what do I check?

    Start with the scaling activity log using aws autoscaling describe-scaling-activities. Look for Failed status entries which indicate launch errors (bad AMI, deleted security group, IAM profile issues). If there are no activity entries at all, check your scaling policy for step coverage gaps or a MaxSize ceiling. If you see successful activities but no new instances, you've likely hit the max capacity limit.

    How do I find out if my ASG hit its maximum capacity limit?

    Run aws autoscaling describe-auto-scaling-groups with a query for MinSize, MaxSize, and DesiredCapacity. If DesiredCapacity equals MaxSize, you've hit the ceiling. You can also check aws service-quotas get-service-quota --service-code ec2 --quota-code L-1216C47A to see if your account-level EC2 instance quota is the constraint.

    What causes the health check grace period to make scaling appear broken?

    If the grace period is shorter than your application startup time, Auto Scaling marks new instances unhealthy before they're ready, terminates them, and immediately launches replacements. This creates a thrash loop where instances constantly launch and die without ever serving traffic. The fix is setting the grace period to slightly longer than your measured application startup time, or using lifecycle hooks for precise readiness signaling.

    How do I bypass Auto Scaling cooldowns during an active incident?

    Use aws autoscaling set-desired-capacity with the --honor-cooldown false flag to immediately set a specific capacity target without waiting for the cooldown period to expire. This is the right emergency lever when you need to get ahead of a capacity problem while automatic policies catch up. After the incident, review your cooldown and warm-up settings to reduce the chance of needing to do this again.

    Can a step scaling policy silently do nothing even when the alarm fires?

    Yes. If your step adjustments don't cover the actual metric range — for example you have steps for 0-10 and 10-20 above the threshold but your metric breaches by 35 — the value falls in an uncovered range and no scaling action executes. The fix is ensuring your last step has no upper bound so it acts as a catch-all for any breach magnitude above a certain level.

    Related Articles