Symptoms
You're watching your application crawl under load. CPU is pinned at 90%, response times are climbing, and users are complaining — but your Auto Scaling Group is sitting there doing absolutely nothing. No new instances. No scale-out events in the activity history. Just silence while your application burns.
The signs are usually unmistakable once you know what to look for:
- The ASG desired capacity hasn't changed despite sustained high load
- CloudWatch metrics show CPU or request count well above your alarm threshold
- The EC2 Auto Scaling activity history shows no recent scaling events
- Your load balancer is returning elevated error rates or climbing response latency
- No scaling notifications have fired on your SNS topic
- A manual increase of desired capacity works fine — but the automatic trigger never fires
The tricky part is that Auto Scaling failures are almost always silent. The service doesn't page you. It doesn't log a user-visible error. It just doesn't act, and you find out when a human notices the application struggling. This runbook walks through every realistic cause, how to diagnose it with real CLI output, and how to fix it under pressure.
Root Cause 1: CloudWatch Alarm Not Firing
The entire Auto Scaling chain starts with a CloudWatch alarm. If that alarm never transitions to the ALARM state, nothing downstream will trigger — full stop. This single point of failure catches a lot of engineers off-guard because the alarm and the scaling policy can look completely correct in the console, but the alarm is stuck in
INSUFFICIENT_DATAor stubbornly sitting at
OKwhen it shouldn't be.
The most common reason this happens is that the metric simply isn't being published. Basic EC2 monitoring only emits metrics every 5 minutes. A short CPU spike can start and end without ever producing enough datapoints to cross a threshold configured for 1-minute evaluation periods. A misconfigured metric namespace, wrong dimension keys, or a stale custom metric publisher will also produce
INSUFFICIENT_DATAindefinitely — the alarm has no way to tell you it's looking at the wrong thing.
Start by checking the alarm state directly:
aws cloudwatch describe-alarms \
--alarm-names "my-asg-high-cpu" \
--query 'MetricAlarms[*].{Name:AlarmName,State:StateValue,Reason:StateReason}'
If the alarm is broken, you'll see something like this:
[
{
"Name": "my-asg-high-cpu",
"State": "INSUFFICIENT_DATA",
"Reason": "Unchecked: Initial alarm creation"
}
]
Next, verify the metric actually has data. For a standard ASG CPU alarm:
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=AutoScalingGroupName,Value=my-asg \
--start-time 2024-01-15T12:00:00Z \
--end-time 2024-01-15T13:00:00Z \
--period 60 \
--statistics Average
If the
Datapointsarray comes back empty, EC2 isn't publishing that metric for your ASG. Enable detailed monitoring on your instances to get 1-minute resolution:
aws ec2 monitor-instances --instance-ids i-0abc123def456789a
For custom application metrics, check that the namespace and dimension names in your alarm exactly match what your application is publishing. A single mismatched dimension key — even a capitalization difference between
AutoScalingGroupNameand
autoscalinggroupname— will result in
INSUFFICIENT_DATAforever. Also verify your alarm's
DatapointsToAlarmsetting isn't requiring more consecutive breaching datapoints than your evaluation period produces under realistic load.
Root Cause 2: Scaling Policy Misconfiguration
Even if your alarm fires perfectly, a broken scaling policy will silently absorb the signal and do nothing. Scaling policy issues are particularly sneaky because the ASG activity log often won't record a failure — it simply won't record anything at all.
In my experience, the most common policy mistake involves step scaling: the step adjustments don't cover the actual metric range. You configure a step for 0–10 above the threshold and another for 10–20 above, but when your metric shoots 40 points above the threshold, it falls into a gap where no adjustment is defined. Zero action. The other failure mode I see frequently is using
PercentChangeInCapacitywith a small percentage on a small ASG. If you have 2 instances and set a 10% adjustment, 10% of 2 is 0.2 instances — which rounds down to zero. Nothing launches.
Inspect your policy configuration in full:
aws autoscaling describe-policies \
--auto-scaling-group-name my-asg \
--query 'ScalingPolicies[*].{Name:PolicyName,Type:PolicyType,AdjType:AdjustmentType,Steps:StepAdjustments}'
For a correctly configured step scaling policy, the output should show no gaps and no upper bound on the final step:
[
{
"Name": "scale-out-cpu",
"Type": "StepScaling",
"AdjType": "ChangeInCapacity",
"Steps": [
{"MetricIntervalLowerBound": 0.0, "MetricIntervalUpperBound": 10.0, "ScalingAdjustment": 1},
{"MetricIntervalLowerBound": 10.0, "MetricIntervalUpperBound": 25.0, "ScalingAdjustment": 2},
{"MetricIntervalLowerBound": 25.0, "ScalingAdjustment": 4}
]
}
]
That last step with no
MetricIntervalUpperBoundis the catch-all. If your last step has an upper bound, any metric value above it triggers no action at all. To fix a misconfigured step scaling policy, recreate it:
aws autoscaling put-scaling-policy \
--auto-scaling-group-name my-asg \
--policy-name scale-out-cpu \
--policy-type StepScaling \
--adjustment-type ChangeInCapacity \
--metric-aggregation-type Average \
--estimated-instance-warmup 90 \
--step-adjustments \
MetricIntervalLowerBound=0,MetricIntervalUpperBound=10,ScalingAdjustment=1 \
MetricIntervalLowerBound=10,MetricIntervalUpperBound=25,ScalingAdjustment=2 \
MetricIntervalLowerBound=25,ScalingAdjustment=4
For target tracking policies, the failure mode is a target value that's simply set too high. If your target CPU is 80% and your fleet regularly operates at 75%, the policy won't scale until you're well past comfortable territory. Set targets based on actual load testing data, not intuition.
Root Cause 3: Min/Max Capacity Limits Reached
This is the single most common root cause I encounter in production environments. The ASG has already hit its
MaxSize, so even when the alarm fires and the policy executes correctly, the group literally cannot add more instances. The system is working exactly as configured — it's just hitting a ceiling you set and forgot about.
There's a less obvious variant: AWS service quotas. Every AWS account has a per-region limit on running EC2 instances. If you've hit that quota, new instance launches fail silently from the ASG perspective. Similarly, if your VPC subnet has exhausted its available private IPs in the relevant availability zones, launches will fail without an obvious connection to capacity limits.
Check the current capacity situation first:
aws autoscaling describe-auto-scaling-groups \
--auto-scaling-group-names my-asg \
--query 'AutoScalingGroups[*].{Min:MinSize,Max:MaxSize,Desired:DesiredCapacity,InService:Instances[?LifecycleState==`InService`] | length(@)}'
[
{
"Min": 2,
"Max": 10,
"Desired": 10,
"InService": 10
}
]
When
Desiredequals
Max, you've hit the ceiling. Confirm by looking at the activity log — you'll often see a scale-out attempt logged as successfully updating capacity from N to N, which is how AWS records a no-op against the max boundary. Check your EC2 instance quota as well:
aws service-quotas get-service-quota \
--service-code ec2 \
--quota-code L-1216C47A \
--query 'Quota.Value'
The fix for hitting the MaxSize ceiling is to increase it to whatever your architecture and cost controls can support:
aws autoscaling update-auto-scaling-group \
--auto-scaling-group-name my-asg \
--max-size 25
If you need a quota increase, submit through Service Quotas. Standard instance types typically process within a few hours:
aws service-quotas request-service-quota-increase \
--service-code ec2 \
--quota-code L-1216C47A \
--desired-value 500
Don't set
MaxSizearbitrarily high just to avoid the problem. Set it to a number that reflects your actual capacity planning and cost tolerance. An ASG with MaxSize set to 1000 and no spend alerting is a future incident waiting for a trigger. Revisit your max values whenever your traffic patterns change significantly — a limit set during early growth stages will become a surprise constraint during a traffic event if you never update it.
Root Cause 4: Launch Template or Configuration Error
When Auto Scaling tries to launch a new instance and that launch fails, the ASG records a failed activity, backs off, and tries again later. Meanwhile, your desired capacity stays the same, your alarm stays in
ALARM, and nothing is actually helping your load problem. From the outside it looks like scaling isn't triggering. In reality it's triggering and failing immediately — a critical distinction because the fix is completely different.
This almost always happens right after an infrastructure change. Someone updated the launch template with an AMI ID that doesn't exist in this region. A security group referenced in the template got deleted. An IAM instance profile was renamed without updating the launch template. A subnet was removed and the ASG's availability zone list wasn't updated to match. These changes break instance launches instantly and completely, with no automated rollback.
The activity log is your first stop:
aws autoscaling describe-scaling-activities \
--auto-scaling-group-name my-asg \
--max-items 10 \
--query 'Activities[*].{Status:StatusCode,Time:StartTime,Message:StatusMessage}'
A failed launch looks like this:
[
{
"Status": "Failed",
"Time": "2024-01-15T14:23:11Z",
"Message": "The image id '[ami-0deadbeef12345678]' does not exist (Service: AmazonEC2; Status Code: 400; Error Code: InvalidAMIID.NotFound)"
}
]
That error message tells you exactly what's broken. Now inspect the current launch template to confirm what it's referencing:
aws ec2 describe-launch-template-versions \
--launch-template-name my-app-lt \
--versions '$Latest' \
--query 'LaunchTemplateVersions[*].LaunchTemplateData.{AMI:ImageId,Type:InstanceType,SGs:SecurityGroupIds,Profile:IamInstanceProfile}'
Validate the AMI exists in this specific region:
aws ec2 describe-images \
--image-ids ami-0deadbeef12345678 \
--query 'Images[*].{State:State,Name:Name}'
An empty array means the AMI is either gone or was never available in this region — a common issue when teams copy launch templates across regions without updating the AMI ID. Create a new template version with a valid AMI and point your ASG at it:
aws ec2 create-launch-template-version \
--launch-template-name my-app-lt \
--source-version '$Latest' \
--launch-template-data '{"ImageId":"ami-0newvalidimage12345"}'
aws autoscaling update-auto-scaling-group \
--auto-scaling-group-name my-asg \
--launch-template LaunchTemplateName=my-app-lt,Version='$Latest'
After fixing the template, immediately test by manually bumping desired capacity by one and watching the activity log for a successful launch before you're back under pressure.
Root Cause 5: Health Check Grace Period Too Long
The health check grace period is a buffer window after a new instance launches, during which Auto Scaling won't terminate that instance for failing health checks. The intent is reasonable — your application needs time to start up before health checks mean anything. But this setting has side effects that make scaling appear broken when it's actually working incorrectly in a different way.
The first effect: with ELB health checks enabled, instances won't receive production traffic until they pass the load balancer's health check. If your grace period is much longer than your actual startup time, the ELB can clear the instance long before Auto Scaling's grace period expires. The instance is healthy and serving traffic, but the grace period is still counting down. This is mostly benign but delays accurate metric reporting.
The more disruptive effect — and the one that looks exactly like scaling not working — is when the grace period is shorter than your actual application startup time. Auto Scaling evaluates health before the app is ready, marks the instance unhealthy, terminates it, and launches a replacement. You end up in a thrash loop where new instances constantly launch and die without ever serving traffic. Your CloudWatch metrics stay elevated, the alarm stays in
ALARM, and nothing improves. You might even see your instance count briefly tick up before falling back down, which looks like partial scaling that immediately reverses.
Check the current grace period and health check type:
aws autoscaling describe-auto-scaling-groups \
--auto-scaling-group-names my-asg \
--query 'AutoScalingGroups[*].{GracePeriod:HealthCheckGracePeriod,HealthType:HealthCheckType}'
To confirm a thrash loop, look for rapid launch-then-terminate sequences in the activity log:
aws autoscaling describe-scaling-activities \
--auto-scaling-group-name my-asg \
--max-items 30 \
--query 'Activities[*].{Status:StatusCode,Description:Description,Time:StartTime}'
Repeated
Launchingentries followed shortly by
Terminatingentries with a health check failure reason confirms the loop. Set the grace period to match your actual measured startup time — not a guess, not a worst-case from years ago:
aws autoscaling update-auto-scaling-group \
--auto-scaling-group-name my-asg \
--health-check-grace-period 120
For applications with variable startup times, lifecycle hooks give you precise control. The instance holds in
Pending:Waituntil your bootstrap script signals completion:
aws autoscaling put-lifecycle-hook \
--auto-scaling-group-name my-asg \
--lifecycle-hook-name app-ready \
--lifecycle-transition autoscaling:EC2_INSTANCE_LAUNCHING \
--heartbeat-timeout 180 \
--default-result CONTINUE
Your instance bootstrap script signals when the application is actually ready to serve traffic:
aws autoscaling complete-lifecycle-action \
--auto-scaling-group-name my-asg \
--lifecycle-hook-name app-ready \
--instance-id $(curl -s http://169.254.169.254/latest/meta-data/instance-id) \
--lifecycle-action-result CONTINUE
Root Cause 6: Cooldown Period Blocking Subsequent Scale-Outs
Auto Scaling's cooldown mechanism exists to prevent runaway scaling. After any scale-out or scale-in event, the ASG enters a cooldown period during which additional scaling actions won't fire. The default is 300 seconds — five full minutes. In a fast-moving incident where load doubles every two minutes, this cooldown means you might execute one scale-out event while your capacity need grows far beyond what that single event addresses.
This is most painful with simple scaling policies, which respect the ASG default cooldown strictly. Step scaling and target tracking policies have their own instance warm-up concept that's a bit more nuanced, but you can still run into effective blocking delays if these values aren't tuned.
Look at your activity log timestamps to identify the pattern:
aws autoscaling describe-scaling-activities \
--auto-scaling-group-name my-asg \
--max-items 10 \
--query 'Activities[*].{Status:StatusCode,Start:StartTime,Cause:Cause}'
If you see a successful scale-out at 14:00:00 and the next event only at 14:05:15, you hit the default 300-second cooldown. Check what the ASG default is set to:
aws autoscaling describe-auto-scaling-groups \
--auto-scaling-group-names my-asg \
--query 'AutoScalingGroups[*].DefaultCooldown'
For most modern workloads, drop the effective cooldown to something closer to your instance warm-up time and attach it directly to the policy rather than relying on the ASG default:
aws autoscaling put-scaling-policy \
--auto-scaling-group-name my-asg \
--policy-name scale-out-cpu \
--policy-type StepScaling \
--adjustment-type ChangeInCapacity \
--estimated-instance-warmup 90 \
--step-adjustments \
MetricIntervalLowerBound=0,MetricIntervalUpperBound=15,ScalingAdjustment=2 \
MetricIntervalLowerBound=15,ScalingAdjustment=4
In an active incident where you need capacity right now, bypass the cooldown entirely with a manual desired capacity override:
aws autoscaling set-desired-capacity \
--auto-scaling-group-name my-asg \
--desired-capacity 20 \
--honor-cooldown false
This is a valid emergency lever. Use it when you need to get ahead of the problem manually and let the automatic policies catch up once things stabilize.
Root Cause 7: Instance Warm-Up Distorting Target Tracking Metrics
Target tracking scaling policies exclude instances that are still in their configured warm-up period from aggregate metric calculations. The reasoning is sound — a half-started instance shouldn't drag down your average CPU reading and fool the policy into thinking you have more headroom than you do. But when the
EstimatedInstanceWarmupvalue is set far higher than your actual startup time, healthy and fully operational instances get excluded from metric calculations for many minutes after they're already serving traffic.
The result is a distorted view: your CPU average stays artificially high because contributing instances are excluded, the policy believes it needs more capacity, and it continues scaling when it doesn't need to — or conversely, keeps scaling when what you really need is for the existing new instances to be counted so the policy can see that load is being absorbed. Either way, the behavior looks wrong and the actual scaling logic is working on bad input data.
Check what your target tracking policy has configured:
aws autoscaling describe-policies \
--auto-scaling-group-name my-asg \
--policy-types TargetTrackingScaling \
--query 'ScalingPolicies[*].{Name:PolicyName,Target:TargetTrackingConfiguration.TargetValue,WarmUp:TargetTrackingConfiguration.EstimatedInstanceWarmup}'
If
EstimatedInstanceWarmupis 600 and your application starts in 90 seconds, you're excluding healthy, in-service instances from metric calculations for an extra 510 seconds per scale-out event. Update the policy with a value grounded in measured startup time:
aws autoscaling put-scaling-policy \
--auto-scaling-group-name my-asg \
--policy-name target-cpu-60 \
--policy-type TargetTrackingScaling \
--target-tracking-configuration '{
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ASGAverageCPUUtilization"
},
"TargetValue": 60.0,
"EstimatedInstanceWarmup": 120
}'
Prevention
The most important thing you can do is test your scaling before you need it. Set up a load test — k6, wrk, or even
stress-ngon an instance — and verify that your ASG actually scales out in response to real load. Do this in staging. Do it again after every significant infrastructure change. A scaling policy that hasn't been tested is a policy you can't trust during an incident.
Enable scaling activity notifications so you find out about failures the moment they happen, not when an engineer notices the application struggling:
aws autoscaling put-notification-configuration \
--auto-scaling-group-name my-asg \
--topic-arn arn:aws:sns:us-east-1:123456789012:asg-alerts \
--notification-types \
autoscaling:EC2_INSTANCE_LAUNCH_ERROR \
autoscaling:EC2_INSTANCE_TERMINATE_ERROR \
autoscaling:EC2_INSTANCE_LAUNCH \
autoscaling:EC2_INSTANCE_TERMINATE
Create a CloudWatch alarm that watches for divergence between desired and in-service capacity. When these two values differ for more than a few minutes, something is actively wrong — either launches are failing or the policy isn't executing. A persistent gap is always worth an alert:
aws cloudwatch put-metric-alarm \
--alarm-name "asg-desired-vs-inservice-gap" \
--alarm-description "Desired capacity exceeds in-service count for too long" \
--namespace AWS/AutoScaling \
--metric-name GroupInServiceInstances \
--dimensions Name=AutoScalingGroupName,Value=my-asg \
--period 300 \
--evaluation-periods 3 \
--threshold 1 \
--comparison-operator LessThanThreshold \
--alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts
Define all your ASG configurations, launch templates, scaling policies, and CloudWatch alarms in infrastructure-as-code. Drift between what's in Terraform or CloudFormation and what's actually deployed in the account is how a working scaling configuration silently becomes a broken one. When a manual change breaks scaling at 2am, the person investigating needs to know what the intended configuration was — and that information should be in version control, not in someone's memory.
Review your
MaxSizevalues as part of your regular capacity planning cycle. An ASG ceiling set during early growth stages will eventually become the thing that limits you during a traffic event, and you won't find out until you're already in the incident. Set it deliberately, attach spend alerts so an unexpected scale-out doesn't produce a surprise bill, and raise it proactively rather than reactively.
Finally, keep your launch templates validated. Any automated pipeline that builds and publishes AMIs should also update the launch template and verify that a test instance can launch successfully. Catching a broken AMI reference in CI is dramatically better than catching it during a scale-out event under production load.
