Symptoms
You set up a CloudWatch alarm, wired it to an SNS topic, and waited. Nothing happened. The metric you're watching is clearly spiking — the EC2 instance on sw-infrarunbook-01 is sitting at 95% CPU — but the alarm stays green or shows INSUFFICIENT_DATA forever. No email, no PagerDuty page, no Lambda trigger. Just silence.
This is one of the most frustrating AWS debugging experiences because the console gives you almost no feedback when an alarm fails to fire. In my experience, you'll typically encounter one of these four symptoms:
- The alarm state is permanently OK even during obvious resource stress
- The alarm shows INSUFFICIENT_DATA and never transitions out of it
- The alarm transitions to ALARM, but no notification arrives
- The alarm was previously working, then suddenly stopped triggering after an IAM or infrastructure change
Each points to a different root cause. Let's go through them systematically, starting with the most common.
Root Cause 1: The Metric Is Not Being Published
This is the first thing you should rule out. If no metric data is arriving in CloudWatch, the alarm has nothing to evaluate and can never fire.
Why does it happen? For EC2 instances, basic monitoring only pushes a limited set of metrics — CPUUtilization, NetworkIn, NetworkOut, DiskReadOps — every 5 minutes. Memory usage, disk space, and custom application metrics are not sent by default. They require the CloudWatch agent to be installed and actively running. If the agent isn't running, or if its configuration doesn't reference the metric you're alarming on, no data flows. The same problem hits ECS tasks that don't have the awslogs driver configured, and Lambda functions where custom metrics require explicit PutMetricData calls from within the function code.
How to identify it: start by checking whether any data points exist for the metric. Don't trust the CloudWatch console graph's default time window — it sometimes renders no data as a flat line, making it look like the metric exists when it doesn't.
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-0abc123def456789 \
--start-time 2026-04-18T10:00:00Z \
--end-time 2026-04-18T11:00:00Z \
--period 300 \
--statistics Average \
--region us-east-1If the response comes back with an empty Datapoints array, the metric simply isn't being published:
{
"Label": "CPUUtilization",
"Datapoints": []
}For custom metrics, check whether the CloudWatch agent is running:
sudo systemctl status amazon-cloudwatch-agent● amazon-cloudwatch-agent.service - Amazon CloudWatch Agent
Loaded: loaded (/usr/lib/systemd/system/amazon-cloudwatch-agent.service; enabled)
Active: inactive (dead)That dead status explains everything. Restart it with the correct config:
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
-a fetch-config \
-m ec2 \
-s \
-c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json
sudo systemctl start amazon-cloudwatch-agent
sudo systemctl status amazon-cloudwatch-agentAfter restarting, wait a full 5 minutes and re-run the get-metric-statistics call. If data points appear, you've found your culprit.
Root Cause 2: Wrong Namespace
CloudWatch organizes metrics into namespaces, and the namespace must match exactly. It's case-sensitive and there are no fuzzy matches. I've seen engineers alarm on
AWS/ec2when the correct namespace is
AWS/EC2, or use
CWAgentwhen the custom namespace in the agent config is actually
custom/app-metrics. This is especially treacherous when you're writing alarms via CloudFormation or Terraform and copy a namespace string from the wrong source.
How to identify it: list the actual namespaces that currently have data in your account:
aws cloudwatch list-metrics \
--region us-east-1 \
--query "Metrics[].Namespace" \
--output text | tr '\t' '\n' | sort -uAWS/EC2
AWS/ECS
AWS/Lambda
AWS/RDS
AWS/S3
CWAgent
custom/app-metricsThen check what namespace your alarm is actually using:
aws cloudwatch describe-alarms \
--alarm-names "high-cpu-sw-infrarunbook-01" \
--query "MetricAlarms[0].{Namespace:Namespace,MetricName:MetricName,Threshold:Threshold}"{
"Namespace": "AWS/ec2",
"MetricName": "CPUUtilization",
"Threshold": 80.0
}There it is —
AWS/ec2instead of
AWS/EC2. Fix it by reissuing the put-metric-alarm with the corrected namespace:
aws cloudwatch put-metric-alarm \
--alarm-name "high-cpu-sw-infrarunbook-01" \
--alarm-description "CPU above 80% on sw-infrarunbook-01" \
--namespace "AWS/EC2" \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-0abc123def456789 \
--statistic Average \
--period 300 \
--threshold 80 \
--comparison-operator GreaterThanOrEqualToThreshold \
--evaluation-periods 2 \
--alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts \
--region us-east-1Root Cause 3: Threshold Configured Incorrectly
Sometimes the alarm is technically working — metric data is flowing, the namespace is right — but the threshold is set in a way that it will never actually be breached. This sounds obvious, but it's surprisingly easy to get wrong with metrics that have unintuitive units or comparison directions.
The classic example: you want to alarm when disk space drops below 20% free. You set the threshold to 20 and use GreaterThanOrEqualToThreshold. The alarm never fires even when disk is at 5% free. Why? Because you're checking if the metric is greater than or equal to 20, but disk_free_percent at 5% is not >= 20. You needed LessThanOrEqualToThreshold with a value of 20 — or, better, alarm on disk_used_percent with GreaterThanOrEqualToThreshold at 80.
Another common mistake is unit confusion. NetworkIn on EC2 is reported in bytes. Setting a threshold of 1000 expecting 1000 MB means you'll alarm at essentially zero network traffic. The correct threshold for 1 GB is 1073741824.
How to identify it: pull actual data points and compare them to your threshold side by side:
aws cloudwatch get-metric-statistics \
--namespace CWAgent \
--metric-name disk_used_percent \
--dimensions Name=InstanceId,Value=i-0abc123def456789 \
Name=path,Value=/ \
Name=device,Value=xvda1 \
Name=fstype,Value=ext4 \
--start-time 2026-04-18T08:00:00Z \
--end-time 2026-04-18T11:00:00Z \
--period 300 \
--statistics Average \
--region us-east-1{
"Label": "disk_used_percent",
"Datapoints": [
{"Timestamp": "2026-04-18T09:00:00Z", "Average": 87.3, "Unit": "Percent"},
{"Timestamp": "2026-04-18T09:05:00Z", "Average": 88.1, "Unit": "Percent"},
{"Timestamp": "2026-04-18T09:10:00Z", "Average": 89.0, "Unit": "Percent"}
]
}Now check the alarm's operator and threshold:
aws cloudwatch describe-alarms \
--alarm-names "disk-space-alarm-sw-infrarunbook-01" \
--query "MetricAlarms[0].{Operator:ComparisonOperator,Threshold:Threshold}"{
"Operator": "GreaterThanOrEqualToThreshold",
"Threshold": 90.0
}Disk used is at 89%, threshold is 90% with >=. The alarm won't fire until it crosses 90%. If your intent was to alert at 85%, update accordingly and also reduce EvaluationPeriods if you want faster response:
aws cloudwatch put-metric-alarm \
--alarm-name "disk-space-alarm-sw-infrarunbook-01" \
--namespace CWAgent \
--metric-name disk_used_percent \
--dimensions Name=InstanceId,Value=i-0abc123def456789 \
Name=path,Value=/ \
Name=device,Value=xvda1 \
Name=fstype,Value=ext4 \
--statistic Average \
--period 300 \
--threshold 85 \
--comparison-operator GreaterThanOrEqualToThreshold \
--evaluation-periods 2 \
--alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts \
--region us-east-1Root Cause 4: Alarm Stuck in INSUFFICIENT_DATA State
INSUFFICIENT_DATA is where every alarm starts. The problem is when it never moves out of this state even after you expect data to be flowing. CloudWatch puts an alarm here when it doesn't have enough data points within the evaluation window to make a determination.
The most frequent cause is a mismatch between the alarm's period and the metric's reporting interval. If your metric only reports every 5 minutes and your alarm period is set to 60 seconds, CloudWatch sees empty windows for most evaluations and stays stuck. Setting EvaluationPeriods to a high number (say, 10) compounds this — you'd need 10 consecutive non-empty windows before the alarm can even transition to OK.
I've also hit this when the monitored instance was stopped or terminated. A stopped EC2 instance publishes no metrics, so an alarm watching it by InstanceId will degrade back to INSUFFICIENT_DATA. This is especially sneaky in auto-scaling groups where instances are frequently replaced.
How to identify it:
aws cloudwatch describe-alarms \
--alarm-names "high-cpu-sw-infrarunbook-01" \
--query "MetricAlarms[0].{State:StateValue,Reason:StateReason,Period:Period,EvalPeriods:EvaluationPeriods,DatapointsToAlarm:DatapointsToAlarm}"{
"State": "INSUFFICIENT_DATA",
"Reason": "Insufficient Data: 10 datapoints were unknown.",
"Period": 60,
"EvalPeriods": 10,
"DatapointsToAlarm": 10
}Period is 60 seconds but basic EC2 monitoring publishes every 300 seconds. Fix it by either enabling detailed monitoring on the instance or updating the alarm period to match the metric's reporting interval:
# Enable detailed monitoring (publishes every 60 seconds)
aws ec2 monitor-instances \
--instance-ids i-0abc123def456789 \
--region us-east-1
# OR update the alarm to a 300-second period
aws cloudwatch put-metric-alarm \
--alarm-name "high-cpu-sw-infrarunbook-01" \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-0abc123def456789 \
--statistic Average \
--period 300 \
--threshold 80 \
--comparison-operator GreaterThanOrEqualToThreshold \
--evaluation-periods 2 \
--datapoints-to-alarm 2 \
--treat-missing-data notBreaching \
--alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts \
--region us-east-1Note the
--treat-missing-data notBreachingflag. By default, missing data is treated as missing, which directly contributes to INSUFFICIENT_DATA state. In most operational alarms you'll want either
notBreaching(missing data doesn't trigger the alarm — useful for bursty metrics) or
breaching(missing data fires the alarm — useful when loss of telemetry itself is the incident). Pick deliberately; don't leave it at the default.
Root Cause 5: Missing IAM Permission for the Metric
This is the subtlest one on this list. The CloudWatch agent running on your instance needs permission to call PutMetricData, and if the instance's IAM role doesn't grant it, custom metrics never reach CloudWatch. No error appears in the AWS console — the agent just silently fails to push data.
This also appears in multi-account setups where you're centralizing alarms in one account while watching metrics in another. Cross-account metric sharing must be explicitly configured, and it's easy to overlook during initial setup or miss during an IAM policy tightening exercise.
How to identify it: start by checking what policies are attached to the instance role:
aws iam list-attached-role-policies \
--role-name sw-infrarunbook-01-instance-role \
--query "AttachedPolicies[].PolicyName"[
"AmazonSSMManagedInstanceCore"
]No CloudWatch policy. The agent has no permission to push metrics. Confirm it by checking the agent's own logs:
sudo tail -100 /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log2026-04-18T09:15:32Z E! WriteToCloudWatch failed, err: AccessDeniedException:
User: arn:aws:sts::123456789012:assumed-role/sw-infrarunbook-01-instance-role/i-0abc123def456789
is not authorized to perform: cloudwatch:PutMetricData
on resource: arn:aws:cloudwatch:us-east-1:123456789012:*That's the smoking gun. Fix it by attaching the AWS-managed CloudWatch agent policy:
aws iam attach-role-policy \
--role-name sw-infrarunbook-01-instance-role \
--policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicyIf you need a least-privilege policy instead of the managed one, the minimum required permissions are:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"cloudwatch:PutMetricData",
"cloudwatch:GetMetricStatistics",
"cloudwatch:ListMetrics"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": ["ec2:DescribeTags"],
"Resource": "*"
}
]
}After attaching the policy, the IAM change takes effect immediately for new API calls — no instance reboot required. Restart the CloudWatch agent to flush its error state and watch the log for successful writes.
Root Cause 6: Incorrect or Missing Dimensions
Dimensions are the key-value pairs that identify a specific resource within a namespace. Get them wrong and your alarm watches nothing — or silently watches a different resource. EC2 metrics use InstanceId. ECS metrics use ClusterName and ServiceName together. RDS uses DBInstanceIdentifier. The CloudWatch agent emits metrics with dimensions based on its config file, which may not match what you typed in the alarm definition.
The safest way to get the exact dimension set is to query CloudWatch directly for what it actually received:
aws cloudwatch list-metrics \
--namespace CWAgent \
--metric-name mem_used_percent \
--region us-east-1{
"Metrics": [
{
"Namespace": "CWAgent",
"MetricName": "mem_used_percent",
"Dimensions": [
{"Name": "InstanceId", "Value": "i-0abc123def456789"},
{"Name": "InstanceType", "Value": "t3.medium"},
{"Name": "ImageId", "Value": "ami-0abcdef1234567890"}
]
}
]
}If your alarm was configured with only InstanceId and is missing InstanceType and ImageId, it won't match this metric. CloudWatch requires the alarm's dimension set to be an exact subset match. Update the alarm to use every dimension returned by list-metrics for that metric.
Root Cause 7: SNS Topic Misconfiguration
The alarm might be firing correctly — genuinely transitioning to ALARM state — but you never receive the notification because the downstream SNS delivery is broken. This is a different class of problem but easy to conflate with the alarm not triggering.
Check the alarm's action ARN and verify the topic has a confirmed subscription:
aws sns list-subscriptions-by-topic \
--topic-arn arn:aws:sns:us-east-1:123456789012:ops-alerts \
--region us-east-1{
"Subscriptions": [
{
"SubscriptionArn": "PendingConfirmation",
"Owner": "123456789012",
"Protocol": "email",
"Endpoint": "infrarunbook-admin@solvethenetwork.com",
"TopicArn": "arn:aws:sns:us-east-1:123456789012:ops-alerts"
}
]
}PendingConfirmation means the subscription was never confirmed. The confirmation email to infrarunbook-admin@solvethenetwork.com arrived, nobody clicked the link, and SNS refuses to deliver to unconfirmed endpoints. Re-send the confirmation and make sure someone actually clicks it:
aws sns subscribe \
--topic-arn arn:aws:sns:us-east-1:123456789012:ops-alerts \
--protocol email \
--notification-endpoint infrarunbook-admin@solvethenetwork.com \
--region us-east-1Also confirm that the SNS topic's access policy explicitly allows CloudWatch to publish to it. Fetch the policy and look for a statement granting cloudwatch.amazonaws.com the sns:Publish action:
aws sns get-topic-attributes \
--topic-arn arn:aws:sns:us-east-1:123456789012:ops-alerts \
--query "Attributes.Policy" \
--output text | python3 -m json.toolIf that principal is missing from the policy, CloudWatch's publish calls will be silently dropped.
Prevention
The best time to verify an alarm works is immediately after you create it, not during an incident. After creating any CloudWatch alarm, force a state transition with set-alarm-state and confirm the notification arrives:
aws cloudwatch set-alarm-state \
--alarm-name "high-cpu-sw-infrarunbook-01" \
--state-value ALARM \
--state-reason "Manual test to verify SNS delivery" \
--region us-east-1If the notification doesn't arrive within two minutes, something in the pipeline is broken. Fix it now.
Use infrastructure as code for all alarm definitions. When alarms live in Terraform or CloudFormation, the namespace, dimensions, and threshold are code-reviewed, version-controlled, and reproducible. Ad-hoc console-created alarms are where typos and forgotten dimensions live.
Run a periodic audit for alarms stuck in INSUFFICIENT_DATA. Add this to a weekly operational review or pipe it into a monitoring script:
aws cloudwatch describe-alarms \
--state-value INSUFFICIENT_DATA \
--region us-east-1 \
--query "MetricAlarms[].{Name:AlarmName,Reason:StateReason}" \
--output tableAny alarm sitting in INSUFFICIENT_DATA for more than 30 minutes after creation deserves investigation. In most cases it means the metric isn't flowing or the dimensions don't match anything real in the account.
Attach the CloudWatchAgentServerPolicy to every EC2 instance role by default. Make it part of your base AMI or your launch template. Don't let IAM permission gaps silently kill your observability — you won't know they're missing until an incident when you need the data most.
Finally, document your alarm naming conventions and the exact metric namespaces used in each environment. When infrarunbook-admin needs to create a new alarm quickly, they shouldn't have to guess whether the namespace is CWAgent, custom/app-metrics, or AWS/EC2. A one-page runbook with example put-metric-alarm commands for each tier of your infrastructure pays for itself the first time it gets used at 2am.
