InfraRunBook
    Back to articles

    AWS CloudWatch Alarm Not Triggering

    Cloud
    Published: Apr 17, 2026
    Updated: Apr 17, 2026

    Step-by-step troubleshooting guide for AWS CloudWatch alarms that refuse to fire, covering missing metrics, namespace mismatches, IAM permission gaps, and more.

    AWS CloudWatch Alarm Not Triggering

    Symptoms

    You set up a CloudWatch alarm, wired it to an SNS topic, and waited. Nothing happened. The metric you're watching is clearly spiking — the EC2 instance on sw-infrarunbook-01 is sitting at 95% CPU — but the alarm stays green or shows INSUFFICIENT_DATA forever. No email, no PagerDuty page, no Lambda trigger. Just silence.

    This is one of the most frustrating AWS debugging experiences because the console gives you almost no feedback when an alarm fails to fire. In my experience, you'll typically encounter one of these four symptoms:

    • The alarm state is permanently OK even during obvious resource stress
    • The alarm shows INSUFFICIENT_DATA and never transitions out of it
    • The alarm transitions to ALARM, but no notification arrives
    • The alarm was previously working, then suddenly stopped triggering after an IAM or infrastructure change

    Each points to a different root cause. Let's go through them systematically, starting with the most common.


    Root Cause 1: The Metric Is Not Being Published

    This is the first thing you should rule out. If no metric data is arriving in CloudWatch, the alarm has nothing to evaluate and can never fire.

    Why does it happen? For EC2 instances, basic monitoring only pushes a limited set of metrics — CPUUtilization, NetworkIn, NetworkOut, DiskReadOps — every 5 minutes. Memory usage, disk space, and custom application metrics are not sent by default. They require the CloudWatch agent to be installed and actively running. If the agent isn't running, or if its configuration doesn't reference the metric you're alarming on, no data flows. The same problem hits ECS tasks that don't have the awslogs driver configured, and Lambda functions where custom metrics require explicit PutMetricData calls from within the function code.

    How to identify it: start by checking whether any data points exist for the metric. Don't trust the CloudWatch console graph's default time window — it sometimes renders no data as a flat line, making it look like the metric exists when it doesn't.

    aws cloudwatch get-metric-statistics \
      --namespace AWS/EC2 \
      --metric-name CPUUtilization \
      --dimensions Name=InstanceId,Value=i-0abc123def456789 \
      --start-time 2026-04-18T10:00:00Z \
      --end-time 2026-04-18T11:00:00Z \
      --period 300 \
      --statistics Average \
      --region us-east-1

    If the response comes back with an empty Datapoints array, the metric simply isn't being published:

    {
        "Label": "CPUUtilization",
        "Datapoints": []
    }

    For custom metrics, check whether the CloudWatch agent is running:

    sudo systemctl status amazon-cloudwatch-agent
    ● amazon-cloudwatch-agent.service - Amazon CloudWatch Agent
       Loaded: loaded (/usr/lib/systemd/system/amazon-cloudwatch-agent.service; enabled)
       Active: inactive (dead)

    That dead status explains everything. Restart it with the correct config:

    sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
      -a fetch-config \
      -m ec2 \
      -s \
      -c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json
    
    sudo systemctl start amazon-cloudwatch-agent
    sudo systemctl status amazon-cloudwatch-agent

    After restarting, wait a full 5 minutes and re-run the get-metric-statistics call. If data points appear, you've found your culprit.


    Root Cause 2: Wrong Namespace

    CloudWatch organizes metrics into namespaces, and the namespace must match exactly. It's case-sensitive and there are no fuzzy matches. I've seen engineers alarm on

    AWS/ec2
    when the correct namespace is
    AWS/EC2
    , or use
    CWAgent
    when the custom namespace in the agent config is actually
    custom/app-metrics
    . This is especially treacherous when you're writing alarms via CloudFormation or Terraform and copy a namespace string from the wrong source.

    How to identify it: list the actual namespaces that currently have data in your account:

    aws cloudwatch list-metrics \
      --region us-east-1 \
      --query "Metrics[].Namespace" \
      --output text | tr '\t' '\n' | sort -u
    AWS/EC2
    AWS/ECS
    AWS/Lambda
    AWS/RDS
    AWS/S3
    CWAgent
    custom/app-metrics

    Then check what namespace your alarm is actually using:

    aws cloudwatch describe-alarms \
      --alarm-names "high-cpu-sw-infrarunbook-01" \
      --query "MetricAlarms[0].{Namespace:Namespace,MetricName:MetricName,Threshold:Threshold}"
    {
        "Namespace": "AWS/ec2",
        "MetricName": "CPUUtilization",
        "Threshold": 80.0
    }

    There it is —

    AWS/ec2
    instead of
    AWS/EC2
    . Fix it by reissuing the put-metric-alarm with the corrected namespace:

    aws cloudwatch put-metric-alarm \
      --alarm-name "high-cpu-sw-infrarunbook-01" \
      --alarm-description "CPU above 80% on sw-infrarunbook-01" \
      --namespace "AWS/EC2" \
      --metric-name CPUUtilization \
      --dimensions Name=InstanceId,Value=i-0abc123def456789 \
      --statistic Average \
      --period 300 \
      --threshold 80 \
      --comparison-operator GreaterThanOrEqualToThreshold \
      --evaluation-periods 2 \
      --alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts \
      --region us-east-1

    Root Cause 3: Threshold Configured Incorrectly

    Sometimes the alarm is technically working — metric data is flowing, the namespace is right — but the threshold is set in a way that it will never actually be breached. This sounds obvious, but it's surprisingly easy to get wrong with metrics that have unintuitive units or comparison directions.

    The classic example: you want to alarm when disk space drops below 20% free. You set the threshold to 20 and use GreaterThanOrEqualToThreshold. The alarm never fires even when disk is at 5% free. Why? Because you're checking if the metric is greater than or equal to 20, but disk_free_percent at 5% is not >= 20. You needed LessThanOrEqualToThreshold with a value of 20 — or, better, alarm on disk_used_percent with GreaterThanOrEqualToThreshold at 80.

    Another common mistake is unit confusion. NetworkIn on EC2 is reported in bytes. Setting a threshold of 1000 expecting 1000 MB means you'll alarm at essentially zero network traffic. The correct threshold for 1 GB is 1073741824.

    How to identify it: pull actual data points and compare them to your threshold side by side:

    aws cloudwatch get-metric-statistics \
      --namespace CWAgent \
      --metric-name disk_used_percent \
      --dimensions Name=InstanceId,Value=i-0abc123def456789 \
                    Name=path,Value=/ \
                    Name=device,Value=xvda1 \
                    Name=fstype,Value=ext4 \
      --start-time 2026-04-18T08:00:00Z \
      --end-time 2026-04-18T11:00:00Z \
      --period 300 \
      --statistics Average \
      --region us-east-1
    {
        "Label": "disk_used_percent",
        "Datapoints": [
            {"Timestamp": "2026-04-18T09:00:00Z", "Average": 87.3, "Unit": "Percent"},
            {"Timestamp": "2026-04-18T09:05:00Z", "Average": 88.1, "Unit": "Percent"},
            {"Timestamp": "2026-04-18T09:10:00Z", "Average": 89.0, "Unit": "Percent"}
        ]
    }

    Now check the alarm's operator and threshold:

    aws cloudwatch describe-alarms \
      --alarm-names "disk-space-alarm-sw-infrarunbook-01" \
      --query "MetricAlarms[0].{Operator:ComparisonOperator,Threshold:Threshold}"
    {
        "Operator": "GreaterThanOrEqualToThreshold",
        "Threshold": 90.0
    }

    Disk used is at 89%, threshold is 90% with >=. The alarm won't fire until it crosses 90%. If your intent was to alert at 85%, update accordingly and also reduce EvaluationPeriods if you want faster response:

    aws cloudwatch put-metric-alarm \
      --alarm-name "disk-space-alarm-sw-infrarunbook-01" \
      --namespace CWAgent \
      --metric-name disk_used_percent \
      --dimensions Name=InstanceId,Value=i-0abc123def456789 \
                    Name=path,Value=/ \
                    Name=device,Value=xvda1 \
                    Name=fstype,Value=ext4 \
      --statistic Average \
      --period 300 \
      --threshold 85 \
      --comparison-operator GreaterThanOrEqualToThreshold \
      --evaluation-periods 2 \
      --alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts \
      --region us-east-1

    Root Cause 4: Alarm Stuck in INSUFFICIENT_DATA State

    INSUFFICIENT_DATA is where every alarm starts. The problem is when it never moves out of this state even after you expect data to be flowing. CloudWatch puts an alarm here when it doesn't have enough data points within the evaluation window to make a determination.

    The most frequent cause is a mismatch between the alarm's period and the metric's reporting interval. If your metric only reports every 5 minutes and your alarm period is set to 60 seconds, CloudWatch sees empty windows for most evaluations and stays stuck. Setting EvaluationPeriods to a high number (say, 10) compounds this — you'd need 10 consecutive non-empty windows before the alarm can even transition to OK.

    I've also hit this when the monitored instance was stopped or terminated. A stopped EC2 instance publishes no metrics, so an alarm watching it by InstanceId will degrade back to INSUFFICIENT_DATA. This is especially sneaky in auto-scaling groups where instances are frequently replaced.

    How to identify it:

    aws cloudwatch describe-alarms \
      --alarm-names "high-cpu-sw-infrarunbook-01" \
      --query "MetricAlarms[0].{State:StateValue,Reason:StateReason,Period:Period,EvalPeriods:EvaluationPeriods,DatapointsToAlarm:DatapointsToAlarm}"
    {
        "State": "INSUFFICIENT_DATA",
        "Reason": "Insufficient Data: 10 datapoints were unknown.",
        "Period": 60,
        "EvalPeriods": 10,
        "DatapointsToAlarm": 10
    }

    Period is 60 seconds but basic EC2 monitoring publishes every 300 seconds. Fix it by either enabling detailed monitoring on the instance or updating the alarm period to match the metric's reporting interval:

    # Enable detailed monitoring (publishes every 60 seconds)
    aws ec2 monitor-instances \
      --instance-ids i-0abc123def456789 \
      --region us-east-1
    
    # OR update the alarm to a 300-second period
    aws cloudwatch put-metric-alarm \
      --alarm-name "high-cpu-sw-infrarunbook-01" \
      --namespace AWS/EC2 \
      --metric-name CPUUtilization \
      --dimensions Name=InstanceId,Value=i-0abc123def456789 \
      --statistic Average \
      --period 300 \
      --threshold 80 \
      --comparison-operator GreaterThanOrEqualToThreshold \
      --evaluation-periods 2 \
      --datapoints-to-alarm 2 \
      --treat-missing-data notBreaching \
      --alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts \
      --region us-east-1

    Note the

    --treat-missing-data notBreaching
    flag. By default, missing data is treated as missing, which directly contributes to INSUFFICIENT_DATA state. In most operational alarms you'll want either
    notBreaching
    (missing data doesn't trigger the alarm — useful for bursty metrics) or
    breaching
    (missing data fires the alarm — useful when loss of telemetry itself is the incident). Pick deliberately; don't leave it at the default.


    Root Cause 5: Missing IAM Permission for the Metric

    This is the subtlest one on this list. The CloudWatch agent running on your instance needs permission to call PutMetricData, and if the instance's IAM role doesn't grant it, custom metrics never reach CloudWatch. No error appears in the AWS console — the agent just silently fails to push data.

    This also appears in multi-account setups where you're centralizing alarms in one account while watching metrics in another. Cross-account metric sharing must be explicitly configured, and it's easy to overlook during initial setup or miss during an IAM policy tightening exercise.

    How to identify it: start by checking what policies are attached to the instance role:

    aws iam list-attached-role-policies \
      --role-name sw-infrarunbook-01-instance-role \
      --query "AttachedPolicies[].PolicyName"
    [
        "AmazonSSMManagedInstanceCore"
    ]

    No CloudWatch policy. The agent has no permission to push metrics. Confirm it by checking the agent's own logs:

    sudo tail -100 /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log
    2026-04-18T09:15:32Z E! WriteToCloudWatch failed, err: AccessDeniedException:
      User: arn:aws:sts::123456789012:assumed-role/sw-infrarunbook-01-instance-role/i-0abc123def456789
      is not authorized to perform: cloudwatch:PutMetricData
      on resource: arn:aws:cloudwatch:us-east-1:123456789012:*

    That's the smoking gun. Fix it by attaching the AWS-managed CloudWatch agent policy:

    aws iam attach-role-policy \
      --role-name sw-infrarunbook-01-instance-role \
      --policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy

    If you need a least-privilege policy instead of the managed one, the minimum required permissions are:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Action": [
            "cloudwatch:PutMetricData",
            "cloudwatch:GetMetricStatistics",
            "cloudwatch:ListMetrics"
          ],
          "Resource": "*"
        },
        {
          "Effect": "Allow",
          "Action": ["ec2:DescribeTags"],
          "Resource": "*"
        }
      ]
    }

    After attaching the policy, the IAM change takes effect immediately for new API calls — no instance reboot required. Restart the CloudWatch agent to flush its error state and watch the log for successful writes.


    Root Cause 6: Incorrect or Missing Dimensions

    Dimensions are the key-value pairs that identify a specific resource within a namespace. Get them wrong and your alarm watches nothing — or silently watches a different resource. EC2 metrics use InstanceId. ECS metrics use ClusterName and ServiceName together. RDS uses DBInstanceIdentifier. The CloudWatch agent emits metrics with dimensions based on its config file, which may not match what you typed in the alarm definition.

    The safest way to get the exact dimension set is to query CloudWatch directly for what it actually received:

    aws cloudwatch list-metrics \
      --namespace CWAgent \
      --metric-name mem_used_percent \
      --region us-east-1
    {
        "Metrics": [
            {
                "Namespace": "CWAgent",
                "MetricName": "mem_used_percent",
                "Dimensions": [
                    {"Name": "InstanceId",   "Value": "i-0abc123def456789"},
                    {"Name": "InstanceType", "Value": "t3.medium"},
                    {"Name": "ImageId",      "Value": "ami-0abcdef1234567890"}
                ]
            }
        ]
    }

    If your alarm was configured with only InstanceId and is missing InstanceType and ImageId, it won't match this metric. CloudWatch requires the alarm's dimension set to be an exact subset match. Update the alarm to use every dimension returned by list-metrics for that metric.


    Root Cause 7: SNS Topic Misconfiguration

    The alarm might be firing correctly — genuinely transitioning to ALARM state — but you never receive the notification because the downstream SNS delivery is broken. This is a different class of problem but easy to conflate with the alarm not triggering.

    Check the alarm's action ARN and verify the topic has a confirmed subscription:

    aws sns list-subscriptions-by-topic \
      --topic-arn arn:aws:sns:us-east-1:123456789012:ops-alerts \
      --region us-east-1
    {
        "Subscriptions": [
            {
                "SubscriptionArn": "PendingConfirmation",
                "Owner": "123456789012",
                "Protocol": "email",
                "Endpoint": "infrarunbook-admin@solvethenetwork.com",
                "TopicArn": "arn:aws:sns:us-east-1:123456789012:ops-alerts"
            }
        ]
    }

    PendingConfirmation means the subscription was never confirmed. The confirmation email to infrarunbook-admin@solvethenetwork.com arrived, nobody clicked the link, and SNS refuses to deliver to unconfirmed endpoints. Re-send the confirmation and make sure someone actually clicks it:

    aws sns subscribe \
      --topic-arn arn:aws:sns:us-east-1:123456789012:ops-alerts \
      --protocol email \
      --notification-endpoint infrarunbook-admin@solvethenetwork.com \
      --region us-east-1

    Also confirm that the SNS topic's access policy explicitly allows CloudWatch to publish to it. Fetch the policy and look for a statement granting cloudwatch.amazonaws.com the sns:Publish action:

    aws sns get-topic-attributes \
      --topic-arn arn:aws:sns:us-east-1:123456789012:ops-alerts \
      --query "Attributes.Policy" \
      --output text | python3 -m json.tool

    If that principal is missing from the policy, CloudWatch's publish calls will be silently dropped.


    Prevention

    The best time to verify an alarm works is immediately after you create it, not during an incident. After creating any CloudWatch alarm, force a state transition with set-alarm-state and confirm the notification arrives:

    aws cloudwatch set-alarm-state \
      --alarm-name "high-cpu-sw-infrarunbook-01" \
      --state-value ALARM \
      --state-reason "Manual test to verify SNS delivery" \
      --region us-east-1

    If the notification doesn't arrive within two minutes, something in the pipeline is broken. Fix it now.

    Use infrastructure as code for all alarm definitions. When alarms live in Terraform or CloudFormation, the namespace, dimensions, and threshold are code-reviewed, version-controlled, and reproducible. Ad-hoc console-created alarms are where typos and forgotten dimensions live.

    Run a periodic audit for alarms stuck in INSUFFICIENT_DATA. Add this to a weekly operational review or pipe it into a monitoring script:

    aws cloudwatch describe-alarms \
      --state-value INSUFFICIENT_DATA \
      --region us-east-1 \
      --query "MetricAlarms[].{Name:AlarmName,Reason:StateReason}" \
      --output table

    Any alarm sitting in INSUFFICIENT_DATA for more than 30 minutes after creation deserves investigation. In most cases it means the metric isn't flowing or the dimensions don't match anything real in the account.

    Attach the CloudWatchAgentServerPolicy to every EC2 instance role by default. Make it part of your base AMI or your launch template. Don't let IAM permission gaps silently kill your observability — you won't know they're missing until an incident when you need the data most.

    Finally, document your alarm naming conventions and the exact metric namespaces used in each environment. When infrarunbook-admin needs to create a new alarm quickly, they shouldn't have to guess whether the namespace is CWAgent, custom/app-metrics, or AWS/EC2. A one-page runbook with example put-metric-alarm commands for each tier of your infrastructure pays for itself the first time it gets used at 2am.

    Frequently Asked Questions

    Why is my CloudWatch alarm stuck in INSUFFICIENT_DATA permanently?

    The most common reason is a mismatch between the alarm's period and the metric's reporting interval. For example, if your alarm period is 60 seconds but the metric only publishes every 300 seconds (basic EC2 monitoring), CloudWatch sees mostly empty windows and cannot evaluate the condition. Either enable detailed monitoring on the instance or update the alarm period to 300 seconds to match. Also check that the monitored resource is still running — a stopped or terminated instance publishes no data.

    How do I verify that a CloudWatch metric is actually being published?

    Run aws cloudwatch get-metric-statistics with an explicit start-time and end-time covering the last hour and check whether the Datapoints array is empty. If it is, the metric is not reaching CloudWatch. Check whether the CloudWatch agent is running (systemctl status amazon-cloudwatch-agent), review the agent logs at /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log for errors, and confirm the instance IAM role includes cloudwatch:PutMetricData.

    What IAM permissions does the CloudWatch agent need on an EC2 instance?

    At minimum, the instance role needs cloudwatch:PutMetricData, cloudwatch:GetMetricStatistics, cloudwatch:ListMetrics, and ec2:DescribeTags. The easiest way to grant these is by attaching the AWS-managed policy arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy to the instance's IAM role. The change takes effect immediately without requiring a reboot.

    How do I test a CloudWatch alarm without waiting for the real metric to breach the threshold?

    Use aws cloudwatch set-alarm-state to manually force a transition: aws cloudwatch set-alarm-state --alarm-name your-alarm-name --state-value ALARM --state-reason 'Manual test' --region us-east-1. This triggers the alarm actions immediately. If no notification arrives within two minutes, investigate the SNS topic configuration, subscription confirmation status, and the topic's access policy.

    Can a CloudWatch alarm fire without sending an SNS notification?

    Yes. The alarm can transition to ALARM state while the notification fails silently for several reasons: the SNS subscription is in PendingConfirmation status (the endpoint email was never confirmed), the SNS topic's resource policy doesn't allow cloudwatch.amazonaws.com to call sns:Publish, or the SNS topic itself was deleted after the alarm was created. Check subscription status with aws sns list-subscriptions-by-topic and verify the topic's access policy allows CloudWatch as a principal.

    Related Articles