InfraRunBook
    Back to articles

    Alertmanager Not Sending Notifications

    Monitoring
    Published: Apr 11, 2026
    Updated: Apr 11, 2026

    A practical troubleshooting guide for when Alertmanager stops sending notifications, covering routing misconfigs, SMTP auth failures, accidental silences, inhibition rules, and receiver errors with real CLI commands.

    Alertmanager Not Sending Notifications

    Symptoms

    You're staring at your Prometheus dashboard. Alerts are clearly firing — you can see them in the Prometheus UI under /alerts, state: FIRING. But your phone isn't ringing, Slack is quiet, and your inbox is empty. Alertmanager is running, the process is healthy, and yet notifications just aren't going out.

    This is one of the most frustrating situations in monitoring because everything looks fine on the surface. Here's what you're likely observing:

    • Alerts are visible in Prometheus in the FIRING state
    • Alertmanager's web UI shows alerts in the Alerts tab
    • No obvious errors in systemd or Docker logs for the alertmanager process
    • No notifications delivered to email, Slack, PagerDuty, or any other channel
    • The /api/v2/alerts endpoint returns active alerts but nothing is dispatched

    In my experience, this problem almost always comes down to one of five core issues: the routing tree isn't matching your alerts to any receiver, the receiver itself is misconfigured, SMTP authentication is broken, someone set a silence that's suppressing everything, or an inhibition rule is swallowing alerts before they ever go out. Let's work through each one methodically.


    Root Cause 1: Routing Configuration Is Wrong

    Why It Happens

    Alertmanager dispatches alerts through a tree of route blocks. Each route can match on label matchers, and if no child route matches, the alert falls back to the parent and ultimately the default receiver. The most common mistake I see is a route that looks correct but silently fails to match — because of a label name typo, a missing matcher, or a regex that doesn't actually cover the labels your alerts carry.

    There's another pattern that catches people off guard: the continue flag. By default, once a route matches, Alertmanager stops walking the tree. If your specific team's route is defined after a broad catch-all, the catch-all consumes the alert first. Your team's receiver never sees it.

    How to Identify It

    The best tool here is the amtool routing test. You POST a set of labels against the routing tree and see exactly which route and receiver gets selected:

    amtool --alertmanager.url=http://192.168.10.50:9093 config routes test \
      alertname="HighCPU" severity="critical" team="infra"

    Example output when routing is broken:

    Routing tree:
    .
    └── default-route  receiver: null-receiver  continue: false
        Matched: true
        └── team-infra  receiver: pagerduty-infra  continue: false
            Matched: false
            Reason: label "team" not found in alert labels

    The alert matched the default catch-all route but not the specific team route. The label was actually called squad in Prometheus after a relabeling change nobody communicated to the oncall team. The routing config expected team, found nothing, and fell through to null-receiver.

    You can also confirm the actual labels your alerts carry directly from Prometheus:

    curl -s http://192.168.10.51:9090/api/v1/alerts | python3 -m json.tool | grep -A 20 "HighCPU"

    How to Fix It

    Compare the labels on the firing alert against what your route matcher expects. Update alertmanager.yml to match the actual label names:

    route:
      receiver: null-receiver
      routes:
        - matchers:
            - alertname="HighCPU"
            - squad="infra"   # was: team="infra"
          receiver: pagerduty-infra
          continue: false

    After editing, reload Alertmanager without a full restart:

    curl -X POST http://192.168.10.50:9093/-/reload

    Run the route test again to confirm the right receiver is now selected before you call the incident resolved.


    Root Cause 2: Receiver Misconfigured

    Why It Happens

    Your routing tree correctly selects a receiver, but the receiver definition itself has a bug. A wrong webhook URL, an outdated API token, an incorrect Slack channel name, a PagerDuty integration key that was rotated without updating the config — any of these will cause Alertmanager to attempt delivery, fail, retry, and eventually log an error that's easy to miss if nobody's watching.

    How to Identify It

    Start with a config validation pass:

    amtool check-config /etc/alertmanager/alertmanager.yml

    If the config parses cleanly, dig into the logs for dispatch errors:

    journalctl -u alertmanager -n 300 --no-pager | grep -i "error\|warn\|notify\|dispatch"

    A misconfigured Slack receiver produces output like this:

    level=error ts=2026-04-12T08:14:33.221Z caller=notify.go:732 component=dispatcher
      receiver=slack-infra integration=slack[0]
      msg="Notify attempt failed, will retry later"
      attempts=3
      err="unexpected response status 404: channel_not_found"

    A dead webhook endpoint looks like this:

    level=warn ts=2026-04-12T08:22:11.409Z caller=notify.go:712 component=dispatcher
      receiver=ops-webhook integration=webhook[0]
      msg="Notify attempt failed, will retry later"
      attempts=1
      err="Post \"http://192.168.10.75:8080/hooks/alert\": dial tcp 192.168.10.75:8080: connect: connection refused"

    How to Fix It

    Test your receiver endpoint directly from the Alertmanager host before touching the config. For a Slack webhook:

    curl -s -X POST -H 'Content-type: application/json' \
      --data '{"text":"test from infrarunbook-admin on sw-infrarunbook-01"}' \
      https://hooks.slack.com/services/YOUR/ACTUAL/WEBHOOK

    For a webhook target, verify basic connectivity:

    curl -v http://192.168.10.75:8080/hooks/alert

    Once you've confirmed the endpoint accepts requests, update the receiver definition in alertmanager.yml with the corrected values and reload. Don't forget to verify by running amtool config routes test after the reload to make sure routing still resolves correctly.


    Root Cause 3: SMTP Authentication Failure

    Why It Happens

    Email notifications through Alertmanager rely on SMTP, and SMTP auth failures are far more common than they should be. You rotate a service account password, migrate to a new mail relay, enable 2FA on a Google Workspace account without creating an app-specific password — and suddenly email alerts go completely dark. Alertmanager keeps trying to authenticate and failing. The errors pile up in the log, but if nobody's watching them, you're flying blind.

    I've also seen this triggered by a mismatch between TLS requirements. The SMTP server enforces STARTTLS but the config has require_tls: false, so the TLS negotiation fails and the connection drops before auth even begins. The log entry just shows EOF, which is unhelpfully vague.

    How to Identify It

    Filter the logs specifically for SMTP-related errors:

    journalctl -u alertmanager --since "2 hours ago" | grep -i "smtp\|email\|auth\|535\|534\|EOF"

    You'll see something like:

    level=error ts=2026-04-12T07:55:02.813Z caller=email.go:151
      msg="Error sending email"
      err="535 5.7.8 Authentication credentials invalid"

    Or the more cryptic TLS failure variant:

    level=error ts=2026-04-12T07:55:02.813Z caller=notify.go:732 component=dispatcher
      receiver=email-oncall integration=email[0]
      msg="Notify attempt failed, will retry later"
      attempts=5
      err="dial+send: EOF"

    Test the SMTP connection directly from sw-infrarunbook-01 using swaks, which gives you granular feedback on exactly where the handshake breaks:

    swaks --server mail.solvethenetwork.com:587 \
      --tls \
      --auth LOGIN \
      --auth-user infrarunbook-admin@solvethenetwork.com \
      --auth-password "YourPasswordHere" \
      --to oncall@solvethenetwork.com \
      --from alertmanager@solvethenetwork.com \
      --h-Subject "SMTP connectivity test" \
      --body "Testing SMTP auth from sw-infrarunbook-01"

    If swaks fails with a 535 error but you're confident the password is right, the issue is usually the auth mechanism. Try switching from LOGIN to PLAIN, or check whether the account needs an app-specific password.

    How to Fix It

    Update your email receiver block in alertmanager.yml with the correct credentials and TLS configuration:

    receivers:
      - name: email-oncall
        email_configs:
          - to: oncall@solvethenetwork.com
            from: alertmanager@solvethenetwork.com
            smarthost: mail.solvethenetwork.com:587
            auth_username: infrarunbook-admin@solvethenetwork.com
            auth_password: "app-specific-password-here"
            require_tls: true
            tls_config:
              insecure_skip_verify: false

    If you're routing through an internal mail relay that trusts the RFC 1918 source range (like 192.168.10.0/24), you may be able to skip authentication entirely — confirm with your mail admin. For Google Workspace accounts with 2FA enabled, an app-specific password is mandatory; the regular login password will always be rejected by the SMTP endpoint.

    Reload after the change and watch the logs actively for 2-3 minutes to confirm the next dispatch attempt succeeds.


    Root Cause 4: Silence Accidentally Set

    Why It Happens

    Silences are Alertmanager's mechanism for suppressing notifications during planned maintenance or acknowledged incidents. They're easy to create, easy to forget, and unlike most config changes, they live in Alertmanager's state rather than in your config file — so they survive reloads and restarts.

    An engineer silences a noisy alert at 2 AM during an incident, sets the duration to eight hours, goes to sleep, and never follows up. Or a deployment script creates a silence via the API to prevent alert storms during a rollout, but the cleanup step fails and the silence is never expired. Either way, the result is the same: alerts fire in Prometheus, reach Alertmanager, and get silently swallowed before any notification goes out.

    How to Identify It

    List all active silences:

    amtool --alertmanager.url=http://192.168.10.50:9093 silence query

    Example output revealing the culprit:

    ID                                    Matchers              Ends At                  Created By            Comment
    4f2e1c3a-91bb-4d22-b8c3-7a5e9d0f1234  alertname=~".+"       2026-06-30 00:00:00 UTC  infrarunbook-admin    maintenance window - delete when done

    That matcher alertname=~".+" matches every single alert. This silence was created during a maintenance window months ago with an end date set far into the future. Everything has been suppressed ever since.

    You can also query the raw API if you want JSON output for scripting purposes:

    curl -s http://192.168.10.50:9093/api/v2/silences | \
      python3 -m json.tool | grep -E "state|matchers|createdBy|endsAt"

    The Alertmanager web UI at http://192.168.10.50:9093/#/silences is often the fastest visual check — overly broad silences stand out immediately when you can see the matcher strings laid out in a table.

    How to Fix It

    Expire the silence immediately by its ID:

    amtool --alertmanager.url=http://192.168.10.50:9093 silence expire 4f2e1c3a-91bb-4d22-b8c3-7a5e9d0f1234

    Confirm the silence is gone and no others remain:

    amtool --alertmanager.url=http://192.168.10.50:9093 silence query

    Once the silence is removed, Alertmanager will re-evaluate pending alerts on its next dispatch cycle. If those alerts are still firing in Prometheus, notifications will go out within the group_wait window — typically 30 seconds to a few minutes depending on your config.


    Root Cause 5: Inhibition Rule Matching Too Broadly

    Why It Happens

    Inhibition rules let you suppress lower-severity alerts when a higher-severity alert for the same system is already firing. The intent is sound — don't page someone about a service being slow if you're already paging them about the host being completely unreachable. The implementation detail that bites people is the equal field.

    The equal field controls which labels must match between the source alert (the inhibiting one) and the target alert (the one being suppressed). When equal is empty or omitted, Alertmanager doesn't require any label correlation at all. A critical alert on one service suppresses warning alerts on every other service. That's almost never what was intended, but it's what happens.

    How to Identify It

    Confirm that alerts are reaching Alertmanager but being marked as inhibited rather than dispatched:

    curl -s "http://192.168.10.50:9093/api/v2/alerts?inhibited=true&active=true" | \
      python3 -m json.tool | grep -E "alertname|inhibited|status"

    Or with amtool:

    amtool --alertmanager.url=http://192.168.10.50:9093 alert query --inhibited

    Then look at the inhibit_rules section of your config:

    inhibit_rules:
      - source_matchers:
          - severity="critical"
        target_matchers:
          - severity=~"warning|info"
        equal: []

    Empty equal list. Any critical alert anywhere in the stack will suppress all warnings and info alerts system-wide. This is the inhibition rule equivalent of alertname=~".+" in a silence.

    How to Fix It

    Add meaningful labels to the equal field so inhibition only applies when both alerts share the same scope:

    inhibit_rules:
      - source_matchers:
          - severity="critical"
        target_matchers:
          - severity=~"warning|info"
        equal:
          - cluster
          - service

    Now the inhibition only fires when both alerts carry the same cluster and service labels. A critical alert on the auth service won't suppress warnings on the API gateway. Reload Alertmanager and verify that the previously inhibited alerts are now dispatched:

    curl -X POST http://192.168.10.50:9093/-/reload
    amtool --alertmanager.url=http://192.168.10.50:9093 alert query

    Root Cause 6: Group Wait and Interval Delaying Notifications

    Why It Happens

    Sometimes notifications aren't missing — they're just late. Very late. Alertmanager batches alerts into groups before dispatching them, and the group_wait and group_interval settings control how long it waits before sending. I've seen production configs with a group_wait of 10 or even 15 minutes, which means you won't hear anything for the first quarter-hour of an incident. That's not a bug in Alertmanager, but it is a configuration problem.

    How to Identify It

    Check your current timing configuration:

    amtool --alertmanager.url=http://192.168.10.50:9093 config show | grep -E "group_wait|group_interval|repeat_interval"

    If you see values like this, you've found your problem:

    group_wait: 10m
    group_interval: 15m
    repeat_interval: 4h

    How to Fix It

    Tune the timing to something appropriate for your operational tempo. These are sane starting defaults for most infrastructure teams:

    route:
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 3h
      receiver: default-receiver

    The 30-second group_wait gives Prometheus enough time to resolve flapping alerts before anyone gets paged, without creating a multi-minute blind spot on real incidents.


    Prevention

    Most of what's described above is avoidable with a few practices baked into your standard workflow.

    Run amtool check-config as a required CI step on every pull request that touches alertmanager.yml. Combine it with a routing test for your most critical label combinations so you catch routing regressions before they reach production:

    amtool check-config /etc/alertmanager/alertmanager.yml && \
    amtool config routes test alertname="HighCPU" severity="critical" cluster="prod" \
      --config.file /etc/alertmanager/alertmanager.yml

    Set up a dead man's switch. This is a Prometheus alert that always fires — its entire job is to produce a constant stream of notifications. If you ever stop receiving that notification, something in the pipeline between Prometheus and your phone is broken. Many default Prometheus rule sets ship a Watchdog alert for exactly this purpose. Route it to a dedicated low-noise Slack channel and treat any gap in that signal as a P1.

    Audit silences weekly. It takes 30 seconds. Add it as a standing item in your oncall handoff. Silences should be short-lived by nature — if you're suppressing an alert for days, the right fix is usually to either resolve the underlying condition or adjust the alerting threshold, not maintain an indefinite silence.

    Comment every inhibition rule in your config. Document what it's designed to suppress and why, what label scope it operates on, and when it was added. Three months from now, when you're debugging why warnings aren't going out at 3 AM, you'll thank yourself for the 30 seconds it took to write that comment.

    Finally, scrape Alertmanager's own metrics from Prometheus and alert on them. The alertmanager_notifications_failed_total counter is invaluable:

    alertmanager_notifications_failed_total{integration="email"} > 0

    When the notification pipeline breaks, this metric tells you before your users do. Don't monitor everything in your stack except the monitoring system itself — that's the one gap that always comes back to bite you at the worst possible moment.

    Frequently Asked Questions

    How do I test if my Alertmanager routing is working without waiting for a real alert to fire?

    Use amtool to simulate label sets against your routing tree: run `amtool --alertmanager.url=http://192.168.10.50:9093 config routes test alertname="YourAlert" severity="critical"`. It will show exactly which route matches and which receiver gets selected, without sending any actual notifications.

    Alertmanager shows alerts in the UI but they are marked as suppressed. How do I find out why?

    Query the API with the suppressed flag: `curl -s 'http://192.168.10.50:9093/api/v2/alerts?silenced=true&inhibited=true' | python3 -m json.tool`. Check the status object in each alert — it will indicate whether the alert is silenced or inhibited, which tells you whether to look at your silences list or your inhibit_rules config.

    How can I reload Alertmanager config without restarting the process?

    Send a POST request to the reload endpoint: `curl -X POST http://192.168.10.50:9093/-/reload`. This reloads the config file and resets the routing tree without dropping existing alert state. Always run amtool check-config first to validate the file before reloading.

    What is the difference between a silence and an inhibition rule in Alertmanager?

    A silence is a manually created, time-bounded suppression that mutes specific alerts matching label matchers. An inhibition rule is a persistent config-driven rule that automatically suppresses lower-severity alerts when a correlated higher-severity alert is already firing. Silences are managed at runtime through the API or UI; inhibition rules live in alertmanager.yml and require a config reload to change.

    Alertmanager is not sending email but Slack notifications work fine. What should I check first?

    Start with SMTP authentication. Run swaks from the Alertmanager host against your mail relay to test the connection directly. Check the Alertmanager logs for 535 auth errors or EOF messages, which usually indicate a bad password, a missing app-specific password for accounts with 2FA enabled, or a TLS negotiation failure caused by require_tls being set incorrectly in your email receiver config.

    Related Articles