Symptoms
You're staring at your Prometheus dashboard. Alerts are clearly firing — you can see them in the Prometheus UI under /alerts, state: FIRING. But your phone isn't ringing, Slack is quiet, and your inbox is empty. Alertmanager is running, the process is healthy, and yet notifications just aren't going out.
This is one of the most frustrating situations in monitoring because everything looks fine on the surface. Here's what you're likely observing:
- Alerts are visible in Prometheus in the FIRING state
- Alertmanager's web UI shows alerts in the Alerts tab
- No obvious errors in systemd or Docker logs for the alertmanager process
- No notifications delivered to email, Slack, PagerDuty, or any other channel
- The /api/v2/alerts endpoint returns active alerts but nothing is dispatched
In my experience, this problem almost always comes down to one of five core issues: the routing tree isn't matching your alerts to any receiver, the receiver itself is misconfigured, SMTP authentication is broken, someone set a silence that's suppressing everything, or an inhibition rule is swallowing alerts before they ever go out. Let's work through each one methodically.
Root Cause 1: Routing Configuration Is Wrong
Why It Happens
Alertmanager dispatches alerts through a tree of route blocks. Each route can match on label matchers, and if no child route matches, the alert falls back to the parent and ultimately the default receiver. The most common mistake I see is a route that looks correct but silently fails to match — because of a label name typo, a missing matcher, or a regex that doesn't actually cover the labels your alerts carry.
There's another pattern that catches people off guard: the continue flag. By default, once a route matches, Alertmanager stops walking the tree. If your specific team's route is defined after a broad catch-all, the catch-all consumes the alert first. Your team's receiver never sees it.
How to Identify It
The best tool here is the amtool routing test. You POST a set of labels against the routing tree and see exactly which route and receiver gets selected:
amtool --alertmanager.url=http://192.168.10.50:9093 config routes test \
alertname="HighCPU" severity="critical" team="infra"Example output when routing is broken:
Routing tree:
.
└── default-route receiver: null-receiver continue: false
Matched: true
└── team-infra receiver: pagerduty-infra continue: false
Matched: false
Reason: label "team" not found in alert labelsThe alert matched the default catch-all route but not the specific team route. The label was actually called squad in Prometheus after a relabeling change nobody communicated to the oncall team. The routing config expected team, found nothing, and fell through to null-receiver.
You can also confirm the actual labels your alerts carry directly from Prometheus:
curl -s http://192.168.10.51:9090/api/v1/alerts | python3 -m json.tool | grep -A 20 "HighCPU"How to Fix It
Compare the labels on the firing alert against what your route matcher expects. Update alertmanager.yml to match the actual label names:
route:
receiver: null-receiver
routes:
- matchers:
- alertname="HighCPU"
- squad="infra" # was: team="infra"
receiver: pagerduty-infra
continue: falseAfter editing, reload Alertmanager without a full restart:
curl -X POST http://192.168.10.50:9093/-/reloadRun the route test again to confirm the right receiver is now selected before you call the incident resolved.
Root Cause 2: Receiver Misconfigured
Why It Happens
Your routing tree correctly selects a receiver, but the receiver definition itself has a bug. A wrong webhook URL, an outdated API token, an incorrect Slack channel name, a PagerDuty integration key that was rotated without updating the config — any of these will cause Alertmanager to attempt delivery, fail, retry, and eventually log an error that's easy to miss if nobody's watching.
How to Identify It
Start with a config validation pass:
amtool check-config /etc/alertmanager/alertmanager.ymlIf the config parses cleanly, dig into the logs for dispatch errors:
journalctl -u alertmanager -n 300 --no-pager | grep -i "error\|warn\|notify\|dispatch"A misconfigured Slack receiver produces output like this:
level=error ts=2026-04-12T08:14:33.221Z caller=notify.go:732 component=dispatcher
receiver=slack-infra integration=slack[0]
msg="Notify attempt failed, will retry later"
attempts=3
err="unexpected response status 404: channel_not_found"A dead webhook endpoint looks like this:
level=warn ts=2026-04-12T08:22:11.409Z caller=notify.go:712 component=dispatcher
receiver=ops-webhook integration=webhook[0]
msg="Notify attempt failed, will retry later"
attempts=1
err="Post \"http://192.168.10.75:8080/hooks/alert\": dial tcp 192.168.10.75:8080: connect: connection refused"How to Fix It
Test your receiver endpoint directly from the Alertmanager host before touching the config. For a Slack webhook:
curl -s -X POST -H 'Content-type: application/json' \
--data '{"text":"test from infrarunbook-admin on sw-infrarunbook-01"}' \
https://hooks.slack.com/services/YOUR/ACTUAL/WEBHOOKFor a webhook target, verify basic connectivity:
curl -v http://192.168.10.75:8080/hooks/alertOnce you've confirmed the endpoint accepts requests, update the receiver definition in alertmanager.yml with the corrected values and reload. Don't forget to verify by running amtool config routes test after the reload to make sure routing still resolves correctly.
Root Cause 3: SMTP Authentication Failure
Why It Happens
Email notifications through Alertmanager rely on SMTP, and SMTP auth failures are far more common than they should be. You rotate a service account password, migrate to a new mail relay, enable 2FA on a Google Workspace account without creating an app-specific password — and suddenly email alerts go completely dark. Alertmanager keeps trying to authenticate and failing. The errors pile up in the log, but if nobody's watching them, you're flying blind.
I've also seen this triggered by a mismatch between TLS requirements. The SMTP server enforces STARTTLS but the config has require_tls: false, so the TLS negotiation fails and the connection drops before auth even begins. The log entry just shows EOF, which is unhelpfully vague.
How to Identify It
Filter the logs specifically for SMTP-related errors:
journalctl -u alertmanager --since "2 hours ago" | grep -i "smtp\|email\|auth\|535\|534\|EOF"You'll see something like:
level=error ts=2026-04-12T07:55:02.813Z caller=email.go:151
msg="Error sending email"
err="535 5.7.8 Authentication credentials invalid"Or the more cryptic TLS failure variant:
level=error ts=2026-04-12T07:55:02.813Z caller=notify.go:732 component=dispatcher
receiver=email-oncall integration=email[0]
msg="Notify attempt failed, will retry later"
attempts=5
err="dial+send: EOF"Test the SMTP connection directly from sw-infrarunbook-01 using swaks, which gives you granular feedback on exactly where the handshake breaks:
swaks --server mail.solvethenetwork.com:587 \
--tls \
--auth LOGIN \
--auth-user infrarunbook-admin@solvethenetwork.com \
--auth-password "YourPasswordHere" \
--to oncall@solvethenetwork.com \
--from alertmanager@solvethenetwork.com \
--h-Subject "SMTP connectivity test" \
--body "Testing SMTP auth from sw-infrarunbook-01"If swaks fails with a 535 error but you're confident the password is right, the issue is usually the auth mechanism. Try switching from LOGIN to PLAIN, or check whether the account needs an app-specific password.
How to Fix It
Update your email receiver block in alertmanager.yml with the correct credentials and TLS configuration:
receivers:
- name: email-oncall
email_configs:
- to: oncall@solvethenetwork.com
from: alertmanager@solvethenetwork.com
smarthost: mail.solvethenetwork.com:587
auth_username: infrarunbook-admin@solvethenetwork.com
auth_password: "app-specific-password-here"
require_tls: true
tls_config:
insecure_skip_verify: falseIf you're routing through an internal mail relay that trusts the RFC 1918 source range (like 192.168.10.0/24), you may be able to skip authentication entirely — confirm with your mail admin. For Google Workspace accounts with 2FA enabled, an app-specific password is mandatory; the regular login password will always be rejected by the SMTP endpoint.
Reload after the change and watch the logs actively for 2-3 minutes to confirm the next dispatch attempt succeeds.
Root Cause 4: Silence Accidentally Set
Why It Happens
Silences are Alertmanager's mechanism for suppressing notifications during planned maintenance or acknowledged incidents. They're easy to create, easy to forget, and unlike most config changes, they live in Alertmanager's state rather than in your config file — so they survive reloads and restarts.
An engineer silences a noisy alert at 2 AM during an incident, sets the duration to eight hours, goes to sleep, and never follows up. Or a deployment script creates a silence via the API to prevent alert storms during a rollout, but the cleanup step fails and the silence is never expired. Either way, the result is the same: alerts fire in Prometheus, reach Alertmanager, and get silently swallowed before any notification goes out.
How to Identify It
List all active silences:
amtool --alertmanager.url=http://192.168.10.50:9093 silence queryExample output revealing the culprit:
ID Matchers Ends At Created By Comment
4f2e1c3a-91bb-4d22-b8c3-7a5e9d0f1234 alertname=~".+" 2026-06-30 00:00:00 UTC infrarunbook-admin maintenance window - delete when doneThat matcher alertname=~".+" matches every single alert. This silence was created during a maintenance window months ago with an end date set far into the future. Everything has been suppressed ever since.
You can also query the raw API if you want JSON output for scripting purposes:
curl -s http://192.168.10.50:9093/api/v2/silences | \
python3 -m json.tool | grep -E "state|matchers|createdBy|endsAt"The Alertmanager web UI at http://192.168.10.50:9093/#/silences is often the fastest visual check — overly broad silences stand out immediately when you can see the matcher strings laid out in a table.
How to Fix It
Expire the silence immediately by its ID:
amtool --alertmanager.url=http://192.168.10.50:9093 silence expire 4f2e1c3a-91bb-4d22-b8c3-7a5e9d0f1234Confirm the silence is gone and no others remain:
amtool --alertmanager.url=http://192.168.10.50:9093 silence queryOnce the silence is removed, Alertmanager will re-evaluate pending alerts on its next dispatch cycle. If those alerts are still firing in Prometheus, notifications will go out within the group_wait window — typically 30 seconds to a few minutes depending on your config.
Root Cause 5: Inhibition Rule Matching Too Broadly
Why It Happens
Inhibition rules let you suppress lower-severity alerts when a higher-severity alert for the same system is already firing. The intent is sound — don't page someone about a service being slow if you're already paging them about the host being completely unreachable. The implementation detail that bites people is the equal field.
The equal field controls which labels must match between the source alert (the inhibiting one) and the target alert (the one being suppressed). When equal is empty or omitted, Alertmanager doesn't require any label correlation at all. A critical alert on one service suppresses warning alerts on every other service. That's almost never what was intended, but it's what happens.
How to Identify It
Confirm that alerts are reaching Alertmanager but being marked as inhibited rather than dispatched:
curl -s "http://192.168.10.50:9093/api/v2/alerts?inhibited=true&active=true" | \
python3 -m json.tool | grep -E "alertname|inhibited|status"Or with amtool:
amtool --alertmanager.url=http://192.168.10.50:9093 alert query --inhibitedThen look at the inhibit_rules section of your config:
inhibit_rules:
- source_matchers:
- severity="critical"
target_matchers:
- severity=~"warning|info"
equal: []Empty equal list. Any critical alert anywhere in the stack will suppress all warnings and info alerts system-wide. This is the inhibition rule equivalent of alertname=~".+" in a silence.
How to Fix It
Add meaningful labels to the equal field so inhibition only applies when both alerts share the same scope:
inhibit_rules:
- source_matchers:
- severity="critical"
target_matchers:
- severity=~"warning|info"
equal:
- cluster
- serviceNow the inhibition only fires when both alerts carry the same cluster and service labels. A critical alert on the auth service won't suppress warnings on the API gateway. Reload Alertmanager and verify that the previously inhibited alerts are now dispatched:
curl -X POST http://192.168.10.50:9093/-/reload
amtool --alertmanager.url=http://192.168.10.50:9093 alert queryRoot Cause 6: Group Wait and Interval Delaying Notifications
Why It Happens
Sometimes notifications aren't missing — they're just late. Very late. Alertmanager batches alerts into groups before dispatching them, and the group_wait and group_interval settings control how long it waits before sending. I've seen production configs with a group_wait of 10 or even 15 minutes, which means you won't hear anything for the first quarter-hour of an incident. That's not a bug in Alertmanager, but it is a configuration problem.
How to Identify It
Check your current timing configuration:
amtool --alertmanager.url=http://192.168.10.50:9093 config show | grep -E "group_wait|group_interval|repeat_interval"If you see values like this, you've found your problem:
group_wait: 10m
group_interval: 15m
repeat_interval: 4hHow to Fix It
Tune the timing to something appropriate for your operational tempo. These are sane starting defaults for most infrastructure teams:
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 3h
receiver: default-receiverThe 30-second group_wait gives Prometheus enough time to resolve flapping alerts before anyone gets paged, without creating a multi-minute blind spot on real incidents.
Prevention
Most of what's described above is avoidable with a few practices baked into your standard workflow.
Run amtool check-config as a required CI step on every pull request that touches alertmanager.yml. Combine it with a routing test for your most critical label combinations so you catch routing regressions before they reach production:
amtool check-config /etc/alertmanager/alertmanager.yml && \
amtool config routes test alertname="HighCPU" severity="critical" cluster="prod" \
--config.file /etc/alertmanager/alertmanager.ymlSet up a dead man's switch. This is a Prometheus alert that always fires — its entire job is to produce a constant stream of notifications. If you ever stop receiving that notification, something in the pipeline between Prometheus and your phone is broken. Many default Prometheus rule sets ship a Watchdog alert for exactly this purpose. Route it to a dedicated low-noise Slack channel and treat any gap in that signal as a P1.
Audit silences weekly. It takes 30 seconds. Add it as a standing item in your oncall handoff. Silences should be short-lived by nature — if you're suppressing an alert for days, the right fix is usually to either resolve the underlying condition or adjust the alerting threshold, not maintain an indefinite silence.
Comment every inhibition rule in your config. Document what it's designed to suppress and why, what label scope it operates on, and when it was added. Three months from now, when you're debugging why warnings aren't going out at 3 AM, you'll thank yourself for the 30 seconds it took to write that comment.
Finally, scrape Alertmanager's own metrics from Prometheus and alert on them. The alertmanager_notifications_failed_total counter is invaluable:
alertmanager_notifications_failed_total{integration="email"} > 0When the notification pipeline breaks, this metric tells you before your users do. Don't monitor everything in your stack except the monitoring system itself — that's the one gap that always comes back to bite you at the worst possible moment.
