Symptoms
When Prometheus starts struggling with high cardinality, it doesn't fail cleanly. It degrades slowly, and by the time someone notices, the situation is already bad. I've been called into more than a few incidents that started with "Grafana is slow" and ended with a Prometheus process eating 40GB of RAM on sw-infrarunbook-01 while on-call engineers scrambled to figure out what changed over the weekend.
Here's what the degradation actually looks like. The Prometheus UI becomes sluggish or completely unresponsive. Queries that returned in milliseconds now spin for 30 seconds before throwing an error. The
prometheus_tsdb_head_seriesmetric — your single most important cardinality health indicator — shows numbers in the millions instead of the tens of thousands you're used to. Grafana dashboards fail with red panels, and when you dig into the query inspector, you see
context deadline exceedederrors stacking up.
In more severe cases, Prometheus itself gets OOM-killed by the kernel. You'll find entries like this in your system journal on sw-infrarunbook-01:
Apr 16 03:14:22 sw-infrarunbook-01 kernel: Out of memory: Kill process 4821 (prometheus) score 892 or sacrifice child
Apr 16 03:14:22 sw-infrarunbook-01 kernel: Killed process 4821 (prometheus) total-vm:42389504kB, anon-rss:38912048kB
After a restart things look fine briefly — then memory climbs back up and you're back where you started. That cycle is the cardinality problem in its most brutal form. The process restarts cleanly but immediately begins rebuilding the same head series from the WAL and incoming scrapes, hitting the same ceiling within hours.
Root Cause 1: Label Values Are Unbounded
This is the most common cause of cardinality explosions, and it's almost always introduced by someone who had genuinely good intentions. Prometheus stores each unique combination of label values as a separate time series. The total cardinality of your metric set is the product of all unique label values across every label dimension. If even one label contains unbounded values — user IDs, request IDs, session tokens, IP addresses, full file paths, raw SQL queries — your series count multiplies out of control fast.
Here's why it happens. A developer instruments their HTTP service and adds a
pathlabel to track request latency per endpoint. Sounds reasonable. But they pull the raw path from the request, which includes dynamic path parameters:
http_requests_total{method="GET", path="/api/users/12345", status="200"}
http_requests_total{method="GET", path="/api/users/67890", status="200"}
http_requests_total{method="GET", path="/api/users/99999", status="200"}
Every unique user ID embedded in the
pathlabel creates a new time series. With a million users, that's a million time series for a single metric. I've seen this exact pattern with IP addresses in
client_iplabels, with full SQL query strings in
querylabels, and with trace IDs embedded in metric labels by engineers who confused metrics with distributed traces.
To identify which labels are causing the explosion, query the Prometheus TSDB status API:
curl -s http://10.10.1.50:9090/api/v1/status/tsdb | \
jq '.data.seriesCountByMetricName | sort_by(.value) | reverse | .[0:10]'
The output will look something like this:
[
{ "name": "http_requests_total", "value": 847293 },
{ "name": "http_request_duration_seconds_bucket", "value": 762564 },
{ "name": "node_cpu_seconds_total", "value": 4800 },
{ "name": "up", "value": 320 }
]
When one metric has 847,000 series and everything else has a few thousand, you've found your offender. Drill into the specific label causing the bloat:
curl -s 'http://10.10.1.50:9090/api/v1/label/path/values' | jq '.data | length'
If that returns 847293, the
pathlabel is unbounded and that's your culprit.
The fix depends on where the label is set. Ideally you fix the application instrumentation to normalize path parameters —
/api/users/:idinstead of
/api/users/12345. If you can't ship that fix immediately, use Prometheus metric relabeling to rewrite offending label values at scrape time:
scrape_configs:
- job_name: 'app-backend'
static_configs:
- targets: ['10.10.1.80:8080']
metric_relabel_configs:
- source_labels: [path]
regex: '^/api/users/[0-9]+'
target_label: path
replacement: '/api/users/:id'
- source_labels: [path]
regex: '^/api/orders/[0-9]+'
target_label: path
replacement: '/api/orders/:id'
Apply the relabeling, restart Prometheus, and watch
prometheus_tsdb_head_seriesdrop over the next few scrape cycles. The existing bloated series will age out once their retention window passes — you won't see an immediate cliff, but the growth will stop.
Root Cause 2: Too Many Time Series Overall
Even without a single unbounded label, you can hit the cardinality wall by having too many targets, too many metrics per target, or both. This is a systems engineering problem, not a labeling mistake. The general rule of thumb I use: a well-resourced Prometheus instance handles around 1 to 2 million active series comfortably. Beyond that, you're fighting physics.
The problem compounds when teams add exporters without any governance. Every new node_exporter, blackbox_exporter, mysqld_exporter, or custom application exporter adds hundreds or thousands of series. Multiply that across 300 nodes and 50 services and you've quietly crossed the threshold without any single decision feeling obviously wrong. Nobody added 2 million series in one go — they accumulated over months of legitimate growth.
Check your current total series count first:
curl -s 'http://10.10.1.50:9090/api/v1/query?query=prometheus_tsdb_head_series' | \
jq '.data.result[0].value[1]'
Then look at which label value pairs contribute most to overall cardinality:
curl -s http://10.10.1.50:9090/api/v1/status/tsdb | \
jq '.data.seriesCountByLabelValuePair | sort_by(.value) | reverse | .[0:20]'
You'll often find that a handful of combinations —
instancevalues for your highest-metric-count exporters — dominate the list. This tells you where pruning will have the most impact.
The fix has two layers. First, drop metrics you're not actually querying. node_exporter alone exposes over a thousand metrics by default; most teams only use a fraction of them. Use metric relabeling to drop entire families:
metric_relabel_configs:
- source_labels: [__name__]
regex: 'go_gc_.*|go_memstats_.*|go_goroutines|go_threads'
action: drop
- source_labels: [__name__]
regex: 'node_nf_conntrack_.*|node_sockstat_.*'
action: drop
Before applying drops broadly, verify nobody is querying those metrics. A quick search through your Grafana dashboards and alerting rules for any reference to the metric family will confirm it's safe to drop.
Second, if growth is structural rather than incidental, you need to shard your Prometheus topology. Deploy multiple Prometheus instances, each scraping a subset of targets, and use Thanos or Grafana Mimir to provide a federated query layer across all shards. This is the architectural answer to growth that relabeling alone can't solve.
Root Cause 3: Query Timeout
High cardinality doesn't just eat memory — it makes queries slow. Very slow. When a query has to scan 800,000 series to evaluate a selector like
http_requests_total{job="api"}, it doesn't matter how fast your disk is. The fan-out is simply too wide, and the query engine has to touch every matching series before it can return a result.
The symptom is unambiguous. You run a query in the Prometheus UI or Grafana and see:
Error: context deadline exceeded
Through the HTTP API the error response looks like:
{
"status": "error",
"errorType": "timeout",
"error": "query timed out in expression evaluation"
}
Prometheus has a default query timeout of two minutes (
--query.timeout=2m). When queries hit this consistently, the queries themselves are too expensive — usually because the series they're scanning are too numerous or the time range is too wide.
Enable query logging to identify your worst offenders. Add this flag to your Prometheus startup command on sw-infrarunbook-01:
--query.log-file=/var/log/prometheus/query.log
Then mine the log for slow queries:
grep 'duration' /var/log/prometheus/query.log | \
awk -F'"' '{print $8, $0}' | sort -rn | head -20
You can also monitor query performance through Prometheus's own metrics:
prometheus_engine_query_duration_seconds{slice="inner_eval", quantile="0.9"}
If the 90th percentile of inner_eval duration is consistently above 5 seconds, your queries are deeply expensive and your alerting rules are likely misfiring or evaluating with significant lag.
The fix combines cardinality reduction with query optimization. Use specific label matchers to reduce the scan scope on every query. Instead of:
rate(http_requests_total[5m])
Use:
rate(http_requests_total{job="api", env="production"}[5m])
The additional label matchers act as pre-filters that let the TSDB skip entire series before the rate calculation even begins. Also avoid regex matchers on high-cardinality labels wherever possible —
{path=~"/api/.*"}forces a full scan of every series label to evaluate the regex, which compounds the cost on already-bloated metrics.
Root Cause 4: Memory Pressure on Prometheus
Prometheus is fundamentally an in-memory database with a disk-backed WAL. Every active time series keeps its recent data in the head chunk in RAM. When you have 5 million active series, that's an enormous number of head chunks sitting in memory simultaneously, and Prometheus doesn't offer a "low memory mode" that trades RAM for slower query performance. It's all-or-nothing on the memory front.
In my experience, memory growth is often the first symptom teams notice — before they understand cardinality is the root cause. The Prometheus process that used to run comfortably at 4GB starts touching 12GB, then 20GB, then gets OOM-killed. After the restart it climbs again. The team increases the memory limit. It climbs past that too.
Check current memory usage with a PromQL query:
process_resident_memory_bytes{job="prometheus"}
Or directly from sw-infrarunbook-01:
ps -o pid,rss,comm -p $(pgrep prometheus) | awk 'NR>1 {printf "PID %s: %.1f GB\n", $1, $2/1048576}'
Watch the growth rate over time to distinguish a one-time increase from a continuous leak:
rate(process_resident_memory_bytes{job="prometheus"}[1h])
If this rate is consistently positive and growing, new series are being created faster than old ones expire. That's a cardinality leak — something in your environment is producing novel label combinations continuously.
For immediate relief, apply a
sample_limitto scrape jobs to prevent a single bad target from flooding the head:
scrape_configs:
- job_name: 'app-backend'
sample_limit: 15000
static_configs:
- targets: ['10.10.1.80:8080']
- job_name: 'node-exporters'
sample_limit: 2000
static_configs:
- targets: ['10.10.1.81:9100', '10.10.1.82:9100', '10.10.1.83:9100']
When a target exceeds its
sample_limit, Prometheus drops the entire scrape and increments
scrape_samples_post_metric_relabeling— you'll see a gap in that target's data, which is your signal that the limit was hit. It's a circuit breaker, not a solution, but it prevents one misbehaving application from taking down your entire monitoring stack while you work the proper fix.
From Prometheus 2.45+ you can also enforce global guardrails in prometheus.yml directly:
global:
scrape_interval: 15s
label_limit: 30
label_name_length_limit: 200
label_value_length_limit: 500
The
label_limitsetting is underused but valuable. Legitimate applications rarely need more than 15-20 labels on a metric. Setting a hard limit of 30 catches badly instrumented exporters before they can inject dozens of custom labels that multiply your cardinality.
Root Cause 5: Recording Rules Needed
This one is slightly different from the others because it's not about the number of series stored — it's about the cost of evaluating them repeatedly. Even with a manageable series count, if your alerting rules and Grafana dashboards are running heavy aggregations over wide time windows on every evaluation cycle, you'll starve the Prometheus query engine. The symptoms look identical to cardinality problems: slow dashboards, timeouts, high CPU on sw-infrarunbook-01 — but the series count might look completely reasonable.
Recording rules pre-compute expensive query expressions and store the results as new, lightweight time series. The rule runs once per evaluation interval, writes a single output series, and every downstream consumer — dashboards, alerts, other rules — queries that pre-computed series instead of re-running the full aggregation on every load. Query time drops from seconds to milliseconds. Prometheus CPU drops. Dashboards become snappy again.
Identify whether you need recording rules by checking rule evaluation duration:
prometheus_rule_evaluation_duration_seconds{quantile="0.9"}
Also check for rule evaluation failures, which indicate rules are timing out silently:
rate(prometheus_rule_evaluation_failures_total[5m])
A non-zero failure rate here means your alerting rules may not be firing when they should be. That's not just a performance problem — it's a reliability problem for your entire alerting infrastructure.
Here's a concrete before/after example. Suppose multiple Grafana dashboards and two alerting rules all evaluate this query independently:
sum(rate(http_requests_total[5m])) by (service, status_code)
With 800,000 http_requests_total series spread across 200 services, this aggregation hits every raw series on every evaluation. Convert it to a recording rule:
groups:
- name: http_aggregations
interval: 60s
rules:
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (service, status_code)
Now your dashboards and alerts query
job:http_requests:rate5m— a pre-computed series with maybe 600 rows representing 200 services times 3 status code categories — instead of aggregating across 800,000 raw series every 15 seconds. The naming convention
level:metric:operationsis the community standard and makes recording rules self-documenting at a glance.
Apply this pattern to any query that appears in both dashboards and alerting rules. If multiple teams query the same expensive expression, that's exactly the case recording rules were designed for.
Prevention
Preventing cardinality problems is mostly a governance and early detection problem. Once you're in a full cardinality crisis with 8 million active series, your options are painful: drop data, restart Prometheus and lose head chunks, or fight through degraded performance while incrementally fixing labels. None of those are fun conversations to have at 2 AM. The smart move is never getting there.
Set up cardinality and memory alerts on your Prometheus-about-Prometheus metrics. These should fire well before things become critical:
groups:
- name: prometheus_health
rules:
- alert: PrometheusHighCardinality
expr: prometheus_tsdb_head_series > 1500000
for: 10m
labels:
severity: warning
annotations:
summary: "Prometheus series count is high on {{ $labels.instance }}"
description: "Active series: {{ $value }}. Investigate label cardinality before this becomes critical."
- alert: PrometheusCriticalCardinality
expr: prometheus_tsdb_head_series > 3000000
for: 5m
labels:
severity: critical
annotations:
summary: "Prometheus cardinality critical on {{ $labels.instance }}"
- alert: PrometheusMemoryHigh
expr: process_resident_memory_bytes{job="prometheus"} > 8589934592
for: 5m
labels:
severity: critical
annotations:
summary: "Prometheus memory exceeds 8GB on {{ $labels.instance }}"
Make cardinality review part of your deployment process for any service that ships Prometheus instrumentation. When a developer opens a PR adding new metrics, the review question is simple: what are the possible values for each label, and is that set bounded? If the answer is "users can put anything there" or "it depends on request parameters," that's a red flag that needs to be resolved before the code ships.
Run the TSDB status endpoint as part of your regular operational review and snapshot the output for comparison:
curl -s http://10.10.1.50:9090/api/v1/status/tsdb | \
jq '.data' | tee /var/log/prometheus/tsdb-status-$(date +%F).json
Compare that output week over week. A sudden jump in
seriesCountByMetricNamefor a specific metric almost always correlates with a deployment that introduced bad instrumentation. Catching it a week after deploy is much easier than catching it three months later when the bloated series have multiplied through every dashboard and alert that references them.
Finally, treat
sample_limiton scrape jobs and recording rules for expensive queries as standard practice from day one — not as remediation steps you reach for after an incident. Build both into your team's monitoring standards and you'll sidestep most of the query timeout and memory pressure problems before they ever escalate into a page.
