InfraRunBook
    Back to articles

    Prometheus High Cardinality Issues

    Monitoring
    Published: Apr 16, 2026
    Updated: Apr 16, 2026

    High cardinality in Prometheus causes memory exhaustion, query timeouts, and cascading monitoring failures. This runbook covers every root cause with real commands and actionable fixes.

    Prometheus High Cardinality Issues

    Symptoms

    When Prometheus starts struggling with high cardinality, it doesn't fail cleanly. It degrades slowly, and by the time someone notices, the situation is already bad. I've been called into more than a few incidents that started with "Grafana is slow" and ended with a Prometheus process eating 40GB of RAM on sw-infrarunbook-01 while on-call engineers scrambled to figure out what changed over the weekend.

    Here's what the degradation actually looks like. The Prometheus UI becomes sluggish or completely unresponsive. Queries that returned in milliseconds now spin for 30 seconds before throwing an error. The

    prometheus_tsdb_head_series
    metric — your single most important cardinality health indicator — shows numbers in the millions instead of the tens of thousands you're used to. Grafana dashboards fail with red panels, and when you dig into the query inspector, you see
    context deadline exceeded
    errors stacking up.

    In more severe cases, Prometheus itself gets OOM-killed by the kernel. You'll find entries like this in your system journal on sw-infrarunbook-01:

    Apr 16 03:14:22 sw-infrarunbook-01 kernel: Out of memory: Kill process 4821 (prometheus) score 892 or sacrifice child
    Apr 16 03:14:22 sw-infrarunbook-01 kernel: Killed process 4821 (prometheus) total-vm:42389504kB, anon-rss:38912048kB

    After a restart things look fine briefly — then memory climbs back up and you're back where you started. That cycle is the cardinality problem in its most brutal form. The process restarts cleanly but immediately begins rebuilding the same head series from the WAL and incoming scrapes, hitting the same ceiling within hours.


    Root Cause 1: Label Values Are Unbounded

    This is the most common cause of cardinality explosions, and it's almost always introduced by someone who had genuinely good intentions. Prometheus stores each unique combination of label values as a separate time series. The total cardinality of your metric set is the product of all unique label values across every label dimension. If even one label contains unbounded values — user IDs, request IDs, session tokens, IP addresses, full file paths, raw SQL queries — your series count multiplies out of control fast.

    Here's why it happens. A developer instruments their HTTP service and adds a

    path
    label to track request latency per endpoint. Sounds reasonable. But they pull the raw path from the request, which includes dynamic path parameters:

    http_requests_total{method="GET", path="/api/users/12345", status="200"}
    http_requests_total{method="GET", path="/api/users/67890", status="200"}
    http_requests_total{method="GET", path="/api/users/99999", status="200"}

    Every unique user ID embedded in the

    path
    label creates a new time series. With a million users, that's a million time series for a single metric. I've seen this exact pattern with IP addresses in
    client_ip
    labels, with full SQL query strings in
    query
    labels, and with trace IDs embedded in metric labels by engineers who confused metrics with distributed traces.

    To identify which labels are causing the explosion, query the Prometheus TSDB status API:

    curl -s http://10.10.1.50:9090/api/v1/status/tsdb | \
      jq '.data.seriesCountByMetricName | sort_by(.value) | reverse | .[0:10]'

    The output will look something like this:

    [
      { "name": "http_requests_total", "value": 847293 },
      { "name": "http_request_duration_seconds_bucket", "value": 762564 },
      { "name": "node_cpu_seconds_total", "value": 4800 },
      { "name": "up", "value": 320 }
    ]

    When one metric has 847,000 series and everything else has a few thousand, you've found your offender. Drill into the specific label causing the bloat:

    curl -s 'http://10.10.1.50:9090/api/v1/label/path/values' | jq '.data | length'

    If that returns 847293, the

    path
    label is unbounded and that's your culprit.

    The fix depends on where the label is set. Ideally you fix the application instrumentation to normalize path parameters —

    /api/users/:id
    instead of
    /api/users/12345
    . If you can't ship that fix immediately, use Prometheus metric relabeling to rewrite offending label values at scrape time:

    scrape_configs:
      - job_name: 'app-backend'
        static_configs:
          - targets: ['10.10.1.80:8080']
        metric_relabel_configs:
          - source_labels: [path]
            regex: '^/api/users/[0-9]+'
            target_label: path
            replacement: '/api/users/:id'
          - source_labels: [path]
            regex: '^/api/orders/[0-9]+'
            target_label: path
            replacement: '/api/orders/:id'

    Apply the relabeling, restart Prometheus, and watch

    prometheus_tsdb_head_series
    drop over the next few scrape cycles. The existing bloated series will age out once their retention window passes — you won't see an immediate cliff, but the growth will stop.


    Root Cause 2: Too Many Time Series Overall

    Even without a single unbounded label, you can hit the cardinality wall by having too many targets, too many metrics per target, or both. This is a systems engineering problem, not a labeling mistake. The general rule of thumb I use: a well-resourced Prometheus instance handles around 1 to 2 million active series comfortably. Beyond that, you're fighting physics.

    The problem compounds when teams add exporters without any governance. Every new node_exporter, blackbox_exporter, mysqld_exporter, or custom application exporter adds hundreds or thousands of series. Multiply that across 300 nodes and 50 services and you've quietly crossed the threshold without any single decision feeling obviously wrong. Nobody added 2 million series in one go — they accumulated over months of legitimate growth.

    Check your current total series count first:

    curl -s 'http://10.10.1.50:9090/api/v1/query?query=prometheus_tsdb_head_series' | \
      jq '.data.result[0].value[1]'

    Then look at which label value pairs contribute most to overall cardinality:

    curl -s http://10.10.1.50:9090/api/v1/status/tsdb | \
      jq '.data.seriesCountByLabelValuePair | sort_by(.value) | reverse | .[0:20]'

    You'll often find that a handful of combinations —

    instance
    values for your highest-metric-count exporters — dominate the list. This tells you where pruning will have the most impact.

    The fix has two layers. First, drop metrics you're not actually querying. node_exporter alone exposes over a thousand metrics by default; most teams only use a fraction of them. Use metric relabeling to drop entire families:

    metric_relabel_configs:
      - source_labels: [__name__]
        regex: 'go_gc_.*|go_memstats_.*|go_goroutines|go_threads'
        action: drop
      - source_labels: [__name__]
        regex: 'node_nf_conntrack_.*|node_sockstat_.*'
        action: drop

    Before applying drops broadly, verify nobody is querying those metrics. A quick search through your Grafana dashboards and alerting rules for any reference to the metric family will confirm it's safe to drop.

    Second, if growth is structural rather than incidental, you need to shard your Prometheus topology. Deploy multiple Prometheus instances, each scraping a subset of targets, and use Thanos or Grafana Mimir to provide a federated query layer across all shards. This is the architectural answer to growth that relabeling alone can't solve.


    Root Cause 3: Query Timeout

    High cardinality doesn't just eat memory — it makes queries slow. Very slow. When a query has to scan 800,000 series to evaluate a selector like

    http_requests_total{job="api"}
    , it doesn't matter how fast your disk is. The fan-out is simply too wide, and the query engine has to touch every matching series before it can return a result.

    The symptom is unambiguous. You run a query in the Prometheus UI or Grafana and see:

    Error: context deadline exceeded

    Through the HTTP API the error response looks like:

    {
      "status": "error",
      "errorType": "timeout",
      "error": "query timed out in expression evaluation"
    }

    Prometheus has a default query timeout of two minutes (

    --query.timeout=2m
    ). When queries hit this consistently, the queries themselves are too expensive — usually because the series they're scanning are too numerous or the time range is too wide.

    Enable query logging to identify your worst offenders. Add this flag to your Prometheus startup command on sw-infrarunbook-01:

    --query.log-file=/var/log/prometheus/query.log

    Then mine the log for slow queries:

    grep 'duration' /var/log/prometheus/query.log | \
      awk -F'"' '{print $8, $0}' | sort -rn | head -20

    You can also monitor query performance through Prometheus's own metrics:

    prometheus_engine_query_duration_seconds{slice="inner_eval", quantile="0.9"}

    If the 90th percentile of inner_eval duration is consistently above 5 seconds, your queries are deeply expensive and your alerting rules are likely misfiring or evaluating with significant lag.

    The fix combines cardinality reduction with query optimization. Use specific label matchers to reduce the scan scope on every query. Instead of:

    rate(http_requests_total[5m])

    Use:

    rate(http_requests_total{job="api", env="production"}[5m])

    The additional label matchers act as pre-filters that let the TSDB skip entire series before the rate calculation even begins. Also avoid regex matchers on high-cardinality labels wherever possible —

    {path=~"/api/.*"}
    forces a full scan of every series label to evaluate the regex, which compounds the cost on already-bloated metrics.


    Root Cause 4: Memory Pressure on Prometheus

    Prometheus is fundamentally an in-memory database with a disk-backed WAL. Every active time series keeps its recent data in the head chunk in RAM. When you have 5 million active series, that's an enormous number of head chunks sitting in memory simultaneously, and Prometheus doesn't offer a "low memory mode" that trades RAM for slower query performance. It's all-or-nothing on the memory front.

    In my experience, memory growth is often the first symptom teams notice — before they understand cardinality is the root cause. The Prometheus process that used to run comfortably at 4GB starts touching 12GB, then 20GB, then gets OOM-killed. After the restart it climbs again. The team increases the memory limit. It climbs past that too.

    Check current memory usage with a PromQL query:

    process_resident_memory_bytes{job="prometheus"}

    Or directly from sw-infrarunbook-01:

    ps -o pid,rss,comm -p $(pgrep prometheus) | awk 'NR>1 {printf "PID %s: %.1f GB\n", $1, $2/1048576}'

    Watch the growth rate over time to distinguish a one-time increase from a continuous leak:

    rate(process_resident_memory_bytes{job="prometheus"}[1h])

    If this rate is consistently positive and growing, new series are being created faster than old ones expire. That's a cardinality leak — something in your environment is producing novel label combinations continuously.

    For immediate relief, apply a

    sample_limit
    to scrape jobs to prevent a single bad target from flooding the head:

    scrape_configs:
      - job_name: 'app-backend'
        sample_limit: 15000
        static_configs:
          - targets: ['10.10.1.80:8080']
      - job_name: 'node-exporters'
        sample_limit: 2000
        static_configs:
          - targets: ['10.10.1.81:9100', '10.10.1.82:9100', '10.10.1.83:9100']

    When a target exceeds its

    sample_limit
    , Prometheus drops the entire scrape and increments
    scrape_samples_post_metric_relabeling
    — you'll see a gap in that target's data, which is your signal that the limit was hit. It's a circuit breaker, not a solution, but it prevents one misbehaving application from taking down your entire monitoring stack while you work the proper fix.

    From Prometheus 2.45+ you can also enforce global guardrails in prometheus.yml directly:

    global:
      scrape_interval: 15s
      label_limit: 30
      label_name_length_limit: 200
      label_value_length_limit: 500

    The

    label_limit
    setting is underused but valuable. Legitimate applications rarely need more than 15-20 labels on a metric. Setting a hard limit of 30 catches badly instrumented exporters before they can inject dozens of custom labels that multiply your cardinality.


    Root Cause 5: Recording Rules Needed

    This one is slightly different from the others because it's not about the number of series stored — it's about the cost of evaluating them repeatedly. Even with a manageable series count, if your alerting rules and Grafana dashboards are running heavy aggregations over wide time windows on every evaluation cycle, you'll starve the Prometheus query engine. The symptoms look identical to cardinality problems: slow dashboards, timeouts, high CPU on sw-infrarunbook-01 — but the series count might look completely reasonable.

    Recording rules pre-compute expensive query expressions and store the results as new, lightweight time series. The rule runs once per evaluation interval, writes a single output series, and every downstream consumer — dashboards, alerts, other rules — queries that pre-computed series instead of re-running the full aggregation on every load. Query time drops from seconds to milliseconds. Prometheus CPU drops. Dashboards become snappy again.

    Identify whether you need recording rules by checking rule evaluation duration:

    prometheus_rule_evaluation_duration_seconds{quantile="0.9"}

    Also check for rule evaluation failures, which indicate rules are timing out silently:

    rate(prometheus_rule_evaluation_failures_total[5m])

    A non-zero failure rate here means your alerting rules may not be firing when they should be. That's not just a performance problem — it's a reliability problem for your entire alerting infrastructure.

    Here's a concrete before/after example. Suppose multiple Grafana dashboards and two alerting rules all evaluate this query independently:

    sum(rate(http_requests_total[5m])) by (service, status_code)

    With 800,000 http_requests_total series spread across 200 services, this aggregation hits every raw series on every evaluation. Convert it to a recording rule:

    groups:
      - name: http_aggregations
        interval: 60s
        rules:
          - record: job:http_requests:rate5m
            expr: sum(rate(http_requests_total[5m])) by (service, status_code)

    Now your dashboards and alerts query

    job:http_requests:rate5m
    — a pre-computed series with maybe 600 rows representing 200 services times 3 status code categories — instead of aggregating across 800,000 raw series every 15 seconds. The naming convention
    level:metric:operations
    is the community standard and makes recording rules self-documenting at a glance.

    Apply this pattern to any query that appears in both dashboards and alerting rules. If multiple teams query the same expensive expression, that's exactly the case recording rules were designed for.


    Prevention

    Preventing cardinality problems is mostly a governance and early detection problem. Once you're in a full cardinality crisis with 8 million active series, your options are painful: drop data, restart Prometheus and lose head chunks, or fight through degraded performance while incrementally fixing labels. None of those are fun conversations to have at 2 AM. The smart move is never getting there.

    Set up cardinality and memory alerts on your Prometheus-about-Prometheus metrics. These should fire well before things become critical:

    groups:
      - name: prometheus_health
        rules:
          - alert: PrometheusHighCardinality
            expr: prometheus_tsdb_head_series > 1500000
            for: 10m
            labels:
              severity: warning
            annotations:
              summary: "Prometheus series count is high on {{ $labels.instance }}"
              description: "Active series: {{ $value }}. Investigate label cardinality before this becomes critical."
    
          - alert: PrometheusCriticalCardinality
            expr: prometheus_tsdb_head_series > 3000000
            for: 5m
            labels:
              severity: critical
            annotations:
              summary: "Prometheus cardinality critical on {{ $labels.instance }}"
    
          - alert: PrometheusMemoryHigh
            expr: process_resident_memory_bytes{job="prometheus"} > 8589934592
            for: 5m
            labels:
              severity: critical
            annotations:
              summary: "Prometheus memory exceeds 8GB on {{ $labels.instance }}"

    Make cardinality review part of your deployment process for any service that ships Prometheus instrumentation. When a developer opens a PR adding new metrics, the review question is simple: what are the possible values for each label, and is that set bounded? If the answer is "users can put anything there" or "it depends on request parameters," that's a red flag that needs to be resolved before the code ships.

    Run the TSDB status endpoint as part of your regular operational review and snapshot the output for comparison:

    curl -s http://10.10.1.50:9090/api/v1/status/tsdb | \
      jq '.data' | tee /var/log/prometheus/tsdb-status-$(date +%F).json

    Compare that output week over week. A sudden jump in

    seriesCountByMetricName
    for a specific metric almost always correlates with a deployment that introduced bad instrumentation. Catching it a week after deploy is much easier than catching it three months later when the bloated series have multiplied through every dashboard and alert that references them.

    Finally, treat

    sample_limit
    on scrape jobs and recording rules for expensive queries as standard practice from day one — not as remediation steps you reach for after an incident. Build both into your team's monitoring standards and you'll sidestep most of the query timeout and memory pressure problems before they ever escalate into a page.

    Frequently Asked Questions

    What is high cardinality in Prometheus and why does it matter?

    High cardinality in Prometheus means having too many unique combinations of label values, which causes Prometheus to create and track an excessive number of time series. Each unique label set is stored as a separate series in memory, so high cardinality directly drives up RAM usage, slows down queries, and can ultimately crash Prometheus via OOM kill.

    How do I find which metric is causing my Prometheus cardinality explosion?

    Use the TSDB status API: curl -s http://10.10.1.50:9090/api/v1/status/tsdb | jq '.data.seriesCountByMetricName | sort_by(.value) | reverse | .[0:10]'. This returns the top metrics by series count. If one metric has hundreds of thousands of series while others have a few hundred, that metric's labels are almost certainly unbounded.

    How many active time series can Prometheus handle before performance degrades?

    On well-resourced hardware, Prometheus handles 1 to 2 million active series comfortably. Beyond that, memory pressure increases significantly and query latency degrades. The exact limit depends on your hardware, query patterns, and scrape intervals — but 1.5 million is a reasonable warning threshold to alert on.

    What is a recording rule in Prometheus and when should I use one?

    A recording rule pre-computes an expensive PromQL expression on a schedule and stores the result as a new, lightweight time series. You should use recording rules whenever an aggregation query is evaluated repeatedly by dashboards or alerting rules, especially when the underlying metric has high cardinality. They reduce query time from seconds to milliseconds for frequently-accessed aggregations.

    Can I stop a single bad scrape target from overwhelming Prometheus with too many series?

    Yes. Set sample_limit in the scrape job configuration for that target. When the scraped metric count exceeds the limit, Prometheus drops the entire scrape for that interval rather than ingesting the excess series. This acts as a circuit breaker to protect the rest of your monitoring stack while you fix the underlying instrumentation problem.

    Related Articles