InfraRunBook
    Back to articles

    Prometheus Architecture and Data Model Explained

    Monitoring
    Published: Apr 8, 2026
    Updated: Apr 8, 2026

    A senior engineer's breakdown of how Prometheus is structured, how its pull-based scrape model works, and why the label-based data model is both its greatest strength and most common operational pitfall.

    Prometheus Architecture and Data Model Explained

    What Prometheus Actually Is (And What It Isn't)

    Prometheus is a pull-based, open-source monitoring system and time series database. It scrapes metrics from HTTP endpoints your applications and infrastructure expose, stores those metrics locally in its own TSDB, and gives you PromQL — a functional query language — to aggregate, transform, and alert on that data. It was born at SoundCloud around 2012 and graduated from the CNCF in 2018. At this point it's the de facto standard for metrics collection in cloud-native environments, and increasingly outside of them too.

    What it isn't: a log aggregator, a tracing system, or a long-term analytics store. I've seen teams try to shove everything into Prometheus — structured log data, distributed traces, multi-year retention. It buckles under that load every time. Prometheus is laser-focused on time series metrics. Keep it that way and it'll serve you well. Start treating it like a general-purpose data store and you'll spend your next on-call rotation wondering why your monitoring system is the thing that's down.

    The Architecture: How the Pieces Fit Together

    The Prometheus server is the brain. It handles metric collection, storage, query execution, and rule evaluation. But it doesn't stand alone — a production Prometheus deployment involves several cooperating components, each with a clear responsibility boundary.

    The Prometheus Server

    The server runs a retrieval loop — the scrape cycle. Every

    scrape_interval
    seconds (15s by default), it sends HTTP GET requests to every configured target, reads the exposition format response, parses the metrics, and writes them to the local TSDB. That's it. The simplicity is deceptive. The scrape loop is what gives Prometheus its "pull" model, and that design choice has real implications for network architecture, firewall rules, and reliability that we'll get into shortly.

    The local TSDB stores data in a block-based structure on disk. Incoming data lands in an in-memory head block that gets compacted and written to disk as immutable two-hour blocks. Those blocks are then merged in the background into larger blocks covering longer time ranges. This design means Prometheus handles brief network blips gracefully — it doesn't care if your network was flaky for 20 seconds, it'll just scrape on the next interval and carry on.

    Exporters: The Metric Bridges

    Most of your infrastructure doesn't natively expose Prometheus metrics. That's where exporters come in. An exporter is a process that sits alongside a system (or connects to it via API or socket), translates its internal state into Prometheus exposition format, and serves it on an HTTP endpoint — almost always at

    /metrics
    .

    The node_exporter is the canonical example. Deploy it on every host you want to monitor, and it exposes hundreds of metrics about CPU, memory, disk I/O, network interfaces, filesystems, and more. Point Prometheus at port 9100 on

    sw-infrarunbook-01.solvethenetwork.com
    and you've got full host telemetry with no application code changes required.

    scrape_configs:
      - job_name: 'node'
        static_configs:
          - targets:
              - '10.10.1.15:9100'
              - '10.10.1.16:9100'
              - '10.10.1.17:9100'

    The exporter ecosystem is enormous. There's blackbox_exporter for probing HTTP, TCP, ICMP endpoints from the outside in; mysqld_exporter for MySQL replication lag and query performance; redis_exporter; snmp_exporter for network gear that speaks SNMP. If a system emits operational data in any form, there's almost certainly an exporter that can bridge it to Prometheus.

    Pushgateway: The Exception, Not the Rule

    The Pushgateway exists for one specific use case: short-lived jobs that finish before Prometheus would scrape them. A nightly database backup script that runs for 45 seconds and exits is the textbook example. If Prometheus's scrape interval is 60 seconds, it'll never catch that job's metrics in flight. Instead, the job pushes its final metrics to the Pushgateway at completion, and Prometheus scrapes the Pushgateway on its normal schedule.

    Don't use the Pushgateway as a general-purpose ingest endpoint for services that are always running. I've seen this mistake more times than I'd like. Teams route all their application metrics through Pushgateway because they think the pull model requires too many firewall holes, or because setting up a

    /metrics
    endpoint felt complicated. The result is that Prometheus loses its automatic health signal — Pushgateway will happily serve stale metrics from a service that died three hours ago, and your
    up
    metric will keep showing 1 because Pushgateway itself is still alive.

    Service Discovery

    Static target configurations work fine in small, stable environments. Once you're operating across dynamic infrastructure — Kubernetes pods spinning up and down, autoscaling groups, ephemeral CI runners — you need service discovery. Prometheus has first-class SD integrations with Kubernetes, EC2, Consul, DNS, Azure, GCE, Nomad, and more.

    In Kubernetes environments the Kubernetes SD configuration lets Prometheus discover pods, services, nodes, and endpoints automatically. You combine this with relabeling rules to filter and reshape the discovered targets into exactly the scrape configuration you need.

    scrape_configs:
      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_pod_name]
            action: replace
            separator: '/'
            target_label: pod

    Alertmanager

    Prometheus evaluates alerting rules on a configurable interval (default 1 minute). When a rule's expression evaluates to a non-empty result for longer than its

    for
    duration, Prometheus fires an alert and ships it to Alertmanager. Alertmanager then handles grouping related alerts together, deduplicating repeated fires, applying silences, and routing to notification channels — PagerDuty, Slack, email, OpsGenie, Webhook.

    The separation here is deliberate and important. Prometheus knows about metrics and time. Alertmanager knows about human notification workflows. They each do one thing and do it well. Don't try to route alerts directly from Prometheus to Slack — route them through Alertmanager so you get grouping and deduplication and the ability to silence alerts during maintenance windows.

    The Data Model: Where the Real Power Lives

    Every piece of data in Prometheus is a time series. A time series is uniquely identified by a metric name and a set of key-value pairs called labels. That combination — name plus label set — defines a stream of timestamped float64 values called samples. The exposition format makes this concrete:

    <metric_name>{<label_name>=<label_value>, ...} <value> [<timestamp>]

    A real example scraped from node_exporter running on

    sw-infrarunbook-01.solvethenetwork.com
    :

    node_cpu_seconds_total{cpu="0",mode="idle",instance="10.10.1.15:9100",job="node"} 12345.67
    node_cpu_seconds_total{cpu="0",mode="system",instance="10.10.1.15:9100",job="node"} 234.56
    node_cpu_seconds_total{cpu="1",mode="idle",instance="10.10.1.15:9100",job="node"} 12300.00
    node_cpu_seconds_total{cpu="1",mode="system",instance="10.10.1.15:9100",job="node"} 245.10

    Each line is a distinct time series. They share the same metric name but differ in their label sets. That's the entire model. Its power comes from PromQL's ability to aggregate and transform across those label dimensions — you can sum CPU seconds across all modes on a single host, or sum across all hosts in a job, or compute a ratio between two label values, all with a single query expression.

    The Four Metric Types

    Prometheus defines four metric types in its client libraries. These types only matter at instrumentation time — once data hits the TSDB, it's all just float64 samples. But the types shape how you should reason about and query each metric.

    Counter is a monotonically increasing value that only goes up (or resets to zero on process restart). Total HTTP requests, bytes transmitted, errors encountered — anything you'd compute a rate over. Counters are the most commonly misused type in my experience. I've seen engineers use a gauge to track "total requests processed" and then wonder why their graphs show nonsensical dips. Use counters for cumulative totals, always, and use

    rate()
    or
    increase()
    in PromQL to extract meaningful rates from them.

    # HELP http_requests_total Total HTTP requests received
    # TYPE http_requests_total counter
    http_requests_total{method="GET",status="200"} 1027443
    http_requests_total{method="POST",status="200"} 34892
    http_requests_total{method="GET",status="500"} 127

    Gauge is a value that can go up or down — current memory usage, queue depth, number of active goroutines, temperature. You read gauges directly; their instantaneous value is meaningful without transformation.

    Histogram is where instrumentation gets sophisticated. A histogram samples observations (request durations, payload sizes) and counts them into predefined buckets. You also get the total count and the sum of all observed values automatically. Histograms are the correct tool for latency and size distributions — anything where you need percentile estimates.

    # HELP http_request_duration_seconds HTTP request latency
    # TYPE http_request_duration_seconds histogram
    http_request_duration_seconds_bucket{le="0.005"} 24054
    http_request_duration_seconds_bucket{le="0.01"} 33444
    http_request_duration_seconds_bucket{le="0.025"} 100392
    http_request_duration_seconds_bucket{le="0.05"} 129389
    http_request_duration_seconds_bucket{le="0.1"} 133988
    http_request_duration_seconds_bucket{le="+Inf"} 144320
    http_request_duration_seconds_sum 53423.147
    http_request_duration_seconds_count 144320

    The

    histogram_quantile()
    PromQL function estimates percentiles from those bucket boundaries. Accuracy depends entirely on your bucket configuration — if your actual latencies don't fall within your bucket ranges, your p99 estimates will be misleading. Getting histogram buckets right requires knowing your latency distribution in advance, which makes it more art than science the first time around.

    Summary computes quantiles client-side on a sliding time window. It sounds appealing but has a fundamental limitation: you cannot aggregate summaries across instances. If you have four replicas each computing their own p99, there's no correct way to combine them into a service-level p99. For most production use cases, histograms are the right choice. Summaries are useful when you need very accurate quantiles, have a single instance, and your latency distribution varies enough that predefined histogram buckets won't serve you well.

    Labels and the Cardinality Problem

    Labels are Prometheus's most powerful feature and the source of most production problems I've encountered with it. Every unique combination of label values creates a distinct time series in the TSDB. That's cardinality. If you add a label with 10,000 possible values to a metric family that already has 50 time series, you now have 500,000 time series. At production scale, high cardinality destroys Prometheus's memory footprint and can crash the server outright.

    The rule is simple and non-negotiable: never use unbounded values as label values. User IDs, request IDs, email addresses, raw URLs, session tokens, trace IDs — all of these will cause cardinality explosions. I've watched a single mislabeled metric take down a Prometheus instance that was otherwise handling over a million active time series without complaint. The server's memory usage went from 4GB to OOM in under an hour after a bad deploy added a

    user_id
    label.

    # Good label design — bounded, enumerable values
    http_requests_total{method="GET", status="200", service="api-gateway"} 48291
    
    # Cardinality disaster waiting to happen
    http_requests_total{method="GET", status="200", user_id="a7f3c891-4d2e-..."} 1

    Good label values are: HTTP methods (seven possible values), status code classes (2xx, 3xx, 4xx, 5xx), region names, environment names, service names, queue names. Bounded, enumerable, and meaningful for aggregation. If a label value can take more than a few hundred distinct values, question hard whether it should be a label at all.

    Why This Architecture Matters in Practice

    The pull model is a deliberate design choice with operational consequences that aren't immediately obvious. In a push-based monitoring system, a misbehaving target that starts emitting millions of metrics per second can overwhelm the central collector. With Prometheus's pull model, the server controls ingestion rate completely — targets cannot push more data than the scrape interval allows, regardless of how badly they misbehave.

    The pull model also gives you automatic instance health checking for free. If Prometheus can't reach a target — because it's down, because a firewall rule changed, because the process crashed — the scrape fails and the

    up
    metric for that target drops to 0. You don't need a separate heartbeat mechanism or health check system. The scrape itself is the heartbeat.

    up{job="node", instance="10.10.1.15:9100"} 1
    up{job="node", instance="10.10.1.16:9100"} 0
    up{job="node", instance="10.10.1.17:9100"} 1

    The local TSDB design means Prometheus has zero external dependencies for its core function. No Kafka, no Cassandra, no distributed coordination. This is a feature, not a limitation. A monitoring system that requires a healthy distributed infrastructure to operate is poorly suited to alerting you when your distributed infrastructure has problems. Prometheus running on a single VM with local disk will keep scraping and alerting even when everything else is on fire.

    For teams that need longer retention or high-availability setups, Prometheus supports

    remote_write
    — streaming samples to a long-term backend like Thanos, Cortex, or VictoriaMetrics as they're ingested. This is the standard pattern for getting both the operational simplicity of Prometheus and the durability of distributed storage. Prometheus handles real-time scraping and alerting; the remote backend handles historical querying and HA.

    A Real-World Configuration at solvethenetwork.com

    Here's what a practical Prometheus configuration looks like for a small production infrastructure stack, pulling together everything covered above:

    global:
      scrape_interval: 15s
      evaluation_interval: 15s
      external_labels:
        datacenter: 'dc1'
        environment: 'production'
    
    rule_files:
      - '/etc/prometheus/rules/*.yml'
    
    alerting:
      alertmanagers:
        - static_configs:
            - targets:
                - '10.10.1.20:9093'
    
    scrape_configs:
      - job_name: 'prometheus'
        static_configs:
          - targets: ['localhost:9090']
    
      - job_name: 'node'
        static_configs:
          - targets:
              - 'sw-infrarunbook-01.solvethenetwork.com:9100'
              - '10.10.1.15:9100'
              - '10.10.1.16:9100'
    
      - job_name: 'blackbox-http'
        metrics_path: /probe
        params:
          module: [http_2xx]
        static_configs:
          - targets:
              - 'https://solvethenetwork.com'
              - 'https://api.solvethenetwork.com/health'
        relabel_configs:
          - source_labels: [__address__]
            target_label: __param_target
          - source_labels: [__param_target]
            target_label: instance
          - target_label: __address__
            replacement: '10.10.1.21:9115'

    And the alerting rules that go with it:

    groups:
      - name: infrastructure
        interval: 1m
        rules:
          - alert: InstanceDown
            expr: up == 0
            for: 2m
            labels:
              severity: critical
            annotations:
              summary: "Instance {{ $labels.instance }} unreachable"
              description: "{{ $labels.instance }} (job: {{ $labels.job }}) has failed scrapes for over 2 minutes."
    
          - alert: HighMemoryPressure
            expr: |
              (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "Memory pressure on {{ $labels.instance }}"
              description: "Memory utilization is {{ $value | printf \"%.1f\" }}% on {{ $labels.instance }}"
    
          - alert: DiskFillingSoon
            expr: |
              predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[1h], 4 * 3600) < 0
            for: 10m
            labels:
              severity: warning
            annotations:
              summary: "Disk filling up on {{ $labels.instance }}"
              description: "Filesystem {{ $labels.mountpoint }} on {{ $labels.instance }} predicted to run out of space within 4 hours."

    Common Misconceptions Worth Addressing Directly

    The first one I hear constantly: "Prometheus isn't reliable because it only stores data locally and has no replication." This fundamentally misunderstands the design intent. Prometheus is meant to be deployed close to the things it monitors — per-cluster, per-datacenter, per-environment. Its local storage is intentional. For HA and long-term retention you layer Thanos or another remote storage adapter on top. Treating the lack of built-in replication as a flaw is like complaining that a screwdriver can't drive nails.

    Second misconception: labels give you unlimited observability flexibility so you should use as many as possible. Technically true, operationally disastrous at scale. Labels are not free. Every label dimension multiplies your active time series count. Design your label schema before you start instrumenting, think carefully about which dimensions you'll actually query along, and be conservative. Adding labels later is easy. Removing a high-cardinality label from a production metric requires a migration that your on-call engineers won't thank you for.

    Third: the Pushgateway is a good way to handle services that are behind NAT or firewalls. It's not — you should fix your network access instead, or use a pull-based reverse proxy pattern. Using the Pushgateway for long-running services means losing the automatic

    up
    health signal and potentially serving stale metrics indefinitely when a service dies.

    Fourth, and this one is subtle: Prometheus metric types affect how data is stored and queried. They don't. Counter, gauge, histogram, summary — these are hints for client libraries and human readers. The TSDB stores float64 samples. PromQL doesn't enforce type semantics at query time. You can call

    rate()
    on a gauge. Whether the result is meaningful is on you. The type system is documentation, not enforcement.

    Fifth: Prometheus metrics are inaccurate because you can miss events between scrapes. There's a kernel of truth here — if something spikes and recovers within a single 15-second scrape interval you won't see it. But for the operational signals that matter — CPU utilization, memory pressure, request error rates, queue depth — 15-second resolution is more than adequate. The

    rate()
    and
    increase()
    functions account for the scrape interval in their calculations. If you need sub-second event detection, Prometheus isn't the right tool. But most infrastructure observability questions don't need that resolution.


    Prometheus's architecture is built around a handful of strong, deliberate opinions: pull over push, local storage first, simple text-based exposition format, labels as the primary organizational primitive. Once you internalize those opinions, everything else — the exporters, service discovery, Alertmanager, remote storage integrations — follows logically from them.

    The data model is the foundation that makes PromQL so expressive. Understanding label cardinality is what separates teams that run Prometheus smoothly from teams that fight it constantly. Start with the server, a couple of node_exporters, and a handful of scrape targets. Write your first

    rate()
    query. See how the label dimensions let you slice the same metric family across jobs, instances, and whatever dimensions matter to your infrastructure. That's when the model stops being abstract and starts being genuinely useful.

    Frequently Asked Questions

    What is the difference between a Counter and a Gauge in Prometheus?

    A Counter only ever increases (or resets to zero on restart) and is used for cumulative totals like total HTTP requests or bytes sent — you compute rates from counters using rate() or increase(). A Gauge represents a value that can go up or down at any time, like current memory usage or active connection count, and is read directly without needing a rate function.

    Why does Prometheus use a pull model instead of push?

    The pull model gives Prometheus control over ingestion rate, preventing misbehaving targets from overwhelming the server. It also provides automatic health checking — if a scrape fails, Prometheus sets the up metric to 0 for that target without needing a separate heartbeat mechanism. Push-based systems require targets to know the monitoring endpoint and can mask failures if a dying process simply stops pushing.

    What causes high cardinality in Prometheus and how do I prevent it?

    High cardinality occurs when a label has many possible values, multiplying the number of active time series. Common culprits are labels containing user IDs, request IDs, email addresses, raw URLs, or session tokens. Prevent it by restricting label values to bounded, enumerable sets — HTTP methods, status code classes, region names, service names. Audit new instrumentation for cardinality before it reaches production.

    When should I use a Histogram versus a Summary in Prometheus?

    Use a Histogram in almost all cases. Histograms let you aggregate percentile estimates across multiple instances using histogram_quantile() in PromQL, which is essential for service-level latency metrics. Summaries compute quantiles client-side on a sliding window and cannot be aggregated across instances — if you have multiple replicas, there's no correct way to combine their summary quantiles into a service-level p99.

    Can Prometheus handle long-term metric storage on its own?

    Prometheus's local TSDB is not designed for multi-year retention — default retention is 15 days and storage is not replicated. For long-term storage and high availability, use remote_write to ship samples to a dedicated backend like Thanos, Cortex, or VictoriaMetrics. These systems store data in object storage (S3, GCS) and provide global query views across multiple Prometheus instances.

    Related Articles