Prometheus Architecture and Data Model...

What Prometheus Actually Is (And What It Isn't)

Prometheus is a pull-based, open-source monitoring system and time series database. It scrapes metrics from HTTP endpoints your applications and infrastructure expose, stores those metrics locally in its own TSDB, and gives you PromQL — a functional query language — to aggregate, transform, and alert on that data. It was born at SoundCloud around 2012 and graduated from the CNCF in 2018. At this point it's the de facto standard for metrics collection in cloud-native environments, and increasingly outside of them too.

What it isn't: a log aggregator, a tracing system, or a long-term analytics store. I've seen teams try to shove everything into Prometheus — structured log data, distributed traces, multi-year retention. It buckles under that load every time. Prometheus is laser-focused on time series metrics. Keep it that way and it'll serve you well. Start treating it like a general-purpose data store and you'll spend your next on-call rotation wondering why your monitoring system is the thing that's down.

The Architecture: How the Pieces Fit Together

The Prometheus server is the brain. It handles metric collection, storage, query execution, and rule evaluation. But it doesn't stand alone — a production Prometheus deployment involves several cooperating components, each with a clear responsibility boundary.

The Prometheus Server

The server runs a retrieval loop — the scrape cycle. Every

scrape_interval

seconds (15s by default), it sends HTTP GET requests to every configured target, reads the exposition format response, parses the metrics, and writes them to the local TSDB. That's it. The simplicity is deceptive. The scrape loop is what gives Prometheus its "pull" model, and that design choice has real implications for network architecture, firewall rules, and reliability that we'll get into shortly.

The local TSDB stores data in a block-based structure on disk. Incoming data lands in an in-memory head block that gets compacted and written to disk as immutable two-hour blocks. Those blocks are then merged in the background into larger blocks covering longer time ranges. This design means Prometheus handles brief network blips gracefully — it doesn't care if your network was flaky for 20 seconds, it'll just scrape on the next interval and carry on.

Exporters: The Metric Bridges

Most of your infrastructure doesn't natively expose Prometheus metrics. That's where exporters come in. An exporter is a process that sits alongside a system (or connects to it via API or socket), translates its internal state into Prometheus exposition format, and serves it on an HTTP endpoint — almost always at

/metrics

The node_exporter is the canonical example. Deploy it on every host you want to monitor, and it exposes hundreds of metrics about CPU, memory, disk I/O, network interfaces, filesystems, and more. Point Prometheus at port 9100 on

sw-infrarunbook-01.solvethenetwork.com

and you've got full host telemetry with no application code changes required.

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets:
          - '10.10.1.15:9100'
          - '10.10.1.16:9100'
          - '10.10.1.17:9100'

The exporter ecosystem is enormous. There's blackbox_exporter for probing HTTP, TCP, ICMP endpoints from the outside in; mysqld_exporter for MySQL replication lag and query performance; redis_exporter; snmp_exporter for network gear that speaks SNMP. If a system emits operational data in any form, there's almost certainly an exporter that can bridge it to Prometheus.

Pushgateway: The Exception, Not the Rule

The Pushgateway exists for one specific use case: short-lived jobs that finish before Prometheus would scrape them. A nightly database backup script that runs for 45 seconds and exits is the textbook example. If Prometheus's scrape interval is 60 seconds, it'll never catch that job's metrics in flight. Instead, the job pushes its final metrics to the Pushgateway at completion, and Prometheus scrapes the Pushgateway on its normal schedule.

Don't use the Pushgateway as a general-purpose ingest endpoint for services that are always running. I've seen this mistake more times than I'd like. Teams route all their application metrics through Pushgateway because they think the pull model requires too many firewall holes, or because setting up a

/metrics

endpoint felt complicated. The result is that Prometheus loses its automatic health signal — Pushgateway will happily serve stale metrics from a service that died three hours ago, and your

up

metric will keep showing 1 because Pushgateway itself is still alive.

Service Discovery

Static target configurations work fine in small, stable environments. Once you're operating across dynamic infrastructure — Kubernetes pods spinning up and down, autoscaling groups, ephemeral CI runners — you need service discovery. Prometheus has first-class SD integrations with Kubernetes, EC2, Consul, DNS, Azure, GCE, Nomad, and more.

In Kubernetes environments the Kubernetes SD configuration lets Prometheus discover pods, services, nodes, and endpoints automatically. You combine this with relabeling rules to filter and reshape the discovered targets into exactly the scrape configuration you need.

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_pod_name]
        action: replace
        separator: '/'
        target_label: pod

Alertmanager

Prometheus evaluates alerting rules on a configurable interval (default 1 minute). When a rule's expression evaluates to a non-empty result for longer than its

for

duration, Prometheus fires an alert and ships it to Alertmanager. Alertmanager then handles grouping related alerts together, deduplicating repeated fires, applying silences, and routing to notification channels — PagerDuty, Slack, email, OpsGenie, Webhook.

The separation here is deliberate and important. Prometheus knows about metrics and time. Alertmanager knows about human notification workflows. They each do one thing and do it well. Don't try to route alerts directly from Prometheus to Slack — route them through Alertmanager so you get grouping and deduplication and the ability to silence alerts during maintenance windows.

The Data Model: Where the Real Power Lives

Every piece of data in Prometheus is a time series. A time series is uniquely identified by a metric name and a set of key-value pairs called labels. That combination — name plus label set — defines a stream of timestamped float64 values called samples. The exposition format makes this concrete:

<metric_name>{<label_name>=<label_value>, ...} <value> [<timestamp>]

A real example scraped from node_exporter running on

sw-infrarunbook-01.solvethenetwork.com

node_cpu_seconds_total{cpu="0",mode="idle",instance="10.10.1.15:9100",job="node"} 12345.67
node_cpu_seconds_total{cpu="0",mode="system",instance="10.10.1.15:9100",job="node"} 234.56
node_cpu_seconds_total{cpu="1",mode="idle",instance="10.10.1.15:9100",job="node"} 12300.00
node_cpu_seconds_total{cpu="1",mode="system",instance="10.10.1.15:9100",job="node"} 245.10

Each line is a distinct time series. They share the same metric name but differ in their label sets. That's the entire model. Its power comes from PromQL's ability to aggregate and transform across those label dimensions — you can sum CPU seconds across all modes on a single host, or sum across all hosts in a job, or compute a ratio between two label values, all with a single query expression.

The Four Metric Types

Prometheus defines four metric types in its client libraries. These types only matter at instrumentation time — once data hits the TSDB, it's all just float64 samples. But the types shape how you should reason about and query each metric.

Counter is a monotonically increasing value that only goes up (or resets to zero on process restart). Total HTTP requests, bytes transmitted, errors encountered — anything you'd compute a rate over. Counters are the most commonly misused type in my experience. I've seen engineers use a gauge to track "total requests processed" and then wonder why their graphs show nonsensical dips. Use counters for cumulative totals, always, and use

rate()

increase()

in PromQL to extract meaningful rates from them.

# HELP http_requests_total Total HTTP requests received
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200"} 1027443
http_requests_total{method="POST",status="200"} 34892
http_requests_total{method="GET",status="500"} 127

Gauge is a value that can go up or down — current memory usage, queue depth, number of active goroutines, temperature. You read gauges directly; their instantaneous value is meaningful without transformation.

Histogram is where instrumentation gets sophisticated. A histogram samples observations (request durations, payload sizes) and counts them into predefined buckets. You also get the total count and the sum of all observed values automatically. Histograms are the correct tool for latency and size distributions — anything where you need percentile estimates.

# HELP http_request_duration_seconds HTTP request latency
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.005"} 24054
http_request_duration_seconds_bucket{le="0.01"} 33444
http_request_duration_seconds_bucket{le="0.025"} 100392
http_request_duration_seconds_bucket{le="0.05"} 129389
http_request_duration_seconds_bucket{le="0.1"} 133988
http_request_duration_seconds_bucket{le="+Inf"} 144320
http_request_duration_seconds_sum 53423.147
http_request_duration_seconds_count 144320

The

histogram_quantile()

PromQL function estimates percentiles from those bucket boundaries. Accuracy depends entirely on your bucket configuration — if your actual latencies don't fall within your bucket ranges, your p99 estimates will be misleading. Getting histogram buckets right requires knowing your latency distribution in advance, which makes it more art than science the first time around.

Summary computes quantiles client-side on a sliding time window. It sounds appealing but has a fundamental limitation: you cannot aggregate summaries across instances. If you have four replicas each computing their own p99, there's no correct way to combine them into a service-level p99. For most production use cases, histograms are the right choice. Summaries are useful when you need very accurate quantiles, have a single instance, and your latency distribution varies enough that predefined histogram buckets won't serve you well.

Labels and the Cardinality Problem

Labels are Prometheus's most powerful feature and the source of most production problems I've encountered with it. Every unique combination of label values creates a distinct time series in the TSDB. That's cardinality. If you add a label with 10,000 possible values to a metric family that already has 50 time series, you now have 500,000 time series. At production scale, high cardinality destroys Prometheus's memory footprint and can crash the server outright.

The rule is simple and non-negotiable: never use unbounded values as label values. User IDs, request IDs, email addresses, raw URLs, session tokens, trace IDs — all of these will cause cardinality explosions. I've watched a single mislabeled metric take down a Prometheus instance that was otherwise handling over a million active time series without complaint. The server's memory usage went from 4GB to OOM in under an hour after a bad deploy added a

user_id

label.

# Good label design — bounded, enumerable values
http_requests_total{method="GET", status="200", service="api-gateway"} 48291

# Cardinality disaster waiting to happen
http_requests_total{method="GET", status="200", user_id="a7f3c891-4d2e-..."} 1

Good label values are: HTTP methods (seven possible values), status code classes (2xx, 3xx, 4xx, 5xx), region names, environment names, service names, queue names. Bounded, enumerable, and meaningful for aggregation. If a label value can take more than a few hundred distinct values, question hard whether it should be a label at all.

Why This Architecture Matters in Practice

The pull model is a deliberate design choice with operational consequences that aren't immediately obvious. In a push-based monitoring system, a misbehaving target that starts emitting millions of metrics per second can overwhelm the central collector. With Prometheus's pull model, the server controls ingestion rate completely — targets cannot push more data than the scrape interval allows, regardless of how badly they misbehave.

The pull model also gives you automatic instance health checking for free. If Prometheus can't reach a target — because it's down, because a firewall rule changed, because the process crashed — the scrape fails and the

up

metric for that target drops to 0. You don't need a separate heartbeat mechanism or health check system. The scrape itself is the heartbeat.

up{job="node", instance="10.10.1.15:9100"} 1
up{job="node", instance="10.10.1.16:9100"} 0
up{job="node", instance="10.10.1.17:9100"} 1

The local TSDB design means Prometheus has zero external dependencies for its core function. No Kafka, no Cassandra, no distributed coordination. This is a feature, not a limitation. A monitoring system that requires a healthy distributed infrastructure to operate is poorly suited to alerting you when your distributed infrastructure has problems. Prometheus running on a single VM with local disk will keep scraping and alerting even when everything else is on fire.

For teams that need longer retention or high-availability setups, Prometheus supports

remote_write

— streaming samples to a long-term backend like Thanos, Cortex, or VictoriaMetrics as they're ingested. This is the standard pattern for getting both the operational simplicity of Prometheus and the durability of distributed storage. Prometheus handles real-time scraping and alerting; the remote backend handles historical querying and HA.

A Real-World Configuration at solvethenetwork.com

Here's what a practical Prometheus configuration looks like for a small production infrastructure stack, pulling together everything covered above:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    datacenter: 'dc1'
    environment: 'production'

rule_files:
  - '/etc/prometheus/rules/*.yml'

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - '10.10.1.20:9093'

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets:
          - 'sw-infrarunbook-01.solvethenetwork.com:9100'
          - '10.10.1.15:9100'
          - '10.10.1.16:9100'

  - job_name: 'blackbox-http'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - 'https://solvethenetwork.com'
          - 'https://api.solvethenetwork.com/health'
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: '10.10.1.21:9115'

And the alerting rules that go with it:

groups:
  - name: infrastructure
    interval: 1m
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} unreachable"
          description: "{{ $labels.instance }} (job: {{ $labels.job }}) has failed scrapes for over 2 minutes."

      - alert: HighMemoryPressure
        expr: |
          (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Memory pressure on {{ $labels.instance }}"
          description: "Memory utilization is {{ $value | printf \"%.1f\" }}% on {{ $labels.instance }}"

      - alert: DiskFillingSoon
        expr: |
          predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[1h], 4 * 3600) < 0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Disk filling up on {{ $labels.instance }}"
          description: "Filesystem {{ $labels.mountpoint }} on {{ $labels.instance }} predicted to run out of space within 4 hours."

Common Misconceptions Worth Addressing Directly

The first one I hear constantly: "Prometheus isn't reliable because it only stores data locally and has no replication." This fundamentally misunderstands the design intent. Prometheus is meant to be deployed close to the things it monitors — per-cluster, per-datacenter, per-environment. Its local storage is intentional. For HA and long-term retention you layer Thanos or another remote storage adapter on top. Treating the lack of built-in replication as a flaw is like complaining that a screwdriver can't drive nails.

Second misconception: labels give you unlimited observability flexibility so you should use as many as possible. Technically true, operationally disastrous at scale. Labels are not free. Every label dimension multiplies your active time series count. Design your label schema before you start instrumenting, think carefully about which dimensions you'll actually query along, and be conservative. Adding labels later is easy. Removing a high-cardinality label from a production metric requires a migration that your on-call engineers won't thank you for.

Third: the Pushgateway is a good way to handle services that are behind NAT or firewalls. It's not — you should fix your network access instead, or use a pull-based reverse proxy pattern. Using the Pushgateway for long-running services means losing the automatic

up

health signal and potentially serving stale metrics indefinitely when a service dies.

Fourth, and this one is subtle: Prometheus metric types affect how data is stored and queried. They don't. Counter, gauge, histogram, summary — these are hints for client libraries and human readers. The TSDB stores float64 samples. PromQL doesn't enforce type semantics at query time. You can call

rate()

on a gauge. Whether the result is meaningful is on you. The type system is documentation, not enforcement.

Fifth: Prometheus metrics are inaccurate because you can miss events between scrapes. There's a kernel of truth here — if something spikes and recovers within a single 15-second scrape interval you won't see it. But for the operational signals that matter — CPU utilization, memory pressure, request error rates, queue depth — 15-second resolution is more than adequate. The

rate()

and

increase()

functions account for the scrape interval in their calculations. If you need sub-second event detection, Prometheus isn't the right tool. But most infrastructure observability questions don't need that resolution.

Prometheus's architecture is built around a handful of strong, deliberate opinions: pull over push, local storage first, simple text-based exposition format, labels as the primary organizational primitive. Once you internalize those opinions, everything else — the exporters, service discovery, Alertmanager, remote storage integrations — follows logically from them.

The data model is the foundation that makes PromQL so expressive. Understanding label cardinality is what separates teams that run Prometheus smoothly from teams that fight it constantly. Start with the server, a couple of node_exporters, and a handful of scrape targets. Write your first

rate()

query. See how the label dimensions let you slice the same metric family across jobs, instances, and whatever dimensions matter to your infrastructure. That's when the model stops being abstract and starts being genuinely useful.

Prometheus Architecture and Data Model Explained

What Prometheus Actually Is (And What It Isn't)

The Architecture: How the Pieces Fit Together

The Prometheus Server

Exporters: The Metric Bridges

Pushgateway: The Exception, Not the Rule

Service Discovery

Alertmanager

The Data Model: Where the Real Power Lives

The Four Metric Types

Labels and the Cardinality Problem

Why This Architecture Matters in Practice

A Real-World Configuration at solvethenetwork.com

Common Misconceptions Worth Addressing Directly

Related Articles

Frequently Asked Questions

What is the difference between a Counter and a Gauge in Prometheus?

Why does Prometheus use a pull model instead of push?

What causes high cardinality in Prometheus and how do I prevent it?

When should I use a Histogram versus a Summary in Prometheus?

Can Prometheus handle long-term metric storage on its own?

Related Articles