Prometheus and Grafana Stack Setup Guide

Prerequisites

Before you touch a single config file, let's make sure the environment is ready. I've seen people skip this part and spend two hours debugging what turned out to be a firewall rule. Don't be that person.

You'll need a Linux host — I'm using Ubuntu 22.04 LTS on sw-infrarunbook-01 (192.168.10.50) throughout this guide. The server should have at minimum 2 vCPUs and 4 GB of RAM for a comfortable single-node setup that scrapes a handful of targets. If you're pulling metrics from 50+ nodes, plan for more. You'll also need sudo access under the infrarunbook-admin account and the following ports accessible on the host:

9090 — Prometheus web UI and API
3000 — Grafana web UI
9100 — Node Exporter metrics endpoint (on each monitored host)
9093 — Alertmanager (optional but recommended)

Make sure

wget

tar

, and

systemd

are available. On a fresh Ubuntu install they always are, but I've hit stripped-down container images where this wasn't the case. Also confirm your system clock is synchronized — Prometheus is time-series data, and clock drift will make your graphs look like abstract art.

sudo timedatectl set-ntp true
timedatectl status

If NTP is active and the clock looks sane, you're ready to proceed.

Step-by-Step Setup

Step 1: Install Prometheus

Prometheus doesn't ship in Ubuntu's default apt repositories at a version I'd trust for production. Pull it directly from the official release page. At time of writing, 2.51.x is the stable branch.

cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.51.2/prometheus-2.51.2.linux-amd64.tar.gz
tar xvf prometheus-2.51.2.linux-amd64.tar.gz
cd prometheus-2.51.2.linux-amd64

Now create the directory structure and system user. Running Prometheus as root is a bad idea — create a dedicated service account:

sudo useradd --no-create-home --shell /bin/false prometheus
sudo mkdir /etc/prometheus
sudo mkdir /var/lib/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus

Move the binaries and bundled consoles into place:

sudo cp prometheus /usr/local/bin/
sudo cp promtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/prometheus
sudo chown prometheus:prometheus /usr/local/bin/promtool
sudo cp -r consoles /etc/prometheus
sudo cp -r console_libraries /etc/prometheus
sudo chown -R prometheus:prometheus /etc/prometheus

Step 2: Write the Prometheus Configuration

Drop a base configuration at

/etc/prometheus/prometheus.yml

. I'll build the full production example in the next section, but here's the minimal version to get Prometheus scraping itself and a node exporter:

sudo nano /etc/prometheus/prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    monitor: 'infrarunbook-monitor'

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['192.168.10.50:9100']

sudo chown prometheus:prometheus /etc/prometheus/prometheus.yml

Step 3: Create the Systemd Unit

A systemd service keeps Prometheus running across reboots and gives you clean start/stop/restart semantics. Create the unit file:

sudo nano /etc/systemd/system/prometheus.service

[Unit]
Description=Prometheus Monitoring System
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --storage.tsdb.retention.time=30d \
  --web.listen-address=0.0.0.0:9090 \
  --web.enable-lifecycle

Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

The

--web.enable-lifecycle

flag is one I always include — it lets you send a POST to

/-/reload

to hot-reload the config without restarting the process. Extremely useful when you're adding scrape targets in the middle of the day without wanting to drop a gap in your metric collection.

sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus
sudo systemctl status prometheus

Step 4: Install Node Exporter

Node Exporter exposes host-level metrics — CPU, memory, disk, network, the works. Install it on every host you want to monitor. For this guide, we're putting it on sw-infrarunbook-01 itself.

cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvf node_exporter-1.7.0.linux-amd64.tar.gz
sudo cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
sudo useradd --no-create-home --shell /bin/false node_exporter
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter

sudo nano /etc/systemd/system/node_exporter.service

[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
  --web.listen-address=0.0.0.0:9100 \
  --collector.systemd \
  --collector.processes

Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

Step 5: Install Grafana

Grafana has an official apt repository, which makes upgrades trivial. This is the right way to install it on Debian-based systems:

sudo apt-get install -y apt-transport-https software-properties-common wget
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt-get update
sudo apt-get install grafana -y

sudo systemctl enable grafana-server
sudo systemctl start grafana-server

Grafana starts on port 3000 by default. The initial credentials are admin / admin — change this immediately on first login. I've audited environments where teams left default credentials in place on internal Grafana instances. Don't do it.

Full Configuration Example

Here's a production-quality

prometheus.yml

that covers self-scraping, node exporter across multiple hosts, blackbox probing, and wires up alerting and recording rules. This is closer to what I'd actually deploy rather than the trimmed-down quickstart version above.

global:
  scrape_interval: 15s
  scrape_timeout: 10s
  evaluation_interval: 15s
  external_labels:
    datacenter: 'dc-east'
    monitor: 'sw-infrarunbook-01'
    environment: 'production'

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - '192.168.10.51:9093'

# Load alerting rules and recording rules
rule_files:
  - '/etc/prometheus/rules/node_alerts.yml'
  - '/etc/prometheus/rules/recording_rules.yml'

scrape_configs:
  # Prometheus self-monitoring
  - job_name: 'prometheus'
    scrape_interval: 10s
    static_configs:
      - targets: ['localhost:9090']
        labels:
          host: 'sw-infrarunbook-01'
          role: 'monitoring'

  # Node Exporter -- infrastructure hosts
  - job_name: 'node'
    scrape_interval: 15s
    static_configs:
      - targets:
          - '192.168.10.50:9100'
          - '192.168.10.51:9100'
          - '192.168.10.52:9100'
        labels:
          datacenter: 'dc-east'
          env: 'production'

  # Blackbox Exporter -- HTTP endpoint probing
  - job_name: 'blackbox-http'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - 'https://portal.solvethenetwork.com'
          - 'https://api.solvethenetwork.com/health'
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: '192.168.10.53:9115'

For alerting rules, create the directory and drop in a rules file. These three cover the most common failure scenarios I see in infrastructure environments:

sudo mkdir -p /etc/prometheus/rules
sudo nano /etc/prometheus/rules/node_alerts.yml

groups:
  - name: node_alerts
    interval: 1m
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 5m
        labels:
          severity: warning
          team: infra
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage exceeded 85% on {{ $labels.instance }} for 5 minutes"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs"} / node_filesystem_size_bytes) * 100 < 15
        for: 10m
        labels:
          severity: critical
          team: infra
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Filesystem {{ $labels.mountpoint }} is below 15% free on {{ $labels.instance }}"

      - alert: NodeDown
        expr: up{job="node"} == 0
        for: 2m
        labels:
          severity: critical
          team: infra
        annotations:
          summary: "Node exporter unreachable: {{ $labels.instance }}"
          description: "{{ $labels.instance }} has been unreachable for more than 2 minutes"

Recording rules pre-compute expensive queries and store them as new time series. This is something I see skipped constantly, and then teams wonder why their Grafana dashboards with 30-day range queries take 20 seconds to render. Define them early, before performance becomes a problem:

sudo nano /etc/prometheus/rules/recording_rules.yml

groups:
  - name: node_recording
    interval: 1m
    rules:
      - record: job:node_cpu_usage:avg
        expr: 100 - (avg by(job, instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

      - record: job:node_memory_used_percent:avg
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

      - record: job:node_filesystem_used_percent:avg
        expr: (1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs"} / node_filesystem_size_bytes)) * 100

After modifying any configuration, always validate it before reloading. This is non-negotiable:

promtool check config /etc/prometheus/prometheus.yml
promtool check rules /etc/prometheus/rules/node_alerts.yml
promtool check rules /etc/prometheus/rules/recording_rules.yml

If promtool comes back clean, hot-reload the config without touching the process:

curl -s -X POST http://localhost:9090/-/reload

Now for the Grafana side. After logging in and changing the default password, add Prometheus as a data source. You can do this via the UI, but the provisioning system is the right approach for reproducible environments — your data source config lives in version control, not locked inside a SQLite database that you'll lose the next time you rebuild the host.

sudo mkdir -p /etc/grafana/provisioning/datasources
sudo nano /etc/grafana/provisioning/datasources/prometheus.yml

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://192.168.10.50:9090
    isDefault: true
    editable: false
    jsonData:
      httpMethod: POST
      timeInterval: '15s'
      queryTimeout: '60s'

sudo systemctl restart grafana-server

Verification Steps

At this point the stack should be running. Let's verify each component methodically rather than just hoping the dashboards look right.

First, confirm Prometheus is up and its targets are healthy:

curl -s http://localhost:9090/-/healthy
# Expected: Prometheus Server is Healthy.

curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep 'health'
# All targets should report "health": "up"

Navigate to

http://192.168.10.50:9090/targets

in your browser. You want to see every target in the UP state. If anything shows as DOWN, click it — Prometheus shows the last error message, which is usually either a connection refused (exporter isn't running) or a context deadline exceeded (firewall blocking the scrape port).

Test a basic PromQL query to confirm node metrics are flowing:

curl -s 'http://localhost:9090/api/v1/query?query=node_memory_MemTotal_bytes' | python3 -m json.tool

You should see a result with a numeric value for each scraped host. If the result set is empty, either node_exporter isn't running or Prometheus can't reach it on port 9100.

For Grafana, hit the health endpoint:

curl -s http://localhost:3000/api/health
# Expected JSON with "database": "ok" and a version string

Then verify the provisioned data source through the API:

curl -s -u infrarunbook-admin:yourpassword \
  http://localhost:3000/api/datasources/name/Prometheus | python3 -m json.tool

Import the Node Exporter Full dashboard (ID 1860 from grafana.com) as a quick sanity check. Go to Dashboards → Import → enter 1860 → select your Prometheus data source. If CPU, memory, and disk graphs render with actual data, your stack is functional end-to-end. That's the green light.

Common Mistakes

I've deployed this stack in a lot of environments, and the same problems come up time and again. Here's what to watch for.

Retention set too short or left at the default. Prometheus defaults to 15 days of retention. That's usually not enough to do meaningful capacity planning or incident retrospectives. Set

--storage.tsdb.retention.time=90d

in your systemd unit, but make sure your storage can handle it. A single node with moderate cardinality at 15-second scrape intervals will consume roughly 1–2 GB per monitored host per month. Do the math before you run out of disk at 2 AM.

Scraping localhost instead of the actual host IP in multi-interface setups. If your monitoring host has multiple NICs — say, a management interface at 192.168.10.50 and a data-plane interface at 10.20.30.40 — binding node_exporter to 0.0.0.0 usually works, but I've seen firewall policies that block cross-interface traffic in ways that only manifest when scraping from a remote Prometheus. Always test with

curl http://192.168.10.50:9100/metrics

from a different host, not just from the server itself.

Not setting external labels. Once you have more than one Prometheus instance — say, one per datacenter — you'll want to federate or use remote_write to aggregate them. Without external_labels set from the start, you can't tell which instance scraped which metrics in the aggregated view. Add them on day one. Retrofitting them later is annoying.

Skipping promtool validation before reloads. A YAML syntax error in prometheus.yml will cause the reload to fail silently — the process keeps running with the old config. Running

promtool check config

before every reload takes three seconds and saves you from head-scratching sessions where a config change isn't taking effect. Make it a habit.

Grafana data source using localhost instead of the actual Prometheus address. When Grafana and Prometheus live on the same host,

localhost:9090

works fine. But if you containerize Grafana later, localhost inside the container resolves to the container itself, not the host. Using the actual IP (192.168.10.50:9090) in your provisioning file works in both cases and avoids a confusing migration later.

High-cardinality labels. This one is subtle but critical. In my experience, the single most common cause of Prometheus exhausting memory is someone adding a high-cardinality label — a user ID, a session token, a UUID — to a metric. Each unique label combination creates a distinct time series. Add 100,000 unique user IDs to a request counter and you've just created 100,000 time series for that one metric. The Prometheus data model documentation is worth reading before you expose any custom instrumentation. Enforce label governance early.

Missing

--collector.systemd

on node_exporter. If you want to monitor systemd unit states from Grafana — whether nginx is up, whether a backup job completed — you need to explicitly enable the systemd collector. It's disabled by default because it requires D-Bus access. Add

--collector.systemd

to the node_exporter ExecStart line and restart the service.

One thing that catches people off guard: Prometheus's
scrape_timeout
must always be less than or equal to
scrape_interval
. A 10-second timeout on a 15-second interval is fine. A 20-second timeout on a 15-second interval will cause a config validation error on reload. promtool catches this before it bites you in production — another reason to always validate first.

Once your stack is stable, the natural next steps are wiring up Alertmanager for notification routing, enabling remote_write to push long-term metrics to a Thanos or Mimir backend, and building dashboards tailored to your actual services rather than relying entirely on community imports. But with Prometheus and Grafana healthy and scraping real data from real hosts, you've got the foundation that everything else builds on.

Prometheus and Grafana Stack Setup Guide

Prerequisites

Step-by-Step Setup

Step 1: Install Prometheus

Step 2: Write the Prometheus Configuration

Step 3: Create the Systemd Unit

Step 4: Install Node Exporter

Step 5: Install Grafana

Full Configuration Example

Verification Steps

Common Mistakes

Frequently Asked Questions

How do I add a new scrape target to Prometheus without restarting the service?

What is the default data retention period in Prometheus?

Why are my Prometheus targets showing as DOWN?

Can Grafana and Prometheus run on the same server?

What are recording rules and when should I use them?

Related Articles