Prerequisites
Before you touch a single config file, let's make sure the environment is ready. I've seen people skip this part and spend two hours debugging what turned out to be a firewall rule. Don't be that person.
You'll need a Linux host — I'm using Ubuntu 22.04 LTS on sw-infrarunbook-01 (192.168.10.50) throughout this guide. The server should have at minimum 2 vCPUs and 4 GB of RAM for a comfortable single-node setup that scrapes a handful of targets. If you're pulling metrics from 50+ nodes, plan for more. You'll also need sudo access under the infrarunbook-admin account and the following ports accessible on the host:
- 9090 — Prometheus web UI and API
- 3000 — Grafana web UI
- 9100 — Node Exporter metrics endpoint (on each monitored host)
- 9093 — Alertmanager (optional but recommended)
Make sure
wget,
tar, and
systemdare available. On a fresh Ubuntu install they always are, but I've hit stripped-down container images where this wasn't the case. Also confirm your system clock is synchronized — Prometheus is time-series data, and clock drift will make your graphs look like abstract art.
sudo timedatectl set-ntp true
timedatectl status
If NTP is active and the clock looks sane, you're ready to proceed.
Step-by-Step Setup
Step 1: Install Prometheus
Prometheus doesn't ship in Ubuntu's default apt repositories at a version I'd trust for production. Pull it directly from the official release page. At time of writing, 2.51.x is the stable branch.
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.51.2/prometheus-2.51.2.linux-amd64.tar.gz
tar xvf prometheus-2.51.2.linux-amd64.tar.gz
cd prometheus-2.51.2.linux-amd64
Now create the directory structure and system user. Running Prometheus as root is a bad idea — create a dedicated service account:
sudo useradd --no-create-home --shell /bin/false prometheus
sudo mkdir /etc/prometheus
sudo mkdir /var/lib/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus
Move the binaries and bundled consoles into place:
sudo cp prometheus /usr/local/bin/
sudo cp promtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/prometheus
sudo chown prometheus:prometheus /usr/local/bin/promtool
sudo cp -r consoles /etc/prometheus
sudo cp -r console_libraries /etc/prometheus
sudo chown -R prometheus:prometheus /etc/prometheus
Step 2: Write the Prometheus Configuration
Drop a base configuration at
/etc/prometheus/prometheus.yml. I'll build the full production example in the next section, but here's the minimal version to get Prometheus scraping itself and a node exporter:
sudo nano /etc/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
monitor: 'infrarunbook-monitor'
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['192.168.10.50:9100']
sudo chown prometheus:prometheus /etc/prometheus/prometheus.yml
Step 3: Create the Systemd Unit
A systemd service keeps Prometheus running across reboots and gives you clean start/stop/restart semantics. Create the unit file:
sudo nano /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus Monitoring System
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--storage.tsdb.retention.time=30d \
--web.listen-address=0.0.0.0:9090 \
--web.enable-lifecycle
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
The
--web.enable-lifecycleflag is one I always include — it lets you send a POST to
/-/reloadto hot-reload the config without restarting the process. Extremely useful when you're adding scrape targets in the middle of the day without wanting to drop a gap in your metric collection.
sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus
sudo systemctl status prometheus
Step 4: Install Node Exporter
Node Exporter exposes host-level metrics — CPU, memory, disk, network, the works. Install it on every host you want to monitor. For this guide, we're putting it on sw-infrarunbook-01 itself.
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvf node_exporter-1.7.0.linux-amd64.tar.gz
sudo cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
sudo useradd --no-create-home --shell /bin/false node_exporter
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
sudo nano /etc/systemd/system/node_exporter.service
[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--web.listen-address=0.0.0.0:9100 \
--collector.systemd \
--collector.processes
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
Step 5: Install Grafana
Grafana has an official apt repository, which makes upgrades trivial. This is the right way to install it on Debian-based systems:
sudo apt-get install -y apt-transport-https software-properties-common wget
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt-get update
sudo apt-get install grafana -y
sudo systemctl enable grafana-server
sudo systemctl start grafana-server
Grafana starts on port 3000 by default. The initial credentials are admin / admin — change this immediately on first login. I've audited environments where teams left default credentials in place on internal Grafana instances. Don't do it.
Full Configuration Example
Here's a production-quality
prometheus.ymlthat covers self-scraping, node exporter across multiple hosts, blackbox probing, and wires up alerting and recording rules. This is closer to what I'd actually deploy rather than the trimmed-down quickstart version above.
global:
scrape_interval: 15s
scrape_timeout: 10s
evaluation_interval: 15s
external_labels:
datacenter: 'dc-east'
monitor: 'sw-infrarunbook-01'
environment: 'production'
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- '192.168.10.51:9093'
# Load alerting rules and recording rules
rule_files:
- '/etc/prometheus/rules/node_alerts.yml'
- '/etc/prometheus/rules/recording_rules.yml'
scrape_configs:
# Prometheus self-monitoring
- job_name: 'prometheus'
scrape_interval: 10s
static_configs:
- targets: ['localhost:9090']
labels:
host: 'sw-infrarunbook-01'
role: 'monitoring'
# Node Exporter -- infrastructure hosts
- job_name: 'node'
scrape_interval: 15s
static_configs:
- targets:
- '192.168.10.50:9100'
- '192.168.10.51:9100'
- '192.168.10.52:9100'
labels:
datacenter: 'dc-east'
env: 'production'
# Blackbox Exporter -- HTTP endpoint probing
- job_name: 'blackbox-http'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- 'https://portal.solvethenetwork.com'
- 'https://api.solvethenetwork.com/health'
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: '192.168.10.53:9115'
For alerting rules, create the directory and drop in a rules file. These three cover the most common failure scenarios I see in infrastructure environments:
sudo mkdir -p /etc/prometheus/rules
sudo nano /etc/prometheus/rules/node_alerts.yml
groups:
- name: node_alerts
interval: 1m
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 5m
labels:
severity: warning
team: infra
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage exceeded 85% on {{ $labels.instance }} for 5 minutes"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs"} / node_filesystem_size_bytes) * 100 < 15
for: 10m
labels:
severity: critical
team: infra
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Filesystem {{ $labels.mountpoint }} is below 15% free on {{ $labels.instance }}"
- alert: NodeDown
expr: up{job="node"} == 0
for: 2m
labels:
severity: critical
team: infra
annotations:
summary: "Node exporter unreachable: {{ $labels.instance }}"
description: "{{ $labels.instance }} has been unreachable for more than 2 minutes"
Recording rules pre-compute expensive queries and store them as new time series. This is something I see skipped constantly, and then teams wonder why their Grafana dashboards with 30-day range queries take 20 seconds to render. Define them early, before performance becomes a problem:
sudo nano /etc/prometheus/rules/recording_rules.yml
groups:
- name: node_recording
interval: 1m
rules:
- record: job:node_cpu_usage:avg
expr: 100 - (avg by(job, instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
- record: job:node_memory_used_percent:avg
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
- record: job:node_filesystem_used_percent:avg
expr: (1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs"} / node_filesystem_size_bytes)) * 100
After modifying any configuration, always validate it before reloading. This is non-negotiable:
promtool check config /etc/prometheus/prometheus.yml
promtool check rules /etc/prometheus/rules/node_alerts.yml
promtool check rules /etc/prometheus/rules/recording_rules.yml
If promtool comes back clean, hot-reload the config without touching the process:
curl -s -X POST http://localhost:9090/-/reload
Now for the Grafana side. After logging in and changing the default password, add Prometheus as a data source. You can do this via the UI, but the provisioning system is the right approach for reproducible environments — your data source config lives in version control, not locked inside a SQLite database that you'll lose the next time you rebuild the host.
sudo mkdir -p /etc/grafana/provisioning/datasources
sudo nano /etc/grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://192.168.10.50:9090
isDefault: true
editable: false
jsonData:
httpMethod: POST
timeInterval: '15s'
queryTimeout: '60s'
sudo systemctl restart grafana-server
Verification Steps
At this point the stack should be running. Let's verify each component methodically rather than just hoping the dashboards look right.
First, confirm Prometheus is up and its targets are healthy:
curl -s http://localhost:9090/-/healthy
# Expected: Prometheus Server is Healthy.
curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep 'health'
# All targets should report "health": "up"
Navigate to
http://192.168.10.50:9090/targetsin your browser. You want to see every target in the UP state. If anything shows as DOWN, click it — Prometheus shows the last error message, which is usually either a connection refused (exporter isn't running) or a context deadline exceeded (firewall blocking the scrape port).
Test a basic PromQL query to confirm node metrics are flowing:
curl -s 'http://localhost:9090/api/v1/query?query=node_memory_MemTotal_bytes' | python3 -m json.tool
You should see a result with a numeric value for each scraped host. If the result set is empty, either node_exporter isn't running or Prometheus can't reach it on port 9100.
For Grafana, hit the health endpoint:
curl -s http://localhost:3000/api/health
# Expected JSON with "database": "ok" and a version string
Then verify the provisioned data source through the API:
curl -s -u infrarunbook-admin:yourpassword \
http://localhost:3000/api/datasources/name/Prometheus | python3 -m json.tool
Import the Node Exporter Full dashboard (ID 1860 from grafana.com) as a quick sanity check. Go to Dashboards → Import → enter 1860 → select your Prometheus data source. If CPU, memory, and disk graphs render with actual data, your stack is functional end-to-end. That's the green light.
Common Mistakes
I've deployed this stack in a lot of environments, and the same problems come up time and again. Here's what to watch for.
Retention set too short or left at the default. Prometheus defaults to 15 days of retention. That's usually not enough to do meaningful capacity planning or incident retrospectives. Set
--storage.tsdb.retention.time=90din your systemd unit, but make sure your storage can handle it. A single node with moderate cardinality at 15-second scrape intervals will consume roughly 1–2 GB per monitored host per month. Do the math before you run out of disk at 2 AM.
Scraping localhost instead of the actual host IP in multi-interface setups. If your monitoring host has multiple NICs — say, a management interface at 192.168.10.50 and a data-plane interface at 10.20.30.40 — binding node_exporter to 0.0.0.0 usually works, but I've seen firewall policies that block cross-interface traffic in ways that only manifest when scraping from a remote Prometheus. Always test with
curl http://192.168.10.50:9100/metricsfrom a different host, not just from the server itself.
Not setting external labels. Once you have more than one Prometheus instance — say, one per datacenter — you'll want to federate or use remote_write to aggregate them. Without external_labels set from the start, you can't tell which instance scraped which metrics in the aggregated view. Add them on day one. Retrofitting them later is annoying.
Skipping promtool validation before reloads. A YAML syntax error in prometheus.yml will cause the reload to fail silently — the process keeps running with the old config. Running
promtool check configbefore every reload takes three seconds and saves you from head-scratching sessions where a config change isn't taking effect. Make it a habit.
Grafana data source using localhost instead of the actual Prometheus address. When Grafana and Prometheus live on the same host,
localhost:9090works fine. But if you containerize Grafana later, localhost inside the container resolves to the container itself, not the host. Using the actual IP (192.168.10.50:9090) in your provisioning file works in both cases and avoids a confusing migration later.
High-cardinality labels. This one is subtle but critical. In my experience, the single most common cause of Prometheus exhausting memory is someone adding a high-cardinality label — a user ID, a session token, a UUID — to a metric. Each unique label combination creates a distinct time series. Add 100,000 unique user IDs to a request counter and you've just created 100,000 time series for that one metric. The Prometheus data model documentation is worth reading before you expose any custom instrumentation. Enforce label governance early.
Missing --collector.systemd
on node_exporter. If you want to monitor systemd unit states from Grafana — whether nginx is up, whether a backup job completed — you need to explicitly enable the systemd collector. It's disabled by default because it requires D-Bus access. Add
--collector.systemdto the node_exporter ExecStart line and restart the service.
One thing that catches people off guard: Prometheus's
scrape_timeoutmust always be less than or equal toscrape_interval. A 10-second timeout on a 15-second interval is fine. A 20-second timeout on a 15-second interval will cause a config validation error on reload. promtool catches this before it bites you in production — another reason to always validate first.
Once your stack is stable, the natural next steps are wiring up Alertmanager for notification routing, enabling remote_write to push long-term metrics to a Thanos or Mimir backend, and building dashboards tailored to your actual services rather than relying entirely on community imports. But with Prometheus and Grafana healthy and scraping real data from real hosts, you've got the foundation that everything else builds on.
