InfraRunBook
    Back to articles

    Prometheus and Grafana Stack Setup Guide

    Monitoring
    Published: Apr 19, 2026
    Updated: Apr 19, 2026

    A practical, production-focused guide to deploying Prometheus and Grafana for infrastructure monitoring, covering installation, configuration, alerting rules, and the mistakes that trip up even experienced engineers.

    Prometheus and Grafana Stack Setup Guide

    Prerequisites

    Before you touch a single config file, let's make sure the environment is ready. I've seen people skip this part and spend two hours debugging what turned out to be a firewall rule. Don't be that person.

    You'll need a Linux host — I'm using Ubuntu 22.04 LTS on sw-infrarunbook-01 (192.168.10.50) throughout this guide. The server should have at minimum 2 vCPUs and 4 GB of RAM for a comfortable single-node setup that scrapes a handful of targets. If you're pulling metrics from 50+ nodes, plan for more. You'll also need sudo access under the infrarunbook-admin account and the following ports accessible on the host:

    • 9090 — Prometheus web UI and API
    • 3000 — Grafana web UI
    • 9100 — Node Exporter metrics endpoint (on each monitored host)
    • 9093 — Alertmanager (optional but recommended)

    Make sure

    wget
    ,
    tar
    , and
    systemd
    are available. On a fresh Ubuntu install they always are, but I've hit stripped-down container images where this wasn't the case. Also confirm your system clock is synchronized — Prometheus is time-series data, and clock drift will make your graphs look like abstract art.

    sudo timedatectl set-ntp true
    timedatectl status

    If NTP is active and the clock looks sane, you're ready to proceed.


    Step-by-Step Setup

    Step 1: Install Prometheus

    Prometheus doesn't ship in Ubuntu's default apt repositories at a version I'd trust for production. Pull it directly from the official release page. At time of writing, 2.51.x is the stable branch.

    cd /tmp
    wget https://github.com/prometheus/prometheus/releases/download/v2.51.2/prometheus-2.51.2.linux-amd64.tar.gz
    tar xvf prometheus-2.51.2.linux-amd64.tar.gz
    cd prometheus-2.51.2.linux-amd64

    Now create the directory structure and system user. Running Prometheus as root is a bad idea — create a dedicated service account:

    sudo useradd --no-create-home --shell /bin/false prometheus
    sudo mkdir /etc/prometheus
    sudo mkdir /var/lib/prometheus
    sudo chown prometheus:prometheus /var/lib/prometheus

    Move the binaries and bundled consoles into place:

    sudo cp prometheus /usr/local/bin/
    sudo cp promtool /usr/local/bin/
    sudo chown prometheus:prometheus /usr/local/bin/prometheus
    sudo chown prometheus:prometheus /usr/local/bin/promtool
    sudo cp -r consoles /etc/prometheus
    sudo cp -r console_libraries /etc/prometheus
    sudo chown -R prometheus:prometheus /etc/prometheus

    Step 2: Write the Prometheus Configuration

    Drop a base configuration at

    /etc/prometheus/prometheus.yml
    . I'll build the full production example in the next section, but here's the minimal version to get Prometheus scraping itself and a node exporter:

    sudo nano /etc/prometheus/prometheus.yml
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
      external_labels:
        monitor: 'infrarunbook-monitor'
    
    scrape_configs:
      - job_name: 'prometheus'
        static_configs:
          - targets: ['localhost:9090']
    
      - job_name: 'node'
        static_configs:
          - targets: ['192.168.10.50:9100']
    sudo chown prometheus:prometheus /etc/prometheus/prometheus.yml

    Step 3: Create the Systemd Unit

    A systemd service keeps Prometheus running across reboots and gives you clean start/stop/restart semantics. Create the unit file:

    sudo nano /etc/systemd/system/prometheus.service
    [Unit]
    Description=Prometheus Monitoring System
    Wants=network-online.target
    After=network-online.target
    
    [Service]
    User=prometheus
    Group=prometheus
    Type=simple
    ExecStart=/usr/local/bin/prometheus \
      --config.file=/etc/prometheus/prometheus.yml \
      --storage.tsdb.path=/var/lib/prometheus \
      --storage.tsdb.retention.time=30d \
      --web.listen-address=0.0.0.0:9090 \
      --web.enable-lifecycle
    
    Restart=on-failure
    RestartSec=5
    
    [Install]
    WantedBy=multi-user.target

    The

    --web.enable-lifecycle
    flag is one I always include — it lets you send a POST to
    /-/reload
    to hot-reload the config without restarting the process. Extremely useful when you're adding scrape targets in the middle of the day without wanting to drop a gap in your metric collection.

    sudo systemctl daemon-reload
    sudo systemctl enable prometheus
    sudo systemctl start prometheus
    sudo systemctl status prometheus

    Step 4: Install Node Exporter

    Node Exporter exposes host-level metrics — CPU, memory, disk, network, the works. Install it on every host you want to monitor. For this guide, we're putting it on sw-infrarunbook-01 itself.

    cd /tmp
    wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
    tar xvf node_exporter-1.7.0.linux-amd64.tar.gz
    sudo cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
    sudo useradd --no-create-home --shell /bin/false node_exporter
    sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
    sudo nano /etc/systemd/system/node_exporter.service
    [Unit]
    Description=Prometheus Node Exporter
    Wants=network-online.target
    After=network-online.target
    
    [Service]
    User=node_exporter
    Group=node_exporter
    Type=simple
    ExecStart=/usr/local/bin/node_exporter \
      --web.listen-address=0.0.0.0:9100 \
      --collector.systemd \
      --collector.processes
    
    Restart=on-failure
    RestartSec=5
    
    [Install]
    WantedBy=multi-user.target
    sudo systemctl daemon-reload
    sudo systemctl enable node_exporter
    sudo systemctl start node_exporter

    Step 5: Install Grafana

    Grafana has an official apt repository, which makes upgrades trivial. This is the right way to install it on Debian-based systems:

    sudo apt-get install -y apt-transport-https software-properties-common wget
    wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
    echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
    sudo apt-get update
    sudo apt-get install grafana -y
    sudo systemctl enable grafana-server
    sudo systemctl start grafana-server

    Grafana starts on port 3000 by default. The initial credentials are admin / admin — change this immediately on first login. I've audited environments where teams left default credentials in place on internal Grafana instances. Don't do it.


    Full Configuration Example

    Here's a production-quality

    prometheus.yml
    that covers self-scraping, node exporter across multiple hosts, blackbox probing, and wires up alerting and recording rules. This is closer to what I'd actually deploy rather than the trimmed-down quickstart version above.

    global:
      scrape_interval: 15s
      scrape_timeout: 10s
      evaluation_interval: 15s
      external_labels:
        datacenter: 'dc-east'
        monitor: 'sw-infrarunbook-01'
        environment: 'production'
    
    # Alertmanager configuration
    alerting:
      alertmanagers:
        - static_configs:
            - targets:
                - '192.168.10.51:9093'
    
    # Load alerting rules and recording rules
    rule_files:
      - '/etc/prometheus/rules/node_alerts.yml'
      - '/etc/prometheus/rules/recording_rules.yml'
    
    scrape_configs:
      # Prometheus self-monitoring
      - job_name: 'prometheus'
        scrape_interval: 10s
        static_configs:
          - targets: ['localhost:9090']
            labels:
              host: 'sw-infrarunbook-01'
              role: 'monitoring'
    
      # Node Exporter -- infrastructure hosts
      - job_name: 'node'
        scrape_interval: 15s
        static_configs:
          - targets:
              - '192.168.10.50:9100'
              - '192.168.10.51:9100'
              - '192.168.10.52:9100'
            labels:
              datacenter: 'dc-east'
              env: 'production'
    
      # Blackbox Exporter -- HTTP endpoint probing
      - job_name: 'blackbox-http'
        metrics_path: /probe
        params:
          module: [http_2xx]
        static_configs:
          - targets:
              - 'https://portal.solvethenetwork.com'
              - 'https://api.solvethenetwork.com/health'
        relabel_configs:
          - source_labels: [__address__]
            target_label: __param_target
          - source_labels: [__param_target]
            target_label: instance
          - target_label: __address__
            replacement: '192.168.10.53:9115'

    For alerting rules, create the directory and drop in a rules file. These three cover the most common failure scenarios I see in infrastructure environments:

    sudo mkdir -p /etc/prometheus/rules
    sudo nano /etc/prometheus/rules/node_alerts.yml
    groups:
      - name: node_alerts
        interval: 1m
        rules:
          - alert: HighCPUUsage
            expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
            for: 5m
            labels:
              severity: warning
              team: infra
            annotations:
              summary: "High CPU usage on {{ $labels.instance }}"
              description: "CPU usage exceeded 85% on {{ $labels.instance }} for 5 minutes"
    
          - alert: DiskSpaceLow
            expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs"} / node_filesystem_size_bytes) * 100 < 15
            for: 10m
            labels:
              severity: critical
              team: infra
            annotations:
              summary: "Low disk space on {{ $labels.instance }}"
              description: "Filesystem {{ $labels.mountpoint }} is below 15% free on {{ $labels.instance }}"
    
          - alert: NodeDown
            expr: up{job="node"} == 0
            for: 2m
            labels:
              severity: critical
              team: infra
            annotations:
              summary: "Node exporter unreachable: {{ $labels.instance }}"
              description: "{{ $labels.instance }} has been unreachable for more than 2 minutes"

    Recording rules pre-compute expensive queries and store them as new time series. This is something I see skipped constantly, and then teams wonder why their Grafana dashboards with 30-day range queries take 20 seconds to render. Define them early, before performance becomes a problem:

    sudo nano /etc/prometheus/rules/recording_rules.yml
    groups:
      - name: node_recording
        interval: 1m
        rules:
          - record: job:node_cpu_usage:avg
            expr: 100 - (avg by(job, instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
    
          - record: job:node_memory_used_percent:avg
            expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
    
          - record: job:node_filesystem_used_percent:avg
            expr: (1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs"} / node_filesystem_size_bytes)) * 100

    After modifying any configuration, always validate it before reloading. This is non-negotiable:

    promtool check config /etc/prometheus/prometheus.yml
    promtool check rules /etc/prometheus/rules/node_alerts.yml
    promtool check rules /etc/prometheus/rules/recording_rules.yml

    If promtool comes back clean, hot-reload the config without touching the process:

    curl -s -X POST http://localhost:9090/-/reload

    Now for the Grafana side. After logging in and changing the default password, add Prometheus as a data source. You can do this via the UI, but the provisioning system is the right approach for reproducible environments — your data source config lives in version control, not locked inside a SQLite database that you'll lose the next time you rebuild the host.

    sudo mkdir -p /etc/grafana/provisioning/datasources
    sudo nano /etc/grafana/provisioning/datasources/prometheus.yml
    apiVersion: 1
    
    datasources:
      - name: Prometheus
        type: prometheus
        access: proxy
        url: http://192.168.10.50:9090
        isDefault: true
        editable: false
        jsonData:
          httpMethod: POST
          timeInterval: '15s'
          queryTimeout: '60s'
    sudo systemctl restart grafana-server

    Verification Steps

    At this point the stack should be running. Let's verify each component methodically rather than just hoping the dashboards look right.

    First, confirm Prometheus is up and its targets are healthy:

    curl -s http://localhost:9090/-/healthy
    # Expected: Prometheus Server is Healthy.
    
    curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep 'health'
    # All targets should report "health": "up"

    Navigate to

    http://192.168.10.50:9090/targets
    in your browser. You want to see every target in the UP state. If anything shows as DOWN, click it — Prometheus shows the last error message, which is usually either a connection refused (exporter isn't running) or a context deadline exceeded (firewall blocking the scrape port).

    Test a basic PromQL query to confirm node metrics are flowing:

    curl -s 'http://localhost:9090/api/v1/query?query=node_memory_MemTotal_bytes' | python3 -m json.tool

    You should see a result with a numeric value for each scraped host. If the result set is empty, either node_exporter isn't running or Prometheus can't reach it on port 9100.

    For Grafana, hit the health endpoint:

    curl -s http://localhost:3000/api/health
    # Expected JSON with "database": "ok" and a version string

    Then verify the provisioned data source through the API:

    curl -s -u infrarunbook-admin:yourpassword \
      http://localhost:3000/api/datasources/name/Prometheus | python3 -m json.tool

    Import the Node Exporter Full dashboard (ID 1860 from grafana.com) as a quick sanity check. Go to Dashboards → Import → enter 1860 → select your Prometheus data source. If CPU, memory, and disk graphs render with actual data, your stack is functional end-to-end. That's the green light.


    Common Mistakes

    I've deployed this stack in a lot of environments, and the same problems come up time and again. Here's what to watch for.

    Retention set too short or left at the default. Prometheus defaults to 15 days of retention. That's usually not enough to do meaningful capacity planning or incident retrospectives. Set

    --storage.tsdb.retention.time=90d
    in your systemd unit, but make sure your storage can handle it. A single node with moderate cardinality at 15-second scrape intervals will consume roughly 1–2 GB per monitored host per month. Do the math before you run out of disk at 2 AM.

    Scraping localhost instead of the actual host IP in multi-interface setups. If your monitoring host has multiple NICs — say, a management interface at 192.168.10.50 and a data-plane interface at 10.20.30.40 — binding node_exporter to 0.0.0.0 usually works, but I've seen firewall policies that block cross-interface traffic in ways that only manifest when scraping from a remote Prometheus. Always test with

    curl http://192.168.10.50:9100/metrics
    from a different host, not just from the server itself.

    Not setting external labels. Once you have more than one Prometheus instance — say, one per datacenter — you'll want to federate or use remote_write to aggregate them. Without external_labels set from the start, you can't tell which instance scraped which metrics in the aggregated view. Add them on day one. Retrofitting them later is annoying.

    Skipping promtool validation before reloads. A YAML syntax error in prometheus.yml will cause the reload to fail silently — the process keeps running with the old config. Running

    promtool check config
    before every reload takes three seconds and saves you from head-scratching sessions where a config change isn't taking effect. Make it a habit.

    Grafana data source using localhost instead of the actual Prometheus address. When Grafana and Prometheus live on the same host,

    localhost:9090
    works fine. But if you containerize Grafana later, localhost inside the container resolves to the container itself, not the host. Using the actual IP (192.168.10.50:9090) in your provisioning file works in both cases and avoids a confusing migration later.

    High-cardinality labels. This one is subtle but critical. In my experience, the single most common cause of Prometheus exhausting memory is someone adding a high-cardinality label — a user ID, a session token, a UUID — to a metric. Each unique label combination creates a distinct time series. Add 100,000 unique user IDs to a request counter and you've just created 100,000 time series for that one metric. The Prometheus data model documentation is worth reading before you expose any custom instrumentation. Enforce label governance early.

    Missing

    --collector.systemd
    on node_exporter. If you want to monitor systemd unit states from Grafana — whether nginx is up, whether a backup job completed — you need to explicitly enable the systemd collector. It's disabled by default because it requires D-Bus access. Add
    --collector.systemd
    to the node_exporter ExecStart line and restart the service.

    One thing that catches people off guard: Prometheus's

    scrape_timeout
    must always be less than or equal to
    scrape_interval
    . A 10-second timeout on a 15-second interval is fine. A 20-second timeout on a 15-second interval will cause a config validation error on reload. promtool catches this before it bites you in production — another reason to always validate first.

    Once your stack is stable, the natural next steps are wiring up Alertmanager for notification routing, enabling remote_write to push long-term metrics to a Thanos or Mimir backend, and building dashboards tailored to your actual services rather than relying entirely on community imports. But with Prometheus and Grafana healthy and scraping real data from real hosts, you've got the foundation that everything else builds on.

    Frequently Asked Questions

    How do I add a new scrape target to Prometheus without restarting the service?

    Add the target to your prometheus.yml under the appropriate scrape_configs block, validate with promtool check config, then send a POST to http://localhost:9090/-/reload. This requires the --web.enable-lifecycle flag in your Prometheus systemd unit's ExecStart line.

    What is the default data retention period in Prometheus?

    Prometheus retains data for 15 days by default. For production environments you'll typically want 30–90 days. Set this with --storage.tsdb.retention.time=90d in your systemd unit and make sure your disk has room — estimate roughly 1–2 GB per monitored host per month at 15-second scrape intervals.

    Why are my Prometheus targets showing as DOWN?

    The most common causes are: the exporter process isn't running on the target host, a firewall rule is blocking the scrape port (9100 for node_exporter), or the target IP and port in prometheus.yml are incorrect. The /targets page in the Prometheus UI shows the last error message for each DOWN target — start there.

    Can Grafana and Prometheus run on the same server?

    Yes, and for small environments it's common. Use the actual host IP (not localhost) in your Grafana data source URL from the start so the configuration still works if you later move Grafana to a container or a separate host.

    What are recording rules and when should I use them?

    Recording rules pre-compute expensive or frequently-queried PromQL expressions and store the results as new time series. Use them when dashboard queries over long time ranges are slow. They're defined in a YAML rules file, referenced under rule_files in prometheus.yml, and reloaded on config reload without a restart.

    Related Articles