Symptoms
You open Grafana, pull up the node dashboard, and something's wrong. Half the panels are blank. Or you're checking Prometheus directly and notice a host that should be reporting is simply gone from
up{job="node"}. The alert fires at 2 AM — NodeDown — and now you're digging through logs trying to figure out why metrics stopped flowing.
Before diving in, here's what "node exporter metrics missing" typically looks like in practice:
- The Prometheus Targets page shows a host in DOWN state with a scrape error message
up{instance="192.168.10.45:9100"}
returns 0 or produces no data at all- Grafana panels show "No data" for one or more specific nodes while others report fine
- Individual metrics like
node_cpu_seconds_total
ornode_memory_MemAvailable_bytes
are absent from query results - Prometheus scrape errors read something like
context deadline exceeded
,connection refused
, orno such host
This guide walks through every root cause I've run into in production — from the obvious "the exporter died" to the sneaky label mismatch that once took me the better part of an hour to track down. Let's get into it.
Root Cause 1: Node Exporter Is Not Running
This is the most common cause, and it's also the easiest to overlook because it feels too obvious to check first. The exporter process crashed, was never started after a reboot, or was stopped during maintenance and nobody restarted it. In my experience, the reboot scenario is especially common after a kernel update — the service is marked enabled but something in the startup sequence fails and nobody notices for hours.
How to Identify It
SSH into the target host and check the service state directly:
ssh infrarunbook-admin@192.168.10.45
systemctl status node_exporter
If the exporter has crashed, you'll see output along these lines:
● node_exporter.service - Node Exporter
Loaded: loaded (/etc/systemd/system/node_exporter.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Sat 2026-04-19 03:12:44 UTC; 2h 14min ago
Process: 3842 ExecStart=/usr/local/bin/node_exporter (code=exited, status=1/FAILURE)
Main PID: 3842 (code=exited, status=1/FAILURE)
Confirm nothing is listening on port 9100:
ss -tlnp | grep 9100
No output means the port is dead. Then pull the journal to understand why it failed — don't skip this step, because the fix depends on the failure reason:
journalctl -u node_exporter -n 50 --no-pager
How to Fix It
Start the service and make sure it's enabled for future reboots:
systemctl start node_exporter
systemctl enable node_exporter
systemctl status node_exporter
After it starts, confirm the metrics endpoint is responding:
curl -s http://192.168.10.45:9100/metrics | head -20
A healthy exporter returns output starting with something like this:
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 4.9351e-05
go_gc_duration_seconds{quantile="0.25"} 6.7598e-05
go_gc_duration_seconds{quantile="0.5"} 9.1254e-05
If the service keeps crashing, the journal will tell you why. Usual suspects: a missing or moved binary, a permissions error on a path a collector tries to read, or a startup flag that was valid in an older version of node_exporter but got removed in a recent upgrade.
Root Cause 2: Firewall Blocking Port 9100
The exporter is running. The metrics endpoint responds locally. But Prometheus still can't collect anything. Firewall rules are the silent killer here. A teammate tightened iptables during a security hardening sprint and dropped the rule for 9100. Or a cloud security group got updated and nobody noticed node exporter traffic wasn't exempted. I've also seen this happen when a host gets migrated to a new subnet and the firewall policy for that VLAN didn't include a carve-out for monitoring traffic.
How to Identify It
From the Prometheus server, try to reach the exporter directly and watch for a timeout rather than an immediate connection refused:
curl -v --connect-timeout 5 http://192.168.10.45:9100/metrics
A firewall-blocked connection looks like this — the packet is dropped, not rejected:
* Trying 192.168.10.45:9100...
* Connection timeout after 5001ms
* Closing connection 0
curl: (28) Connection timed out after 5001 milliseconds
Compare that to connection refused, which means the port is reachable but nothing is listening. A timeout means the packet never gets a response. On the target host, check the active rules:
iptables -L INPUT -n -v | grep 9100
Or if the host uses nftables:
nft list ruleset | grep 9100
For hosts running firewalld:
firewall-cmd --list-all
If port 9100 isn't listed under
ports:or in an explicit allow rule, that's your problem.
How to Fix It
For iptables, add a rule scoped to the Prometheus server IP — don't open it to the world:
iptables -A INPUT -s 192.168.10.10 -p tcp --dport 9100 -j ACCEPT
iptables-save > /etc/iptables/rules.v4
For firewalld:
firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="192.168.10.10" port port="9100" protocol="tcp" accept'
firewall-cmd --reload
After applying the rule, rerun the curl from the Prometheus host to confirm connectivity is restored before calling it done.
Root Cause 3: Prometheus Scrape Config Is Wrong
The exporter is running, the port is open, and Prometheus still isn't collecting. Now you need to look at
prometheus.ymlitself. A typo in a hostname, an incorrect port number, a missing job entry, a stale IP — any of these will silently leave a host unmonitored. This is more common than it sounds, especially in environments where the Prometheus config is edited by hand rather than generated from a service catalog.
How to Identify It
Go to the Prometheus UI at
http://192.168.10.10:9090/targets. Look for the host in question. If it doesn't appear at all, it's not in any scrape config. If it appears but shows DOWN, click the error — it'll say something like:
Get "http://192.168.10.45:9100/metrics": dial tcp 192.168.10.45:9100: connect: connection refused
Or for a DNS resolution failure:
Get "http://sw-infrarunbook-01.solvethenetwork.com:9100/metrics": dial tcp: lookup sw-infrarunbook-01.solvethenetwork.com: no such host
Check your scrape config directly:
cat /etc/prometheus/prometheus.yml
A correct static node exporter job looks like this:
scrape_configs:
- job_name: 'node'
static_configs:
- targets:
- '192.168.10.45:9100'
- '192.168.10.46:9100'
- '192.168.10.47:9100'
Common mistakes: the port is listed as
9190instead of
9100, the IP has a transposed digit, the host was added to the wrong job block, or the config was edited but Prometheus was never told to reload it. Check whether Prometheus is actually running the config you think it is:
curl -X POST http://192.168.10.10:9090/-/reload
Or check the last reload timestamp in the Prometheus UI under Status > Runtime & Build Information.
How to Fix It
Correct the target entry, validate the file, then reload — always in that order:
promtool check config /etc/prometheus/prometheus.yml
systemctl reload prometheus
Running
promtool check configbefore reloading is non-negotiable. It catches syntax errors before they cause Prometheus to reject the entire config and fall back to whatever it had before — or worse, fail to start after a restart.
Root Cause 4: Collector Disabled
Node exporter ships with a large set of collectors — cpu, memory, disk, network, filesystem, and many more. By default, most are enabled, but some aren't. And in some environments I've worked in, teams explicitly disable collectors to reduce metric cardinality or shorten scrape time. If you're looking for a specific metric and it simply isn't there, the collector that exposes it might be disabled — either intentionally or by accident when someone copied a startup config from a different environment.
How to Identify It
Check how node_exporter is being launched and what flags are passed to it:
ps aux | grep node_exporter
Or inspect the systemd unit file directly:
cat /etc/systemd/system/node_exporter.service
You might find something like this:
[Service]
ExecStart=/usr/local/bin/node_exporter \
--no-collector.wifi \
--no-collector.nfs \
--no-collector.xfs \
--no-collector.cpu
That last flag —
--no-collector.cpu— explains why
node_cpu_seconds_totalis completely absent. You can also hit the metrics endpoint directly and search for the specific metric:
curl -s http://192.168.10.45:9100/metrics | grep node_cpu_seconds_total
No output confirms the collector is off. To see all currently enabled collectors and their scrape status, look at the
node_scrape_collector_successmetric:
curl -s http://192.168.10.45:9100/metrics | grep node_scrape_collector_success
node_scrape_collector_success{collector="arp"} 1
node_scrape_collector_success{collector="bcache"} 1
node_scrape_collector_success{collector="bonding"} 1
node_scrape_collector_success{collector="conntrack"} 1
node_scrape_collector_success{collector="diskstats"} 1
node_scrape_collector_success{collector="filesystem"} 1
node_scrape_collector_success{collector="meminfo"} 1
If
cpudoesn't appear in that list, it was disabled at launch.
How to Fix It
Edit the service file to remove the
--no-collector.cpuflag (or whichever collector you need to restore), then reload systemd and restart the service:
systemctl daemon-reload
systemctl restart node_exporter
If you need to enable a collector that's off by default, use the
--collector.<name>flag. For example, to enable the perf collector:
ExecStart=/usr/local/bin/node_exporter --collector.perf
Be deliberate about which collectors you run. Enabling everything in a high-cardinality environment can meaningfully increase scrape duration and Prometheus memory usage. If you're disabling collectors intentionally, document it with a comment in the unit file — undocumented omissions are indistinguishable from bugs.
Root Cause 5: Label Mismatch
This one is subtle and easy to miss because the data is actually being scraped — it's in Prometheus — but your query returns nothing, or Grafana shows blank panels. The reason is a label mismatch: your query is filtering on a label value that doesn't match what's actually attached to the metric in storage.
In my experience, this surfaces most often after someone changes how targets are discovered — switching from static configs to file-based or Consul service discovery, renaming a job, or modifying relabeling rules. The metric is right there in Prometheus. It's just labeled differently than what your dashboard expects.
How to Identify It
Run a broad query in Prometheus without any label filters to see what labels are actually attached to the metric you're looking for:
node_cpu_seconds_total
Look at what comes back. You might see:
node_cpu_seconds_total{cpu="0",instance="192.168.10.45:9100",job="linux_nodes",mode="idle"} 12345.67
Now check your Grafana panel query — it might be filtering on
job="node"when the actual job label is
job="linux_nodes". That single mismatch means zero results, even though all the data is right there. Also check the relabeling rules in your prometheus.yml, which can silently transform or drop labels during scraping:
scrape_configs:
- job_name: 'linux_nodes'
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: '(.+):9100'
replacement: '$1'
That rule strips the port from the instance label. If your dashboard is querying
instance="192.168.10.45:9100", it won't match — the stored label is
instance="192.168.10.45". Use the Prometheus UI's Labels explorer to inspect exactly what's stored for a given metric and target.
How to Fix It
You have two options: update your queries and dashboards to match the actual label values, or adjust your relabeling rules to produce the labels your queries expect. In most cases, updating the queries is faster and less risky. In Grafana, go into the panel editor, locate the label filter, and correct it to match what Prometheus actually stores.
For Prometheus alerting rules that rely on specific labels, update them in your rules files and reload:
promtool check rules /etc/prometheus/rules/*.yml
curl -X POST http://192.168.10.10:9090/-/reload
If the mismatch is widespread — say, you renamed a job and now thirty dashboards are broken — you can temporarily add a metric_relabel_config to add a backward-compatible label alias while you migrate. Don't leave that workaround in place indefinitely, though. Fix the root definition and update the consumers systematically.
Root Cause 6: Node Exporter Bound to Loopback Only
The exporter is running and healthy, but it's only listening on
127.0.0.1— so nothing outside the host can reach it. I've seen this happen when someone copies a startup script from an old tutorial that explicitly binds to loopback for "security," not realizing it breaks remote scraping entirely. It's also a common artifact of installing node_exporter from a distribution package rather than the official release, where the default unit file may bind to localhost.
How to Identify It
Check what address node_exporter is actually listening on:
ss -tlnp | grep 9100
If you see loopback only, the exporter is unreachable from outside the host:
LISTEN 0 128 127.0.0.1:9100 0.0.0.0:* users:(("node_exporter",pid=4521,fd=3))
What you want instead — binding to all interfaces:
LISTEN 0 128 0.0.0.0:9100 0.0.0.0:* users:(("node_exporter",pid=4521,fd=3))
How to Fix It
The flag controlling bind address is
--web.listen-address. Check the service file and update it:
ExecStart=/usr/local/bin/node_exporter --web.listen-address="0.0.0.0:9100"
Reload systemd and restart the service, then confirm the binding changed:
systemctl daemon-reload
systemctl restart node_exporter
ss -tlnp | grep 9100
If your security policy doesn't allow binding to all interfaces, bind to the specific interface that Prometheus uses to reach this host — just not loopback. A host with IP
192.168.10.45on its primary interface would use
--web.listen-address="192.168.10.45:9100".
Root Cause 7: TLS or Authentication Misconfiguration
Hardened environments often add TLS and basic authentication to node exporter. When the exporter is configured to require these but Prometheus isn't configured with matching credentials or a trusted CA, scrapes fail — either with an explicit HTTP error or a cryptic connection error that doesn't immediately suggest an auth problem.
How to Identify It
On the Prometheus Targets page, check the error column. A missing or wrong credential shows up as:
server returned HTTP status 401 Unauthorized
If TLS is required but Prometheus is still using plain HTTP, you'll often see:
Get "http://192.168.10.45:9100/metrics": EOF
Check whether the exporter has a web config file that enables TLS or auth:
cat /etc/node_exporter/web-config.yml
tls_server_config:
cert_file: /etc/ssl/node_exporter/node_exporter.crt
key_file: /etc/ssl/node_exporter/node_exporter.key
basic_auth_users:
prometheus: $2y$10$X5jVqN8...
If this file exists and is being passed to node_exporter via
--web.config.file, then Prometheus must be configured to match.
How to Fix It
Update your Prometheus scrape job to include the TLS configuration and credentials:
scrape_configs:
- job_name: 'node'
scheme: https
tls_config:
ca_file: /etc/prometheus/certs/ca.crt
insecure_skip_verify: false
basic_auth:
username: prometheus
password_file: /etc/prometheus/node_exporter_password
static_configs:
- targets:
- '192.168.10.45:9100'
Store the password in a file with restricted permissions, not inline in the config:
chmod 600 /etc/prometheus/node_exporter_password
chown prometheus:prometheus /etc/prometheus/node_exporter_password
Validate and reload Prometheus, then confirm the target transitions to UP on the Targets page.
Prevention
Most of these issues are preventable with upfront work. Here's what I keep in place on every environment I manage.
Start with an alerting rule for the
upmetric. If node exporter stops reporting, you want to know within minutes — not when someone opens a dashboard and notices blank panels:
groups:
- name: node_exporter
rules:
- alert: NodeExporterDown
expr: up{job="node"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Node exporter down on {{ $labels.instance }}"
description: "Node exporter on {{ $labels.instance }} has been unreachable for more than 2 minutes."
Use configuration management — Ansible, Puppet, Salt — to enforce that the node_exporter service is running and enabled. Don't rely on humans to remember to start it after a reboot. A simple Ansible handler with
state: startedand
enabled: yeshandles the whole class of "forgot to restart after kernel update" incidents.
Add firewall rule management to the same configuration management role that deploys node exporter. The rule and the service should live together. If you install the exporter, you open the port. If you remove the exporter, the rule goes with it. Treating them as separate tasks is how you end up with an exporter that starts but can't be scraped.
Validate your prometheus.yml on every change with
promtool check configbefore applying it. If you manage your Prometheus config in a git repo — and you should — run this check in CI. Catching a port typo in a target address before it hits production is far better than hunting it down while an alert is firing.
Document disabled collectors in the unit file. A one-line comment explaining why
--no-collector.xfsis set takes ten seconds to write and saves the next engineer (including future you) from assuming it's a bug. Undocumented intentional omissions are indistinguishable from accidents.
Finally, do a periodic audit of your label structure whenever you modify scrape configs or relabeling rules. Grep your alerting rules and Grafana dashboards for any label values that might be affected by the change. Label drift is one of those problems that compounds silently — dashboards can appear functional while actually showing stale or incomplete data for weeks until someone notices.
Node exporter is stable, mature software. It rarely breaks on its own. When metrics go missing, something in the surrounding environment changed. Work outward from the basics: is the process running, can Prometheus reach it, is Prometheus configured to look for it, and do the labels on what gets scraped match what your queries expect. That covers the overwhelming majority of cases you'll encounter.
