Symptoms
You'll know Prometheus is out of disk space before your monitoring tells you — because your monitoring just stopped working. The most common first sign is Grafana going dark for recent time windows while older data still renders fine. Alerts stop firing. The Prometheus UI at sw-infrarunbook-01:9090 might still show all targets green while the TSDB is silently refusing to write anything new.
Pull the logs immediately:
journalctl -u prometheus -f
You'll likely see something like this:
level=error ts=2026-04-18T03:42:17.283Z caller=head.go:592 component=tsdb msg="create chunk" err="no space left on device"
level=error ts=2026-04-18T03:42:17.285Z caller=compact.go:519 component=tsdb msg="compaction failed" err="no space left on device"
level=warn ts=2026-04-18T03:42:18.001Z caller=scrape.go:1378 component="scrape manager" scrapeErr="context deadline exceeded"
Confirm the disk state:
df -h /var/lib/prometheus
Filesystem Size Used Avail Use% Mounted on
/dev/sdb1 500G 500G 0 100% /var/lib/prometheus
You have an incident. Let's find out why.
Root Cause 1: Retention Period Too Long
Why it happens: The Prometheus default retention is 15 days. Most teams hit trouble not because they set something extreme, but because they bumped retention to 90 or 180 days to cover compliance or analytics needs, and then watched their scrape targets grow without revisiting storage allocation. Prometheus stores data in two-hour blocks. The longer the retention window, the more blocks accumulate — and every new exporter or service you add multiplies the cost of each extra day.
How to identify it: Check what retention is actually configured:
ps aux | grep prometheus | grep -o 'retention[^ ]*'
# or inspect the systemd unit
systemctl cat prometheus | grep retention
Sample output that should make you wince:
/usr/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus/data \
--storage.tsdb.retention.time=180d
Then see how storage is distributed across blocks:
du -sh /var/lib/prometheus/data/*/
ls -lth /var/lib/prometheus/data/ | head -30
If Prometheus is still partially functional, query storage metrics directly:
curl -s http://10.0.1.10:9090/api/v1/query \
--data-urlencode 'query=prometheus_tsdb_storage_blocks_bytes' | jq .
curl -s http://10.0.1.10:9090/api/v1/query \
--data-urlencode 'query=prometheus_tsdb_time_range_seconds' | jq .
How to fix it: The fastest short-term fix is cutting the retention period. Prometheus deletes old blocks during the next compaction cycle after a restart:
# Edit your systemd unit or startup script
# Change: --storage.tsdb.retention.time=180d
# To: --storage.tsdb.retention.time=30d
systemctl daemon-reload
systemctl restart prometheus
If you need immediate disk recovery and can't wait for a compaction cycle, stop Prometheus and manually remove the oldest TSDB blocks. Each block is a directory named with a ULID. Don't touch the
walor
chunks_headdirectories:
systemctl stop prometheus
# List blocks, oldest first (ULIDs sort chronologically)
ls -lt /var/lib/prometheus/data/ | grep -v wal | grep -v chunks_head | tail -20
# Delete the oldest blocks
rm -rf /var/lib/prometheus/data/01HXYZ0000000000000000001
rm -rf /var/lib/prometheus/data/01HXYZ0000000000000000002
systemctl start prometheus
For the long term, calculate your average daily ingest volume, size your disk to cover the desired retention window plus a 25% buffer, and — critically — add
--storage.tsdb.retention.sizeas a hard cap so a future spike can't breach it again.
Root Cause 2: Too Many Series Stored (Cardinality Explosion)
Why it happens: Every unique combination of metric name and label values is a separate time series. When labels carry high-cardinality values — user IDs, session tokens, request trace IDs, or names of ephemeral Kubernetes pods — you can end up with millions of active series. In my experience, a single misbehaving application instrumented by a developer who didn't read the Prometheus best practices docs can push two million new series in under an hour just by embedding a UUID in a label. Each of those series occupies space in the TSDB head block and eventually gets written to disk.
How to identify it: Start with the built-in gauge:
curl -s http://10.0.1.10:9090/api/v1/query \
--data-urlencode 'query=prometheus_tsdb_head_series' | \
jq '.data.result[0].value[1]'
A well-managed Prometheus instance with moderate infrastructure coverage sits around 100k–500k series. At 5M+ you have a cardinality crisis. Find the guilty metrics:
curl -s http://10.0.1.10:9090/api/v1/query \
--data-urlencode 'query=topk(20, count by (__name__)({__name__=~".+"}))' | \
jq '.data.result[] | {metric: .metric.__name__, count: .value[1]}'
Output that shows a problem:
{
"metric": "http_request_duration_seconds_bucket",
"count": "4823917"
}
{
"metric": "grpc_server_handling_seconds_bucket",
"count": "2104883"
}
Drill into a specific metric to find which label is exploding:
curl -s http://10.0.1.10:9090/api/v1/query \
--data-urlencode 'query=count by (customer_id)(http_request_duration_seconds_bucket)' | \
jq '.data.result | length'
For a comprehensive cardinality report, use the analyze tool against a data block:
promtool tsdb analyze /var/lib/prometheus/data
Block ID: 01HXYZ1234567890ABCDEFGH
Duration: 2h0m0s
Series: 4823917
Label names: 47
Postings (unique label pairs): 9847231
Top 5 label names with most values:
customer_id: 1923847
request_id: 1847234
trace_id: 892341
pod: 72341
instance: 4821
How to fix it: Drop the offending labels at the scrape level using
metric_relabel_configs. This strips them before they ever touch the TSDB:
# prometheus.yml
scrape_configs:
- job_name: 'app-api'
static_configs:
- targets: ['10.0.1.20:8080']
metric_relabel_configs:
- regex: 'customer_id|request_id|trace_id'
action: labeldrop
Reload Prometheus without a restart:
curl -X POST http://10.0.1.10:9090/-/reload
New scrapes will no longer produce series with those labels. Existing series will age out according to your retention policy — you won't recover disk space immediately, but the growth stops. Pair this with a cardinality alert so you catch the next offender before it fills the disk again.
Root Cause 3: Compaction Not Running
Why it happens: Prometheus writes samples to the WAL first, then flushes in-memory chunks to two-hour blocks on disk. Compaction is the background process that merges those small blocks into larger ones (6h, 24h) and applies compression. When compaction stalls or fails, you accumulate dozens of small, uncompressed two-hour blocks instead of a handful of efficient large ones. The same data ends up consuming three to five times more disk space than it should. And once disk pressure increases, compaction can fail specifically because it doesn't have room to write the merged output — a particularly nasty chicken-and-egg situation.
How to identify it: Check the failure counter:
curl -s http://10.0.1.10:9090/api/v1/query \
--data-urlencode 'query=prometheus_tsdb_compactions_failed_total' | jq .
{
"status": "success",
"data": {
"result": [
{
"metric": {},
"value": [1745030537, "14"]
}
]
}
}
Fourteen failed compactions is serious. Now count the total block directories — a healthy 15-day retention should have somewhere between 15 and 40 blocks, not hundreds:
ls /var/lib/prometheus/data/ | grep -v wal | grep -v chunks_head | wc -l
Check the logs for the actual compaction error:
journalctl -u prometheus --since "2 hours ago" | grep -i compact
level=error ts=2026-04-18T01:15:33.891Z caller=compact.go:519 component=tsdb \
msg="compaction failed" err="write /var/lib/prometheus/data/tmp-compact/chunks/000001: no space left on device"
How to fix it: If the failure is disk-space related, you need to free up enough room for compaction to write its output. Stop Prometheus first, remove a few of the oldest two-hour blocks, then restart:
systemctl stop prometheus
# Identify oldest blocks — ULIDs are chronological
ls -lth /var/lib/prometheus/data/ | grep -v wal | grep -v chunks_head | tail -10
# Remove the oldest several
rm -rf /var/lib/prometheus/data/01HXYZ0000000000000000001
rm -rf /var/lib/prometheus/data/01HXYZ0000000000000000002
rm -rf /var/lib/prometheus/data/01HXYZ0000000000000000003
systemctl start prometheus
Watch the logs to confirm compaction resumes successfully:
journalctl -u prometheus -f | grep compact
level=info ts=2026-04-18T04:02:11.773Z caller=compact.go:442 component=tsdb \
msg="compact blocks" count=12 mint=1745000000000 maxt=1745043200000 \
ulid=01HXYZ9999999999999999999 duration=47.293s
If you see that message and the block count starts dropping, compaction is healthy again. Keep watching for another 15–20 minutes to make sure it doesn't fail again on the next cycle.
Root Cause 4: Remote Write Filling Disk
Why it happens: When
remote_writeis configured, Prometheus ships all ingested samples to a remote endpoint — Thanos, Cortex, Mimir, VictoriaMetrics, or a custom receiver — in addition to storing them locally. The remote write subsystem uses its own WAL buffer to hold samples that haven't been successfully acknowledged yet. If the remote endpoint becomes slow, unreachable, or starts rejecting samples, that buffer grows without bound. I have seen this scenario push WAL directories past 80GB on a host with a 500GB volume, where local TSDB retention was only seven days and completely healthy. The WAL just kept accumulating undelivered samples.
How to identify it: Start by checking the WAL size directly:
du -sh /var/lib/prometheus/data/wal/
# Healthy: a few hundred MB to low single-digit GB
# Sick: 20GB, 80GB, 200GB
du -sh /var/lib/prometheus/data/*/
Then look at the remote write queue metrics:
curl -s http://10.0.1.10:9090/api/v1/query \
--data-urlencode 'query=prometheus_remote_storage_samples_pending' | jq .
curl -s http://10.0.1.10:9090/api/v1/query \
--data-urlencode 'query=prometheus_remote_storage_failed_samples_total' | jq .
curl -s http://10.0.1.10:9090/api/v1/query \
--data-urlencode 'query=time() - prometheus_remote_storage_queue_highest_sent_timestamp_seconds' | jq .
If pending samples are in the millions and the lag metric shows the queue is hours or days behind real time, you have a stuck remote write pipeline. Check the logs:
journalctl -u prometheus --since "3 hours ago" | grep -i "remote"
level=warn ts=2026-04-18T02:14:55.002Z caller=queue_manager.go:892 component=remote \
remote_name=thanos url=http://10.0.2.50:10908/api/v1/receive \
msg="Could not store data" err="context deadline exceeded"
How to fix it: First, restore the remote endpoint. Verify it's reachable:
curl -v http://10.0.2.50:10908/api/v1/receive
ip route get 10.0.2.50
ss -tn dst 10.0.2.50
Once the endpoint is healthy, Prometheus will drain the queue automatically. To prevent unbounded WAL growth in the future, add queue limits to your
remote_writeblock:
# prometheus.yml
remote_write:
- url: "http://10.0.2.50:10908/api/v1/receive"
queue_config:
capacity: 10000
max_shards: 10
max_samples_per_send: 2000
batch_send_deadline: 5s
min_backoff: 30ms
max_backoff: 5s
If the remote endpoint is permanently gone and you want to stop the queue from consuming disk immediately, remove the
remote_writeblock and reload. Prometheus will drop the unsent buffer and WAL growth will stop. The data that was queued but never sent is lost, but at this point the alternative is losing everything when the disk fills completely.
Root Cause 5: TSDB Corruption
Why it happens: TSDB corruption follows unclean shutdowns — an OOM kill, a kernel panic, a hard power loss, or a
kill -9. When Prometheus is stopped mid-write, the WAL can contain partially written segments and on-disk blocks can be left in an inconsistent state. Prometheus has a built-in repair mechanism that runs at startup and handles most cases gracefully, but in some situations it partially opens the database and continues writing into a fragmented layout. Corrupted blocks don't compact properly, can't be removed by the normal retention process, and tend to accumulate silently over time — causing disk usage to grow in ways that don't match the expected retention math.
How to identify it: Look for repair and corruption messages from the last startup:
journalctl -u prometheus | grep -i "corrupt\|repair\|invalid\|error opening"
level=warn ts=2026-04-18T00:01:22.441Z caller=repair.go:56 component=tsdb \
msg="Found healthy block" mint=1744876800000 maxt=1744884000000
level=error ts=2026-04-18T00:01:22.552Z caller=db.go:887 component=tsdb \
msg="Unexpected error opening DB" err="unexpected non-sequence files in directory"
level=info ts=2026-04-18T00:01:22.553Z caller=repair.go:88 component=tsdb \
msg="Deleting broken block" id=01HXYZ1111111111111111111
Run the analyze tool to inspect block metadata:
promtool tsdb analyze /var/lib/prometheus/data
Block ID: 01HXYZ1111111111111111111
Duration: 0s <-- zero duration: corrupted
Series: 0
Chunks: 0
You can also scan block metadata files directly to spot invalid JSON:
for dir in /var/lib/prometheus/data/*/; do
if [ -f "$dir/meta.json" ]; then
python3 -m json.tool "$dir/meta.json" > /dev/null 2>&1 || \
echo "CORRUPTED BLOCK: $dir"
fi
done
How to fix it: Stop Prometheus and let it run its automatic repair on restart — this resolves most corruption cases:
systemctl stop prometheus
systemctl start prometheus
journalctl -u prometheus -f | grep -i "repair\|corrupt\|delet"
If automatic repair isn't sufficient, use
promtool tsdb repairdirectly. Always back up first if you have space anywhere:
systemctl stop prometheus
# Back up if capacity allows
rsync -av /var/lib/prometheus/data/ /mnt/backup/prometheus-data/
# Run the repair tool
promtool tsdb repair /var/lib/prometheus/data
systemctl start prometheus
After restart, re-run
promtool tsdb analyzeand verify no zero-duration or zero-series blocks remain. Any that do can be safely deleted while Prometheus is stopped — they contain nothing useful.
Root Cause 6: WAL Growing Without Remote Write
Why it happens: Even with no remote write configured, the WAL can grow unexpectedly. Prometheus writes all raw samples to the WAL first and checkpoints it periodically as data is flushed to head chunks. If scrape volume surges — a new high-frequency job, a suddenly expanded target list — the WAL can grow faster than it's being checkpointed. WAL corruption after an unclean shutdown can also leave orphaned segments that never get cleaned up, silently consuming gigabytes over weeks.
How to identify it: Check WAL size and segment count:
du -sh /var/lib/prometheus/data/wal/
ls /var/lib/prometheus/data/wal/ | wc -l
# A normal WAL has a handful of numbered segments (00000001, 00000002...)
# If you're seeing hundreds, something is wrong
curl -s http://10.0.1.10:9090/api/v1/query \
--data-urlencode 'query=prometheus_tsdb_wal_corruptions_total' | jq .
How to fix it: If the WAL is large due to a temporary scrape volume surge, reduce your scrape targets or extend scrape intervals and wait for Prometheus to checkpoint normally. If corruption is involved, a clean restart usually resolves it. In cases where old checkpoint directories are accumulating, stopping Prometheus and clearing them forces a clean WAL state on next start:
systemctl stop prometheus
ls /var/lib/prometheus/data/wal/
# checkpoint.000000X are old, sealed checkpoints — safe to remove
rm -rf /var/lib/prometheus/data/wal/checkpoint.000000*
systemctl start prometheus
Prevention
The best Prometheus disk-full incident is one that pages you at 80% capacity instead of 100%. Wire up an alert on your own disk usage via
node_exporter, which runs independently of Prometheus's TSDB health:
- alert: PrometheusDiskRunningFull
expr: |
(
node_filesystem_avail_bytes{mountpoint="/var/lib/prometheus"} /
node_filesystem_size_bytes{mountpoint="/var/lib/prometheus"}
) < 0.20
for: 5m
labels:
severity: warning
annotations:
summary: "Prometheus disk at {{ $value | humanizePercentage }} free on {{ $labels.instance }}"
description: "Projected to fill within hours — investigate retention, cardinality, and compaction."
Add a cardinality alert so you catch series explosions before they become a disk crisis:
- alert: PrometheusHighCardinality
expr: prometheus_tsdb_head_series > 2000000
for: 10m
labels:
severity: warning
annotations:
summary: "Prometheus series count too high: {{ $value | humanize }}"
description: "High cardinality often precedes disk exhaustion. Find and drop the offending labels."
Use both retention flags together. Time-based retention is fine for normal operation, but size-based retention is your insurance policy against unexpected ingest spikes:
/usr/bin/prometheus \
--storage.tsdb.path=/var/lib/prometheus/data \
--storage.tsdb.retention.time=30d \
--storage.tsdb.retention.size=400GB
Prometheus enforces whichever limit is hit first, so size-based retention will kick in automatically if a cardinality event or remote write queue backup causes storage to spike before the time limit would normally clean it up.
Run
promtool tsdb analyzeagainst your production data on a monthly basis. It takes seconds, outputs a block-by-block summary, and lets you spot creeping cardinality growth before it becomes an emergency. Pair this with a Grafana dashboard that tracks
prometheus_tsdb_head_series,
prometheus_tsdb_compactions_failed_total, and
prometheus_remote_storage_samples_pendingover time — all three trend toward problems weeks before they cause an outage.
Finally, if you're running remote write, make sure your receiver infrastructure has its own alerting. Prometheus has no way to alert you that its remote write endpoint is down — it just keeps buffering. A silent receiver failure is one of the more insidious paths to a full WAL. Treat remote write queue lag as a first-class metric and page on it.
