Symptoms
You open Grafana, switch to the Explore view, select your Loki datasource, and run a query that normally returns thousands of lines — and you get nothing. No entries. Not even an error, just silence. Or maybe you are getting an error:
context deadline exceededor
no org idstaring back at you. Either way, logs that should be flowing into Loki aren't arriving, and you need to figure out why quickly.
The most common symptoms when Loki ingestion breaks down:
- Grafana Explore returns empty results for queries that previously worked fine
- Log-based alerts stop firing even though the underlying condition is actively occurring
- Promtail logs fill with repeated error messages like
429 Too Many Requests
orconnection refused
- The Loki
/metrics
endpoint showsloki_distributor_lines_received_total
has stopped incrementing logcli
returns zero results for streams you know are active- Application teams report that their structured logs vanished from dashboards with no obvious explanation
The tricky thing about Loki ingestion failures is that they're often silent. Promtail keeps running, applications keep writing logs, but nothing makes it through the pipeline. Everything looks healthy on the surface until you actually query for data. This article walks through the most common culprits, how to identify each one precisely, and exactly how to fix them.
Root Cause 1: Promtail Not Running
This is the obvious one, but it bites people more often than they'd like to admit. Promtail is the agent that scrapes log files and ships them to Loki. If it's not running, nothing gets ingested — full stop. In my experience, this happens most frequently right after a system reboot that didn't restore the service, after a package update that silently replaced the unit file, or after a config change that introduced a syntax error and prevented the service from starting cleanly.
To check whether Promtail is actually running on
sw-infrarunbook-01:
systemctl status promtail.service
If it's failed or inactive, you'll see something like:
● promtail.service - Promtail log shipping agent
Loaded: loaded (/etc/systemd/system/promtail.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Wed 2026-04-16 09:12:44 UTC; 3min 22s ago
Process: 4821 ExecStart=/usr/local/bin/promtail -config.file=/etc/promtail/config.yaml (code=exited, status=1/FAILURE)
Main PID: 4821 (code=exited, status=1/FAILURE)
Pull the actual error from the journal to understand why it failed:
journalctl -u promtail.service -n 50 --no-pager
A config syntax error produces output like this in the journal:
Apr 16 09:12:44 sw-infrarunbook-01 promtail[4821]: level=error ts=2026-04-16T09:12:44.312Z caller=main.go:67 msg="error creating promtail" err="invalid config: yaml: line 34: mapping values are not allowed in this context"
Always validate your config before attempting a restart — saves you from bouncing the service only to have it fail again immediately:
promtail -config.file=/etc/promtail/config.yaml -check-syntax
Once the config passes validation, enable and start the service:
systemctl enable promtail.service
systemctl start promtail.service
systemctl status promtail.service
If Promtail keeps crashing in a loop even with a valid config, check file permissions on the log paths it's trying to tail. The user Promtail runs as must have read access to every file it's watching. A quick sanity check:
stat /var/log/syslog
ls -la /var/log/app/*.log
If Promtail runs as the
promtailsystem user and the log files are owned by
rootwith mode
640, it can't read them. Depending on the version, it may fail silently rather than throwing an obvious permission denied error. Either add
promtailto the appropriate group, or adjust the ACLs on the log directory.
Root Cause 2: Wrong Loki Endpoint
Promtail is running fine, but it's pointed at the wrong address. This is surprisingly common in environments where Loki was recently migrated to a new server, placed behind a reverse proxy, or switched from HTTP to HTTPS. The agent happily attempts to ship logs into the void and, depending on the error, may not complain loudly enough for anyone to notice until there's a gap in the data.
The Loki push endpoint in Promtail's config lives under the
clientsblock:
clients:
- url: http://192.168.10.25:3100/loki/api/v1/push
Common mistakes I've seen: using port
3100when Loki is actually behind nginx on port
80, omitting the
/loki/api/v1/pushpath suffix and pointing at the root URL, pointing at a dev Loki instance instead of production, or using an IP address that changed after a VM migration or re-IP event. The best way to catch this quickly is to look at Promtail's own logs for connection errors:
journalctl -u promtail.service --since "10 minutes ago" | grep -iE "error|failed|refused|404|429"
A wrong endpoint will produce output like:
level=warn ts=2026-04-16T09:45:11.803Z caller=client.go:349 component=client host=192.168.10.99:3100 msg="error sending batch, will retry" status=0 err="Post \"http://192.168.10.99:3100/loki/api/v1/push\": dial tcp 192.168.10.99:3100: connect: connection refused"
Or if the host resolves but the path is wrong:
level=warn ts=2026-04-16T09:45:11.803Z caller=client.go:349 component=client host=192.168.10.25:3100 msg="error sending batch, will retry" status=404 err="server returned HTTP status 404 Not Found"
Verify Loki is actually reachable at the expected address from the Promtail host before editing any config:
curl -v http://192.168.10.25:3100/ready
If Loki is running properly, you'll get back a
200 OKwith the body
ready. If you're getting a 404 specifically on the push endpoint, also check whether you're running Loki behind a reverse proxy with a path prefix. Some configurations serve Loki at
/loki/on the proxy, which means the full push URL becomes
http://192.168.10.25:80/loki/loki/api/v1/push— yes, with
lokidoubled. It looks wrong, but it's correct in that scenario.
Fix the URL in
/etc/promtail/config.yaml, then restart Promtail and tail the logs to confirm entries are flowing:
systemctl restart promtail.service
journalctl -u promtail.service -f
You should see lines like
msg="successfully sent batch"appearing within seconds if the endpoint is now correct and Loki is reachable.
Root Cause 3: Label Mismatch
Loki's data model is fundamentally label-based. Every log stream is uniquely identified by a set of labels, and those labels define the stream's identity permanently once created. When Promtail ships a batch of logs with a label set that conflicts with how Loki has an existing stream structured — or when a pipeline stage produces unexpected label combinations — you'll end up with rejected entries, duplicate streams, or query results that don't match what you expect.
One common scenario: you rename a label in your Promtail pipeline stages partway through the day, say changing
jobto
app. Loki already has an open stream for that log source with the old label schema. The new batches arrive with a different label fingerprint, and instead of appending cleanly to the existing stream, Loki creates a new one. Old queries using
{job="webserver"}stop returning new data because the stream has effectively diverged.
Another classic problem is high-cardinality labels — including a user ID, request UUID, pod IP, or any other value that changes per-request as a Loki label. This explodes the number of streams Loki has to manage, eventually triggering limits and causing ingestion to stall entirely.
To see what labels Promtail is actually attaching, enable debug logging temporarily:
promtail -config.file=/etc/promtail/config.yaml -log.level=debug 2>&1 | grep -i "labels\|stream"
You can also query Loki directly to see the full label set it has indexed:
curl http://192.168.10.25:3100/loki/api/v1/labels | jq .
{
"status": "success",
"data": [
"app",
"env",
"host",
"job"
]
}
To detect a cardinality problem, check the number of unique values for a label that should be low-cardinality:
curl "http://192.168.10.25:3100/loki/api/v1/label/host/values" | jq '.data | length'
If that returns hundreds or thousands for a label that should have a handful, you've found the problem. Fix it in the Promtail pipeline stage by dropping the high-cardinality label and keeping that value inside the log line body as a structured field instead:
pipeline_stages:
- json:
expressions:
request_id: request_id
level: level
- labels:
app:
env:
level:
# request_id stays in the log body — never make it a label
After fixing the label config, restart Promtail. Cleaning up existing high-cardinality streams in Loki requires either using the Loki admin delete API or waiting for the retention period to expire and compact them away.
Root Cause 4: Out of Order Entries
This one trips up engineers who are shipping logs from multiple sources, replaying archived log data, or dealing with hosts that have drifted clocks. Loki enforces that within a single stream, log entries must arrive in strictly monotonically increasing timestamp order. If an entry arrives with a timestamp older than the most recently accepted entry for that stream, Loki rejects it outright.
Why does this happen in practice? NTP drift between hosts producing logs, log files being re-read from the beginning by Promtail after a crash or restart that wiped the position file, log aggregators that buffer and reorder entries before forwarding, and manual imports of archived log data are all common triggers. The error message in Promtail's logs is descriptive when this is the cause:
level=warn ts=2026-04-16T10:03:22.194Z caller=client.go:349 component=client host=192.168.10.25:3100 msg="error sending batch, will retry" status=400 err="rpc error: code = Code(400) desc = entry for stream '{app=\"webserver\", env=\"prod\", host=\"sw-infrarunbook-01\"}' has timestamp too old: 2026-04-16T09:55:01Z, oldest acceptable timestamp is 2026-04-16T10:03:00Z"
HTTP status
400combined with the phrase "timestamp too old" or "entry out of order" makes this unambiguous. Note that unlike rate limit errors (which return 429 and trigger retries), out-of-order rejections are permanent — those log entries are lost unless you can re-deliver them after adjusting Loki's tolerance window.
On the Loki server side, you can widen the acceptance window by adjusting
reject_old_samples_max_agein
limits_config:
limits_config:
reject_old_samples: true
reject_old_samples_max_age: 168h
Be careful — setting this too loosely can mask genuine clock problems and make future debugging harder. Use the narrowest window that fixes the actual symptom.
On the Promtail side, make sure the position file is on a persistent path so it survives restarts without re-reading log files from the beginning:
positions:
filename: /var/lib/promtail/positions.yaml
If the positions file gets deleted — which happens during certain upgrade procedures or aggressive
/var/libcleanups — Promtail starts reading every configured log file from offset zero on next start. For active high-volume logs, this floods Loki with entries that are days old, arriving far out of order relative to what's already been ingested. Restore the positions file from backup if you have one, or add
tail_from_end: trueto your scrape configs to start from the current tail on first read rather than the beginning.
For NTP drift issues, verify clock sync on all log-producing hosts:
timedatectl status
chronyc tracking
If the offset is more than a second or two, fix NTP synchronization first. Everything downstream of a drifted clock becomes unreliable.
Root Cause 5: Rate Limit Hit
Loki ships with ingestion rate limits enabled by default, and they're conservative by design. Under steady-state operations you won't hit them. The moment you deploy a newly verbose microservice, start tailing a debug log that's producing ten thousand lines per second, kick off a batch job that dumps gigabytes of application output, or add ten new hosts all pointing at the same Loki instance simultaneously — you'll slam into the limits and start dropping entries.
The error in Promtail's logs when rate limits are hit is hard to miss:
level=warn ts=2026-04-16T11:22:05.441Z caller=client.go:349 component=client host=192.168.10.25:3100 msg="error sending batch, will retry" status=429 err="rpc error: code = Code(429) desc = ingestion rate limit (4194304 bytes/s) exceeded while adding 65536 bytes for user 'fake', reduce log volume or contact your Loki administrator"
Status
429with the phrase "ingestion rate limit exceeded" is the key. Unlike 400 errors for out-of-order entries, Promtail will retry 429 responses with backoff — so in some cases the logs will eventually make it through if the rate normalizes. But if the elevated rate persists, Promtail's internal batch buffer fills up, retries accumulate, and you start losing entries permanently.
Check the current rate limit configuration on the Loki server:
grep -A 10 'limits_config' /etc/loki/config.yaml
limits_config:
ingestion_rate_mb: 4
ingestion_burst_size_mb: 6
per_stream_rate_limit: 3MB
per_stream_rate_limit_burst: 15MB
The default
ingestion_rate_mbof 4 MB/s is per tenant in multi-tenant mode, or global in single-tenant mode. For anything beyond a handful of lightly active services, that ceiling is low. Raise it to match your actual volume with some headroom:
limits_config:
ingestion_rate_mb: 32
ingestion_burst_size_mb: 64
per_stream_rate_limit: 10MB
per_stream_rate_limit_burst: 30MB
Restart Loki after changing its config and confirm it came up cleanly:
systemctl restart loki.service
journalctl -u loki.service -n 20 --no-pager
Raising limits is the fast fix, but you should also find out why the rate spiked in the first place. Identify which stream is responsible by checking Loki's metrics endpoint:
curl -s http://192.168.10.25:3100/metrics | grep loki_distributor_bytes_received_total
If one specific app or host label is generating the overwhelming bulk of the volume, consider adding a Promtail pipeline stage to drop unnecessary log levels before they ever reach Loki. This is cleaner than just opening up the rate limit ceiling:
pipeline_stages:
- match:
selector: '{app="verbose-service"}'
stages:
- drop:
expression: ".*level=debug.*"
- drop:
expression: ".*level=trace.*"
Drop noisy log levels at the agent, not at query time. Query-time filtering doesn't reduce your ingestion cost — the data is already in Loki eating storage and bandwidth.
Root Cause 6: Filesystem Full on the Loki Host
Simple and brutal. If the disk where Loki stores its chunks and index fills up, ingestion stops. Depending on your Loki version, the error messages can be surprisingly unhelpful — sometimes appearing as generic internal server errors rather than clearly pointing to disk exhaustion. This one tends to cause confusion because the root cause is completely outside the Loki configuration itself.
Check disk usage on the Loki storage path:
df -h /var/loki
du -sh /var/loki/chunks /var/loki/index /var/loki/boltdb-shipper-active
Filesystem Size Used Avail Use% Mounted on
/dev/sdb1 50G 49G 512M 99% /var/loki
If you're at 99%, that's your problem. Short-term: check whether retention is configured and actually deleting old data. Long-term: expand the volume or move Loki's storage backend to object storage so you're not constrained by local disk capacity.
Verify Loki's retention configuration:
grep -A 5 'table_manager\|compactor\|retention' /etc/loki/config.yaml
compactor:
working_directory: /var/loki/compactor
shared_store: filesystem
retention_enabled: true
limits_config:
retention_period: 336h
If
retention_enabledis missing or set to
false, Loki never cleans up old chunks. Enable it, set a retention period appropriate for your needs (336 hours — 14 days — is a sensible default for most environments), and restart Loki. The compactor will begin cleaning up expired chunks on its next scheduled run. Don't expect immediate disk reclamation; give it an hour and re-check.
Root Cause 7: Authentication and Tenant ID Errors
If your Loki deployment has multi-tenancy enabled (the
auth_enabled: truesetting), every push request must include an
X-Scope-OrgIDheader. Promtail must be configured to send it. Without it, Loki rejects the push with a 401 or a generic error. I've seen this catch teams off guard when they stand up a new Loki instance and copy a working Promtail config from a single-tenant environment where the header was never required.
The Promtail config for tenant ID lives under the client block:
clients:
- url: http://192.168.10.25:3100/loki/api/v1/push
tenant_id: infrarunbook-admin
If you see errors like
no org idin Loki's logs or Promtail gets back a
401 Unauthorized, add the
tenant_idfield matching the org ID configured in Loki's tenant configuration. Alternatively, if you're intentionally running single-tenant and don't need auth, set
auth_enabled: falsein Loki's config and restart.
Prevention
Most Loki ingestion failures are detectable before they become visible outages, provided you have the right observability in place on the pipeline itself. The single most valuable metric to watch is
loki_distributor_lines_received_total. Set an alert: if this counter stops incrementing for any stream that should be continuously active, something is wrong. You want to know in two minutes, not two hours.
For rate limit exposure, alert on the ratio of dropped entries in Promtail. The metric
promtail_dropped_entries_totalshould be zero in steady state. A non-zero and rising value means either rate limits are being hit, the endpoint is unreachable, or out-of-order rejections are accumulating. Any of those conditions warrants immediate investigation.
Keep your Promtail position file on a persistent, backed-up volume — never on a tmpfs, a path that gets wiped on reboot, or anywhere that aggressive cleanup scripts might touch. The position file is the only thing standing between you and Promtail flooding Loki with re-read historical data after every restart. Treat it as important operational state.
Run
promtail -config.file=/etc/promtail/config.yaml -check-syntaxas an explicit validation step in your configuration management pipeline — Ansible, Chef, Puppet, whatever you use. A config file with a syntax error will kill Promtail on the next restart, which kills log ingestion for every service on that host. Adding this check takes thirty seconds and eliminates an entire class of outage.
Enforce label discipline across your fleet. Establish a fixed schema —
host,
job,
app,
env— and document it clearly so every team onboarding a new service knows exactly which fields are labels and which stay in the log body. Every high-cardinality value (request IDs, user IDs, trace IDs, session tokens) belongs inside the structured log line, not as a Loki label. Violating this will cause stream explosion, which leads directly to rate limit failures and query performance degradation that's expensive to reverse.
Finally, build synthetic end-to-end ingestion monitoring. A simple cron job on
sw-infrarunbook-01that writes a known, timestamped sentinel entry to a file Promtail is watching, then queries Loki for that entry via
logcliand alerts if it's not found within two minutes, gives you real end-to-end pipeline visibility. If that check goes red, you know immediately that ingestion is broken — and you have a clear starting point for every diagnostic step in this article.
