Loki Ingestion Issues

Symptoms

You open Grafana, switch to the Explore view, select your Loki datasource, and run a query that normally returns thousands of lines — and you get nothing. No entries. Not even an error, just silence. Or maybe you are getting an error:

context deadline exceeded

no org id

staring back at you. Either way, logs that should be flowing into Loki aren't arriving, and you need to figure out why quickly.

The most common symptoms when Loki ingestion breaks down:

Grafana Explore returns empty results for queries that previously worked fine
Log-based alerts stop firing even though the underlying condition is actively occurring
Promtail logs fill with repeated error messages like
429 Too Many Requests
or
connection refused
The Loki
/metrics
endpoint shows
loki_distributor_lines_received_total
has stopped incrementing
logcli
returns zero results for streams you know are active
Application teams report that their structured logs vanished from dashboards with no obvious explanation

The tricky thing about Loki ingestion failures is that they're often silent. Promtail keeps running, applications keep writing logs, but nothing makes it through the pipeline. Everything looks healthy on the surface until you actually query for data. This article walks through the most common culprits, how to identify each one precisely, and exactly how to fix them.

Root Cause 1: Promtail Not Running

This is the obvious one, but it bites people more often than they'd like to admit. Promtail is the agent that scrapes log files and ships them to Loki. If it's not running, nothing gets ingested — full stop. In my experience, this happens most frequently right after a system reboot that didn't restore the service, after a package update that silently replaced the unit file, or after a config change that introduced a syntax error and prevented the service from starting cleanly.

To check whether Promtail is actually running on

sw-infrarunbook-01

systemctl status promtail.service

If it's failed or inactive, you'll see something like:

● promtail.service - Promtail log shipping agent
     Loaded: loaded (/etc/systemd/system/promtail.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Wed 2026-04-16 09:12:44 UTC; 3min 22s ago
    Process: 4821 ExecStart=/usr/local/bin/promtail -config.file=/etc/promtail/config.yaml (code=exited, status=1/FAILURE)
   Main PID: 4821 (code=exited, status=1/FAILURE)

Pull the actual error from the journal to understand why it failed:

journalctl -u promtail.service -n 50 --no-pager

A config syntax error produces output like this in the journal:

Apr 16 09:12:44 sw-infrarunbook-01 promtail[4821]: level=error ts=2026-04-16T09:12:44.312Z caller=main.go:67 msg="error creating promtail" err="invalid config: yaml: line 34: mapping values are not allowed in this context"

Always validate your config before attempting a restart — saves you from bouncing the service only to have it fail again immediately:

promtail -config.file=/etc/promtail/config.yaml -check-syntax

Once the config passes validation, enable and start the service:

systemctl enable promtail.service
systemctl start promtail.service
systemctl status promtail.service

If Promtail keeps crashing in a loop even with a valid config, check file permissions on the log paths it's trying to tail. The user Promtail runs as must have read access to every file it's watching. A quick sanity check:

stat /var/log/syslog
ls -la /var/log/app/*.log

If Promtail runs as the

promtail

system user and the log files are owned by

root

with mode

640

, it can't read them. Depending on the version, it may fail silently rather than throwing an obvious permission denied error. Either add

promtail

to the appropriate group, or adjust the ACLs on the log directory.

Root Cause 2: Wrong Loki Endpoint

Promtail is running fine, but it's pointed at the wrong address. This is surprisingly common in environments where Loki was recently migrated to a new server, placed behind a reverse proxy, or switched from HTTP to HTTPS. The agent happily attempts to ship logs into the void and, depending on the error, may not complain loudly enough for anyone to notice until there's a gap in the data.

The Loki push endpoint in Promtail's config lives under the

clients

block:

clients:
  - url: http://192.168.10.25:3100/loki/api/v1/push

Common mistakes I've seen: using port

3100

when Loki is actually behind nginx on port

80

, omitting the

/loki/api/v1/push

path suffix and pointing at the root URL, pointing at a dev Loki instance instead of production, or using an IP address that changed after a VM migration or re-IP event. The best way to catch this quickly is to look at Promtail's own logs for connection errors:

journalctl -u promtail.service --since "10 minutes ago" | grep -iE "error|failed|refused|404|429"

A wrong endpoint will produce output like:

level=warn ts=2026-04-16T09:45:11.803Z caller=client.go:349 component=client host=192.168.10.99:3100 msg="error sending batch, will retry" status=0 err="Post \"http://192.168.10.99:3100/loki/api/v1/push\": dial tcp 192.168.10.99:3100: connect: connection refused"

Or if the host resolves but the path is wrong:

level=warn ts=2026-04-16T09:45:11.803Z caller=client.go:349 component=client host=192.168.10.25:3100 msg="error sending batch, will retry" status=404 err="server returned HTTP status 404 Not Found"

Verify Loki is actually reachable at the expected address from the Promtail host before editing any config:

curl -v http://192.168.10.25:3100/ready

If Loki is running properly, you'll get back a

200 OK

with the body

ready

. If you're getting a 404 specifically on the push endpoint, also check whether you're running Loki behind a reverse proxy with a path prefix. Some configurations serve Loki at

/loki/

on the proxy, which means the full push URL becomes

http://192.168.10.25:80/loki/loki/api/v1/push

— yes, with

loki

doubled. It looks wrong, but it's correct in that scenario.

Fix the URL in

/etc/promtail/config.yaml

, then restart Promtail and tail the logs to confirm entries are flowing:

systemctl restart promtail.service
journalctl -u promtail.service -f

You should see lines like

msg="successfully sent batch"

appearing within seconds if the endpoint is now correct and Loki is reachable.

Root Cause 3: Label Mismatch

Loki's data model is fundamentally label-based. Every log stream is uniquely identified by a set of labels, and those labels define the stream's identity permanently once created. When Promtail ships a batch of logs with a label set that conflicts with how Loki has an existing stream structured — or when a pipeline stage produces unexpected label combinations — you'll end up with rejected entries, duplicate streams, or query results that don't match what you expect.

One common scenario: you rename a label in your Promtail pipeline stages partway through the day, say changing

job

app

. Loki already has an open stream for that log source with the old label schema. The new batches arrive with a different label fingerprint, and instead of appending cleanly to the existing stream, Loki creates a new one. Old queries using

{job="webserver"}

stop returning new data because the stream has effectively diverged.

Another classic problem is high-cardinality labels — including a user ID, request UUID, pod IP, or any other value that changes per-request as a Loki label. This explodes the number of streams Loki has to manage, eventually triggering limits and causing ingestion to stall entirely.

To see what labels Promtail is actually attaching, enable debug logging temporarily:

promtail -config.file=/etc/promtail/config.yaml -log.level=debug 2>&1 | grep -i "labels\|stream"

You can also query Loki directly to see the full label set it has indexed:

curl http://192.168.10.25:3100/loki/api/v1/labels | jq .

{
  "status": "success",
  "data": [
    "app",
    "env",
    "host",
    "job"
  ]
}

To detect a cardinality problem, check the number of unique values for a label that should be low-cardinality:

curl "http://192.168.10.25:3100/loki/api/v1/label/host/values" | jq '.data | length'

If that returns hundreds or thousands for a label that should have a handful, you've found the problem. Fix it in the Promtail pipeline stage by dropping the high-cardinality label and keeping that value inside the log line body as a structured field instead:

pipeline_stages:
  - json:
      expressions:
        request_id: request_id
        level: level
  - labels:
      app:
      env:
      level:
  # request_id stays in the log body — never make it a label

After fixing the label config, restart Promtail. Cleaning up existing high-cardinality streams in Loki requires either using the Loki admin delete API or waiting for the retention period to expire and compact them away.

Root Cause 4: Out of Order Entries

This one trips up engineers who are shipping logs from multiple sources, replaying archived log data, or dealing with hosts that have drifted clocks. Loki enforces that within a single stream, log entries must arrive in strictly monotonically increasing timestamp order. If an entry arrives with a timestamp older than the most recently accepted entry for that stream, Loki rejects it outright.

Why does this happen in practice? NTP drift between hosts producing logs, log files being re-read from the beginning by Promtail after a crash or restart that wiped the position file, log aggregators that buffer and reorder entries before forwarding, and manual imports of archived log data are all common triggers. The error message in Promtail's logs is descriptive when this is the cause:

level=warn ts=2026-04-16T10:03:22.194Z caller=client.go:349 component=client host=192.168.10.25:3100 msg="error sending batch, will retry" status=400 err="rpc error: code = Code(400) desc = entry for stream '{app=\"webserver\", env=\"prod\", host=\"sw-infrarunbook-01\"}' has timestamp too old: 2026-04-16T09:55:01Z, oldest acceptable timestamp is 2026-04-16T10:03:00Z"

HTTP status

400

combined with the phrase "timestamp too old" or "entry out of order" makes this unambiguous. Note that unlike rate limit errors (which return 429 and trigger retries), out-of-order rejections are permanent — those log entries are lost unless you can re-deliver them after adjusting Loki's tolerance window.

On the Loki server side, you can widen the acceptance window by adjusting

reject_old_samples_max_age

limits_config

limits_config:
  reject_old_samples: true
  reject_old_samples_max_age: 168h

Be careful — setting this too loosely can mask genuine clock problems and make future debugging harder. Use the narrowest window that fixes the actual symptom.

On the Promtail side, make sure the position file is on a persistent path so it survives restarts without re-reading log files from the beginning:

positions:
  filename: /var/lib/promtail/positions.yaml

If the positions file gets deleted — which happens during certain upgrade procedures or aggressive

/var/lib

cleanups — Promtail starts reading every configured log file from offset zero on next start. For active high-volume logs, this floods Loki with entries that are days old, arriving far out of order relative to what's already been ingested. Restore the positions file from backup if you have one, or add

tail_from_end: true

to your scrape configs to start from the current tail on first read rather than the beginning.

For NTP drift issues, verify clock sync on all log-producing hosts:

timedatectl status
chronyc tracking

If the offset is more than a second or two, fix NTP synchronization first. Everything downstream of a drifted clock becomes unreliable.

Root Cause 5: Rate Limit Hit

Loki ships with ingestion rate limits enabled by default, and they're conservative by design. Under steady-state operations you won't hit them. The moment you deploy a newly verbose microservice, start tailing a debug log that's producing ten thousand lines per second, kick off a batch job that dumps gigabytes of application output, or add ten new hosts all pointing at the same Loki instance simultaneously — you'll slam into the limits and start dropping entries.

The error in Promtail's logs when rate limits are hit is hard to miss:

level=warn ts=2026-04-16T11:22:05.441Z caller=client.go:349 component=client host=192.168.10.25:3100 msg="error sending batch, will retry" status=429 err="rpc error: code = Code(429) desc = ingestion rate limit (4194304 bytes/s) exceeded while adding 65536 bytes for user 'fake', reduce log volume or contact your Loki administrator"

Status

429

with the phrase "ingestion rate limit exceeded" is the key. Unlike 400 errors for out-of-order entries, Promtail will retry 429 responses with backoff — so in some cases the logs will eventually make it through if the rate normalizes. But if the elevated rate persists, Promtail's internal batch buffer fills up, retries accumulate, and you start losing entries permanently.

Check the current rate limit configuration on the Loki server:

grep -A 10 'limits_config' /etc/loki/config.yaml

limits_config:
  ingestion_rate_mb: 4
  ingestion_burst_size_mb: 6
  per_stream_rate_limit: 3MB
  per_stream_rate_limit_burst: 15MB

The default

ingestion_rate_mb

of 4 MB/s is per tenant in multi-tenant mode, or global in single-tenant mode. For anything beyond a handful of lightly active services, that ceiling is low. Raise it to match your actual volume with some headroom:

limits_config:
  ingestion_rate_mb: 32
  ingestion_burst_size_mb: 64
  per_stream_rate_limit: 10MB
  per_stream_rate_limit_burst: 30MB

Restart Loki after changing its config and confirm it came up cleanly:

systemctl restart loki.service
journalctl -u loki.service -n 20 --no-pager

Raising limits is the fast fix, but you should also find out why the rate spiked in the first place. Identify which stream is responsible by checking Loki's metrics endpoint:

curl -s http://192.168.10.25:3100/metrics | grep loki_distributor_bytes_received_total

If one specific app or host label is generating the overwhelming bulk of the volume, consider adding a Promtail pipeline stage to drop unnecessary log levels before they ever reach Loki. This is cleaner than just opening up the rate limit ceiling:

pipeline_stages:
  - match:
      selector: '{app="verbose-service"}'
      stages:
        - drop:
            expression: ".*level=debug.*"
        - drop:
            expression: ".*level=trace.*"

Drop noisy log levels at the agent, not at query time. Query-time filtering doesn't reduce your ingestion cost — the data is already in Loki eating storage and bandwidth.

Root Cause 6: Filesystem Full on the Loki Host

Simple and brutal. If the disk where Loki stores its chunks and index fills up, ingestion stops. Depending on your Loki version, the error messages can be surprisingly unhelpful — sometimes appearing as generic internal server errors rather than clearly pointing to disk exhaustion. This one tends to cause confusion because the root cause is completely outside the Loki configuration itself.

Check disk usage on the Loki storage path:

df -h /var/loki
du -sh /var/loki/chunks /var/loki/index /var/loki/boltdb-shipper-active

Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1        50G   49G  512M  99% /var/loki

If you're at 99%, that's your problem. Short-term: check whether retention is configured and actually deleting old data. Long-term: expand the volume or move Loki's storage backend to object storage so you're not constrained by local disk capacity.

Verify Loki's retention configuration:

grep -A 5 'table_manager\|compactor\|retention' /etc/loki/config.yaml

compactor:
  working_directory: /var/loki/compactor
  shared_store: filesystem
  retention_enabled: true

limits_config:
  retention_period: 336h

retention_enabled

is missing or set to

false

, Loki never cleans up old chunks. Enable it, set a retention period appropriate for your needs (336 hours — 14 days — is a sensible default for most environments), and restart Loki. The compactor will begin cleaning up expired chunks on its next scheduled run. Don't expect immediate disk reclamation; give it an hour and re-check.

Root Cause 7: Authentication and Tenant ID Errors

If your Loki deployment has multi-tenancy enabled (the

auth_enabled: true

setting), every push request must include an

X-Scope-OrgID

header. Promtail must be configured to send it. Without it, Loki rejects the push with a 401 or a generic error. I've seen this catch teams off guard when they stand up a new Loki instance and copy a working Promtail config from a single-tenant environment where the header was never required.

The Promtail config for tenant ID lives under the client block:

clients:
  - url: http://192.168.10.25:3100/loki/api/v1/push
    tenant_id: infrarunbook-admin

If you see errors like

no org id

in Loki's logs or Promtail gets back a

401 Unauthorized

, add the

tenant_id

field matching the org ID configured in Loki's tenant configuration. Alternatively, if you're intentionally running single-tenant and don't need auth, set

auth_enabled: false

in Loki's config and restart.

Prevention

Most Loki ingestion failures are detectable before they become visible outages, provided you have the right observability in place on the pipeline itself. The single most valuable metric to watch is

loki_distributor_lines_received_total

. Set an alert: if this counter stops incrementing for any stream that should be continuously active, something is wrong. You want to know in two minutes, not two hours.

For rate limit exposure, alert on the ratio of dropped entries in Promtail. The metric

promtail_dropped_entries_total

should be zero in steady state. A non-zero and rising value means either rate limits are being hit, the endpoint is unreachable, or out-of-order rejections are accumulating. Any of those conditions warrants immediate investigation.

Keep your Promtail position file on a persistent, backed-up volume — never on a tmpfs, a path that gets wiped on reboot, or anywhere that aggressive cleanup scripts might touch. The position file is the only thing standing between you and Promtail flooding Loki with re-read historical data after every restart. Treat it as important operational state.

Run

promtail -config.file=/etc/promtail/config.yaml -check-syntax

as an explicit validation step in your configuration management pipeline — Ansible, Chef, Puppet, whatever you use. A config file with a syntax error will kill Promtail on the next restart, which kills log ingestion for every service on that host. Adding this check takes thirty seconds and eliminates an entire class of outage.

Enforce label discipline across your fleet. Establish a fixed schema —

host

job

app

env

— and document it clearly so every team onboarding a new service knows exactly which fields are labels and which stay in the log body. Every high-cardinality value (request IDs, user IDs, trace IDs, session tokens) belongs inside the structured log line, not as a Loki label. Violating this will cause stream explosion, which leads directly to rate limit failures and query performance degradation that's expensive to reverse.

Finally, build synthetic end-to-end ingestion monitoring. A simple cron job on

sw-infrarunbook-01

that writes a known, timestamped sentinel entry to a file Promtail is watching, then queries Loki for that entry via

logcli

and alerts if it's not found within two minutes, gives you real end-to-end pipeline visibility. If that check goes red, you know immediately that ingestion is broken — and you have a clear starting point for every diagnostic step in this article.

Symptoms

Root Cause 1: Promtail Not Running

Root Cause 2: Wrong Loki Endpoint

Root Cause 3: Label Mismatch

Root Cause 4: Out of Order Entries

Root Cause 5: Rate Limit Hit

Root Cause 6: Filesystem Full on the Loki Host

Root Cause 7: Authentication and Tenant ID Errors

Prevention

Frequently Asked Questions

How do I verify that Promtail is successfully sending logs to Loki?

What does the 'entry out of order' error mean in Loki?

How do I increase the Loki ingestion rate limit?

Why are my Loki queries returning no results even though logs are being written?

How can I prevent high-cardinality labels from breaking Loki ingestion?

Related Articles