Fluent Bit Not Forwarding Logs

Symptoms

You've deployed Fluent Bit across your fleet and pointed it at a downstream destination — Elasticsearch, Loki, Splunk, or a remote syslog receiver — but logs aren't showing up. The service is running,

systemctl status fluent-bit

reports active, but the destination is silent. Maybe forwarding worked fine last week, and now there's a gap in your dashboards starting at some arbitrary hour nobody can explain.

Here's what the failure surface usually looks like. The Fluent Bit process is alive but

output.errors

on the internal metrics endpoint keeps climbing. Or the process is running and metrics look completely flat — zero records in, zero records out — which is its own kind of broken. You tail

/var/log/fluent-bit.log

and see a wall of retry messages, or worse, nothing at all. Some sources forward fine while others are silently dropped. Running

fluent-bit -c /etc/fluent-bit/fluent-bit.conf --dry-run

exits clean, making you feel like the config is correct — but that's a trap. Dry-run validates syntax, not runtime behavior.

The root causes below cover the vast majority of forwarding failures I've seen in production. Work through them systematically rather than tweaking random knobs.

Root Cause 1: Output Plugin Misconfigured

This is the most common cause of logs-not-forwarding bugs, and it's embarrassing how often it catches experienced engineers. The output plugin block has a typo, a wrong port, or a parameter that silently gets ignored because it's not a recognized key for that plugin.

Why it happens: Fluent Bit's configuration parser is permissive. Unknown keys don't cause a startup failure — they're silently dropped. So if you write the wrong port number, reference an Elasticsearch 7 parameter against an Elasticsearch 8 endpoint, or misname an option, Fluent Bit starts fine and then either connects to the wrong place or sends malformed payloads the destination rejects without making noise about it.

How to identify it. First, enable debug logging and pipe it somewhere you can read:

fluent-bit -c /etc/fluent-bit/fluent-bit.conf -vv 2>&1 | tee /tmp/fb-debug.log

Look for lines like these in the output:

[2024/10/14 03:22:17] [ warn] [output:es:es.0] could not connect to 10.10.1.45:9200
[2024/10/14 03:22:17] [error] [output:es:es.0] HTTP status=400 URI=/fluent-bit/_doc, response:
{"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"failed to parse"}]}}

A 400 from Elasticsearch almost always means the index mapping is wrong or you're sending a field Elasticsearch doesn't expect. An HTTP 404 usually means the

Index

parameter points to a non-existent index or the path is wrong. Connection refused means the host or port is wrong. A common misconfigured block looks like this:

[OUTPUT]
    Name  es
    Match *
    Host  10.10.1.45
    Port  9300          # Wrong: 9300 is the transport port, not the HTTP port
    Index fluent-bit
    Type  _doc

How to fix it. Cross-reference the Fluent Bit docs for exact plugin parameter names — they're case-insensitive but spelling matters. For Elasticsearch the HTTP port is 9200. Use

curl

to independently verify the endpoint is reachable from sw-infrarunbook-01 before blaming Fluent Bit:

curl -v http://10.10.1.45:9200/_cluster/health

A corrected output block for Elasticsearch 8.x looks like this:

[OUTPUT]
    Name              es
    Match             *
    Host              10.10.1.45
    Port              9200
    Index             fluent-bit
    Suppress_Type_Name On

The

Suppress_Type_Name

flag is required for Elasticsearch 8.x — without it you'll get 400 errors on every write because type mappings were removed. If you're on ES 7.x, drop that flag.

Root Cause 2: TLS Certificate Not Trusted

If your log destination uses HTTPS and the certificate is self-signed, issued by an internal CA, or expired, Fluent Bit will refuse the connection. In my experience this is the second most common cause of silent forwarding failure, and it's particularly nasty because the error messages aren't always obvious — depending on retry configuration, Fluent Bit may just keep retrying in the background without prominently surfacing why.

Why it happens: Fluent Bit validates server certificates against the system trust store. If your internal CA cert isn't in that store — or if the cert has expired — the TLS handshake fails and records queue up until the buffer fills and they start dropping. The service keeps running. Nothing crashes. You just stop getting logs.

How to identify it. Run with debug verbosity and filter for TLS-related output:

fluent-bit -c /etc/fluent-bit/fluent-bit.conf -vv 2>&1 | grep -i 'tls\|ssl\|cert'

The lines you're looking for:

[2024/10/14 04:11:02] [error] [tls] error: certificate verify failed
[2024/10/14 04:11:02] [error] [output:http:http.0] could not connect to 10.10.1.50:9243

Independently verify the certificate chain from sw-infrarunbook-01 before touching Fluent Bit config:

openssl s_client -connect 10.10.1.50:9243 -CAfile /etc/ssl/certs/ca-bundle.crt 2>&1 | head -30

If you see

Verify return code: 21 (unable to verify the first certificate)

certificate has expired

, you've confirmed the problem is the cert chain, not Fluent Bit itself.

How to fix it. The right approach is to add your internal CA to the system trust store and reference it in your Fluent Bit output block:

# Add the internal CA to the system trust store
cp /etc/pki/internal-ca.crt /usr/local/share/ca-certificates/internal-ca.crt
update-ca-certificates

# Reference it explicitly in the output plugin
[OUTPUT]
    Name        http
    Match       *
    Host        10.10.1.50
    Port        9243
    tls         On
    tls.verify  On
    tls.ca_file /usr/local/share/ca-certificates/internal-ca.crt

The wrong approach — and I see it in production configs more often than I'd like — is setting

tls.verify Off

. That disables certificate validation entirely. You're encrypting the channel but not authenticating the server, which defeats a core purpose of TLS. Don't do this in production. Fix the cert trust instead.

If the certificate is expired, get it renewed. There's no workaround for an expired cert that isn't a security regression. Set a calendar reminder to check cert expiry 30 days out on every log forwarding endpoint — catching this before Fluent Bit does is considerably less stressful.

Root Cause 3: Buffer Full

Fluent Bit uses in-memory and optional filesystem buffers to absorb bursts and handle backpressure. When the downstream destination is slow, down, or rejecting records, Fluent Bit retries and those records sit in the buffer. If the buffer fills up, new incoming records are dropped. The service keeps running, nothing crashes, but your logs stop forwarding and nobody gets paged about it.

Why it happens: the default

Mem_Buf_Limit

for most inputs is around 5MB. In a high-volume environment or during a destination outage, that fills up in minutes. Once the limit is hit, Fluent Bit emits a warning and begins pausing inputs — which means records being written to the log file are no longer tracked. You don't lose the file data, but Fluent Bit falls behind, and if the pause persists long enough, the input offset can drift such that you miss records entirely.

How to identify it. Watch Fluent Bit logs for this specific pattern:

[2024/10/14 05:43:10] [ warn] [input] pausing tail.0 (mem buf overlimit)
[2024/10/14 05:43:10] [ warn] [input] resume tail.0 (mem buf overlimit)

The rapid pausing and resuming cycle is the telltale sign of buffer pressure. Also check the metrics endpoint if you've enabled it:

curl -s http://127.0.0.1:2020/api/v1/metrics/v2 | python3 -m json.tool

{
  "output": {
    "es.0": {
      "proc_records": 14203,
      "proc_bytes": 8921043,
      "errors": 892,
      "retries": 1204,
      "retries_failed": 156,
      "dropped_records": 156
    }
  }
}

Any non-zero

dropped_records

value means you've already lost data. This is a production incident, not just a config tweak.

How to fix it. Increase the

Mem_Buf_Limit

and switch to filesystem buffering so records survive both restarts and destination outages:

[SERVICE]
    storage.path              /var/lib/fluent-bit/buffer
    storage.sync              normal
    storage.checksum          off
    storage.max_chunks_up     128

[INPUT]
    Name              tail
    Path              /var/log/app/*.log
    Tag               app.*
    Mem_Buf_Limit     50MB
    storage.type      filesystem

Also tune the retry behavior in your output plugin. Setting

Retry_Limit

False

tells Fluent Bit to retry indefinitely rather than dropping records after a fixed number of attempts — acceptable once the destination issue is resolved:

[OUTPUT]
    Name        es
    Match       *
    Host        10.10.1.45
    Port        9200
    Retry_Limit False

A full buffer is always a symptom, not the root cause. Something downstream is slow or broken, and Fluent Bit is absorbing the pressure. Fix the destination issue, then tune the buffer to give yourself headroom the next time a downstream hiccup happens.

Root Cause 4: Parser Not Matching Log Format

Parsers are the part of Fluent Bit configs that most people set up once and forget — until logs start arriving at Elasticsearch as a single unparsed blob, filter rules stop working, or structured fields you expect to query on don't exist. When a parser doesn't match, Fluent Bit typically passes the raw line through as a single

log

field. Records are still forwarded, but they're useless for anything beyond archival.

Why it happens: log formats change. An application update switches the timestamp from ISO 8601 to syslog format, and your regex parser silently stops matching. Someone adds multiline Java stack traces and the tail plugin splits each line into a hundred individual events. Or the parser was written against a test log that didn't represent the full range of real output — missing fields, extra spaces, different HTTP methods — and only fails on production traffic.

How to identify it. The fastest diagnostic is adding a temporary stdout output plugin to see what Fluent Bit is actually producing:

[OUTPUT]
    Name   stdout
    Match  *
    Format json_lines

Run Fluent Bit and watch the output. If you see records like this, the parser is not matching:

{"date":1728872537.0,"log":"Oct 14 03:22:17 sw-infrarunbook-01 nginx: 10.10.1.10 - infrarunbook-admin [14/Oct/2024:03:22:17 +0000] GET /api/health HTTP/1.1 200 45"}

The entire log line dumped into a single

log

string instead of discrete fields like

remote_addr

status

, and

body_bytes_sent

. You can also test a parser directly from the command line:

echo '10.10.1.10 - infrarunbook-admin [14/Oct/2024:03:22:17 +0000] GET /health HTTP/1.1 200 45' \
  | fluent-bit -R /etc/fluent-bit/parsers.conf --stdin --parser nginx -i stdin -o stdout -f json_lines

If the parser matches, you'll get structured JSON fields back. If it doesn't, you'll get the raw input passed through unchanged.

How to fix it. Validate your regex against actual log samples before deploying. A common nginx combined log parser that works against real-world output:

[PARSER]
    Name        nginx
    Format      regex
    Regex       ^(?<remote>[^ ]*) (?<host>[^ ]*) (?<user>[^ ]*) \[(?<time>[^\]]*)] (?<method>\S+) (?<path>\S+) \S+ (?<code>[^ ]*) (?<size>[^ ]*)
    Time_Key    time
    Time_Format %d/%b/%Y:%H:%M:%S %z

For multiline logs like Java stack traces, use the multiline parser feature rather than trying to reassemble them in a filter:

[MULTILINE_PARSER]
    name          java_multiline
    type          regex
    flush_timeout 1000
    rule          "start_state"   "/^\d{4}-\d{2}-\d{2}/"  "java_after_ts"
    rule          "java_after_ts" "/^(\s+at|\s+\.{3}\s+\d+)/" "java_after_ts"

[INPUT]
    Name              tail
    Path              /var/log/app/service.log
    Tag               app.service
    multiline.parser  java_multiline

After any parser change, run the command-line test utility against a representative sample of real log data — not just a clean happy-path example you wrote yourself — before rolling it to production.

Root Cause 5: Permission Denied Reading Log File

Fluent Bit runs as a non-root user in most production deployments, and log files written by applications often have restrictive permissions. If Fluent Bit can't read a file, it silently skips it. No crash, no retry, no error in the destination — just nothing.

Why it happens: the tail input plugin tries to open log files at startup and whenever new files match the configured glob. If the Fluent Bit process user doesn't have read permission on the file, or execute permission on a parent directory, the file is simply not watched. This is particularly common when log files are owned by application-specific users like

www-data

postgres

, or

nginx

, or when log directories use restrictive permission bits that block traversal by other users.

How to identify it. Enable debug logging and filter for permission and open errors:

fluent-bit -c /etc/fluent-bit/fluent-bit.conf -vv 2>&1 | grep -iE 'permission|denied|open|inode'

You're looking for output like this:

[2024/10/14 06:01:33] [ warn] [input:tail:tail.0] error opening file /var/log/app/service.log: Permission denied

Find what user Fluent Bit runs as, then manually test access as that user:

# Check the systemd service unit
grep -i user /lib/systemd/system/fluent-bit.service

# Test access as the fluent-bit service user
sudo -u fluent ls -la /var/log/app/
sudo -u fluent cat /var/log/app/service.log

Don't just check the file — check the entire directory path. A file can be world-readable but completely unreachable if a parent directory doesn't allow execute for the Fluent Bit user. Use

namei

to trace the full permission chain:

namei -l /var/log/app/service.log

f: /var/log/app/service.log
drwxr-xr-x root   root   /
drwxr-xr-x root   root   var
drwxr-x--- root   syslog log        <-- Fluent Bit user cannot enter this directory
drwxr-x--- syslog syslog app
-rw-r----- syslog syslog service.log

That

drwxr-x---

/var/log

is the culprit — it only allows the

syslog

group to traverse, so any user not in that group is blocked before even reaching the file.

How to fix it. The cleanest solution is adding the Fluent Bit service user to the group that owns the log directory and files:

# If Fluent Bit runs as user 'fluent' and logs are owned by group 'syslog'
usermod -aG syslog fluent
systemctl restart fluent-bit

# Verify the fix
sudo -u fluent ls -la /var/log/app/

For application logs owned by a dedicated app user where changing group ownership isn't practical, use POSIX ACLs. The default ACL ensures new log files created by log rotation are automatically accessible too:

# Grant read + directory traversal to the fluent-bit user
setfacl -R -m u:fluent:rX /var/log/app/

# Apply the same ACL to all future files and directories created under /var/log/app/
setfacl -R -d -m u:fluent:rX /var/log/app/

# Verify
getfacl /var/log/app/service.log

Avoid making log files world-readable (

chmod o+r

) as a shortcut — logs often contain sensitive data. Group membership or ACLs give you the access you need without oversharing.

Root Cause 6: Tag and Match Misconfiguration

Every record in Fluent Bit carries a tag, and output plugins use

Match

patterns to decide which records they process. If your tags and match patterns don't align, records are ingested and parsed correctly but never forwarded — they fall into the void at the routing stage and are silently discarded.

In debug output, you'll see input metrics incrementing while output metrics stay completely flat. The quickest diagnostic is adding a temporary stdout output with

Match *

— if records appear there but not in your real output, the match pattern is the problem. Then check what tags are actually being generated:

fluent-bit -c /etc/fluent-bit/fluent-bit.conf -vv 2>&1 | grep engine

[2024/10/14 06:15:22] [debug] [engine] flush chunk with tag 'app.service.production' to output 'es.0'

If your output has

Match app.*

, that tag matches fine. If it has

Match application.*

Match app.service

without a wildcard that covers

app.service.production

, those records are dropped. Fix the match pattern or adjust the tag in your input block to align with your existing output rules.

Root Cause 7: Network Connectivity or Firewall Blocking

Sometimes the problem has nothing to do with Fluent Bit configuration. The destination is unreachable because a firewall rule changed, a security group was tightened, or a DNS record wasn't updated after a migration. Fluent Bit retries endlessly and logs connection errors, but if nobody is watching, it looks like a Fluent Bit problem rather than a network problem.

Test directly from sw-infrarunbook-01 to confirm connectivity before spending more time in the Fluent Bit config:

# TCP-level connectivity check
nc -zv 10.10.1.45 9200

# DNS resolution if using a hostname
dig +short logs.solvethenetwork.com

# Full HTTP round-trip
curl -v --max-time 5 http://10.10.1.45:9200/_cluster/health

A timeout or connection refused from

curl

confirms the issue is network-level. Check iptables or nftables rules on both the source and destination, review any intermediate firewall policy, and verify routing from the source subnet to the destination IP. Don't fix Fluent Bit config when the real problem is a firewall ticket that needs to be filed.

Prevention

Most of these failures are preventable with a few operational habits that don't require much upfront investment.

Enable the built-in HTTP server and scrape its metrics with Prometheus or your preferred monitoring stack. A single endpoint at

http://127.0.0.1:2020/api/v1/metrics/v2

gives you visibility into dropped records, retry counts, and buffer pressure before users notice a gap in their dashboards. Alert on

dropped_records > 0

and

retries_failed > 0

— these are production incidents masquerading as background noise.

[SERVICE]
    HTTP_Server  On
    HTTP_Listen  127.0.0.1
    HTTP_Port    2020
    Log_Level    info

Always run configuration changes through a staging deployment first. Use the stdout plugin to verify that records are being parsed and tagged correctly before touching your production output destination. Treat parser changes the same way you'd treat a schema migration — test against a representative sample of real log data, not just a clean example you wrote to match the parser.

Use filesystem buffering in every production deployment. The default in-memory buffer disappears on restart and fills up fast under load. Filesystem buffering survives restarts, provides much more headroom during destination outages, and gives you a path to recover records that would otherwise be lost. The storage path needs to be on a filesystem with adequate free space — monitor it the same way you'd monitor any other critical volume.

Document the service account Fluent Bit runs as, and make log file permissions part of your application deployment and log rotation checklists. Every time a new log path is added or a

logrotate

config is changed, someone needs to verify Fluent Bit can still read the files. Using default ACLs on log directories —

setfacl -d

— means new files automatically inherit the right permissions, which eliminates the entire class of post-rotation permission failures.

Keep your TLS certificate inventory current. If Fluent Bit forwards to an HTTPS endpoint signed by an internal CA, make sure CA cert rotation is part of your certificate lifecycle process. A cert expiring at 2 AM on a Sunday is not when you want to discover Fluent Bit has been silently dropping logs for six hours. Set alerts at 30 and 7 days before expiry on every log forwarding endpoint, and test the full chain with

openssl s_client

after any cert rotation, not just the endpoint health check.

Finally, treat Fluent Bit's own log file as a first-class operational signal. Route it through your monitoring stack. If the log goes silent or the error rate climbs, you want to know about it the same way you'd know about any other service degradation — before your users do.

Fluent Bit Not Forwarding Logs

Symptoms

Root Cause 1: Output Plugin Misconfigured

Root Cause 2: TLS Certificate Not Trusted

Root Cause 3: Buffer Full

Root Cause 4: Parser Not Matching Log Format

Root Cause 5: Permission Denied Reading Log File

Root Cause 6: Tag and Match Misconfiguration

Root Cause 7: Network Connectivity or Firewall Blocking

Prevention

Frequently Asked Questions

Why does Fluent Bit start successfully but not forward any logs to the destination?

How do I test whether my Fluent Bit parser is matching log lines correctly?

What is the difference between Mem_Buf_Limit and filesystem buffering in Fluent Bit?

How do I check if Fluent Bit is silently dropping records?

Can Fluent Bit forward logs to a destination with a self-signed TLS certificate?

Related Articles