Symptoms
You've deployed Fluent Bit across your fleet and pointed it at a downstream destination — Elasticsearch, Loki, Splunk, or a remote syslog receiver — but logs aren't showing up. The service is running,
systemctl status fluent-bitreports active, but the destination is silent. Maybe forwarding worked fine last week, and now there's a gap in your dashboards starting at some arbitrary hour nobody can explain.
Here's what the failure surface usually looks like. The Fluent Bit process is alive but
output.errorson the internal metrics endpoint keeps climbing. Or the process is running and metrics look completely flat — zero records in, zero records out — which is its own kind of broken. You tail
/var/log/fluent-bit.logand see a wall of retry messages, or worse, nothing at all. Some sources forward fine while others are silently dropped. Running
fluent-bit -c /etc/fluent-bit/fluent-bit.conf --dry-runexits clean, making you feel like the config is correct — but that's a trap. Dry-run validates syntax, not runtime behavior.
The root causes below cover the vast majority of forwarding failures I've seen in production. Work through them systematically rather than tweaking random knobs.
Root Cause 1: Output Plugin Misconfigured
This is the most common cause of logs-not-forwarding bugs, and it's embarrassing how often it catches experienced engineers. The output plugin block has a typo, a wrong port, or a parameter that silently gets ignored because it's not a recognized key for that plugin.
Why it happens: Fluent Bit's configuration parser is permissive. Unknown keys don't cause a startup failure — they're silently dropped. So if you write the wrong port number, reference an Elasticsearch 7 parameter against an Elasticsearch 8 endpoint, or misname an option, Fluent Bit starts fine and then either connects to the wrong place or sends malformed payloads the destination rejects without making noise about it.
How to identify it. First, enable debug logging and pipe it somewhere you can read:
fluent-bit -c /etc/fluent-bit/fluent-bit.conf -vv 2>&1 | tee /tmp/fb-debug.log
Look for lines like these in the output:
[2024/10/14 03:22:17] [ warn] [output:es:es.0] could not connect to 10.10.1.45:9200
[2024/10/14 03:22:17] [error] [output:es:es.0] HTTP status=400 URI=/fluent-bit/_doc, response:
{"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"failed to parse"}]}}
A 400 from Elasticsearch almost always means the index mapping is wrong or you're sending a field Elasticsearch doesn't expect. An HTTP 404 usually means the
Indexparameter points to a non-existent index or the path is wrong. Connection refused means the host or port is wrong. A common misconfigured block looks like this:
[OUTPUT]
Name es
Match *
Host 10.10.1.45
Port 9300 # Wrong: 9300 is the transport port, not the HTTP port
Index fluent-bit
Type _doc
How to fix it. Cross-reference the Fluent Bit docs for exact plugin parameter names — they're case-insensitive but spelling matters. For Elasticsearch the HTTP port is 9200. Use
curlto independently verify the endpoint is reachable from sw-infrarunbook-01 before blaming Fluent Bit:
curl -v http://10.10.1.45:9200/_cluster/health
A corrected output block for Elasticsearch 8.x looks like this:
[OUTPUT]
Name es
Match *
Host 10.10.1.45
Port 9200
Index fluent-bit
Suppress_Type_Name On
The
Suppress_Type_Nameflag is required for Elasticsearch 8.x — without it you'll get 400 errors on every write because type mappings were removed. If you're on ES 7.x, drop that flag.
Root Cause 2: TLS Certificate Not Trusted
If your log destination uses HTTPS and the certificate is self-signed, issued by an internal CA, or expired, Fluent Bit will refuse the connection. In my experience this is the second most common cause of silent forwarding failure, and it's particularly nasty because the error messages aren't always obvious — depending on retry configuration, Fluent Bit may just keep retrying in the background without prominently surfacing why.
Why it happens: Fluent Bit validates server certificates against the system trust store. If your internal CA cert isn't in that store — or if the cert has expired — the TLS handshake fails and records queue up until the buffer fills and they start dropping. The service keeps running. Nothing crashes. You just stop getting logs.
How to identify it. Run with debug verbosity and filter for TLS-related output:
fluent-bit -c /etc/fluent-bit/fluent-bit.conf -vv 2>&1 | grep -i 'tls\|ssl\|cert'
The lines you're looking for:
[2024/10/14 04:11:02] [error] [tls] error: certificate verify failed
[2024/10/14 04:11:02] [error] [output:http:http.0] could not connect to 10.10.1.50:9243
Independently verify the certificate chain from sw-infrarunbook-01 before touching Fluent Bit config:
openssl s_client -connect 10.10.1.50:9243 -CAfile /etc/ssl/certs/ca-bundle.crt 2>&1 | head -30
If you see
Verify return code: 21 (unable to verify the first certificate)or
certificate has expired, you've confirmed the problem is the cert chain, not Fluent Bit itself.
How to fix it. The right approach is to add your internal CA to the system trust store and reference it in your Fluent Bit output block:
# Add the internal CA to the system trust store
cp /etc/pki/internal-ca.crt /usr/local/share/ca-certificates/internal-ca.crt
update-ca-certificates
# Reference it explicitly in the output plugin
[OUTPUT]
Name http
Match *
Host 10.10.1.50
Port 9243
tls On
tls.verify On
tls.ca_file /usr/local/share/ca-certificates/internal-ca.crt
The wrong approach — and I see it in production configs more often than I'd like — is setting
tls.verify Off. That disables certificate validation entirely. You're encrypting the channel but not authenticating the server, which defeats a core purpose of TLS. Don't do this in production. Fix the cert trust instead.
If the certificate is expired, get it renewed. There's no workaround for an expired cert that isn't a security regression. Set a calendar reminder to check cert expiry 30 days out on every log forwarding endpoint — catching this before Fluent Bit does is considerably less stressful.
Root Cause 3: Buffer Full
Fluent Bit uses in-memory and optional filesystem buffers to absorb bursts and handle backpressure. When the downstream destination is slow, down, or rejecting records, Fluent Bit retries and those records sit in the buffer. If the buffer fills up, new incoming records are dropped. The service keeps running, nothing crashes, but your logs stop forwarding and nobody gets paged about it.
Why it happens: the default
Mem_Buf_Limitfor most inputs is around 5MB. In a high-volume environment or during a destination outage, that fills up in minutes. Once the limit is hit, Fluent Bit emits a warning and begins pausing inputs — which means records being written to the log file are no longer tracked. You don't lose the file data, but Fluent Bit falls behind, and if the pause persists long enough, the input offset can drift such that you miss records entirely.
How to identify it. Watch Fluent Bit logs for this specific pattern:
[2024/10/14 05:43:10] [ warn] [input] pausing tail.0 (mem buf overlimit)
[2024/10/14 05:43:10] [ warn] [input] resume tail.0 (mem buf overlimit)
The rapid pausing and resuming cycle is the telltale sign of buffer pressure. Also check the metrics endpoint if you've enabled it:
curl -s http://127.0.0.1:2020/api/v1/metrics/v2 | python3 -m json.tool
{
"output": {
"es.0": {
"proc_records": 14203,
"proc_bytes": 8921043,
"errors": 892,
"retries": 1204,
"retries_failed": 156,
"dropped_records": 156
}
}
}
Any non-zero
dropped_recordsvalue means you've already lost data. This is a production incident, not just a config tweak.
How to fix it. Increase the
Mem_Buf_Limitand switch to filesystem buffering so records survive both restarts and destination outages:
[SERVICE]
storage.path /var/lib/fluent-bit/buffer
storage.sync normal
storage.checksum off
storage.max_chunks_up 128
[INPUT]
Name tail
Path /var/log/app/*.log
Tag app.*
Mem_Buf_Limit 50MB
storage.type filesystem
Also tune the retry behavior in your output plugin. Setting
Retry_Limitto
Falsetells Fluent Bit to retry indefinitely rather than dropping records after a fixed number of attempts — acceptable once the destination issue is resolved:
[OUTPUT]
Name es
Match *
Host 10.10.1.45
Port 9200
Retry_Limit False
A full buffer is always a symptom, not the root cause. Something downstream is slow or broken, and Fluent Bit is absorbing the pressure. Fix the destination issue, then tune the buffer to give yourself headroom the next time a downstream hiccup happens.
Root Cause 4: Parser Not Matching Log Format
Parsers are the part of Fluent Bit configs that most people set up once and forget — until logs start arriving at Elasticsearch as a single unparsed blob, filter rules stop working, or structured fields you expect to query on don't exist. When a parser doesn't match, Fluent Bit typically passes the raw line through as a single
logfield. Records are still forwarded, but they're useless for anything beyond archival.
Why it happens: log formats change. An application update switches the timestamp from ISO 8601 to syslog format, and your regex parser silently stops matching. Someone adds multiline Java stack traces and the tail plugin splits each line into a hundred individual events. Or the parser was written against a test log that didn't represent the full range of real output — missing fields, extra spaces, different HTTP methods — and only fails on production traffic.
How to identify it. The fastest diagnostic is adding a temporary stdout output plugin to see what Fluent Bit is actually producing:
[OUTPUT]
Name stdout
Match *
Format json_lines
Run Fluent Bit and watch the output. If you see records like this, the parser is not matching:
{"date":1728872537.0,"log":"Oct 14 03:22:17 sw-infrarunbook-01 nginx: 10.10.1.10 - infrarunbook-admin [14/Oct/2024:03:22:17 +0000] GET /api/health HTTP/1.1 200 45"}
The entire log line dumped into a single
logstring instead of discrete fields like
remote_addr,
status, and
body_bytes_sent. You can also test a parser directly from the command line:
echo '10.10.1.10 - infrarunbook-admin [14/Oct/2024:03:22:17 +0000] GET /health HTTP/1.1 200 45' \
| fluent-bit -R /etc/fluent-bit/parsers.conf --stdin --parser nginx -i stdin -o stdout -f json_lines
If the parser matches, you'll get structured JSON fields back. If it doesn't, you'll get the raw input passed through unchanged.
How to fix it. Validate your regex against actual log samples before deploying. A common nginx combined log parser that works against real-world output:
[PARSER]
Name nginx
Format regex
Regex ^(?<remote>[^ ]*) (?<host>[^ ]*) (?<user>[^ ]*) \[(?<time>[^\]]*)] (?<method>\S+) (?<path>\S+) \S+ (?<code>[^ ]*) (?<size>[^ ]*)
Time_Key time
Time_Format %d/%b/%Y:%H:%M:%S %z
For multiline logs like Java stack traces, use the multiline parser feature rather than trying to reassemble them in a filter:
[MULTILINE_PARSER]
name java_multiline
type regex
flush_timeout 1000
rule "start_state" "/^\d{4}-\d{2}-\d{2}/" "java_after_ts"
rule "java_after_ts" "/^(\s+at|\s+\.{3}\s+\d+)/" "java_after_ts"
[INPUT]
Name tail
Path /var/log/app/service.log
Tag app.service
multiline.parser java_multiline
After any parser change, run the command-line test utility against a representative sample of real log data — not just a clean happy-path example you wrote yourself — before rolling it to production.
Root Cause 5: Permission Denied Reading Log File
Fluent Bit runs as a non-root user in most production deployments, and log files written by applications often have restrictive permissions. If Fluent Bit can't read a file, it silently skips it. No crash, no retry, no error in the destination — just nothing.
Why it happens: the tail input plugin tries to open log files at startup and whenever new files match the configured glob. If the Fluent Bit process user doesn't have read permission on the file, or execute permission on a parent directory, the file is simply not watched. This is particularly common when log files are owned by application-specific users like
www-data,
postgres, or
nginx, or when log directories use restrictive permission bits that block traversal by other users.
How to identify it. Enable debug logging and filter for permission and open errors:
fluent-bit -c /etc/fluent-bit/fluent-bit.conf -vv 2>&1 | grep -iE 'permission|denied|open|inode'
You're looking for output like this:
[2024/10/14 06:01:33] [ warn] [input:tail:tail.0] error opening file /var/log/app/service.log: Permission denied
Find what user Fluent Bit runs as, then manually test access as that user:
# Check the systemd service unit
grep -i user /lib/systemd/system/fluent-bit.service
# Test access as the fluent-bit service user
sudo -u fluent ls -la /var/log/app/
sudo -u fluent cat /var/log/app/service.log
Don't just check the file — check the entire directory path. A file can be world-readable but completely unreachable if a parent directory doesn't allow execute for the Fluent Bit user. Use
nameito trace the full permission chain:
namei -l /var/log/app/service.log
f: /var/log/app/service.log
drwxr-xr-x root root /
drwxr-xr-x root root var
drwxr-x--- root syslog log <-- Fluent Bit user cannot enter this directory
drwxr-x--- syslog syslog app
-rw-r----- syslog syslog service.log
That
drwxr-x---on
/var/logis the culprit — it only allows the
sysloggroup to traverse, so any user not in that group is blocked before even reaching the file.
How to fix it. The cleanest solution is adding the Fluent Bit service user to the group that owns the log directory and files:
# If Fluent Bit runs as user 'fluent' and logs are owned by group 'syslog'
usermod -aG syslog fluent
systemctl restart fluent-bit
# Verify the fix
sudo -u fluent ls -la /var/log/app/
For application logs owned by a dedicated app user where changing group ownership isn't practical, use POSIX ACLs. The default ACL ensures new log files created by log rotation are automatically accessible too:
# Grant read + directory traversal to the fluent-bit user
setfacl -R -m u:fluent:rX /var/log/app/
# Apply the same ACL to all future files and directories created under /var/log/app/
setfacl -R -d -m u:fluent:rX /var/log/app/
# Verify
getfacl /var/log/app/service.log
Avoid making log files world-readable (
chmod o+r) as a shortcut — logs often contain sensitive data. Group membership or ACLs give you the access you need without oversharing.
Root Cause 6: Tag and Match Misconfiguration
Every record in Fluent Bit carries a tag, and output plugins use
Matchpatterns to decide which records they process. If your tags and match patterns don't align, records are ingested and parsed correctly but never forwarded — they fall into the void at the routing stage and are silently discarded.
In debug output, you'll see input metrics incrementing while output metrics stay completely flat. The quickest diagnostic is adding a temporary stdout output with
Match *— if records appear there but not in your real output, the match pattern is the problem. Then check what tags are actually being generated:
fluent-bit -c /etc/fluent-bit/fluent-bit.conf -vv 2>&1 | grep engine
[2024/10/14 06:15:22] [debug] [engine] flush chunk with tag 'app.service.production' to output 'es.0'
If your output has
Match app.*, that tag matches fine. If it has
Match application.*or
Match app.servicewithout a wildcard that covers
app.service.production, those records are dropped. Fix the match pattern or adjust the tag in your input block to align with your existing output rules.
Root Cause 7: Network Connectivity or Firewall Blocking
Sometimes the problem has nothing to do with Fluent Bit configuration. The destination is unreachable because a firewall rule changed, a security group was tightened, or a DNS record wasn't updated after a migration. Fluent Bit retries endlessly and logs connection errors, but if nobody is watching, it looks like a Fluent Bit problem rather than a network problem.
Test directly from sw-infrarunbook-01 to confirm connectivity before spending more time in the Fluent Bit config:
# TCP-level connectivity check
nc -zv 10.10.1.45 9200
# DNS resolution if using a hostname
dig +short logs.solvethenetwork.com
# Full HTTP round-trip
curl -v --max-time 5 http://10.10.1.45:9200/_cluster/health
A timeout or connection refused from
curlconfirms the issue is network-level. Check iptables or nftables rules on both the source and destination, review any intermediate firewall policy, and verify routing from the source subnet to the destination IP. Don't fix Fluent Bit config when the real problem is a firewall ticket that needs to be filed.
Prevention
Most of these failures are preventable with a few operational habits that don't require much upfront investment.
Enable the built-in HTTP server and scrape its metrics with Prometheus or your preferred monitoring stack. A single endpoint at
http://127.0.0.1:2020/api/v1/metrics/v2gives you visibility into dropped records, retry counts, and buffer pressure before users notice a gap in their dashboards. Alert on
dropped_records > 0and
retries_failed > 0— these are production incidents masquerading as background noise.
[SERVICE]
HTTP_Server On
HTTP_Listen 127.0.0.1
HTTP_Port 2020
Log_Level info
Always run configuration changes through a staging deployment first. Use the stdout plugin to verify that records are being parsed and tagged correctly before touching your production output destination. Treat parser changes the same way you'd treat a schema migration — test against a representative sample of real log data, not just a clean example you wrote to match the parser.
Use filesystem buffering in every production deployment. The default in-memory buffer disappears on restart and fills up fast under load. Filesystem buffering survives restarts, provides much more headroom during destination outages, and gives you a path to recover records that would otherwise be lost. The storage path needs to be on a filesystem with adequate free space — monitor it the same way you'd monitor any other critical volume.
Document the service account Fluent Bit runs as, and make log file permissions part of your application deployment and log rotation checklists. Every time a new log path is added or a
logrotateconfig is changed, someone needs to verify Fluent Bit can still read the files. Using default ACLs on log directories —
setfacl -d— means new files automatically inherit the right permissions, which eliminates the entire class of post-rotation permission failures.
Keep your TLS certificate inventory current. If Fluent Bit forwards to an HTTPS endpoint signed by an internal CA, make sure CA cert rotation is part of your certificate lifecycle process. A cert expiring at 2 AM on a Sunday is not when you want to discover Fluent Bit has been silently dropping logs for six hours. Set alerts at 30 and 7 days before expiry on every log forwarding endpoint, and test the full chain with
openssl s_clientafter any cert rotation, not just the endpoint health check.
Finally, treat Fluent Bit's own log file as a first-class operational signal. Route it through your monitoring stack. If the log goes silent or the error rate climbs, you want to know about it the same way you'd know about any other service degradation — before your users do.
