Symptoms
When Fluentd is misbehaving, the signs range from obvious to maddeningly subtle. The obvious case: the process exits, systemd throws it into a restart loop, and nothing flows to your destination — Elasticsearch, Loki, S3, or whatever you're shipping to. The subtle case is worse: Fluentd is running, your uptime monitoring is green, but log volume in Kibana has quietly dropped 40% and nobody noticed until an on-call engineer asked why a log-based alert never fired on an event that happened three hours ago.
Here's what you'll typically see when Fluentd is crashing or losing logs:
- The
td-agent
orfluentd
systemd unit shows failed status or loops throughactivating → active → failed
/var/log/td-agent/td-agent.log
is flooded with[warn]
or[error]
entries referencing buffer limits, connection failures, or plugin exceptions- Buffer files under
/var/log/td-agent/buffer/
keep growing — sometimes into the gigabytes — without ever being flushed - Memory on the Fluentd host climbs steadily until the OOM killer fires
- The
fluentd_output_status_retry_count
Prometheus counter increments continuously - Specific log sources stop appearing in your aggregator while others continue normally, pointing to a per-plugin failure rather than a global one
None of these symptoms are mutually exclusive. In my experience, a single root cause — say, a slow Elasticsearch cluster — can trigger buffer overflow, which then triggers memory pressure, which eventually causes an OOM crash. You need to find the original cause, not just the loudest symptom.
Root Cause 1: Buffer Overflow
Why It Happens
Fluentd uses a buffer system to absorb bursts of log traffic before flushing records downstream. Each output plugin has a buffer — either in-memory or file-backed — with a defined capacity. When that capacity is reached and the output can't drain fast enough, Fluentd has to make a decision: block, drop, or throw an error. The default behavior when the buffer is full is to drop incoming records and log a warning. Logs disappear silently unless you're watching carefully.
Buffer overflow usually happens when log volume spikes unexpectedly — a deployment that triggers verbose debug logging, a runaway process emitting thousands of lines per second, or a destination that slows down and lets the buffer fill from the back while the front keeps receiving.
How to Identify It
Check the Fluentd log for messages like these:
2026-04-12 03:14:22 +0000 [warn]: #0 failed to write data into buffer by buffer overflow action=:drop
2026-04-12 03:14:22 +0000 [warn]: #0 buffer is full, chunk dropped tag="app.production" size=524288
Check buffer file sizes on disk:
du -sh /var/log/td-agent/buffer/*
# Output showing a full buffer:
512M /var/log/td-agent/buffer/app.production
512M /var/log/td-agent/buffer/system.syslog
If you have the Prometheus plugin enabled, query the buffer queue length directly:
curl -s http://localhost:24231/metrics | grep fluentd_output_status_buffer_queue_length
# fluentd_output_status_buffer_queue_length{plugin_id="output_es",type="elasticsearch"} 512
A queue at or near its configured
queue_limit_length(default 512) confirms the overflow.
How to Fix It
Increase the buffer limits and change the overflow behavior in your config:
<match app.**>
@type elasticsearch
host 10.10.1.50
port 9200
<buffer tag>
@type file
path /var/log/td-agent/buffer/app
total_limit_size 2GB
chunk_limit_size 32MB
flush_interval 5s
retry_max_times 20
overflow_action block
</buffer>
</match>
The key change here is
overflow_action block. Instead of silently dropping records, Fluentd will pause ingestion and apply backpressure upstream. Blocking is almost always preferable to data loss. After editing, validate and reload:
fluentd --dry-run -c /etc/td-agent/td-agent.conf
systemctl reload td-agent
If you need to clear stuck or unflushable buffer files, stop the service, remove the buffer directory, and restart. Understand that this means losing whatever was buffered but not yet sent — only do it when you've confirmed the data is unrecoverable or you accept the loss.
systemctl stop td-agent
rm -rf /var/log/td-agent/buffer/
systemctl start td-agent
Root Cause 2: Plugin Error
Why It Happens
Fluentd's plugin ecosystem is one of its greatest strengths — and a frequent source of crashes. Plugins are Ruby code, and Ruby code can raise exceptions. When a plugin's
writeor
processmethod throws an unhandled exception — a nil dereference, an unexpected API response format, a gem dependency version mismatch — the worker thread crashes, taking down log delivery for every tag that worker was handling.
I've seen this most often with
fluent-plugin-elasticsearchwhen Elasticsearch changes its API response shape across major versions, and with custom parser plugins that don't handle malformed log lines gracefully. One bad log line can take down the entire output if the plugin isn't written defensively.
How to Identify It
Plugin errors show up in the Fluentd log as Ruby stack traces:
2026-04-12 04:02:11 +0000 [error]: #0 unexpected error error_class=NoMethodError error="undefined method `[]' for nil:NilClass"
/usr/lib/ruby/gems/2.7.0/gems/fluent-plugin-elasticsearch-5.2.4/lib/fluent/plugin/out_elasticsearch.rb:812:in `client_usable?'
/usr/lib/ruby/gems/2.7.0/gems/fluent-plugin-elasticsearch-5.2.4/lib/fluent/plugin/out_elasticsearch.rb:791:in `write'
/usr/lib/ruby/gems/2.7.0/gems/fluentd-1.16.3/lib/fluent/plugin/output.rb:1203:in `try_flush'
Filter for non-buffer errors and look at the tail:
grep -E '\[error\]|\[warn\]' /var/log/td-agent/td-agent.log | grep -v buffer | tail -50
Check which plugin version is installed:
td-agent-gem list | grep fluent-plugin
# fluent-plugin-elasticsearch (5.2.4)
# fluent-plugin-prometheus (2.1.0)
# fluent-plugin-rewrite-tag-filter (2.4.0)
How to Fix It
Update the offending plugin to its latest stable version:
td-agent-gem update fluent-plugin-elasticsearch
# Verify the new version:
td-agent-gem list | grep fluent-plugin-elasticsearch
# fluent-plugin-elasticsearch (5.4.3)
If the problem is malformed log lines crashing a parser, use Fluentd's built-in error routing to redirect bad records instead of letting them crash the plugin:
<filter app.**>
@type parser
key_name log
reserve_data true
emit_invalid_record_to_error true
<parse>
@type json
</parse>
</filter>
<match fluent.**.error>
@type file
path /var/log/td-agent/parse_errors/
</match>
Setting
emit_invalid_record_to_error trueroutes malformed records to a separate output instead of raising an exception. After making changes, reload:
systemctl reload td-agent
Root Cause 3: Memory Limit Hit
Why It Happens
Fluentd is a Ruby process, and Ruby's garbage collector isn't always aggressive about returning memory to the OS. If you're running Fluentd inside a container with a memory limit, under systemd with
MemoryMaxset, or on a host where other services are competing for RAM, the process will eventually hit its ceiling. When that happens, the Linux OOM killer terminates it — no warning, no graceful shutdown, no buffer flush. Whatever logs were held in memory at the time of termination are gone.
This is especially common when using in-memory buffers (
@type memory), large chunk sizes, or filters that expand record size significantly — like adding full Kubernetes metadata to every log line.
How to Identify It
Check the kernel ring buffer for OOM kill events:
dmesg | grep -i 'oom\|killed process' | tail -20
# [428743.112345] Out of memory: Kill process 18432 (ruby) score 892 or sacrifice child
# [428743.113001] Killed process 18432 (ruby) total-vm:4194304kB, anon-rss:3145728kB, file-rss:8192kB
Confirm via journald that it was the td-agent service that was killed:
journalctl -u td-agent --since '1 hour ago' | grep -E 'Killed|OOM|memory'
# Apr 12 04:17:32 sw-infrarunbook-01 systemd[1]: td-agent.service: Main process exited, code=killed, status=9/KILL
Check current RSS usage of the running process:
ps aux | grep fluentd
# infrarunbook-admin 18901 45.2 12.8 4200000 2097152 ? Sl 03:00 8:42 /usr/bin/ruby /usr/sbin/fluentd
awk '/VmRSS/{print $2/1024 " MB"}' /proc/18901/status
# 2048 MB
How to Fix It
The most impactful change is switching from in-memory buffers to file-backed buffers, which don't hold data in the Ruby heap:
<buffer tag>
@type file # Was: @type memory
path /var/log/td-agent/buffer/app
chunk_limit_size 8MB
total_limit_size 512MB
</buffer>
If you're running under systemd, give Fluentd a realistic memory ceiling and disable swap to prevent slow death-by-swap instead of a clean OOM kill:
# /etc/systemd/system/td-agent.service.d/memory.conf
[Service]
MemoryMax=1G
MemorySwapMax=0
systemctl daemon-reload
systemctl restart td-agent
For Kubernetes environments, use a hostPath volume for the buffer directory so buffer data survives container restarts. Set resource requests and limits based on actual observed usage, not guesses. You can also nudge Ruby's GC to run more aggressively:
# /etc/default/td-agent (Debian/Ubuntu) or /etc/sysconfig/td-agent (RHEL)
RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=0.9
RUBY_GC_MALLOC_LIMIT=4000000
Root Cause 4: Network Timeout to Destination
Why It Happens
Fluentd forwards logs to a downstream destination — Elasticsearch, an S3-compatible store, a remote syslog endpoint, another Fluentd aggregator. If that destination becomes slow or unreachable, the output plugin starts timing out on each flush attempt. Fluentd will retry with exponential backoff. Meanwhile, new logs keep arriving. The buffer fills from the back while flushes stall on the front. Eventually you hit buffer overflow on top of the timeout problem, and now you're dealing with both simultaneously.
Network timeouts are particularly insidious because Fluentd may look healthy — the process is up, the config is valid — but it's quietly accumulating unflushable chunks or silently discarding records after retry exhaustion.
How to Identify It
The Fluentd log will show connection refused or timeout errors with escalating retry intervals:
2026-04-12 05:30:01 +0000 [warn]: #0 failed to flush the buffer. retry_time=3 next_retry_seconds=30 chunk="5b2a1c..." error_class=Fluent::Plugin::ElasticsearchOutput::ConnectionFailure error="Can not reach Elasticsearch cluster"
2026-04-12 05:30:31 +0000 [warn]: #0 failed to flush the buffer. retry_time=4 next_retry_seconds=60 chunk="5b2a1c..."
2026-04-12 05:31:31 +0000 [warn]: #0 failed to flush the buffer. retry_time=5 next_retry_seconds=120 chunk="5b2a1c..."
Confirm the destination is unreachable from the Fluentd host itself:
curl -v http://10.10.1.50:9200/_cluster/health
# * connect to 10.10.1.50 port 9200 failed: Connection refused
nc -zv 10.10.1.50 9200
# nc: connect to 10.10.1.50 port 9200 (tcp) failed: Connection refused
Check the retry counter via Prometheus metrics to confirm how long the backoff has been running:
curl -s http://localhost:24231/metrics | grep retry
# fluentd_output_status_retry_count{plugin_id="output_es",type="elasticsearch"} 47
# fluentd_output_status_retry_wait{plugin_id="output_es",type="elasticsearch"} 120.0
A retry count climbing with a long
retry_waitconfirms you're in a timeout and backoff loop.
How to Fix It
Restoring connectivity to the destination is outside Fluentd's control — investigate the downstream service. But you can configure Fluentd to hold data reliably through extended outages instead of giving up and discarding chunks:
<match app.**>
@type elasticsearch
host 10.10.1.50
port 9200
request_timeout 30s
<buffer tag>
@type file
path /var/log/td-agent/buffer/app
total_limit_size 2GB
flush_interval 10s
retry_type exponential_backoff
retry_max_interval 300s
retry_forever true
</buffer>
</match>
Setting
retry_forever truemeans Fluentd will never discard buffered chunks due to retry exhaustion — it keeps trying until the destination comes back. Pair this with a generous
total_limit_sizeso the buffer can absorb data through a multi-hour outage.
For multi-destination resilience, use the
copyoutput plugin with a local file fallback:
<match app.**>
@type copy
<store>
@type elasticsearch
host 10.10.1.50
port 9200
<buffer tag>
@type file
path /var/log/td-agent/buffer/es_primary
retry_forever true
total_limit_size 1GB
</buffer>
</store>
<store ignore_error>
@type file
path /var/log/td-agent/fallback/app
</store>
</match>
The
ignore_errorflag on the secondary store means a failure there won't affect the primary output. You can replay from the fallback file manually once the primary destination recovers.
Root Cause 5: Config Parse Error
Why It Happens
This is the most common cause after any config change, and also one of the most disruptive. After editing a config file — adding a new filter, updating a destination address, rotating a credential — Fluentd must be reloaded or restarted to pick up the change. If the config has a syntax error, Fluentd refuses to start and all log collection stops cold. The config DSL is not forgiving: a typo in a plugin name, a mismatched tag, an
@includepointing to a file that doesn't exist, or a missing closing directive will all cause a parse failure and a hard exit.
How to Identify It
When Fluentd fails to start due to a config error, the message appears in the journal:
journalctl -u td-agent -n 30
# Apr 12 06:00:01 sw-infrarunbook-01 fluentd[24401]: 2026-04-12 06:00:01 +0000 [error]: config error file="/etc/td-agent/td-agent.conf" error_class=Fluent::ConfigError error="Unknown output plugin 'elasticsaerch'. Run 'gem search -rd fluent-plugin' to find plugins"
# Apr 12 06:00:01 sw-infrarunbook-01 systemd[1]: td-agent.service: Main process exited, code=exited, status=1/FAILURE
A malformed match block looks like this:
2026-04-12 06:05:12 +0000 [error]: config error file="/etc/td-agent/td-agent.conf" error_class=Fluent::ConfigError error="'match' section requires argument, in section match"
Always validate the config before applying it. Fluentd has a built-in dry-run mode that catches parse errors without affecting the running service:
fluentd --dry-run -c /etc/td-agent/td-agent.conf
# On success:
# 2026-04-12 06:10:01 +0000 [info]: reading config file path="/etc/td-agent/td-agent.conf"
# 2026-04-12 06:10:01 +0000 [info]: starting fluentd-1.16.3 pid=24510 ruby="2.7.8"
# 2026-04-12 06:10:02 +0000 [info]: gem 'fluent-plugin-elasticsearch' version '5.4.3'
# 2026-04-12 06:10:02 +0000 [info]: dry run finished
# On failure:
# 2026-04-12 06:10:01 +0000 [error]: config error file="/etc/td-agent/td-agent.conf" error_class=Fluent::ConfigError error="..."
How to Fix It
Read the error message carefully — Fluentd's config errors are usually specific about what went wrong and where. The most common issues fall into a few categories.
Typo in plugin name: Check your spelling and verify the plugin is actually installed.
td-agent-gem list | grep fluent-plugin-elasticsearch
# Install if missing:
td-agent-gem install fluent-plugin-elasticsearch
Mismatched open and close tags: Every
<source>,
<match>,
<filter>, and
<buffer>block needs a closing tag. Count them:
grep -c '^<[a-z]' /etc/td-agent/td-agent.conf
grep -c '^</[a-z]' /etc/td-agent/td-agent.conf
# These numbers must match
Missing included files: If you use
@includedirectives, confirm the referenced files exist:
grep '@include' /etc/td-agent/td-agent.conf
# @include /etc/td-agent/conf.d/*.conf
ls /etc/td-agent/conf.d/
Once fixed, validate and reload atomically:
fluentd --dry-run -c /etc/td-agent/td-agent.conf && systemctl reload td-agent
Root Cause 6: Corrupted Buffer Chunks
Why It Happens
Fluentd stores buffered data in chunk files — binary files encoded in MessagePack format. If a chunk file becomes corrupted due to an ungraceful shutdown, a disk write error, or a Fluentd version upgrade that changed the chunk format, Fluentd will fail when it tries to read and flush that chunk. The typical symptom is a crash loop: Fluentd starts, tries to resume the corrupted chunk, raises an exception, and exits. Systemd restarts it, and the cycle repeats.
How to Identify It
grep -i 'chunk\|broken\|corrupt' /var/log/td-agent/td-agent.log | grep -i error
# 2026-04-12 07:00:03 +0000 [error]: #0 found broken chunk file during resume. Deleted: path="/var/log/td-agent/buffer/app/5b2a1c3d4e5f.log"
# 2026-04-12 07:00:03 +0000 [error]: #0 failed to resume buffer error_class=Fluent::Plugin::Buffer::FileChunk::FileChunkError error="chunk is broken"
How to Fix It
Stop the service and remove the identified broken chunk files. Fluentd logs the exact path of chunks it considers broken, so target those specifically:
systemctl stop td-agent
rm /var/log/td-agent/buffer/app/5b2a1c3d4e5f.log
rm /var/log/td-agent/buffer/app/5b2a1c3d4e5f.log.meta
systemctl start td-agent
If the corruption is widespread, clear the entire buffer directory as described in the buffer overflow section. To prevent this from causing crash loops in the future, enable automatic broken-chunk skipping (available in Fluentd 1.14+):
<buffer tag>
@type file
path /var/log/td-agent/buffer/app
skip_broken_chunks true
</buffer>
Prevention
Preventing Fluentd crashes and log loss comes down to good configuration hygiene and real visibility into what the process is doing. A few practices that have consistently saved production deployments:
Always validate configs before deploying. Make
fluentd --dry-run -c /etc/td-agent/td-agent.confa mandatory gate in your config deployment pipeline. If you're using Ansible or Chef to manage configs, add a handler that runs the dry-run check before any reload fires. A failed dry-run should block the deployment.
Use file-backed buffers everywhere. Memory buffers are faster but they're gone the moment the process dies. File buffers survive restarts and give you a much better chance of not losing data during a crash. The performance difference is negligible for most log volumes.
Set overflow_action block
. The default
drop_oldest_chunksilently discards data. Blocking applies backpressure upstream — your sources slow down or buffer locally, which is recoverable. Dropped log records are not.
Monitor the right metrics. If you're not running
fluent-plugin-prometheus, install it now. The three metrics that matter most are
fluentd_output_status_retry_count,
fluentd_output_status_buffer_queue_length, and
fluentd_output_status_emit_records. Alert on retry_count increasing continuously and buffer_queue_length approaching its configured limit. A sustained drop in emit_records when log sources are active is a reliable early signal that something is wrong upstream.
Pin plugin versions in production. Plugins are deployed via gems, and an uncontrolled
td-agent-gem updatecan pull in a breaking change. Maintain a version list and only update after testing against staging traffic that mirrors production log patterns and volume.
Configure systemd restart behavior sensibly. By default, systemd stops restarting a service after too many failures in a short window — which means a transient issue can leave Fluentd permanently stopped. Configure the limits to allow recovery from transient failures while alerting on persistent ones:
# /etc/systemd/system/td-agent.service.d/restart.conf
[Service]
Restart=on-failure
RestartSec=10s
StartLimitIntervalSec=300
StartLimitBurst=5
Set retry_forever true
for critical outputs. If your destination goes down for a few hours, you want Fluentd to keep the data buffered and retry indefinitely rather than giving up and discarding chunks after a fixed retry count. Pair this with a large
total_limit_sizeso the buffer can hold data through a realistic outage window.
Fluentd is reliable when it's configured deliberately. Most of the crashes and silent data loss events I've debugged in production traced back to one of these root causes — and almost all of them were preventable with better buffer configuration, a working monitoring setup, and a config validation step wired into the deployment process.
