Fluentd Crashing or Losing Logs

Symptoms

When Fluentd is misbehaving, the signs range from obvious to maddeningly subtle. The obvious case: the process exits, systemd throws it into a restart loop, and nothing flows to your destination — Elasticsearch, Loki, S3, or whatever you're shipping to. The subtle case is worse: Fluentd is running, your uptime monitoring is green, but log volume in Kibana has quietly dropped 40% and nobody noticed until an on-call engineer asked why a log-based alert never fired on an event that happened three hours ago.

Here's what you'll typically see when Fluentd is crashing or losing logs:

The
td-agent
or
fluentd
systemd unit shows failed status or loops through
activating → active → failed
/var/log/td-agent/td-agent.log
is flooded with
[warn]
or
[error]
entries referencing buffer limits, connection failures, or plugin exceptions
Buffer files under
/var/log/td-agent/buffer/
keep growing — sometimes into the gigabytes — without ever being flushed
Memory on the Fluentd host climbs steadily until the OOM killer fires
The
fluentd_output_status_retry_count
Prometheus counter increments continuously
Specific log sources stop appearing in your aggregator while others continue normally, pointing to a per-plugin failure rather than a global one

None of these symptoms are mutually exclusive. In my experience, a single root cause — say, a slow Elasticsearch cluster — can trigger buffer overflow, which then triggers memory pressure, which eventually causes an OOM crash. You need to find the original cause, not just the loudest symptom.

Root Cause 1: Buffer Overflow

Why It Happens

Fluentd uses a buffer system to absorb bursts of log traffic before flushing records downstream. Each output plugin has a buffer — either in-memory or file-backed — with a defined capacity. When that capacity is reached and the output can't drain fast enough, Fluentd has to make a decision: block, drop, or throw an error. The default behavior when the buffer is full is to drop incoming records and log a warning. Logs disappear silently unless you're watching carefully.

Buffer overflow usually happens when log volume spikes unexpectedly — a deployment that triggers verbose debug logging, a runaway process emitting thousands of lines per second, or a destination that slows down and lets the buffer fill from the back while the front keeps receiving.

How to Identify It

Check the Fluentd log for messages like these:

2026-04-12 03:14:22 +0000 [warn]: #0 failed to write data into buffer by buffer overflow action=:drop
2026-04-12 03:14:22 +0000 [warn]: #0 buffer is full, chunk dropped tag="app.production" size=524288

Check buffer file sizes on disk:

du -sh /var/log/td-agent/buffer/*
# Output showing a full buffer:
512M    /var/log/td-agent/buffer/app.production
512M    /var/log/td-agent/buffer/system.syslog

If you have the Prometheus plugin enabled, query the buffer queue length directly:

curl -s http://localhost:24231/metrics | grep fluentd_output_status_buffer_queue_length
# fluentd_output_status_buffer_queue_length{plugin_id="output_es",type="elasticsearch"} 512

A queue at or near its configured

queue_limit_length

(default 512) confirms the overflow.

How to Fix It

Increase the buffer limits and change the overflow behavior in your config:

<match app.**>
  @type elasticsearch
  host 10.10.1.50
  port 9200

  <buffer tag>
    @type file
    path /var/log/td-agent/buffer/app
    total_limit_size 2GB
    chunk_limit_size 32MB
    flush_interval 5s
    retry_max_times 20
    overflow_action block
  </buffer>
</match>

The key change here is

overflow_action block

. Instead of silently dropping records, Fluentd will pause ingestion and apply backpressure upstream. Blocking is almost always preferable to data loss. After editing, validate and reload:

fluentd --dry-run -c /etc/td-agent/td-agent.conf
systemctl reload td-agent

If you need to clear stuck or unflushable buffer files, stop the service, remove the buffer directory, and restart. Understand that this means losing whatever was buffered but not yet sent — only do it when you've confirmed the data is unrecoverable or you accept the loss.

systemctl stop td-agent
rm -rf /var/log/td-agent/buffer/
systemctl start td-agent

Root Cause 2: Plugin Error

Why It Happens

Fluentd's plugin ecosystem is one of its greatest strengths — and a frequent source of crashes. Plugins are Ruby code, and Ruby code can raise exceptions. When a plugin's

write

process

method throws an unhandled exception — a nil dereference, an unexpected API response format, a gem dependency version mismatch — the worker thread crashes, taking down log delivery for every tag that worker was handling.

I've seen this most often with

fluent-plugin-elasticsearch

when Elasticsearch changes its API response shape across major versions, and with custom parser plugins that don't handle malformed log lines gracefully. One bad log line can take down the entire output if the plugin isn't written defensively.

How to Identify It

Plugin errors show up in the Fluentd log as Ruby stack traces:

2026-04-12 04:02:11 +0000 [error]: #0 unexpected error error_class=NoMethodError error="undefined method `[]' for nil:NilClass"
  /usr/lib/ruby/gems/2.7.0/gems/fluent-plugin-elasticsearch-5.2.4/lib/fluent/plugin/out_elasticsearch.rb:812:in `client_usable?'
  /usr/lib/ruby/gems/2.7.0/gems/fluent-plugin-elasticsearch-5.2.4/lib/fluent/plugin/out_elasticsearch.rb:791:in `write'
  /usr/lib/ruby/gems/2.7.0/gems/fluentd-1.16.3/lib/fluent/plugin/output.rb:1203:in `try_flush'

Filter for non-buffer errors and look at the tail:

grep -E '\[error\]|\[warn\]' /var/log/td-agent/td-agent.log | grep -v buffer | tail -50

Check which plugin version is installed:

td-agent-gem list | grep fluent-plugin
# fluent-plugin-elasticsearch (5.2.4)
# fluent-plugin-prometheus (2.1.0)
# fluent-plugin-rewrite-tag-filter (2.4.0)

How to Fix It

Update the offending plugin to its latest stable version:

td-agent-gem update fluent-plugin-elasticsearch
# Verify the new version:
td-agent-gem list | grep fluent-plugin-elasticsearch
# fluent-plugin-elasticsearch (5.4.3)

If the problem is malformed log lines crashing a parser, use Fluentd's built-in error routing to redirect bad records instead of letting them crash the plugin:

<filter app.**>
  @type parser
  key_name log
  reserve_data true
  emit_invalid_record_to_error true

  <parse>
    @type json
  </parse>
</filter>

<match fluent.**.error>
  @type file
  path /var/log/td-agent/parse_errors/
</match>

Setting

emit_invalid_record_to_error true

routes malformed records to a separate output instead of raising an exception. After making changes, reload:

systemctl reload td-agent

Root Cause 3: Memory Limit Hit

Why It Happens

Fluentd is a Ruby process, and Ruby's garbage collector isn't always aggressive about returning memory to the OS. If you're running Fluentd inside a container with a memory limit, under systemd with

MemoryMax

set, or on a host where other services are competing for RAM, the process will eventually hit its ceiling. When that happens, the Linux OOM killer terminates it — no warning, no graceful shutdown, no buffer flush. Whatever logs were held in memory at the time of termination are gone.

This is especially common when using in-memory buffers (

@type memory

), large chunk sizes, or filters that expand record size significantly — like adding full Kubernetes metadata to every log line.

How to Identify It

Check the kernel ring buffer for OOM kill events:

dmesg | grep -i 'oom\|killed process' | tail -20
# [428743.112345] Out of memory: Kill process 18432 (ruby) score 892 or sacrifice child
# [428743.113001] Killed process 18432 (ruby) total-vm:4194304kB, anon-rss:3145728kB, file-rss:8192kB

Confirm via journald that it was the td-agent service that was killed:

journalctl -u td-agent --since '1 hour ago' | grep -E 'Killed|OOM|memory'
# Apr 12 04:17:32 sw-infrarunbook-01 systemd[1]: td-agent.service: Main process exited, code=killed, status=9/KILL

Check current RSS usage of the running process:

ps aux | grep fluentd
# infrarunbook-admin  18901  45.2 12.8 4200000 2097152 ?  Sl   03:00   8:42 /usr/bin/ruby /usr/sbin/fluentd

awk '/VmRSS/{print $2/1024 " MB"}' /proc/18901/status
# 2048 MB

How to Fix It

The most impactful change is switching from in-memory buffers to file-backed buffers, which don't hold data in the Ruby heap:

<buffer tag>
  @type file          # Was: @type memory
  path /var/log/td-agent/buffer/app
  chunk_limit_size 8MB
  total_limit_size 512MB
</buffer>

If you're running under systemd, give Fluentd a realistic memory ceiling and disable swap to prevent slow death-by-swap instead of a clean OOM kill:

# /etc/systemd/system/td-agent.service.d/memory.conf
[Service]
MemoryMax=1G
MemorySwapMax=0

systemctl daemon-reload
systemctl restart td-agent

For Kubernetes environments, use a hostPath volume for the buffer directory so buffer data survives container restarts. Set resource requests and limits based on actual observed usage, not guesses. You can also nudge Ruby's GC to run more aggressively:

# /etc/default/td-agent (Debian/Ubuntu) or /etc/sysconfig/td-agent (RHEL)
RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=0.9
RUBY_GC_MALLOC_LIMIT=4000000

Root Cause 4: Network Timeout to Destination

Why It Happens

Fluentd forwards logs to a downstream destination — Elasticsearch, an S3-compatible store, a remote syslog endpoint, another Fluentd aggregator. If that destination becomes slow or unreachable, the output plugin starts timing out on each flush attempt. Fluentd will retry with exponential backoff. Meanwhile, new logs keep arriving. The buffer fills from the back while flushes stall on the front. Eventually you hit buffer overflow on top of the timeout problem, and now you're dealing with both simultaneously.

Network timeouts are particularly insidious because Fluentd may look healthy — the process is up, the config is valid — but it's quietly accumulating unflushable chunks or silently discarding records after retry exhaustion.

How to Identify It

The Fluentd log will show connection refused or timeout errors with escalating retry intervals:

2026-04-12 05:30:01 +0000 [warn]: #0 failed to flush the buffer. retry_time=3 next_retry_seconds=30 chunk="5b2a1c..." error_class=Fluent::Plugin::ElasticsearchOutput::ConnectionFailure error="Can not reach Elasticsearch cluster"
2026-04-12 05:30:31 +0000 [warn]: #0 failed to flush the buffer. retry_time=4 next_retry_seconds=60 chunk="5b2a1c..."
2026-04-12 05:31:31 +0000 [warn]: #0 failed to flush the buffer. retry_time=5 next_retry_seconds=120 chunk="5b2a1c..."

Confirm the destination is unreachable from the Fluentd host itself:

curl -v http://10.10.1.50:9200/_cluster/health
# * connect to 10.10.1.50 port 9200 failed: Connection refused

nc -zv 10.10.1.50 9200
# nc: connect to 10.10.1.50 port 9200 (tcp) failed: Connection refused

Check the retry counter via Prometheus metrics to confirm how long the backoff has been running:

curl -s http://localhost:24231/metrics | grep retry
# fluentd_output_status_retry_count{plugin_id="output_es",type="elasticsearch"} 47
# fluentd_output_status_retry_wait{plugin_id="output_es",type="elasticsearch"} 120.0

A retry count climbing with a long

retry_wait

confirms you're in a timeout and backoff loop.

How to Fix It

Restoring connectivity to the destination is outside Fluentd's control — investigate the downstream service. But you can configure Fluentd to hold data reliably through extended outages instead of giving up and discarding chunks:

<match app.**>
  @type elasticsearch
  host 10.10.1.50
  port 9200
  request_timeout 30s

  <buffer tag>
    @type file
    path /var/log/td-agent/buffer/app
    total_limit_size 2GB
    flush_interval 10s
    retry_type exponential_backoff
    retry_max_interval 300s
    retry_forever true
  </buffer>
</match>

Setting

retry_forever true

means Fluentd will never discard buffered chunks due to retry exhaustion — it keeps trying until the destination comes back. Pair this with a generous

total_limit_size

so the buffer can absorb data through a multi-hour outage.

For multi-destination resilience, use the

copy

output plugin with a local file fallback:

<match app.**>
  @type copy

  <store>
    @type elasticsearch
    host 10.10.1.50
    port 9200
    <buffer tag>
      @type file
      path /var/log/td-agent/buffer/es_primary
      retry_forever true
      total_limit_size 1GB
    </buffer>
  </store>

  <store ignore_error>
    @type file
    path /var/log/td-agent/fallback/app
  </store>
</match>

The

ignore_error

flag on the secondary store means a failure there won't affect the primary output. You can replay from the fallback file manually once the primary destination recovers.

Root Cause 5: Config Parse Error

Why It Happens

This is the most common cause after any config change, and also one of the most disruptive. After editing a config file — adding a new filter, updating a destination address, rotating a credential — Fluentd must be reloaded or restarted to pick up the change. If the config has a syntax error, Fluentd refuses to start and all log collection stops cold. The config DSL is not forgiving: a typo in a plugin name, a mismatched tag, an

@include

pointing to a file that doesn't exist, or a missing closing directive will all cause a parse failure and a hard exit.

How to Identify It

When Fluentd fails to start due to a config error, the message appears in the journal:

journalctl -u td-agent -n 30
# Apr 12 06:00:01 sw-infrarunbook-01 fluentd[24401]: 2026-04-12 06:00:01 +0000 [error]: config error file="/etc/td-agent/td-agent.conf" error_class=Fluent::ConfigError error="Unknown output plugin 'elasticsaerch'. Run 'gem search -rd fluent-plugin' to find plugins"
# Apr 12 06:00:01 sw-infrarunbook-01 systemd[1]: td-agent.service: Main process exited, code=exited, status=1/FAILURE

A malformed match block looks like this:

2026-04-12 06:05:12 +0000 [error]: config error file="/etc/td-agent/td-agent.conf" error_class=Fluent::ConfigError error="'match' section requires argument, in section match"

Always validate the config before applying it. Fluentd has a built-in dry-run mode that catches parse errors without affecting the running service:

fluentd --dry-run -c /etc/td-agent/td-agent.conf

# On success:
# 2026-04-12 06:10:01 +0000 [info]: reading config file path="/etc/td-agent/td-agent.conf"
# 2026-04-12 06:10:01 +0000 [info]: starting fluentd-1.16.3 pid=24510 ruby="2.7.8"
# 2026-04-12 06:10:02 +0000 [info]: gem 'fluent-plugin-elasticsearch' version '5.4.3'
# 2026-04-12 06:10:02 +0000 [info]: dry run finished

# On failure:
# 2026-04-12 06:10:01 +0000 [error]: config error file="/etc/td-agent/td-agent.conf" error_class=Fluent::ConfigError error="..."

How to Fix It

Read the error message carefully — Fluentd's config errors are usually specific about what went wrong and where. The most common issues fall into a few categories.

Typo in plugin name: Check your spelling and verify the plugin is actually installed.

td-agent-gem list | grep fluent-plugin-elasticsearch

# Install if missing:
td-agent-gem install fluent-plugin-elasticsearch

Mismatched open and close tags: Every

<source>

<match>

<filter>

, and

<buffer>

block needs a closing tag. Count them:

grep -c '^<[a-z]' /etc/td-agent/td-agent.conf
grep -c '^</[a-z]' /etc/td-agent/td-agent.conf
# These numbers must match

Missing included files: If you use

@include

directives, confirm the referenced files exist:

grep '@include' /etc/td-agent/td-agent.conf
# @include /etc/td-agent/conf.d/*.conf
ls /etc/td-agent/conf.d/

Once fixed, validate and reload atomically:

fluentd --dry-run -c /etc/td-agent/td-agent.conf && systemctl reload td-agent

Root Cause 6: Corrupted Buffer Chunks

Why It Happens

Fluentd stores buffered data in chunk files — binary files encoded in MessagePack format. If a chunk file becomes corrupted due to an ungraceful shutdown, a disk write error, or a Fluentd version upgrade that changed the chunk format, Fluentd will fail when it tries to read and flush that chunk. The typical symptom is a crash loop: Fluentd starts, tries to resume the corrupted chunk, raises an exception, and exits. Systemd restarts it, and the cycle repeats.

How to Identify It

grep -i 'chunk\|broken\|corrupt' /var/log/td-agent/td-agent.log | grep -i error
# 2026-04-12 07:00:03 +0000 [error]: #0 found broken chunk file during resume. Deleted: path="/var/log/td-agent/buffer/app/5b2a1c3d4e5f.log"
# 2026-04-12 07:00:03 +0000 [error]: #0 failed to resume buffer error_class=Fluent::Plugin::Buffer::FileChunk::FileChunkError error="chunk is broken"

How to Fix It

Stop the service and remove the identified broken chunk files. Fluentd logs the exact path of chunks it considers broken, so target those specifically:

systemctl stop td-agent
rm /var/log/td-agent/buffer/app/5b2a1c3d4e5f.log
rm /var/log/td-agent/buffer/app/5b2a1c3d4e5f.log.meta
systemctl start td-agent

If the corruption is widespread, clear the entire buffer directory as described in the buffer overflow section. To prevent this from causing crash loops in the future, enable automatic broken-chunk skipping (available in Fluentd 1.14+):

<buffer tag>
  @type file
  path /var/log/td-agent/buffer/app
  skip_broken_chunks true
</buffer>

Prevention

Preventing Fluentd crashes and log loss comes down to good configuration hygiene and real visibility into what the process is doing. A few practices that have consistently saved production deployments:

Always validate configs before deploying. Make

fluentd --dry-run -c /etc/td-agent/td-agent.conf

a mandatory gate in your config deployment pipeline. If you're using Ansible or Chef to manage configs, add a handler that runs the dry-run check before any reload fires. A failed dry-run should block the deployment.

Use file-backed buffers everywhere. Memory buffers are faster but they're gone the moment the process dies. File buffers survive restarts and give you a much better chance of not losing data during a crash. The performance difference is negligible for most log volumes.

Set

overflow_action block

. The default

drop_oldest_chunk

silently discards data. Blocking applies backpressure upstream — your sources slow down or buffer locally, which is recoverable. Dropped log records are not.

Monitor the right metrics. If you're not running

fluent-plugin-prometheus

, install it now. The three metrics that matter most are

fluentd_output_status_retry_count

fluentd_output_status_buffer_queue_length

, and

fluentd_output_status_emit_records

. Alert on retry_count increasing continuously and buffer_queue_length approaching its configured limit. A sustained drop in emit_records when log sources are active is a reliable early signal that something is wrong upstream.

Pin plugin versions in production. Plugins are deployed via gems, and an uncontrolled

td-agent-gem update

can pull in a breaking change. Maintain a version list and only update after testing against staging traffic that mirrors production log patterns and volume.

Configure systemd restart behavior sensibly. By default, systemd stops restarting a service after too many failures in a short window — which means a transient issue can leave Fluentd permanently stopped. Configure the limits to allow recovery from transient failures while alerting on persistent ones:

# /etc/systemd/system/td-agent.service.d/restart.conf
[Service]
Restart=on-failure
RestartSec=10s
StartLimitIntervalSec=300
StartLimitBurst=5

Set

retry_forever true

for critical outputs. If your destination goes down for a few hours, you want Fluentd to keep the data buffered and retry indefinitely rather than giving up and discarding chunks after a fixed retry count. Pair this with a large

total_limit_size

so the buffer can hold data through a realistic outage window.

Fluentd is reliable when it's configured deliberately. Most of the crashes and silent data loss events I've debugged in production traced back to one of these root causes — and almost all of them were preventable with better buffer configuration, a working monitoring setup, and a config validation step wired into the deployment process.

Fluentd Crashing or Losing Logs

Symptoms

Root Cause 1: Buffer Overflow

Why It Happens

How to Identify It

How to Fix It

Root Cause 2: Plugin Error

Why It Happens

How to Identify It

How to Fix It

Root Cause 3: Memory Limit Hit

Why It Happens

How to Identify It

How to Fix It

Root Cause 4: Network Timeout to Destination

Why It Happens

How to Identify It

How to Fix It

Root Cause 5: Config Parse Error

Why It Happens

How to Identify It

How to Fix It

Root Cause 6: Corrupted Buffer Chunks

Why It Happens

How to Identify It

How to Fix It

Prevention

Related Articles

Frequently Asked Questions

How do I check if Fluentd is silently dropping logs?

What is the safest overflow_action setting for Fluentd buffers?

How do I stop Fluentd from crashing when it encounters malformed log lines?

What causes Fluentd to enter a restart loop immediately after starting?

How can I make Fluentd survive a long destination outage without losing logs?

Related Articles