InfraRunBook
    Back to articles

    Fluentd Crashing or Losing Logs

    Logging
    Published: Apr 11, 2026
    Updated: Apr 11, 2026

    Fluentd crashes and silent log loss are often caused by buffer overflow, plugin exceptions, OOM kills, network timeouts, or config errors. This guide covers how to diagnose and fix each one.

    Fluentd Crashing or Losing Logs

    Symptoms

    When Fluentd is misbehaving, the signs range from obvious to maddeningly subtle. The obvious case: the process exits, systemd throws it into a restart loop, and nothing flows to your destination — Elasticsearch, Loki, S3, or whatever you're shipping to. The subtle case is worse: Fluentd is running, your uptime monitoring is green, but log volume in Kibana has quietly dropped 40% and nobody noticed until an on-call engineer asked why a log-based alert never fired on an event that happened three hours ago.

    Here's what you'll typically see when Fluentd is crashing or losing logs:

    • The
      td-agent
      or
      fluentd
      systemd unit shows failed status or loops through
      activating → active → failed
    • /var/log/td-agent/td-agent.log
      is flooded with
      [warn]
      or
      [error]
      entries referencing buffer limits, connection failures, or plugin exceptions
    • Buffer files under
      /var/log/td-agent/buffer/
      keep growing — sometimes into the gigabytes — without ever being flushed
    • Memory on the Fluentd host climbs steadily until the OOM killer fires
    • The
      fluentd_output_status_retry_count
      Prometheus counter increments continuously
    • Specific log sources stop appearing in your aggregator while others continue normally, pointing to a per-plugin failure rather than a global one

    None of these symptoms are mutually exclusive. In my experience, a single root cause — say, a slow Elasticsearch cluster — can trigger buffer overflow, which then triggers memory pressure, which eventually causes an OOM crash. You need to find the original cause, not just the loudest symptom.


    Root Cause 1: Buffer Overflow

    Why It Happens

    Fluentd uses a buffer system to absorb bursts of log traffic before flushing records downstream. Each output plugin has a buffer — either in-memory or file-backed — with a defined capacity. When that capacity is reached and the output can't drain fast enough, Fluentd has to make a decision: block, drop, or throw an error. The default behavior when the buffer is full is to drop incoming records and log a warning. Logs disappear silently unless you're watching carefully.

    Buffer overflow usually happens when log volume spikes unexpectedly — a deployment that triggers verbose debug logging, a runaway process emitting thousands of lines per second, or a destination that slows down and lets the buffer fill from the back while the front keeps receiving.

    How to Identify It

    Check the Fluentd log for messages like these:

    2026-04-12 03:14:22 +0000 [warn]: #0 failed to write data into buffer by buffer overflow action=:drop
    2026-04-12 03:14:22 +0000 [warn]: #0 buffer is full, chunk dropped tag="app.production" size=524288

    Check buffer file sizes on disk:

    du -sh /var/log/td-agent/buffer/*
    # Output showing a full buffer:
    512M    /var/log/td-agent/buffer/app.production
    512M    /var/log/td-agent/buffer/system.syslog

    If you have the Prometheus plugin enabled, query the buffer queue length directly:

    curl -s http://localhost:24231/metrics | grep fluentd_output_status_buffer_queue_length
    # fluentd_output_status_buffer_queue_length{plugin_id="output_es",type="elasticsearch"} 512

    A queue at or near its configured

    queue_limit_length
    (default 512) confirms the overflow.

    How to Fix It

    Increase the buffer limits and change the overflow behavior in your config:

    <match app.**>
      @type elasticsearch
      host 10.10.1.50
      port 9200
    
      <buffer tag>
        @type file
        path /var/log/td-agent/buffer/app
        total_limit_size 2GB
        chunk_limit_size 32MB
        flush_interval 5s
        retry_max_times 20
        overflow_action block
      </buffer>
    </match>

    The key change here is

    overflow_action block
    . Instead of silently dropping records, Fluentd will pause ingestion and apply backpressure upstream. Blocking is almost always preferable to data loss. After editing, validate and reload:

    fluentd --dry-run -c /etc/td-agent/td-agent.conf
    systemctl reload td-agent

    If you need to clear stuck or unflushable buffer files, stop the service, remove the buffer directory, and restart. Understand that this means losing whatever was buffered but not yet sent — only do it when you've confirmed the data is unrecoverable or you accept the loss.

    systemctl stop td-agent
    rm -rf /var/log/td-agent/buffer/
    systemctl start td-agent

    Root Cause 2: Plugin Error

    Why It Happens

    Fluentd's plugin ecosystem is one of its greatest strengths — and a frequent source of crashes. Plugins are Ruby code, and Ruby code can raise exceptions. When a plugin's

    write
    or
    process
    method throws an unhandled exception — a nil dereference, an unexpected API response format, a gem dependency version mismatch — the worker thread crashes, taking down log delivery for every tag that worker was handling.

    I've seen this most often with

    fluent-plugin-elasticsearch
    when Elasticsearch changes its API response shape across major versions, and with custom parser plugins that don't handle malformed log lines gracefully. One bad log line can take down the entire output if the plugin isn't written defensively.

    How to Identify It

    Plugin errors show up in the Fluentd log as Ruby stack traces:

    2026-04-12 04:02:11 +0000 [error]: #0 unexpected error error_class=NoMethodError error="undefined method `[]' for nil:NilClass"
      /usr/lib/ruby/gems/2.7.0/gems/fluent-plugin-elasticsearch-5.2.4/lib/fluent/plugin/out_elasticsearch.rb:812:in `client_usable?'
      /usr/lib/ruby/gems/2.7.0/gems/fluent-plugin-elasticsearch-5.2.4/lib/fluent/plugin/out_elasticsearch.rb:791:in `write'
      /usr/lib/ruby/gems/2.7.0/gems/fluentd-1.16.3/lib/fluent/plugin/output.rb:1203:in `try_flush'

    Filter for non-buffer errors and look at the tail:

    grep -E '\[error\]|\[warn\]' /var/log/td-agent/td-agent.log | grep -v buffer | tail -50

    Check which plugin version is installed:

    td-agent-gem list | grep fluent-plugin
    # fluent-plugin-elasticsearch (5.2.4)
    # fluent-plugin-prometheus (2.1.0)
    # fluent-plugin-rewrite-tag-filter (2.4.0)

    How to Fix It

    Update the offending plugin to its latest stable version:

    td-agent-gem update fluent-plugin-elasticsearch
    # Verify the new version:
    td-agent-gem list | grep fluent-plugin-elasticsearch
    # fluent-plugin-elasticsearch (5.4.3)

    If the problem is malformed log lines crashing a parser, use Fluentd's built-in error routing to redirect bad records instead of letting them crash the plugin:

    <filter app.**>
      @type parser
      key_name log
      reserve_data true
      emit_invalid_record_to_error true
    
      <parse>
        @type json
      </parse>
    </filter>
    
    <match fluent.**.error>
      @type file
      path /var/log/td-agent/parse_errors/
    </match>

    Setting

    emit_invalid_record_to_error true
    routes malformed records to a separate output instead of raising an exception. After making changes, reload:

    systemctl reload td-agent

    Root Cause 3: Memory Limit Hit

    Why It Happens

    Fluentd is a Ruby process, and Ruby's garbage collector isn't always aggressive about returning memory to the OS. If you're running Fluentd inside a container with a memory limit, under systemd with

    MemoryMax
    set, or on a host where other services are competing for RAM, the process will eventually hit its ceiling. When that happens, the Linux OOM killer terminates it — no warning, no graceful shutdown, no buffer flush. Whatever logs were held in memory at the time of termination are gone.

    This is especially common when using in-memory buffers (

    @type memory
    ), large chunk sizes, or filters that expand record size significantly — like adding full Kubernetes metadata to every log line.

    How to Identify It

    Check the kernel ring buffer for OOM kill events:

    dmesg | grep -i 'oom\|killed process' | tail -20
    # [428743.112345] Out of memory: Kill process 18432 (ruby) score 892 or sacrifice child
    # [428743.113001] Killed process 18432 (ruby) total-vm:4194304kB, anon-rss:3145728kB, file-rss:8192kB

    Confirm via journald that it was the td-agent service that was killed:

    journalctl -u td-agent --since '1 hour ago' | grep -E 'Killed|OOM|memory'
    # Apr 12 04:17:32 sw-infrarunbook-01 systemd[1]: td-agent.service: Main process exited, code=killed, status=9/KILL

    Check current RSS usage of the running process:

    ps aux | grep fluentd
    # infrarunbook-admin  18901  45.2 12.8 4200000 2097152 ?  Sl   03:00   8:42 /usr/bin/ruby /usr/sbin/fluentd
    
    awk '/VmRSS/{print $2/1024 " MB"}' /proc/18901/status
    # 2048 MB

    How to Fix It

    The most impactful change is switching from in-memory buffers to file-backed buffers, which don't hold data in the Ruby heap:

    <buffer tag>
      @type file          # Was: @type memory
      path /var/log/td-agent/buffer/app
      chunk_limit_size 8MB
      total_limit_size 512MB
    </buffer>

    If you're running under systemd, give Fluentd a realistic memory ceiling and disable swap to prevent slow death-by-swap instead of a clean OOM kill:

    # /etc/systemd/system/td-agent.service.d/memory.conf
    [Service]
    MemoryMax=1G
    MemorySwapMax=0
    systemctl daemon-reload
    systemctl restart td-agent

    For Kubernetes environments, use a hostPath volume for the buffer directory so buffer data survives container restarts. Set resource requests and limits based on actual observed usage, not guesses. You can also nudge Ruby's GC to run more aggressively:

    # /etc/default/td-agent (Debian/Ubuntu) or /etc/sysconfig/td-agent (RHEL)
    RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=0.9
    RUBY_GC_MALLOC_LIMIT=4000000

    Root Cause 4: Network Timeout to Destination

    Why It Happens

    Fluentd forwards logs to a downstream destination — Elasticsearch, an S3-compatible store, a remote syslog endpoint, another Fluentd aggregator. If that destination becomes slow or unreachable, the output plugin starts timing out on each flush attempt. Fluentd will retry with exponential backoff. Meanwhile, new logs keep arriving. The buffer fills from the back while flushes stall on the front. Eventually you hit buffer overflow on top of the timeout problem, and now you're dealing with both simultaneously.

    Network timeouts are particularly insidious because Fluentd may look healthy — the process is up, the config is valid — but it's quietly accumulating unflushable chunks or silently discarding records after retry exhaustion.

    How to Identify It

    The Fluentd log will show connection refused or timeout errors with escalating retry intervals:

    2026-04-12 05:30:01 +0000 [warn]: #0 failed to flush the buffer. retry_time=3 next_retry_seconds=30 chunk="5b2a1c..." error_class=Fluent::Plugin::ElasticsearchOutput::ConnectionFailure error="Can not reach Elasticsearch cluster"
    2026-04-12 05:30:31 +0000 [warn]: #0 failed to flush the buffer. retry_time=4 next_retry_seconds=60 chunk="5b2a1c..."
    2026-04-12 05:31:31 +0000 [warn]: #0 failed to flush the buffer. retry_time=5 next_retry_seconds=120 chunk="5b2a1c..."

    Confirm the destination is unreachable from the Fluentd host itself:

    curl -v http://10.10.1.50:9200/_cluster/health
    # * connect to 10.10.1.50 port 9200 failed: Connection refused
    
    nc -zv 10.10.1.50 9200
    # nc: connect to 10.10.1.50 port 9200 (tcp) failed: Connection refused

    Check the retry counter via Prometheus metrics to confirm how long the backoff has been running:

    curl -s http://localhost:24231/metrics | grep retry
    # fluentd_output_status_retry_count{plugin_id="output_es",type="elasticsearch"} 47
    # fluentd_output_status_retry_wait{plugin_id="output_es",type="elasticsearch"} 120.0

    A retry count climbing with a long

    retry_wait
    confirms you're in a timeout and backoff loop.

    How to Fix It

    Restoring connectivity to the destination is outside Fluentd's control — investigate the downstream service. But you can configure Fluentd to hold data reliably through extended outages instead of giving up and discarding chunks:

    <match app.**>
      @type elasticsearch
      host 10.10.1.50
      port 9200
      request_timeout 30s
    
      <buffer tag>
        @type file
        path /var/log/td-agent/buffer/app
        total_limit_size 2GB
        flush_interval 10s
        retry_type exponential_backoff
        retry_max_interval 300s
        retry_forever true
      </buffer>
    </match>

    Setting

    retry_forever true
    means Fluentd will never discard buffered chunks due to retry exhaustion — it keeps trying until the destination comes back. Pair this with a generous
    total_limit_size
    so the buffer can absorb data through a multi-hour outage.

    For multi-destination resilience, use the

    copy
    output plugin with a local file fallback:

    <match app.**>
      @type copy
    
      <store>
        @type elasticsearch
        host 10.10.1.50
        port 9200
        <buffer tag>
          @type file
          path /var/log/td-agent/buffer/es_primary
          retry_forever true
          total_limit_size 1GB
        </buffer>
      </store>
    
      <store ignore_error>
        @type file
        path /var/log/td-agent/fallback/app
      </store>
    </match>

    The

    ignore_error
    flag on the secondary store means a failure there won't affect the primary output. You can replay from the fallback file manually once the primary destination recovers.


    Root Cause 5: Config Parse Error

    Why It Happens

    This is the most common cause after any config change, and also one of the most disruptive. After editing a config file — adding a new filter, updating a destination address, rotating a credential — Fluentd must be reloaded or restarted to pick up the change. If the config has a syntax error, Fluentd refuses to start and all log collection stops cold. The config DSL is not forgiving: a typo in a plugin name, a mismatched tag, an

    @include
    pointing to a file that doesn't exist, or a missing closing directive will all cause a parse failure and a hard exit.

    How to Identify It

    When Fluentd fails to start due to a config error, the message appears in the journal:

    journalctl -u td-agent -n 30
    # Apr 12 06:00:01 sw-infrarunbook-01 fluentd[24401]: 2026-04-12 06:00:01 +0000 [error]: config error file="/etc/td-agent/td-agent.conf" error_class=Fluent::ConfigError error="Unknown output plugin 'elasticsaerch'. Run 'gem search -rd fluent-plugin' to find plugins"
    # Apr 12 06:00:01 sw-infrarunbook-01 systemd[1]: td-agent.service: Main process exited, code=exited, status=1/FAILURE

    A malformed match block looks like this:

    2026-04-12 06:05:12 +0000 [error]: config error file="/etc/td-agent/td-agent.conf" error_class=Fluent::ConfigError error="'match' section requires argument, in section match"

    Always validate the config before applying it. Fluentd has a built-in dry-run mode that catches parse errors without affecting the running service:

    fluentd --dry-run -c /etc/td-agent/td-agent.conf
    
    # On success:
    # 2026-04-12 06:10:01 +0000 [info]: reading config file path="/etc/td-agent/td-agent.conf"
    # 2026-04-12 06:10:01 +0000 [info]: starting fluentd-1.16.3 pid=24510 ruby="2.7.8"
    # 2026-04-12 06:10:02 +0000 [info]: gem 'fluent-plugin-elasticsearch' version '5.4.3'
    # 2026-04-12 06:10:02 +0000 [info]: dry run finished
    
    # On failure:
    # 2026-04-12 06:10:01 +0000 [error]: config error file="/etc/td-agent/td-agent.conf" error_class=Fluent::ConfigError error="..."

    How to Fix It

    Read the error message carefully — Fluentd's config errors are usually specific about what went wrong and where. The most common issues fall into a few categories.

    Typo in plugin name: Check your spelling and verify the plugin is actually installed.

    td-agent-gem list | grep fluent-plugin-elasticsearch
    
    # Install if missing:
    td-agent-gem install fluent-plugin-elasticsearch

    Mismatched open and close tags: Every

    <source>
    ,
    <match>
    ,
    <filter>
    , and
    <buffer>
    block needs a closing tag. Count them:

    grep -c '^<[a-z]' /etc/td-agent/td-agent.conf
    grep -c '^</[a-z]' /etc/td-agent/td-agent.conf
    # These numbers must match

    Missing included files: If you use

    @include
    directives, confirm the referenced files exist:

    grep '@include' /etc/td-agent/td-agent.conf
    # @include /etc/td-agent/conf.d/*.conf
    ls /etc/td-agent/conf.d/

    Once fixed, validate and reload atomically:

    fluentd --dry-run -c /etc/td-agent/td-agent.conf && systemctl reload td-agent

    Root Cause 6: Corrupted Buffer Chunks

    Why It Happens

    Fluentd stores buffered data in chunk files — binary files encoded in MessagePack format. If a chunk file becomes corrupted due to an ungraceful shutdown, a disk write error, or a Fluentd version upgrade that changed the chunk format, Fluentd will fail when it tries to read and flush that chunk. The typical symptom is a crash loop: Fluentd starts, tries to resume the corrupted chunk, raises an exception, and exits. Systemd restarts it, and the cycle repeats.

    How to Identify It

    grep -i 'chunk\|broken\|corrupt' /var/log/td-agent/td-agent.log | grep -i error
    # 2026-04-12 07:00:03 +0000 [error]: #0 found broken chunk file during resume. Deleted: path="/var/log/td-agent/buffer/app/5b2a1c3d4e5f.log"
    # 2026-04-12 07:00:03 +0000 [error]: #0 failed to resume buffer error_class=Fluent::Plugin::Buffer::FileChunk::FileChunkError error="chunk is broken"

    How to Fix It

    Stop the service and remove the identified broken chunk files. Fluentd logs the exact path of chunks it considers broken, so target those specifically:

    systemctl stop td-agent
    rm /var/log/td-agent/buffer/app/5b2a1c3d4e5f.log
    rm /var/log/td-agent/buffer/app/5b2a1c3d4e5f.log.meta
    systemctl start td-agent

    If the corruption is widespread, clear the entire buffer directory as described in the buffer overflow section. To prevent this from causing crash loops in the future, enable automatic broken-chunk skipping (available in Fluentd 1.14+):

    <buffer tag>
      @type file
      path /var/log/td-agent/buffer/app
      skip_broken_chunks true
    </buffer>

    Prevention

    Preventing Fluentd crashes and log loss comes down to good configuration hygiene and real visibility into what the process is doing. A few practices that have consistently saved production deployments:

    Always validate configs before deploying. Make

    fluentd --dry-run -c /etc/td-agent/td-agent.conf
    a mandatory gate in your config deployment pipeline. If you're using Ansible or Chef to manage configs, add a handler that runs the dry-run check before any reload fires. A failed dry-run should block the deployment.

    Use file-backed buffers everywhere. Memory buffers are faster but they're gone the moment the process dies. File buffers survive restarts and give you a much better chance of not losing data during a crash. The performance difference is negligible for most log volumes.

    Set

    overflow_action block
    . The default
    drop_oldest_chunk
    silently discards data. Blocking applies backpressure upstream — your sources slow down or buffer locally, which is recoverable. Dropped log records are not.

    Monitor the right metrics. If you're not running

    fluent-plugin-prometheus
    , install it now. The three metrics that matter most are
    fluentd_output_status_retry_count
    ,
    fluentd_output_status_buffer_queue_length
    , and
    fluentd_output_status_emit_records
    . Alert on retry_count increasing continuously and buffer_queue_length approaching its configured limit. A sustained drop in emit_records when log sources are active is a reliable early signal that something is wrong upstream.

    Pin plugin versions in production. Plugins are deployed via gems, and an uncontrolled

    td-agent-gem update
    can pull in a breaking change. Maintain a version list and only update after testing against staging traffic that mirrors production log patterns and volume.

    Configure systemd restart behavior sensibly. By default, systemd stops restarting a service after too many failures in a short window — which means a transient issue can leave Fluentd permanently stopped. Configure the limits to allow recovery from transient failures while alerting on persistent ones:

    # /etc/systemd/system/td-agent.service.d/restart.conf
    [Service]
    Restart=on-failure
    RestartSec=10s
    StartLimitIntervalSec=300
    StartLimitBurst=5

    Set

    retry_forever true
    for critical outputs. If your destination goes down for a few hours, you want Fluentd to keep the data buffered and retry indefinitely rather than giving up and discarding chunks after a fixed retry count. Pair this with a large
    total_limit_size
    so the buffer can hold data through a realistic outage window.

    Fluentd is reliable when it's configured deliberately. Most of the crashes and silent data loss events I've debugged in production traced back to one of these root causes — and almost all of them were preventable with better buffer configuration, a working monitoring setup, and a config validation step wired into the deployment process.

    Frequently Asked Questions

    How do I check if Fluentd is silently dropping logs?

    Run `grep 'overflow\|chunk dropped\|buffer is full' /var/log/td-agent/td-agent.log` and check the `fluentd_output_status_retry_count` Prometheus metric. A climbing retry count combined with buffer queue length near the configured limit means Fluentd is discarding records. Also compare log ingestion rates at your destination against expected volume from your sources.

    What is the safest overflow_action setting for Fluentd buffers?

    Use `overflow_action block`. The default `drop_oldest_chunk` silently discards data when the buffer is full. Blocking applies backpressure to upstream log sources, which is recoverable. Lost log records are not. Combine it with a large `total_limit_size` so the buffer has room to absorb bursts before blocking kicks in.

    How do I stop Fluentd from crashing when it encounters malformed log lines?

    In your filter block, set `emit_invalid_record_to_error true` and `reserve_data true` on the parser plugin. Then add a `<match fluent.**.error>` block to route bad records to a file instead of letting them raise an exception. This prevents a single malformed line from crashing the worker thread handling that tag.

    What causes Fluentd to enter a restart loop immediately after starting?

    The two most common causes are a config parse error (check `journalctl -u td-agent -n 30` for a ConfigError message) and a corrupted buffer chunk that raises an exception during buffer resume. For the latter, stop the service, identify the broken chunk files from the log output, delete them and their .meta counterparts, then restart.

    How can I make Fluentd survive a long destination outage without losing logs?

    Set `retry_forever true` and use a file-backed buffer with a large `total_limit_size` (e.g., 2GB or more). This keeps Fluentd retrying indefinitely while storing incoming data on disk. When the destination recovers, Fluentd will automatically flush the accumulated buffer. Combine this with a secondary fallback output using the `copy` plugin for extra resilience.

    Related Articles