What the ELK Stack Is
The ELK Stack — Elasticsearch, Logstash, and Kibana — has become the de facto standard for centralized log management in production infrastructure. I've deployed ELK in environments ranging from small 10-node clusters to multi-datacenter setups ingesting hundreds of gigabytes of logs per day, and the architecture concepts remain consistent regardless of scale. Once you understand how data actually moves through the system, tuning and troubleshooting become intuitive rather than guesswork.
ELK is three distinct tools that complement each other. Elasticsearch is the distributed search and analytics engine. It stores your log data as JSON documents in indices and provides full-text search, aggregations, and near-real-time query capabilities. Logstash is the data processing pipeline — it ingests data from multiple sources, transforms and enriches it, then ships it to a destination, almost always Elasticsearch. Kibana is the visualization layer. It sits on top of Elasticsearch and provides dashboards, search interfaces, alerting, and the Discover view for ad-hoc log exploration.
The modern stack has evolved well beyond just those three components. Elastic introduced Beats — lightweight data shippers — which transformed the data collection layer entirely. Filebeat, Metricbeat, Packetbeat, and Winlogbeat run on endpoints and ship data either directly to Elasticsearch or to Logstash for processing first. The full ecosystem is now officially called the Elastic Stack to reflect this expansion. When engineers say ELK, they still generally mean Elasticsearch as the core with Logstash handling ingestion and Kibana providing the UI — but in practice, most production deployments add Beats at the edge and increasingly adopt Elastic Agent, the unified successor to standalone Beats, as the collection layer.
How the Data Flow Works
Understanding the data flow is where things get interesting — and where misconfigured pipelines cause headaches. At the highest level, data moves through five stages: Source → Collector → Processor → Storage → Visualization. Each stage has distinct responsibilities, and understanding those boundaries is what separates engineers who can actually debug a broken pipeline from those who just restart services and hope.
Stage 1: Log Sources
Your sources are everything generating data: application servers, network devices, firewalls, databases, container orchestrators. On sw-infrarunbook-01 (172.16.10.50), you might have an Nginx access log, a custom application log, and systemd journal entries all needing to be collected and centralized. Each source has its own format quirks — structured JSON from modern apps, space-delimited fields from Nginx, syslog RFC 5424 from network gear. The pipeline needs to handle all of them.
Stage 2: Collection with Filebeat
Filebeat runs on the source host. It tails log files, tracks its position using a registry file so it doesn't re-read events after a restart, and ships events forward. Filebeat is deliberately lightweight — it's designed to do minimal processing and leave the heavy lifting to Logstash or Elasticsearch ingest pipelines downstream.
A typical Filebeat configuration for sw-infrarunbook-01:
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/nginx/access.log
tags: ["nginx", "web"]
- type: log
enabled: true
paths:
- /var/log/app/solvethenetwork-api.log
tags: ["api", "application"]
multiline.pattern: '^\d{4}-\d{2}-\d{2}'
multiline.negate: true
multiline.match: after
output.logstash:
hosts: ["172.16.10.10:5044"]
The multiline configuration is critical for application logs where stack traces span multiple lines. Without it, each line of a Java exception becomes a separate event in Elasticsearch — a nightmare to correlate after the fact. The pattern anchors on a timestamp at the start of a new line; any line that doesn't begin with a timestamp gets appended to the previous event as part of the same log entry. Get this wrong and your exceptions show up as dozens of disconnected fragments.
Stage 3: Processing with Logstash
Logstash is the transformation engine. It runs a pipeline with three sections: input, filter, and output. The input accepts data from Beats on port 5044. The filter section is where you parse raw log strings into structured fields, enrich events with additional context, and route different log types through different processing logic. The output ships the finished event to its destination.
A Logstash pipeline processing logs from sw-infrarunbook-01 at 172.16.10.50:
input {
beats {
port => 5044
}
}
filter {
if "nginx" in [tags] {
grok {
match => {
"message" => '%{IPORHOST:client_ip} - %{DATA:user} \[%{HTTPDATE:timestamp}\] "%{WORD:method} %{DATA:request} HTTP/%{NUMBER:http_version}" %{NUMBER:response_code} %{NUMBER:bytes}'
}
}
date {
match => ["timestamp", "dd/MMM/yyyy:HH:mm:ss Z"]
target => "@timestamp"
}
geoip {
source => "client_ip"
}
mutate {
convert => { "response_code" => "integer" }
convert => { "bytes" => "integer" }
remove_field => ["timestamp"]
}
}
if "api" in [tags] {
json {
source => "message"
}
if [level] == "ERROR" {
mutate {
add_tag => ["alert_candidate"]
}
}
}
}
output {
elasticsearch {
hosts => ["http://172.16.10.20:9200", "http://172.16.10.21:9200", "http://172.16.10.22:9200"]
index => "logs-%{[tags][0]}-%{+YYYY.MM.dd}"
user => "infrarunbook-admin"
password => "${ES_PASSWORD}"
}
}
The
grokfilter is the workhorse for unstructured log parsing. It uses named capture groups against regex patterns —
%{IPORHOST:client_ip}extracts the source IP and assigns it to the
client_ipfield. The
datefilter parses the Nginx timestamp and sets it as the canonical
@timestamp, which Elasticsearch uses for all time-based queries and index lifecycle management. Without this step, your events land with the ingest timestamp rather than the actual log timestamp — which means your time-range queries during any backfill or incident replay are completely wrong.
The output section lists all three Elasticsearch data nodes directly. Logstash will round-robin across them, which distributes indexing load and means the pipeline keeps working even if one node is briefly unavailable during a rolling restart.
Stage 4: Elasticsearch Indexing
When an event arrives at Elasticsearch, it gets indexed into a shard. Elasticsearch distributes shards across nodes in the cluster, providing both redundancy through replicas and query parallelism across primaries. The daily index pattern —
logs-nginx-2026.04.08— is intentional design, not just convention. Daily indices let you delete old data cleanly through Index Lifecycle Management without running expensive delete-by-query operations against live indices.
In any serious deployment, the Elasticsearch cluster separates node roles. Master nodes manage cluster state — index creation, shard allocation, node membership — without handling data or search. Data nodes store shards and execute search queries. Coordinating nodes receive incoming search requests, fan them out to the relevant data nodes, and merge results before returning them to the client. Ingest nodes run Elasticsearch's built-in ingest pipelines, which provide an alternative to Logstash for lighter processing workloads.
A three-node cluster at 172.16.10.20 through 172.16.10.22 handles both master election and data storage in smaller deployments. For production you want at least three dedicated master-eligible nodes to avoid split-brain. The elasticsearch.yml configuration for the first node:
cluster.name: infrarunbook-prod
node.name: es-node-01
node.roles: [master, data]
network.host: 172.16.10.20
discovery.seed_hosts:
- 172.16.10.20
- 172.16.10.21
- 172.16.10.22
cluster.initial_master_nodes:
- es-node-01
- es-node-02
- es-node-03
path.data: /data/elasticsearch
path.logs: /var/log/elasticsearch
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
The
cluster.initial_master_nodessetting is used only on the very first cluster bootstrap. Once the cluster is formed, you remove it. Leaving it in place on subsequent restarts can cause issues — it's a one-time bootstrapping directive, not a persistent configuration.
Stage 5: Visualization with Kibana
Kibana connects to Elasticsearch and provides the interface your team actually uses. The Discover tab lets you search logs in near-real-time using KQL or Lucene syntax. Dashboards aggregate data into visualizations — time series charts for request rates, pie charts for HTTP response code distributions, data tables for the top error messages by frequency. Kibana also runs the alerting engine, which can fire notifications when log patterns match threshold conditions.
Kibana listens on port 5601 and authenticates against Elasticsearch using the security realm you've configured. In production, you'd run Nginx or another reverse proxy in front of Kibana to handle TLS termination and restrict access to the internal network range — 172.16.0.0/12 in a typical RFC 1918 layout — so the raw Kibana port is never exposed directly.
Why It Matters
Centralized logging isn't a convenience — it's an operational necessity at any scale beyond a handful of servers. When an incident hits at 2 AM and you're correlating events across sw-infrarunbook-01, three application servers, a load balancer, and a database cluster, hunting through individual log files over SSH is not a strategy. It's a delay.
I've watched this play out more than once: an engineer spends 45 minutes during an active outage SSH-hopping between servers grepping log files, only to discover the root cause was in a service they didn't think to check first. With ELK, that investigation collapses to a single Kibana query — search for the error pattern across all hosts simultaneously, correlate on timestamps, trace the causal chain. What was 45 minutes of frantic grepping becomes a two-minute investigation.
Beyond incident response, ELK enables proactive detection. You can build alerting rules in Kibana that fire when error rates spike above a baseline, when a specific authentication failure pattern appears, or when log volume from a host drops unexpectedly. That last one is counterintuitive but important — a host that stops sending logs is often more concerning than a host sending error logs. Silence can mean a broken pipeline, a crashed logging agent, or a compromised system deliberately suppressing its traces.
For compliance and audit requirements, centralized log storage with automated retention solves a significant operational problem. Index Lifecycle Management lets you define that hot-tier indices get fast NVMe storage for 7 days, warm-tier gets cheaper spinning disk for 30 days, then data gets deleted or snapshotted to object storage. The whole lifecycle is automated — no manual cleanup scripts, no fragile cron jobs, no surprise disk-full alerts on log servers.
Real-World Architecture: solvethenetwork.com Infrastructure
Here's how these pieces assemble into a working production deployment for the solvethenetwork.com infrastructure stack.
The edge layer runs Filebeat on every server in the 172.16.10.0/24 application subnet. Syslog from network equipment — switches, firewalls, routers — ships via UDP syslog to a dedicated Logstash syslog listener on 172.16.10.10 port 5140. Kubernetes pods write to stdout and stderr, and Filebeat with autodiscover picks those up directly from the container runtime log files on each worker node, automatically tagging events with the pod name, namespace, and container image.
Logstash runs on 172.16.10.10 and 172.16.10.11 behind a Layer 4 load balancer. Two Logstash nodes mean rolling pipeline updates without dropping events — Filebeat detects when a Logstash node is unavailable and backs off, retrying against the other node. Each Logstash instance runs with 4GB JVM heap and uses separate named pipelines for different log types, so a burst of syslog events doesn't starve the application log pipeline of worker threads.
Index templates define field mappings before any data arrives. This isn't optional. Without explicit mappings, Elasticsearch's dynamic mapping will infer types — and it will get them wrong in ways that cause query failures later. A field mapped as
longin one daily index and
keywordin another breaks aggregations across the full index pattern. Set the mappings once in the template and every new index inherits them:
PUT _index_template/logs-nginx
{
"index_patterns": ["logs-nginx-*"],
"template": {
"settings": {
"number_of_shards": 2,
"number_of_replicas": 1,
"index.lifecycle.name": "logs-policy",
"index.lifecycle.rollover_alias": "logs-nginx"
},
"mappings": {
"properties": {
"@timestamp": { "type": "date" },
"client_ip": { "type": "ip" },
"response_code": { "type": "integer" },
"bytes": { "type": "long" },
"request": { "type": "keyword" },
"message": { "type": "text" }
}
}
}
}
The ILM policy attached to these indices automatically moves data through hot, warm, and delete phases based on age and index size. Mapping
client_ipas type
iprather than
keywordunlocks IP range queries in Kibana — you can search for all traffic from a CIDR block without string matching gymnastics. Small mapping decisions like this have outsized impact on the query capabilities you have during an investigation.
Common Misconceptions
The first one I run into constantly: Logstash is required. It's not. You can ship Filebeat directly to Elasticsearch and use ingest pipelines for field parsing and enrichment. Ingest pipelines run inside Elasticsearch nodes and support grok, date parsing, geoip enrichment, and field manipulation — much of what Logstash does. For simpler pipelines or resource-constrained environments, eliminating Logstash reduces the operational surface area significantly. Logstash earns its keep when you need complex conditional routing, multiple outputs simultaneously, or transformation logic that exceeds what ingest pipelines cleanly support.
Second: ELK replaces your metrics stack. It doesn't, and conflating the two creates real problems. ELK is built for log events — discrete, timestamped text records. Prometheus and its ecosystem are built for time series metrics with high-cardinality labels and efficient range queries over numeric data. Metricbeat can ship system metrics to Elasticsearch and Kibana can render them, but the query patterns and storage characteristics differ fundamentally. I've seen teams try to centralize everything in Elasticsearch and hit cardinality scaling walls that a Prometheus deployment would handle without noticing. Use the right tool for each job: ELK for logs, Prometheus for metrics, distributed tracing for request traces.
Third: scaling Elasticsearch is just adding nodes. More nodes help, but the bottlenecks are often elsewhere. Too many small shards are worse than fewer appropriately sized ones — each shard is a Lucene index instance with its own file handles, memory overhead, and coordination cost. The practical target is shard sizes between 10GB and 50GB. Over-sharded clusters spend more CPU on coordination than on actual search. In my experience, clusters with 10,000 shards often perform worse than a well-configured three-node cluster with 300 shards. If you're hitting performance problems, audit your shard count and sizes before ordering more hardware.
Fourth: Logstash is always the ingestion bottleneck. It might be, but just as often the constraint is Elasticsearch's indexing throughput. Logstash scales horizontally without much complexity. Elasticsearch indexing throughput depends on shard count, refresh interval, and underlying storage I/O. If you're seeing indexing lag back up in Logstash's persistent queue, look at Elasticsearch node metrics first. Increasing
index.refresh_intervalfrom the default 1 second to 5 or 30 seconds for write-heavy indices reduces I/O significantly — new documents won't be searchable until the next refresh fires, but for bulk log ingestion that tradeoff is almost always acceptable.
Fifth — and this one has real security consequences: security is optional for internal clusters. This thinking is how Elasticsearch clusters end up accessible to anyone on the network segment, exposing log data that frequently contains API keys embedded in error messages, authentication tokens, internal hostnames, and user identifiers. X-Pack security has been included in Elasticsearch's free basic tier since version 7.1. It provides TLS on both the transport and HTTP layers, native realm authentication, and role-based access control. Enable it from day one when you bootstrap the cluster. Retrofitting TLS and authentication into a running production cluster — while keeping it available — is a painful rolling-restart exercise that's entirely avoidable.
The ELK Stack is mature, opinionated, and genuinely well-suited to the centralized logging problem. The architecture isn't magic — it's a well-designed pipeline from collection through storage to visualization, with each component doing a distinct job. Understanding each stage lets you troubleshoot when events go missing, tune performance when ingestion lags, and design indices that actually support the queries you'll need at 2 AM. Get the fundamentals right and the stack will carry serious production workloads without fighting you.
