Symptoms
You've just pushed data into Elasticsearch — maybe through Logstash, Filebeat, or a direct API call — and nothing shows up. The index simply doesn't exist. Running
GET _cat/indices?vreturns either a blank list or completely omits the index name you expected. Kibana's Discover tab greets you with "No results found." Your application logs are filling up with connection errors or 403/400 responses from the Elasticsearch REST API.
The specific error messages vary depending on the root cause, but here are the most common ones you'll encounter in the wild:
- HTTP 400:
mapper_parsing_exception
when an incoming document contradicts the index mapping - HTTP 403:
security_exception
withaction [indices:admin/create] is unauthorized
- HTTP 400:
illegal_argument_exception
around field types or alias configuration - No error at all — the write silently fails because Logstash swallowed the exception and kept retrying into a dead end
Before chasing any specific cause, get a baseline snapshot of your cluster state with these two commands:
GET _cluster/health?pretty
GET _cat/indices?v&h=health,status,index,pri,rep,docs.count,store.sizeIf cluster health is red, index creation will fail universally. Yellow means some replicas are unassigned — that usually won't block new index creation, but it's worth noting before you proceed. Now let's walk through every root cause I've seen in production environments.
Root Cause 1: Index Template Missing
Why It Happens
Elasticsearch uses index templates to define settings and mappings for indices that match a given name pattern. When you rely on a template to configure shard counts, replica counts, or field mappings — and that template doesn't exist — Elasticsearch either falls back to system defaults (which may not match your expectations) or, when
action.auto_create_indexis disabled cluster-wide, refuses to create the index entirely. In my experience, this bites teams hardest right after a cluster migration or snapshot restore where templates weren't included in the restore scope. Someone restores the data but forgets
include_global_state: true, and the templates vanish silently.
How to Identify It
Check which templates exist and whether any match your target index name pattern:
GET _index_template?pretty
GET _index_template/logs-*If the second call returns a 404 or an empty result set, your template is gone. You can also probe the legacy template API, which some older pipelines still rely on:
GET _template?prettyIf the template exists but the index still isn't being created, verify that the template's
index_patternsfield actually matches your index name. I've seen cases where someone changed the index naming convention from
logs-app-2024.01.01to
app-logs-2024.01.01and forgot the template pattern only covered
logs-*. The template is present, the pattern is wrong, and the index gets no settings applied — or doesn't get created at all.
Simulate template application without actually creating the index to confirm what settings it would receive:
POST _index_template/_simulate_index/logs-app-2024.01.01How to Fix It
Re-create the template. Here's a minimal working example for a logs index template:
PUT _index_template/logs-template
{
"index_patterns": ["logs-*"],
"priority": 100,
"template": {
"settings": {
"number_of_shards": 1,
"number_of_replicas": 1
},
"mappings": {
"properties": {
"@timestamp": { "type": "date" },
"message": { "type": "text" },
"level": { "type": "keyword" },
"host": { "type": "keyword" }
}
}
}
}After creating the template, trigger a manual index creation to confirm it applies cleanly:
PUT logs-test-verify-001
GET logs-test-verify-001/_settings
DELETE logs-test-verify-001If you're running a snapshot-restore workflow, make sure templates are included going forward. The key flag is
include_global_state:
POST _snapshot/my_backup/snapshot_1/_restore
{
"include_global_state": true
}Store your templates in version control and apply them via a bootstrap script during cluster provisioning. Treat them like infrastructure code — because they are.
Root Cause 2: Mapping Conflict
Why It Happens
Elasticsearch is strict about field types once a mapping is established. If you try to create an index — or write data to an auto-created index — and the inferred or declared field type conflicts with an existing template mapping, you'll get a
mapper_parsing_exceptionand the index creation will be rejected. This happens most often when a field that's been mapped as
integerin the template suddenly receives a string value like
N/A, or when a field mapped as
keywordreceives a nested object. Dynamic mapping makes this worse because what worked on your first document might fail on your hundredth when the data shape changes.
How to Identify It
The error from Elasticsearch is usually explicit about which field is the problem:
{
"error": {
"root_cause": [
{
"type": "mapper_parsing_exception",
"reason": "failed to parse field [response_code] of type [integer] in document with id '1'. Preview of field's value: 'N/A'"
}
],
"type": "mapper_parsing_exception",
"reason": "failed to parse field [response_code] of type [integer] in document with id '1'. Preview of field's value: 'N/A'"
},
"status": 400
}To see the current effective mapping that a new index would inherit from a template, use the simulate API and compare it against what your data actually looks like:
POST _index_template/_simulate_index/logs-app-2024.01.01
GET logs-existing-index/_mappingHow to Fix It
The fix depends on which side of the conflict is wrong. If your data is correct and the template mapping is too strict, update the template to use a compatible type. Switching a numeric field to
keywordis a common and safe resolution when the field contains mixed content:
PUT _index_template/logs-template
{
"index_patterns": ["logs-*"],
"priority": 100,
"template": {
"mappings": {
"properties": {
"response_code": { "type": "keyword" }
}
}
}
}Note that you can't change the mapping on an existing index after the fact — you'd need to reindex the data into a new index with the corrected mapping:
POST _reindex
{
"source": { "index": "logs-app-old" },
"dest": { "index": "logs-app-corrected" }
}If the data itself is malformed upstream, fix it at the source rather than hacking around it in Elasticsearch. In Logstash, use a
mutatefilter to cast or sanitize the field. In Filebeat, use a processor. Pushing bad data and relying on the index to tolerate it is a path to more pain later.
Root Cause 3: Disk Watermark Reached
Why It Happens
Elasticsearch has built-in disk-based shard allocation thresholds that kick in automatically. When any node crosses the high watermark (default: 90% disk used), Elasticsearch stops allocating new shards to that node. When the flood stage watermark is hit (default: 95%), it goes further and enforces a read-only index block on all indices assigned to that node. At that point, new indices can't be created and existing ones can't be written to. The cluster isn't broken — it's protecting your data from corruption due to a full disk — but the effect from the application side looks like a total write outage.
How to Identify It
This one's fast to diagnose. Check disk usage across all nodes:
GET _cat/allocation?v&h=node,disk.used,disk.avail,disk.percent,shardsExample output showing a node over the threshold:
node disk.used disk.avail disk.percent shards
sw-infrarunbook-01 450gb 50gb 90% 120
es-data-02 200gb 300gb 40% 80Also check whether any index has been placed into read-only mode by the flood stage watermark trigger:
GET logs-app-2024.01.01/_settings
# You'll see this in the response if it's been locked:
# "index.blocks.read_only_allow_delete": "true"To see what watermark thresholds are currently in effect:
GET _cluster/settings?include_defaults=true&filter_path=**.watermarkHow to Fix It
First, actually free up disk space — delete old indices, move data to cheaper storage, or expand the disk. Running
GET _cat/indices?v&s=store.size:descwill show you the largest indices to target. Once disk usage is back below the high watermark, remove the read-only block:
# Remove from a specific index
PUT logs-app-2024.01.01/_settings
{
"index.blocks.read_only_allow_delete": null
}
# Or remove from all indices at once
PUT _all/_settings
{
"index.blocks.read_only_allow_delete": null
}As a temporary measure during an active incident — not a permanent fix — you can raise the watermark thresholds to give yourself breathing room while you sort out the disk situation:
PUT _cluster/settings
{
"transient": {
"cluster.routing.allocation.disk.watermark.low": "92%",
"cluster.routing.allocation.disk.watermark.high": "95%",
"cluster.routing.allocation.disk.watermark.flood_stage": "97%"
}
}Use
transienthere deliberately — transient settings don't survive a cluster restart, which is exactly the behavior you want for an emergency override. Reset them once the disk situation is resolved.
Root Cause 4: ILM Policy Error
Why It Happens
Index Lifecycle Management is powerful when it works. When it doesn't, it can silently block index creation in ways that are genuinely confusing to debug. The most common failure mode is a broken rollover configuration: your data stream or rollover alias is attached to an ILM policy, but either the policy doesn't exist, the rollover alias is misconfigured, or the ILM step has entered an error state mid-cycle. When ILM can't roll over the current write index to a new one, writes stall. New documents have nowhere to land, and the index that should have been created for the current time period simply never gets made.
How to Identify It
Start by checking the overall ILM status and the specific policy:
GET _ilm/status
GET _ilm/policy/logs-policyThen use the explain API to see what ILM is actually doing with your index:
GET logs-app-000001/_ilm/explainA stuck ILM phase will show
step: ERRORin the response, along with a
failed_stepand a
step_infoblock explaining the cause:
{
"indices": {
"logs-app-000001": {
"index": "logs-app-000001",
"managed": true,
"policy": "logs-policy",
"phase": "hot",
"action": "rollover",
"step": "ERROR",
"failed_step": "check-rollover-ready",
"step_info": {
"type": "illegal_argument_exception",
"reason": "index.lifecycle.rollover_alias [logs-app] does not point to index [logs-app-000001]"
}
}
}
}How to Fix It
If the rollover alias is missing or misconfigured, re-create it with the correct
is_write_indexflag:
POST _aliases
{
"actions": [
{
"add": {
"index": "logs-app-000001",
"alias": "logs-app",
"is_write_index": true
}
}
]
}If the ILM policy itself has a configuration problem, update it with corrected parameters:
PUT _ilm/policy/logs-policy
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_size": "50gb",
"max_age": "7d"
}
}
},
"delete": {
"min_age": "30d",
"actions": {
"delete": {}
}
}
}
}
}After fixing the underlying cause, move the stuck index out of its ERROR step and back into the retry path:
POST _ilm/move/logs-app-000001
{
"current_step": {
"phase": "hot",
"action": "rollover",
"name": "ERROR"
},
"next_step": {
"phase": "hot",
"action": "rollover",
"name": "check-rollover-ready"
}
}ILM runs on a polling interval (default: 10 minutes). If you need it to pick up the fix immediately, you can trigger a manual poll:
POST _ilm/startRoot Cause 5: Insufficient Permissions
Why It Happens
If you're running Elasticsearch with security features enabled — and you should be, especially in any environment that touches real data — every API call requires authentication and authorization. A service account that has
readprivileges on an index pattern won't be able to create indices under it. This shows up constantly in environments where the Elasticsearch security configuration was recently tightened, or where a new ingestion pipeline was deployed without verifying that its service user actually has the
create_indexprivilege. It also appears after role changes: someone updates the role definition to remove a privilege and forgets that three pipelines depend on it.
How to Identify It
The error is usually direct and explicit:
{
"error": {
"root_cause": [
{
"type": "security_exception",
"reason": "action [indices:admin/create] is unauthorized for user [infrarunbook-admin] with roles [logs-read-only] on indices [logs-app-2024.01.01], this action is granted by the index privileges [create_index,manage,all]"
}
],
"type": "security_exception",
"status": 403
}
}Verify what roles a user currently has and what those roles actually grant:
GET _security/user/infrarunbook-admin
GET _security/role/logs-read-onlyYou can also use the has-privileges API to test permissions directly without guessing:
POST _security/user/infrarunbook-admin/_has_privileges
{
"index": [
{
"names": ["logs-*"],
"privileges": ["create_index", "write", "manage"]
}
]
}The response will tell you exactly which privileges the user has and which they're missing, field by field.
How to Fix It
Create or update a role with the correct index privileges, then assign it to the user or service account:
PUT _security/role/logs-writer
{
"indices": [
{
"names": ["logs-*"],
"privileges": ["create_index", "write", "manage", "read"]
}
]
}
PUT _security/user/infrarunbook-admin
{
"roles": ["logs-writer"]
}If you're using API keys instead of user accounts — which is the recommended pattern for Filebeat, Logstash, and other pipeline tools — generate a key with explicit role descriptors scoped to the minimum required access:
POST _security/api_key
{
"name": "logstash-ingest-key",
"role_descriptors": {
"logs-writer": {
"indices": [
{
"names": ["logs-*"],
"privileges": ["create_index", "write", "manage", "read"]
}
]
}
}
}Store the returned API key in your secrets manager and configure it in your ingestion pipeline. Don't use the
elasticsuperuser for ingestion pipelines — it's an operational anti-pattern and a serious security risk. If that key is ever leaked, your entire cluster is exposed.
Root Cause 6: Cluster Health Is Red
Why It Happens
A red cluster health means at least one primary shard is unassigned. Elasticsearch won't create new indices when it can't guarantee shard placement across the cluster. This is often a cascading failure — a data node goes down, its primary shards become unassigned, and suddenly nothing new can be written anywhere. The cluster itself is in a degraded state and refuses to take on more work until the shard allocation problem is resolved.
How to Identify It
GET _cluster/health?pretty
GET _cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=stateLook for shards in
UNASSIGNEDstate with a reason of
NODE_LEFTor
ALLOCATION_FAILED. The cluster allocation explain API gives the most detailed answer on why a specific shard won't assign:
GET _cluster/allocation/explain
{
"index": "logs-app-000001",
"shard": 0,
"primary": true
}How to Fix It
Bring the missing node back online if it's recoverable. If the node is permanently lost and you need to force-allocate the primary shard, be aware this is a data-loss operation — you're telling Elasticsearch to treat a shard as empty rather than wait for the original copy:
POST _cluster/reroute
{
"commands": [
{
"allocate_empty_primary": {
"index": "logs-app-000001",
"shard": 0,
"node": "sw-infrarunbook-01",
"accept_data_loss": true
}
}
]
}Only run this command when you have confirmed the original node is gone for good and you've exhausted recovery options. For logging data the risk is usually acceptable; for transactional data it is not.
Prevention
Most of these failures are entirely preventable with a few operational habits baked into your workflow.
Monitor disk usage proactively. Set alerts when any node crosses 75% disk utilization — by the time you're debugging at 90%, you're already in crisis mode. A Prometheus alert using the Elasticsearch exporter catches this early:
- alert: ElasticsearchDiskHigh
expr: elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes < 0.20
for: 5m
labels:
severity: warning
annotations:
summary: "Elasticsearch node {{ $labels.node }} disk above 80%"Keep your index templates and ILM policies in version control. Treat them like code, because they are. Use a CI/CD pipeline to push templates to your cluster on every deployment, so a cluster restore or rebuild never leaves you without the templates your pipelines expect. A simple bootstrap script that does a
PUT _index_template/for each file in your
elasticsearch/templates/directory takes thirty minutes to write and saves hours of debugging.
Use dedicated service accounts for each ingestion pipeline, scoped to the minimum required index patterns and privileges. This makes permission failures trivially traceable — if the
filebeat-ingest-keyis getting 403s, you know exactly which role to check. Don't reuse the same credentials across multiple pipelines.
Set up ILM monitoring. The ILM explain API can be polled programmatically to detect stuck lifecycle steps before they cascade into write failures. A cron job that queries
GET */_ilm/explainand alerts on any index with
step: ERRORwill catch rollover failures long before they manifest as missing indices.
Test your index creation path in staging after any cluster configuration change — security policy update, template modification, ILM policy change, node resize. A quick integration test that writes a document through your normal ingestion path and verifies the index was created takes minutes to run and catches the majority of these failures before they reach production.
Finally, audit your
auto_create_indexsetting and make it intentional:
GET _cluster/settings?include_defaults=true&filter_path=**.auto_create_indexIn most production environments, either disable it entirely or constrain it to specific patterns. Allowing unrestricted auto-creation can mask template misconfigurations — your index gets created but with wrong settings — and you don't find out until you run a query that returns garbage data or hits a mapping exception three weeks later.
