NL blog elasticsearch 1

Written by Hanumantha Reddy

| Mar 19, 2026

7 min read

More Logs. Longer Retention. Faster Search. This Is What Elasticsearch’s LogsDB Actually Does.

Every time your log volume grows, your ability to search it effectively is under pressure. Slower queries, shorter retention windows, and incomplete incident histories are not Elasticsearch problems — they are symptoms of storing log data the wrong way. LogsDB is Elasticsearch’s native answer: a purpose-built index mode that stores logs the way logs should be stored, so you can search more of them, further back in time, and faster.

bulb
Version requirement:
LogsDB is available from Elasticsearch 8.17+ and fully production-ready in Elasticsearch 9.0+. Verify your version before evaluating this feature.

The Real Problem Is Not Storage — It’s Search Coverage

When log volumes grow, the first casualty is not disk space — it’s search depth. Teams start shortening retention windows to keep clusters manageable. What was 90 days became 60, then 30. At the point an incident needs investigation, the logs that would answer the question are already gone.

The underlying cause is structural. Elasticsearch’s default indexing model stores every document as a full raw _source JSON blob on disk, on top of an inverted index for search and doc values for sorting and aggregations. For log data — which is append-only, time-series, and queried against a small subset of fields — the _source blob is the problem. It is a verbatim copy of every document, stored at full size, for every log line ever written. At scale, it consumes the majority of your index storage.

The clusters we work with regularly show _source storage as the single largest contributor to log index size — not the inverted index, not doc values. This is the inefficiency that drives the retention trade-off, and it is entirely addressable.

bulb
THE CORE QUESTION
If you could store the same logs in half the space — without losing a single document or changing how search works — how many more weeks of log history could you keep in your cluster?

What LogsDB Is — and Why It Changes the Equation

LogsDB is a native index mode in Elasticsearch, designed specifically for log workloads. It is not a plugin, a third-party tool, or a workaround — it is how Elastic intends logs to be stored when storage efficiency and search performance matter. Activating it takes one line of configuration.

The mechanism is precise. LogsDB drops the raw _source blob entirely and replaces it with a synthetic source. When a document’s source is requested, Elasticsearch reconstructs it on-the-fly from doc values, which are stored in a columnar format. Columnar storage groups the same field together across all documents.

For log data, where fields like hostnames, log levels, service names, and status codes repeat across millions of documents, the compression ratios are dramatically better than storing individual JSON blobs per document.

On top of this, LogsDB automatically sorts log events by timestamp and host at write time. This further tightens compression and accelerates the time-range queries that dominate log search — the exact queries your team runs when something breaks.

Search behaviour is completely unchanged. The inverted index that powers full-text search remains intact. Queries run identically. What changes is how much log data fits in your cluster, and therefore how far back those queries can reach.

bulb
DATA INTEGRITY
LogsDB does not drop or modify document content. The same documents go in, the same documents come out, and search results are identical. Only the internal storage layout changes.
bulb
PRODUCTION CONSIDERATION
Because LogsDB relies on doc values for source reconstruction, any field that is not explicitly mapped cannot be reconstructed via a synthetic source. Dynamic unmapped fields are at risk of being lost on source fetch. Before enabling LogsDB, audit your log schema and ensure all fields are explicitly mapped.

LogsDB changes how document _id values are handled — and this affects pipelines that rely on custom IDs.

In standard index mode, Elasticsearch stores and respects whatever _id value your ingestion pipeline assigns to each document. This is commonly used for deduplication (re-indexing the same document with the same _id is idempotent) and for upsert workflows where a known _id is required to update an existing document.

LogsDB does not support custom _id values by default. Because _id is not stored in doc values, it cannot be reconstructed via synthetic source — so LogsDB disables user-provided _ids and generates a synthetic identifier internally instead. Attempts to index a document with an explicit _id into a LogsDB index will either be silently overridden or rejected, depending on configuration.

What this means in practice:

  • Deduplication pipelines that rely on sending the same _id twice to avoid duplicate log entries will not behave as expected under LogsDB. Duplicate suppression logic needs to move upstream — into your ingestion pipeline (Logstash, Elastic Agent, or a custom processor) — before documents reach the index.
  • Upsert workflows (index or update operations using a known _id) are not compatible with LogsDB. If your pipeline updates existing log documents rather than purely appending new ones, LogsDB is not the right index mode for that data stream.
  • Pure append pipelines — which represent the overwhelming majority of log workloads — are completely unaffected. If your logs flow in and are never updated or deduplicated by _id, you will not encounter this limitation at all.

For most log infrastructure, this is a non-issue. Logs are append-only by nature, and synthetic _id generation has no observable impact on search, aggregation, or source retrieval. But if your pipeline was built with _id-based deduplication as a reliability mechanism — a common pattern when dealing with at-least-once delivery guarantees from Kafka or similar systems — audit that logic before enabling LogsDB. Move the deduplication gate to your ingestion layer, and LogsDB will behave exactly as expected downstream.

Enabling LogsDB

One setting in your index template. No pipeline changes, no schema migration, no reingestion of existing data.

This applies only to new indices matching the pattern. Existing indices are unaffected. You can roll it out on new data streams immediately while historical data stays on standard mode — no disruption, no downtime.


// PUT _index_template/logsdb-template
{
"index_patterns": ["logs-*"],
"data_stream": {},
"template": {
"settings": {
"index.mode": "logsdb" // This is all it takes
}
}
}

The Numbers: Same Logs, Half the Space

To put a precise figure on the storage impact, we ingested the same log dataset into two indices on the same cluster — identical shard count (1), identical replica configuration (0 replicas), identical documents. The only variable: the index mode.

Both indices were queried with GET _cat/indices/test*?v&s=store.size after ingestion:

elasticsearch blog img1


Screenshot from 2026 03 17 16 29 27

807.7 MB on a standard index. 355.4 MB on LogsDB. The same 1,453,992 documents, verified with _count on both indices. All shards healthy (total: 1, successful: 1, skipped: 0, failed: 0):

elasticsearch blog img2

elasticsearch blog img3

Not a single document dropped. Not a single shard failed. The 56% storage reduction comes entirely from eliminating the raw _source blob — with zero impact on document count, zero impact on shard health, zero impact on search.

What That Storage Saving Actually Buys You

56% less storage is not a cost line item — it is search coverage. Here is what it means in practice:

At 10 TB of log data, LogsDB brings that to approximately 4.4 TB on the same hardware. That is 5.6 TB of reclaimed capacity — which translates directly into extended retention. A team running 30-day retention today could move to 60+ days on the same cluster, with the same infrastructure spend. A team already at 90 days could double their history without adding a single node.

For incident investigation, this matters enormously. The difference between 30-day and 60-day retention is often the difference between finding the root cause and not. Log data that was being deleted to manage disk costs becomes searchable history.

For teams on managed Elasticsearch services where storage is billed per GB, the cost reduction is immediate and measurable. But the more important outcome is what you keep: more logs in the cluster means more to search.

bulb
RETENTION AS A SEARCH STRATEGY
Every day of log history you can afford to keep is another day of data available for trend analysis, security investigations, compliance queries, and incident post-mortems. LogsDB does not just save storage — it extends the searchable window of your infrastructure’s history.

Rollout risk is low. The change is scoped to new indices, requires no downtime, and leaves existing data untouched. You can adopt it on a single data stream today and measure the impact before rolling it further.

When LogsDB Is — and Is Not — the Right Choice

WORKLOAD LOGSDB? WHY
Application logs YES Append-only, time-series — ideal fit for LogsDB’s architecture
Infrastructure & server logs YES High ingest volume amplifies the storage savings
Container & Kubernetes logs YES Structured, repetitive fields compress well under columnar storage
Security / SIEM logs YES Long retention requirements make the storage reduction compound over time
Metrics & time-series NO Use Elasticsearch’s dedicated metrics index mode instead
Heavy aggregation workloads NO Synthetic source reconstruction adds overhead when aggregations dominate
Pipelines with unmapped dynamic fields NO Unmapped fields cannot be reconstructed via synthetic source — data loss risk
Distributed traces NO Trace access patterns differ fundamentally from log workloads

Our Take

LogsDB solves the right problem. The _source blob is the largest single contributor to log index size, and also the structure least likely to be needed at query time for typical log workloads. Replacing it with columnar reconstruction via synthetic source is architecturally sound — the compression gains are real, the search behaviour is unchanged, and the deployment path is minimal.

The 56% reduction we observed on 1,453,992 documents — 807.7 MB down to 355.4 MB, confirmed with identical _count responses on both indices and clean shard health across both — is consistent with Elastic’s documented range of 60–65%. Variance is driven by field cardinality and repetition patterns in your specific log schema. High-cardinality dynamic fields will see less benefit and carry risk around unmapped field reconstruction; well-structured pipelines with explicit schema mapping will see results at or above what we observed.

For any team managing Elasticsearch log infrastructure on 8.17+, the calculus is straightforward: audit your schema, map your fields explicitly, and enable LogsDB on new data streams. The storage you save is not an end in itself — it is the means to keep more logs, search deeper into your history, and get better answers when they matter most.


Go to Top