NL blog bsfi

Written by Hanumantha Reddy

| Feb 13, 2026

4 min read

Building and Operating Observability at Banking Scale with Elasticsearch

TL;DR

  • High-volume banking systems generate massive data, but the real challenge is detecting abnormal behaviour early.
  • Elasticsearch forms the core observability platform, handling ingestion, storage, search, and dashboards at scale.
  • As alerting requirements evolved, behaviour comparison and aggregation became more important than simple thresholds.
  • A lightweight Python decision layer was intentionally introduced to keep complex alert logic isolated from the core Elasticsearch cluster.
  • This approach reduced detection delays, manual monitoring effort, and alert noise, while keeping the system operationally manageable.

Introduction

When operating large banking and payment systems, observability quickly becomes an operational challenge rather than a tooling one.

These systems run continuously and generate a very large volume of logs and metrics every day. Elasticsearch plays a central role in handling this scale by reliably ingesting, storing, and making data searchable across the platform.

However, as transaction volume grows and operational expectations increase, simply having dashboards and searchable logs is not enough. The real challenge lies in detecting abnormal system behaviour early and giving operations teams the confidence to act quickly.

This blog shares practical lessons from building and operating an Elasticsearch-based observability setup at banking scale, focusing on alerting design, behaviour comparison, and operational decision-making.

The real problem was timing, not visibility

In high-volume banking environments, failures rarely appear as clean outages.

More often, issues show up as:

  • Gradual drops in successful transactions
  • Sudden absence of logs from specific APIs
  • Behaviour changes that only become visible when compared across days

In these situations, data continues to flow into Elasticsearch and dashboards remain accessible. From a surface-level view, systems look healthy. The risk is not lack of visibility, but delay in recognising that something is drifting away from normal behaviour.

Over time, it became clear that observability at this scale is about reducing detection delay, not just collecting more data.

Overall observability architecture

At a high level, Elasticsearch acts as the central observability platform.

Data enters the system from multiple sources:

  • Filebeat collects application and transaction logs
  • Metricbeat collects system-level metrics such as CPU and memory
  • Heartbeat monitors service availability

All incoming data passes through Logstash, where logs are filtered, enriched, and normalised. In practice, raw logs often lack sufficient context, so additional fields are derived during processing. This includes mapping identifiers to meaningful application or destination names so teams can reason about issues more easily.

The processed data is then indexed into Elasticsearch, which serves as the single source of truth for observability.

Elasticsearch setup in practice

The Elasticsearch platform was already well established and handled scale reliably.

The setup included:

  • On-premises deployment
  • Elasticsearch version 8.9
  • 6 Elasticsearch data nodes
  • 4 Logstash and Kibana nodes
  • Logs and metrics ingested via Filebeat and Metricbeat
  • Log processing and enrichment handled in Logstash
  • Kibana secured using Azure AD authentication
  • Raw production logs retained for 7 days

This setup supported:

  • High-volume ingestion without loss
  • Fast search and analysis
  • Dashboards used by monitoring, operations, development, and network teams
  • Centralised visibility across applications and infrastructure

Elasticsearch proved reliable as a core platform for observability at scale.

Why alerting needed to evolve

While dashboards and search worked well, alerting requirements became more complex as the system matured.

Native Elasticsearch alerting works very effectively for:

  • Threshold-based alerts
  • No-data alerts
  • Simple time-window evaluations

However, operational needs extended beyond this. Detecting real issues often required:

  • Frequent checks at short intervals
  • Comparing current behaviour with previous days
  • Aggregating data at multiple levels
  • Avoiding duplicate alerts for the same underlying issue

At this scale, pushing all comparison-heavy logic directly into alerting rules increased complexity and made it harder to reason about system behaviour under load.

The challenge was not that Elasticsearch could not support alerting, but that clarity, failure isolation, and operational control became more important than consolidating all logic into a single layer.

An intentional separation of data and decision logic

Elasticsearch continued to remain the core data platform.

To keep the cluster focused on ingestion, search, and aggregation, a lightweight Python-based decision layer was introduced alongside Elasticsearch. This was an intentional architectural choice, not a workaround.

The goal was to:

  • Keep complex decision logic isolated
  • Reduce pressure on the core cluster
  • Make alert behaviour easier to understand and evolve

In this model:

  • Elasticsearch stores and serves all observability data
  • Python evaluates conditions and determines when alerts should fire

The Python layer:

  • Runs scheduled checks at 1, 5, and 10-minute intervals
  • Fetches required data from Elasticsearch
  • Applies multi-step aggregation
  • Compares current behaviour with historical behaviour
  • Adds derived fields where required
  • Triggers alerts based on evaluated conditions

Aggregated data is written back into Elasticsearch at:

  • 5-minute
  • 1-hour
  • 1-day

This approach supports longer-term analysis while keeping raw data retention under control.

Alert types implemented

Alert rules were built based on real production behaviour, including:

  • No-data alerts when expected logs stop arriving
  • Threshold-based alerts for sudden spikes or drops
  • Ratio-based alerts for abnormal success or failure rates
  • Comparison-based alerts such as:
    • Today versus yesterday
    • Today versus day before yesterday

The system runs hundreds of alert rules and generates thousands of alerts per day.

Integration with operational workflows

Alerts are tightly integrated with operational systems to ensure actionability.

Key aspects include:

  • Alerts sent via email where required
  • Automatic incident creation and updates in ServiceNow
  • Repeated alerts updating the same incident rather than creating new ones
  • Incident updates recorded as work notes

This design reduces alert fatigue and helps teams focus on resolving issues instead of managing noise.

How teams use the platform

The observability platform is shared across teams:

  • Monitoring
  • Operations
  • Development
  • Network

All teams access Kibana through Azure AD-secured login, ensuring controlled access while maintaining a shared view of system behaviour.

Elasticsearch acts as a common observability layer, reducing silos during investigation and incident response.

Operational impact

After stabilising this setup, the impact was clear:

  • Reduced manual dashboard monitoring
  • Faster detection of transaction-related issues
  • Fewer high-severity escalations
  • Better focus on investigation and resolution

This setup does not eliminate all risk. What it does is make abnormal behaviour visible earlier and easier to reason about, which is often the most important factor in high-volume banking systems.

Key learnings

Some important lessons from operating this system at scale:

  • Observability is about reducing detection delay, not just collecting data
  • Behaviour comparison across time is critical in banking environments
  • Elasticsearch scales reliably as a central observability platform
  • Separating data handling from decision logic improves operational clarity
  • Alert quality matters more than alert quantity
  • Designing for human operators is as important as designing for systems

How This Experience Shaped My Thinking

This experience shaped how I approach observability work.

I start with real operational problems, use Elasticsearch as the foundation, and extend it carefully where required to support complex decision-making. At banking scale, observability is about understanding system behaviour under continuous load, not just monitoring system health.

When designed with this mindset, Elasticsearch becomes a powerful platform for building reliable, scalable, and enterprise-grade observability solutions.


Go to Top