Building And Operating Observability At Banking Scale With Elasticsearch

Unified Observability | Feb 13, 2026

4 min read

Building and Operating Observability at Banking Scale with Elasticsearch

High-volume banking systems generate massive data, but the real challenge is detecting abnormal behaviour early.
Elasticsearch forms the core observability platform, handling ingestion, storage, search, and dashboards at scale.
As alerting requirements evolved, behaviour comparison and aggregation became more important than simple thresholds.
A lightweight Python decision layer was intentionally introduced to keep complex alert logic isolated from the core Elasticsearch cluster.
This approach reduced detection delays, manual monitoring effort, and alert noise, while keeping the system operationally manageable.

Introduction

When operating large banking and payment systems, observability quickly becomes an operational challenge rather than a tooling one.

These systems run continuously and generate a very large volume of logs and metrics every day. Elasticsearch plays a central role in handling this scale by reliably ingesting, storing, and making data searchable across the platform.

However, as transaction volume grows and operational expectations increase, simply having dashboards and searchable logs is not enough. The real challenge lies in detecting abnormal system behaviour early and giving operations teams the confidence to act quickly.

This blog shares practical lessons from building and operating an Elasticsearch-based observability setup at banking scale, focusing on alerting design, behaviour comparison, and operational decision-making.

The real problem was timing, not visibility

In high-volume banking environments, failures rarely appear as clean outages.

More often, issues show up as:

Gradual drops in successful transactions
Sudden absence of logs from specific APIs
Behaviour changes that only become visible when compared across days

In these situations, data continues to flow into Elasticsearch and dashboards remain accessible. From a surface-level view, systems look healthy. The risk is not lack of visibility, but delay in recognising that something is drifting away from normal behaviour.

Over time, it became clear that observability at this scale is about reducing detection delay, not just collecting more data.

Overall observability architecture

At a high level, Elasticsearch acts as the central observability platform.

Data enters the system from multiple sources:

Filebeat collects application and transaction logs
Metricbeat collects system-level metrics such as CPU and memory
Heartbeat monitors service availability

All incoming data passes through Logstash, where logs are filtered, enriched, and normalised. In practice, raw logs often lack sufficient context, so additional fields are derived during processing. This includes mapping identifiers to meaningful application or destination names so teams can reason about issues more easily.

The processed data is then indexed into Elasticsearch, which serves as the single source of truth for observability.

Elasticsearch setup in practice

The Elasticsearch platform was already well established and handled scale reliably.

The setup included:

On-premises deployment
Elasticsearch version 8.9
6 Elasticsearch data nodes
4 Logstash and Kibana nodes
Logs and metrics ingested via Filebeat and Metricbeat
Log processing and enrichment handled in Logstash
Kibana secured using Azure AD authentication
Raw production logs retained for 7 days

This setup supported:

High-volume ingestion without loss
Fast search and analysis
Dashboards used by monitoring, operations, development, and network teams
Centralised visibility across applications and infrastructure

Elasticsearch proved reliable as a core platform for observability at scale.

Why alerting needed to evolve

While dashboards and search worked well, alerting requirements became more complex as the system matured.

Native Elasticsearch alerting works very effectively for:

Threshold-based alerts
No-data alerts
Simple time-window evaluations

However, operational needs extended beyond this. Detecting real issues often required:

Frequent checks at short intervals
Comparing current behaviour with previous days
Aggregating data at multiple levels
Avoiding duplicate alerts for the same underlying issue

At this scale, pushing all comparison-heavy logic directly into alerting rules increased complexity and made it harder to reason about system behaviour under load.

The challenge was not that Elasticsearch could not support alerting, but that clarity, failure isolation, and operational control became more important than consolidating all logic into a single layer.

An intentional separation of data and decision logic

Elasticsearch continued to remain the core data platform.

To keep the cluster focused on ingestion, search, and aggregation, a lightweight Python-based decision layer was introduced alongside Elasticsearch. This was an intentional architectural choice, not a workaround.

The goal was to:

Keep complex decision logic isolated
Reduce pressure on the core cluster
Make alert behaviour easier to understand and evolve

In this model:

Elasticsearch stores and serves all observability data
Python evaluates conditions and determines when alerts should fire

The Python layer:

Runs scheduled checks at 1, 5, and 10-minute intervals
Fetches required data from Elasticsearch
Applies multi-step aggregation
Compares current behaviour with historical behaviour
Adds derived fields where required
Triggers alerts based on evaluated conditions

Aggregated data is written back into Elasticsearch at:

5-minute
1-hour
1-day

This approach supports longer-term analysis while keeping raw data retention under control.

Alert types implemented

Alert rules were built based on real production behaviour, including:

No-data alerts when expected logs stop arriving
Threshold-based alerts for sudden spikes or drops
Ratio-based alerts for abnormal success or failure rates
Comparison-based alerts such as:
- Today versus yesterday
- Today versus day before yesterday

The system runs hundreds of alert rules and generates thousands of alerts per day.

Integration with operational workflows

Alerts are tightly integrated with operational systems to ensure actionability.

Key aspects include:

Alerts sent via email where required
Automatic incident creation and updates in ServiceNow
Repeated alerts updating the same incident rather than creating new ones
Incident updates recorded as work notes

This design reduces alert fatigue and helps teams focus on resolving issues instead of managing noise.

How teams use the platform

The observability platform is shared across teams:

Monitoring
Operations
Development
Network

All teams access Kibana through Azure AD-secured login, ensuring controlled access while maintaining a shared view of system behaviour.

Elasticsearch acts as a common observability layer, reducing silos during investigation and incident response.

Operational impact

After stabilising this setup, the impact was clear:

Reduced manual dashboard monitoring
Faster detection of transaction-related issues
Fewer high-severity escalations
Better focus on investigation and resolution

This setup does not eliminate all risk. What it does is make abnormal behaviour visible earlier and easier to reason about, which is often the most important factor in high-volume banking systems.

Key learnings

Some important lessons from operating this system at scale:

Observability is about reducing detection delay, not just collecting data
Behaviour comparison across time is critical in banking environments
Elasticsearch scales reliably as a central observability platform
Separating data handling from decision logic improves operational clarity
Alert quality matters more than alert quantity
Designing for human operators is as important as designing for systems

How This Experience Shaped My Thinking

This experience shaped how I approach observability work.

I start with real operational problems, use Elasticsearch as the foundation, and extend it carefully where required to support complex decision-making. At banking scale, observability is about understanding system behaviour under continuous load, not just monitoring system health.

When designed with this mindset, Elasticsearch becomes a powerful platform for building reliable, scalable, and enterprise-grade observability solutions.

Cookie	Duration	Description
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Why SFTP Is a Hidden Operational Risk – And How Enterprises Are Replacing It with Kafka Based File Transfer Platforms.

Revolutionize Your CX with
Unified Observability

CloudOps Automation tool for Infrastructure monitoring and deployment.

Indonesia’s top digital credit service provider leverages Ashnik’s PostgreSQL expertise and services

Revolutionize Your CX with Unified Observability

Automate and monitor your PostgreSQL with ease.

The CloudOps Automation Tool for easy Infrastructure deployment and monitoring

Maximize Potential of Your Data with Streaming Data Pipeline Architecture

End-to-End Traceability and Unified Observability for the Modern Infrastructure

Watch: How to auto-scale in deployments using Kubernetes(K8s): A Technical Demo