Interpreting Observability At Banking Scale: What I Learned Operating High-Volume Transaction Systems

Unified Observability | Feb 13, 2026

4 min read

Interpreting Observability at Banking Scale: What I Learned Operating High-Volume Transaction Systems

When I work on large banking and payment systems, one thing becomes very clear. Data is never the problem. Understanding that data on time is.

Every transaction creates logs. Every system component generates metrics. When systems run 24/7, this data keeps coming without pause. The real challenge is not collecting logs, but figuring out what is going wrong quickly enough to take action.

This is the situation I faced while working on observability for high-volume banking systems.

Where monitoring started to struggle

The systems I worked on process transactions continuously. Log ingestion alone was around 300–400 GB per day, coming from core banking applications, transaction systems, and infrastructure.

From the outside, things looked fine. Logs were centralized. Dashboards were available. Search worked properly.

But on the operations side, problems were visible:

Manual monitoring took a lot of time
Simple alerts worked only for basic issues
Patterns across time were hard to catch quickly
Some critical issues were detected late

Over time, I realized that the main risk was not system failure itself, but delay in detection.

Why banking systems are different

Banking systems do not fail in a clean way.

Sometimes a short spike in transaction declines is more serious than a complete outage. Sometimes missing logs are a sign of failure, not just logging issues. Very often, what matters is how today compares with yesterday, not just a fixed threshold.

To detect such issues, I had to compare behavior across multiple days. In many cases, this meant scanning more than 1 TB of data for comparison-based checks.

At this point, traditional alerting methods start becoming difficult to manage.

Deciding what really needed to be solved

Before changing anything, I was clear about the goals:

Detect issues within minutes
Reduce manual dashboard checking
Catch issues that appear only when comparing data across days
Avoid too many duplicate alerts
Integrate alerts with incident tracking systems

The focus was not on adding more alerts, but on making alerts useful.

How the setup looks in practice

Elasticsearch was already being used for log search and analysis and was handling ingestion well.

The setup included:

6 Elasticsearch data nodes, running on-premises
4 Kibana nodes used by the monitoring team
7-day retention for raw production logs

This setup was good for search and dashboards. But alerting logic became complex as requirements increased.

That is where Python came in.

Why Python was added

Elastic’s native alerting works well for simple cases. In this environment, the logic was more complex.

I needed to:

Check every log, not sampled data
Run checks at short intervals
Compare current data with previous days
Apply multiple levels of aggregation
Avoid creating multiple incidents for the same issue

Handling all this directly inside Elasticsearch was difficult. Using Python made the logic easier to manage and extend.

What the Python automation does

The Python layer works alongside Elasticsearch.

It does the following:

Runs checks at 1, 5, and 10 minute intervals
Fetches data from Elasticsearch and applies multi-step aggregation
Creates aggregated indexes at:
- 5-minute
- 1-hour
- 1-day levels
Stores aggregated data for longer-term analysis
Triggers alerts and generates reports

This helped separate data processing from decision logic.

Types of alerts that were implemented

Alerts were designed based on real production issues.

They include:

Threshold-based alerts for spikes
No-data alerts when logs stop coming
Ratio-based alerts for complete or abnormal declines
Comparison-based alerts:
- Today versus yesterday
- Today versus 2 days earlier

Overall, the system runs 200+ alert rules and triggers around 2,000–3,000 alerts per day.

To reduce noise, repeated alerts for the same issue update the same incident instead of creating new ones.

Reporting and integration with operations

Alerts are integrated with the incident management system so teams can track and act on them.

In addition:

Some alerts are sent by email
20–30 daily reports are generated
10+ weekly reports are shared
3–5 hourly reports provide frequent updates

Reports are shared as dashboard links, PDF, Excel, or HTML, based on team needs.

Aggregation is also used to control data growth.

Instead of storing every similar log event separately, similar logs are grouped and counted. This reduces index size while still keeping useful trends.

This made it possible to retain meaningful data without storing all raw logs for long periods.

What changed after this setup

After this system was in place, the impact was clear:

33% reduction in manual monitoring effort
25% reduction in P1 escalations
Critical alerts delivered within 1 minute
Faster identification of transaction-related issues

Operations teams no longer needed to constantly watch dashboards. They could focus on fixing real problems.

Final thoughts

This experience shaped how I approach observability work. I start from real operational problems, not tools or dashboards. I design for scale and comparison because most issues do not show up as simple threshold breaches. I use tools where they fit best and measure success by reduced effort and faster response.

At Ashnik, this is how we build observability solutions for enterprises across industries. We focus on how systems behave in production and design setups that help teams detect issues early and act with confidence.

Monitoring Key Performance Parameters: A Guide to Optimizing PostgreSQL Datab...

Nov 20, 2023 | 13 MIN READ

Introduction to Elasticsearch and AI in the Banking industry

Sep 18, 2023 | 4 MIN READ

Mastering Observability: Revolutionize APM with Elasticsearch Synthetic Monit...

Jun 20, 2023 | 6 MIN READ

Cookie	Duration	Description
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

February Newsletter is out now! Catch all the latest from the world of Open source.

Revolutionize Your CX with
Unified Observability

CloudOps Automation tool for Infrastructure monitoring and deployment.

Indonesia’s top digital credit service provider leverages Ashnik’s PostgreSQL expertise and services

Revolutionize Your CX with Unified Observability

Automate and monitor your PostgreSQL with ease.

The CloudOps Automation Tool for easy Infrastructure deployment and monitoring

Maximize Potential of Your Data with Streaming Data Pipeline Architecture

End-to-End Traceability and Unified Observability for the Modern Infrastructure

Watch: How to auto-scale in deployments using Kubernetes(K8s): A Technical Demo