Log Aggregation Platform Using Elastic Stack

Storage Optimization and Search Optimization for a BFSI Enterprise

  • Industry : BFSI / Banking
  • Technology : Elasticsearch, Azure Blob Storage
  • Engagement : Log Aggregation + Regulatory Compliance Archival

THE CUSTOMER

A BFSI enterprise running a net banking and payments platform generating 700–800 GB of application logs daily across 35+ microservices.

THE CHALLENGE

700–800 GB daily log volume, Regulatory 11-year retention mandate, storage constraints, and a zero- downtime requirement – all at once.

THE SOLUTION

A fully architected log management platform with microservice-level indexing, dual-node rolling availability, and automated Azure Blob archival.

THE RESULT

1 second query response across 35+ indexes, 11-year regulatory compliance achieved, zero downtime during maintenance, and a fully automated archival pipeline.

700–800 GB

Application logs per day

1 sec

Query response per microservice

11 Years

Regulatory retention
mandate met

0

Minutes of downtime

CUSTOMER OVERVIEW

A BFSI enterprise operating a high-volume net banking and payments platform engaged Ashnik to design and deliver a centralised log management infrastructure capable of handling the scale, compliance, and availability demands of a live payment environment. The environment comprised 35+ microservices running across 10 Tomcat servers on a Kubernetes platform – collectively generating 700–800 GB of application logs every single day.

The platform needed to address three problems at once: ingest and structure logs at this scale, meet the Reserve Bank of India’s 11-year retention mandate through a verifiable archival strategy, and maintain availability through patching and upgrade cycles. Ashnik designed and delivered the architecture as a single integrated platform addressing all three.

THE CHALLENGE

Storage Constraints
On-premise storage could not absorb years of accumulation at this rate. A tiered approach – hot on-premise, cold on cloud – was the only viable path.
Log Volume at Scale
700–800 GB generated daily across 35+ microservices on 10 Tomcat servers. Without structure, search and incident investigation become unworkable.
Regulatory Compliance
Regulatory mandates 11-year retention of application logs. The solution needed to be verifiable and audit-ready at any point – not just stored.
Zero Downtime Mandate
A live payment environment cannot tolerate gaps in log visibility. Patching, upgrades, and maintenance had to happen without taking the platform offline.
Safe Archival
Automation was necessary, but deletion without confirmation was not acceptable. In a compliance environment, unrecoverable data loss is a regulatory
failure.

ASHNIK’S APPROACH

01
Distributed Log Collection via Filebeat
Five separate Filebeat instances were deployed – one per logical server grouping – rather than a single centralised agent. Each instance manages its own state independently, isolating failure risk at source. A single agent at this ingest volume would create a shared point of backpressure failure. Logs forwarded securely over TLS on port 5044 with full firewall whitelisting.

02
Microservice-Level Index Routing in Logstash
Logstash pipelines route logs into dedicated Elasticsearch indexes for each of the 35+ microservices. Filters – regex, grok, mutate – structure and enrich every event at ingest. UTC-to-IST timestamp normalisation is applied at this layer, ensuring the operations team sees correctly localised timestamps across every dashboard and query – without manual conversion during incident investigation.

03
4-Node Elasticsearch Cluster – Sized for Peak Ingest
Four nodes on Elasticsearch, each carrying deliberately separated roles – master, data, and ingest – following Elastic’s production best practice of preventing heavy indexing operations from destabilising cluster management. Mixing roles at this ingest rate risks GC pressure on the master node, which affects the entire cluster. Each node carries 5.8 TB storage, sized to sustain peak ingest while delivering 1 second query response times across all microservice indexes.

04
Rolling Dual-Node Architecture for Zero Downtime
Two Kibana nodes and two Logstash nodes deployed in a rolling configuration. One node stays active during any maintenance window – continuous log visibility with no interruption to operations.

05
Azure Blob Archival for 11-Year Regulatory Compliance
Weekly snapshots staged on NFS, compressed, and uploaded to Azure Blob Storage. Daily snapshots were considered but ruled out – at this ingest rate, snapshots compete with live indexing for disk I/O, and daily frequency would degrade cluster performance without proportional compliance benefit. Crucially, Elasticsearch snapshots after the first are incremental – only changed segments are written – so weekly frequency does not mean weekly full re-snapshots of the entire dataset. Credentials configured and whitelisted end-to-end.

06
Fully Automated Archival Pipeline
Snapshot, compression, and cloud upload automated via shell scripts and cron jobs – no manual effort in the regular cycle.

07
Manual Verification Gate Before Deletion
No local file is deleted until its presence in Azure Blob is manually confirmed. In a compliance environment where data recovery is not an option, this checkpoint is a design requirement – not an operational inconvenience.
log aggregation

BEFORE & AFTER

METRIC BEFORE AFTER
Log Retention ~No structured long-term archival 11-year Regulatory-compliant retention on Azure Blob
Storage Approach On-premise only, hitting capacity limits Tiered – hot on-premise, cold on Azure Blob
Query Response Time Unstructured search across full dataset 1 seconds per microservice index
Microservice Visibility No index-level separation 35+ dedicated indexes, full traceability
Archival Process Manual, ad hoc Fully automated – snapshot, compress, upload
Downtime During Maintenance At risk Zero – rolling dual-node architecture
Data Deletion Safety No verification step Manual confirmation gate before any deletion

Outcome

Query Performance
1 second response times sustained across 35+ microservice indexes at 700–800 GB daily ingest.
Data Integrity Preserved
Manual verification gate ensures no local file is deleted before confirmed cloud archival.
Full Traceability
Structured Kibana dashboards across all microservices enable fast incident investigation and transaction-level log search.
Storage Constraint Resolved
Active logs tiered on-premise, archival data moved to cloud – without burdening existing infrastructure.
Regulatory Compliance Met
11-year transaction log retention achieved through scalable, cloud-backed archival on Azure Blob Storage.
Operational Efficiency
End-to-end archival pipeline automated – ongoing operational overhead kept minimal despite the scale.
Zero Downtime
Rolling dual-node architecture for Kibana and Logstash ensured continuous availability across all maintenance cycles.

Conclusion

Ashnik’s approach to this engagement was not to deploy a stack – it was to architect a platform that could carry the weight of a BFSI compliance obligation across an 11-year window, at 700–800 GB of daily ingest, without a single point of failure. Every decision – role-separated Elasticsearch nodes, independent Filebeat instances, a manual verification gate before deletion – reflects a deliberate engineering choice, not a default configuration.