In large-scale payment ecosystems, infrastructure upgrades are never routine. These platforms run 24×7, process critical financial transactions, and have near-zero tolerance for disruption.
This engagement builds upon the earlier transformation journey documented here:
From Chaos to Control: How a Payment Solution Company Transformed Log Management with Elastic Stack
At that stage:
- Ingest rate: ~20,000 events per second
- Log sources: 2 Load Balancers
- Retention: 10 days
Today, the platform operates at an entirely different scale:
- Ingest rate: 1,00,000 events per second
- Log sources: 30 Load Balancers
- Retention requirement: 30 days
This exponential growth required a fundamental infrastructure upgrade — including complete physical server replacement — executed on a Kubernetes 1.29 cluster, without any downtime.
The Starting Architecture and Constraints
Elasticsearch was deployed on Kubernetes v1.29, using local persistent volumes, across:
- 6 physical servers
- 21 Elasticsearch pods
However, the cluster had become resource-constrained:
- 2 nodes had ~11 TB disk, but limited CPU and RAM
- 4 nodes had insufficient disk capacity
- Disk watermarks were frequently approached
- Retention was capped at 10 days
- Increasing shard allocation was becoming difficult
Because the setup relied on local persistent storage:
- Stateful pods could not be live-migrated
- Persistent volumes could not be detached and moved
- In-place disk changes were risky
Downtime was not an option.
Why Disk Enhancement Became Mandatory
Disk enhancement was not merely a storage upgrade — it was a scale necessity.
Several factors made it unavoidable:
- Ingest Growth
- From 20K events/sec → 1,00,000 events/sec
- From 2 Load Balancers → 30 Load Balancers
This is a 5x ingest increase and a 15x increase in log sources.
- Retention Increase
- From 10 days → 30 days
Retention growth alone required 3x storage expansion.
- Cluster Stability
- High disk watermark thresholds were being approached frequently
- Shard allocation restrictions were appearing
- Rebalancing operations slowed down
- I/O Pressure
Higher ingest combined with larger shard sizes increased disk throughput requirements.
Without disk enhancement:
- Shards would stop allocating
- Retention extension would fail
- Cluster performance would degrade
- Operational risk would increase
This was a structural capacity expansion — not optional scaling.
Why We Did NOT Just Add Disks to Old Servers
A common question was:
Why not simply attach additional disks instead of replacing servers?
After a detailed hardware assessment, the reasons were clear:
- Servers were aging and nearing hardware lifecycle limits
- Backplane expansion slots were exhausted
- RAID/controller limitations prevented further safe scaling
- CPU and RAM were already undersized relative to modern ingest needs
- Firmware and hardware support windows were closing
Adding disks to aging hardware would have:
- Increased failure risk
- Created CPU-to-disk imbalance
- Prolonged dependency on outdated infrastructure
- Increased operational complexity
Instead, the strategic decision was taken to replace physical servers entirely with modern, balanced configurations.
This ensured:
- Better CPU, RAM, and disk balance
- Improved I/O performance
- Long-term scalability
- Sustainable 30-day retention
- Headroom for future ingest growth beyond 1L EPS
Strategy: Horizontal Expansion Instead of Risky Modifications
Because Elasticsearch was running on Kubernetes 1.29 with local PVs, modifying nodes directly was high risk.
Kubernetes cannot live-migrate StatefulSets.
Local persistent volumes cannot be moved across physical machines.
Therefore, we adopted a horizontal expansion approach:
- Add new physical servers
- Join them to the Kubernetes cluster
- Migrate Elasticsearch data gradually
- Decommission old servers safely
This minimized operational risk.
Application-Level Migration Using Elasticsearch
Instead of treating this as a hardware migration, we treated it as an application-aware migration.
Using Elasticsearch shard allocation controls:
cluster.routing.allocation.exclude._name
We:
- Prevented new shards from allocating to old nodes
- Gradually evacuated existing shards
- Monitored relocation progress
- Maintained GREEN cluster status
Only after complete shard evacuation did we proceed to infrastructure-level operations.
Coordinating with Kubernetes (v1.29)
After shard evacuation:
- Node was cordoned using kubectl cordon
- Elasticsearch pod and PVC were deleted
- Elasticsearch Operator recreated pods on new physical servers
- Allocation exclusion was removed
- Cluster rebalanced safely
This process was repeated node-by-node.
- No downtime.
- No ingestion interruption.
- No traffic impact.
- No service degradation.
The Ashnik team executed this entire physical replacement seamlessly.
What This Phase Achieved
After completion:
- All aging physical servers were replaced
- Disk capacity significantly increased
- Retention extended from 10 days to 30 days
- Ingest stabilized at 1,00,000 events/sec
- Log sources scaled from 2 LBs to 30 LBs
- Disk watermarks normalized
- Cluster rebalancing improved
- Zero downtime is maintained throughout
Why This Approach Matters
This migration demonstrated:
- Stateful workload upgrades require application-aware orchestration
- Hardware lifecycle planning is critical at scale
- Retention growth is capacity engineering, not just configuration
- Kubernetes limitations must be addressed with intelligent design
Most importantly, it showed that:
Even large-scale physical server replacement can be executed without downtime when done methodically.
Summary
From:
- 20K EPS → 1,00,000 EPS
- 2 Load Balancers → 30 Load Balancers
- 10 days retention → 30 days retention
The platform has evolved dramatically.
And instead of patching aging infrastructure, the right strategic decision was made — modernize the hardware foundation itself.
Large systems change safely not through disruption, but through controlled, phased, application-aware execution.
That is the approach we followed — and the system continues to operate stronger than ever.