Working on large payment platforms, infrastructure changes are never simple. These systems run continuously, handle high transaction volumes, and have very little tolerance for downtime.
This was the situation when we started work on a large national payments platform that needed to replace aging physical servers and extend Elasticsearch data retention. The requirement was clear. Servers had to be replaced without impacting live traffic, and retention had to move from 10 days to 30 days.
This write-up covers how I approached and executed this phase of the work, given the constraints of the existing setup.
The starting point and constraints
At the beginning, Elasticsearch was running on Kubernetes using local persistent storage. The cluster consisted of 6 physical servers hosting 21 Elasticsearch pods.
Two major constraints shaped the approach:
- Kubernetes does not support live migration of stateful pods
- Local persistent volumes cannot be easily moved or detached
There was also a clear resource imbalance in the cluster.
- 2 nodes had around 11 TB of storage each, but limited CPU and memory
- 4 nodes had limited storage, which restricted shard movement and prevented retention from being increased
Because of this, retention was capped at 10 days, and launching additional pods was becoming difficult. At the same time, downtime was not an option.
Why we chose horizontal expansion
Given the use of local persistent storage, modifying or draining existing nodes directly carried high risk. Removing disks or forcefully relocating pods could easily impact production traffic.
Kubernetes alone could not solve this problem, since it cannot live-migrate stateful workloads.
Instead of changing what was already running, I decided to take a horizontal expansion approach. The plan was to add new physical servers with better-balanced resources and gradually move Elasticsearch data onto them in a controlled manner.
Application-aware migration using Elasticsearch
The key to this approach was treating the migration as an application-level operation rather than an infrastructure-level one.
From the Elasticsearch side, I used the persistent routing reallocation excluding setting to evacuate data from a node. This ensured that:
- No new shards were allocated to the node
- Existing shards were gradually moved to other nodes
Once shard movement was complete, the node no longer held Elasticsearch data and could be safely handled at the infrastructure level.
Coordinating with Kubernetes for pod movement
After draining shards from Elasticsearch, the next step was at the Kubernetes layer.
I used Kubernetes cordon to prevent new pods from being scheduled on the node. Once cordoned, the Elasticsearch pod and its PVC were deleted.
At this point, the Elasticsearch operator reconciliation logic recreated the pod on another node that had sufficient CPU, memory, and storage. After the new pod was healthy, shard allocation exclusions were removed, allowing Elasticsearch to rebalance data across the cluster.
This process was repeated node by node, moving data from lower-capacity servers to the newly added ones.
Throughout this process, live traffic continued without interruption.
What this phase achieved
By the end of this phase:
- Older physical servers were replaced with new ones
- Elasticsearch pods were redistributed across better-balanced nodes
- Data retention was extended from 10 days to 30 days
- All changes were completed without planned downtime
This phase demonstrated that even with local persistent storage and Kubernetes limitations, zero-downtime server replacement is possible when Elasticsearch shard allocation is used correctly.
Why this approach matters
This work was not only about replacing servers. It established a safe pattern for making infrastructure changes in a constrained environment.
It also created a foundation for future activities, including onboarding additional log sources, expanding across data centers, and planning longer-term retention strategies.
Just as importantly, it highlighted the need for careful sizing and capacity planning as the platform continues to evolve.
Final thoughts
This phase reinforced a simple principle for me. Large systems change safely through careful, phased execution, not big disruptive moves.
At Ashnik, we often work inside these kinds of constraints. The focus is on understanding how the system is running today and making changes that keep it stable while opening up room to grow.
That mindset is what guided this phase of the work and continues to shape how we approach similar enterprise platforms.