Executing Zero-Downtime Server Replacement In A Large Payments Platform

Elastic | Feb 14, 2026

3 min read

Executing Zero-Downtime Server Replacement in a Large Payments Platform

Working on large payment platforms, infrastructure changes are never simple. These systems run continuously, handle high transaction volumes, and have very little tolerance for downtime.

This was the situation when we started work on a large national payments platform that needed to replace aging physical servers and extend Elasticsearch data retention. The requirement was clear. Servers had to be replaced without impacting live traffic, and retention had to move from 10 days to 30 days.

This write-up covers how I approached and executed this phase of the work, given the constraints of the existing setup.

The starting point and constraints

At the beginning, Elasticsearch was running on Kubernetes using local persistent storage. The cluster consisted of 6 physical servers hosting 21 Elasticsearch pods.

Two major constraints shaped the approach:

Kubernetes does not support live migration of stateful pods
Local persistent volumes cannot be easily moved or detached

There was also a clear resource imbalance in the cluster.

2 nodes had around 11 TB of storage each, but limited CPU and memory
4 nodes had limited storage, which restricted shard movement and prevented retention from being increased

Because of this, retention was capped at 10 days, and launching additional pods was becoming difficult. At the same time, downtime was not an option.

Why we chose horizontal expansion

Given the use of local persistent storage, modifying or draining existing nodes directly carried high risk. Removing disks or forcefully relocating pods could easily impact production traffic.

Kubernetes alone could not solve this problem, since it cannot live-migrate stateful workloads.

Instead of changing what was already running, I decided to take a horizontal expansion approach. The plan was to add new physical servers with better-balanced resources and gradually move Elasticsearch data onto them in a controlled manner.

Application-aware migration using Elasticsearch

The key to this approach was treating the migration as an application-level operation rather than an infrastructure-level one.

From the Elasticsearch side, I used the persistent routing reallocation excluding setting to evacuate data from a node. This ensured that:

No new shards were allocated to the node
Existing shards were gradually moved to other nodes

Once shard movement was complete, the node no longer held Elasticsearch data and could be safely handled at the infrastructure level.

Coordinating with Kubernetes for pod movement

After draining shards from Elasticsearch, the next step was at the Kubernetes layer.

I used Kubernetes cordon to prevent new pods from being scheduled on the node. Once cordoned, the Elasticsearch pod and its PVC were deleted.

At this point, the Elasticsearch operator reconciliation logic recreated the pod on another node that had sufficient CPU, memory, and storage. After the new pod was healthy, shard allocation exclusions were removed, allowing Elasticsearch to rebalance data across the cluster.

This process was repeated node by node, moving data from lower-capacity servers to the newly added ones.

Throughout this process, live traffic continued without interruption.

What this phase achieved

By the end of this phase:

Older physical servers were replaced with new ones
Elasticsearch pods were redistributed across better-balanced nodes
Data retention was extended from 10 days to 30 days
All changes were completed without planned downtime

This phase demonstrated that even with local persistent storage and Kubernetes limitations, zero-downtime server replacement is possible when Elasticsearch shard allocation is used correctly.

Why this approach matters

This work was not only about replacing servers. It established a safe pattern for making infrastructure changes in a constrained environment.

It also created a foundation for future activities, including onboarding additional log sources, expanding across data centers, and planning longer-term retention strategies.

Just as importantly, it highlighted the need for careful sizing and capacity planning as the platform continues to evolve.

Final thoughts

This phase reinforced a simple principle for me. Large systems change safely through careful, phased execution, not big disruptive moves.

At Ashnik, we often work inside these kinds of constraints. The focus is on understanding how the system is running today and making changes that keep it stable while opening up room to grow.

That mindset is what guided this phase of the work and continues to shape how we approach similar enterprise platforms.

Unlock the Power of PostgreSQL: A Guide to Managing Large Datasets

Jul 18, 2023 | 9 MIN READ

ElasticSearch Cluster Setup: 10 Best Practices Tips

Oct 21, 2021 | 7 MIN READ

Chaos Engineering with Docker EE

Jul 10, 2018 | 14 MIN READ

Cookie	Duration	Description
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

February Newsletter is out now! Catch all the latest from the world of Open source.

Revolutionize Your CX with
Unified Observability

CloudOps Automation tool for Infrastructure monitoring and deployment.

Indonesia’s top digital credit service provider leverages Ashnik’s PostgreSQL expertise and services

Revolutionize Your CX with Unified Observability

Automate and monitor your PostgreSQL with ease.

The CloudOps Automation Tool for easy Infrastructure deployment and monitoring

Maximize Potential of Your Data with Streaming Data Pipeline Architecture

End-to-End Traceability and Unified Observability for the Modern Infrastructure

Watch: How to auto-scale in deployments using Kubernetes(K8s): A Technical Demo