Executing Zero-Downtime Server Replacement In A Large Payments Platform

Elastic | Feb 14, 2026

3 min read

Executing Zero-Downtime Server Replacement in a Large Payments Platform

In large-scale payment ecosystems, infrastructure upgrades are never routine. These platforms run 24×7, process critical financial transactions, and have near-zero tolerance for disruption.

This engagement builds upon the earlier transformation journey documented here:

From Chaos to Control: How a Payment Solution Company Transformed Log Management with Elastic Stack

At that stage:

Ingest rate: ~20,000 events per second
Log sources: 2 Load Balancers
Retention: 10 days

Today, the platform operates at an entirely different scale:

Ingest rate: 1,00,000 events per second
Log sources: 30 Load Balancers
Retention requirement: 30 days

This exponential growth required a fundamental infrastructure upgrade — including complete physical server replacement — executed on a Kubernetes 1.29 cluster, without any downtime.

The Starting Architecture and Constraints

Elasticsearch was deployed on Kubernetes v1.29, using local persistent volumes, across:

6 physical servers
21 Elasticsearch pods

However, the cluster had become resource-constrained:

2 nodes had ~11 TB disk, but limited CPU and RAM
4 nodes had insufficient disk capacity
Disk watermarks were frequently approached
Retention was capped at 10 days
Increasing shard allocation was becoming difficult

Because the setup relied on local persistent storage:

Stateful pods could not be live-migrated
Persistent volumes could not be detached and moved
In-place disk changes were risky

Downtime was not an option.

Why Disk Enhancement Became Mandatory

Disk enhancement was not merely a storage upgrade — it was a scale necessity.

Several factors made it unavoidable:

Ingest Growth
- From 20K events/sec → 1,00,000 events/sec
- From 2 Load Balancers → 30 Load Balancers
This is a 5x ingest increase and a 15x increase in log sources.
Retention Increase
- From 10 days → 30 days
Retention growth alone required 3x storage expansion.
Cluster Stability
- High disk watermark thresholds were being approached frequently
- Shard allocation restrictions were appearing
- Rebalancing operations slowed down
I/O Pressure
Higher ingest combined with larger shard sizes increased disk throughput requirements.

Without disk enhancement:
- Shards would stop allocating
- Retention extension would fail
- Cluster performance would degrade
- Operational risk would increase
This was a structural capacity expansion — not optional scaling.

Why We Did NOT Just Add Disks to Old Servers

A common question was:

Why not simply attach additional disks instead of replacing servers?

After a detailed hardware assessment, the reasons were clear:

Servers were aging and nearing hardware lifecycle limits
Backplane expansion slots were exhausted
RAID/controller limitations prevented further safe scaling
CPU and RAM were already undersized relative to modern ingest needs
Firmware and hardware support windows were closing

Adding disks to aging hardware would have:

Increased failure risk
Created CPU-to-disk imbalance
Prolonged dependency on outdated infrastructure
Increased operational complexity

Instead, the strategic decision was taken to replace physical servers entirely with modern, balanced configurations.

This ensured:

Better CPU, RAM, and disk balance
Improved I/O performance
Long-term scalability
Sustainable 30-day retention
Headroom for future ingest growth beyond 1L EPS

Strategy: Horizontal Expansion Instead of Risky Modifications

Because Elasticsearch was running on Kubernetes 1.29 with local PVs, modifying nodes directly was high risk.

Kubernetes cannot live-migrate StatefulSets.
Local persistent volumes cannot be moved across physical machines.

Therefore, we adopted a horizontal expansion approach:

Add new physical servers
Join them to the Kubernetes cluster
Migrate Elasticsearch data gradually
Decommission old servers safely

This minimized operational risk.

Application-Level Migration Using Elasticsearch

Instead of treating this as a hardware migration, we treated it as an application-aware migration.

Using Elasticsearch shard allocation controls:

cluster.routing.allocation.exclude._name

We:

Prevented new shards from allocating to old nodes
Gradually evacuated existing shards
Monitored relocation progress
Maintained GREEN cluster status

Only after complete shard evacuation did we proceed to infrastructure-level operations.

Coordinating with Kubernetes (v1.29)

After shard evacuation:

Node was cordoned using kubectl cordon
Elasticsearch pod and PVC were deleted
Elasticsearch Operator recreated pods on new physical servers
Allocation exclusion was removed
Cluster rebalanced safely

This process was repeated node-by-node.

No downtime.
No ingestion interruption.
No traffic impact.
No service degradation.

The Ashnik team executed this entire physical replacement seamlessly.

What This Phase Achieved

After completion:

All aging physical servers were replaced
Disk capacity significantly increased
Retention extended from 10 days to 30 days
Ingest stabilized at 1,00,000 events/sec
Log sources scaled from 2 LBs to 30 LBs
Disk watermarks normalized
Cluster rebalancing improved
Zero downtime is maintained throughout

Why This Approach Matters

This migration demonstrated:

Stateful workload upgrades require application-aware orchestration
Hardware lifecycle planning is critical at scale
Retention growth is capacity engineering, not just configuration
Kubernetes limitations must be addressed with intelligent design

Most importantly, it showed that:

Even large-scale physical server replacement can be executed without downtime when done methodically.

Summary

From:

20K EPS → 1,00,000 EPS
2 Load Balancers → 30 Load Balancers
10 days retention → 30 days retention

The platform has evolved dramatically.

And instead of patching aging infrastructure, the right strategic decision was made — modernize the hardware foundation itself.

Large systems change safely not through disruption, but through controlled, phased, application-aware execution.

That is the approach we followed — and the system continues to operate stronger than ever.

Cookie	Duration	Description
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Upcoming Webinar: From Raw Data to Resolved Incidents for BFSI | 4th June 2026. Register Now!

Revolutionize Your CX with
Unified Observability

CloudOps Automation tool for Infrastructure monitoring and deployment.

From Chaos to Control – Transforming Log Management for a Leading Payment Solution Company

Revolutionize Your CX with Unified Observability

Automate and monitor your PostgreSQL with ease.

The CloudOps Automation Tool for easy Infrastructure deployment and monitoring

Maximize Potential of Your Data with Streaming Data Pipeline Architecture

AI Is Not Failing Because of Models. It’s Failing Because of Architecture.

Watch: Building an MCP Server for PostgreSQL: Making Databases Talk to AI