Zero-Downtime Server Replacement At Scale — Upgrading A High-Volume Payments Log Infrastructure

Zero-Downtime Server Replacement at Scale — Upgrading a High-Volume Payments Log Infrastructure

Industry : Payments
Technology : Elastic Stack
Kubernetes
Engagement : Infrastructure Migration

THE CUSTOMER

A leading payment solution company processing millions of daily transactions on a 24×7 Elastic Stack log platform.

THE CHALLENGE

5x ingest growth and 3x retention requirement made aging physical servers critically undersized — with no safe option to add disks.

THE SOLUTION

Complete physical server replacement on a live Kubernetes 1.29 cluster using horizontal expansion and application-aware shard migration.

THE RESULT

Ingest scaled to 1,00,000 EPS, retention extended to 30 days, 30 LBs onboarded — with zero downtime throughout.

1,00,000

Events per second post-migration

30 days

Log retention achieved

30 LBs

Log sources, up from 2

0

Minutes of downtime

CUSTOMER OVERVIEW

This engagement builds on an earlier transformation where Ashnik helped the same payment solution company overhaul log management using the Elastic Stack. At the close of that phase, the platform ingested ~20,000 events per second from 2 load balancers with a 10-day retention window.

Since then, ingest grew to 1,00,000 events per second across 30 load balancers, and the business required 30-day log retention. The existing infrastructure — 6 physical servers running 21 Elasticsearch pods, with uneven disk, CPU, and RAM configurations — could no longer sustain this load. A full hardware replacement was unavoidable. The constraint: the platform could not go down.

THE CHALLENGE

Capacity Exhaustion

2 nodes had ~11 TB disk but limited CPU and RAM; 4 nodes had insufficient disk capacity.

Watermark Breaches

Elasticsearch’s high watermark thresholds were approached frequently, triggering shard allocation restrictions and slowing rebalancing.

Retention Gap

Extending from 10 to 30 days required 3x storage expansion — impossible on the existing hardware.

Non-Migratable Workloads

The cluster used local persistent volumes; StatefulSet pods could not be live-migrated and PVCs could not be detached or moved.

Hardware Lifecycle

Aging servers had exhausted backplane slots, hit RAID/controller scaling limits, and were approaching firmware end-of-life — making disk additions structurally unsafe.

I/O Pressure

Higher ingest volumes combined with larger shard sizes significantly increased disk throughput requirements, pushing the existing storage subsystem beyond safe operating limits.

ASHNIK’S APPROACH

Rather than treating this as a hardware swap, Ashnik designed an application-aware migration strategy that worked
within Kubernetes and Elasticsearch constraints — not around them.

Full Server Replacement over Disk Addition

A hardware assessment confirmed aging servers had exhausted expansion slots, undersized CPUs, and closing firmware support windows. Adding disks would increase failure risk and create CPU-to-disk imbalance.

Horizontal Expansion on Kubernetes 1.29

New servers were added to the existing cluster rather than modifying live nodes — bypassing Kubernetes migration limitations.

Application-Aware Shard Evacuation

Used Elasticsearch allocation controls to prevent new shards from landing on old nodes and evacuate existing shards.

cluster.routing.allocation.exclude._name

Coordinated Kubernetes Decommissioning

Each node was cordoned, pods recreated on new servers, and cluster rebalanced node-by-node.

kubectl cordon • Elasticsearch Operator

Performance Validation

Confirmed stable ingest at 1,00,000 EPS post-migration with improved shard rebalancing and retention.

BEFORE & AFTER

METRIC	BEFORE	AFTER
Ingest Rate	~20,000 events/sec	1,00,000 events/sec
Log Sources	2 Load Balancers	30 Load Balancers
Log Retention	10 days	30 days
Disk Watermark Status	Frequently breached	Normalized with headroom
Physical Servers	Aging, expansion-exhausted	Modern, balanced configuration
Shard Allocation	Restricted due to disk pressure	Unrestricted, healthy rebalancing
Service Downtime	Not acceptable	Zero — no ingestion interruption

Outcome

5x ingest capacity

Handling 1,00,000 events/sec across 30 load balancers without degradation.

3x longer retention

30-day log availability for compliance, investigation, and business intelligence.

Stable cluster health

disk watermarks normalized and shard rebalancing fully restored.

Future-ready headroom

balanced CPU, RAM, and disk on all nodes, with capacity beyond 1L EPS.

Conclusion

By treating a physical server replacement as an application-aware migration, Ashnik demonstrated that even the most constrained stateful infrastructure can be upgraded without disruption. The key was working with Elasticsearch’s shard allocation controls and Kubernetes’ pod lifecycle — not around them.

For a payments platform where downtime is not an option, this methodical, phased approach delivered a complete infrastructure modernization while the system continued processing transactions at full capacity. Large systems change safely not through disruption, but through controlled, phased, application-aware execution.

Cookie	Duration	Description
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

March Newsletter is Out Now! Check it out.

Revolutionize Your CX with
Unified Observability

CloudOps Automation tool for Infrastructure monitoring and deployment.

From Chaos to Control – Transforming Log Management for a Leading Payment Solution Company

Revolutionize Your CX with Unified Observability

Automate and monitor your PostgreSQL with ease.

The CloudOps Automation Tool for easy Infrastructure deployment and monitoring

Maximize Potential of Your Data with Streaming Data Pipeline Architecture

AI Is Not Failing Because of Models. It’s Failing Because of Architecture.

Watch: Building an MCP Server for PostgreSQL: Making Databases Talk to AI