Zero-Downtime Server Replacement at Scale — Upgrading a High-Volume Payments Log Infrastructure

  • Industry : Payments
  • Technology : Elastic Stack
  • Kubernetes
  • Engagement : Infrastructure Migration

THE CUSTOMER

A leading payment solution company processing millions of daily transactions on a 24×7 Elastic Stack log platform.

THE CHALLENGE

5x ingest growth and 3x retention requirement made aging physical servers critically undersized — with no safe option to add disks.

THE SOLUTION

Complete physical server replacement on a live Kubernetes 1.29 cluster using horizontal expansion and application-aware shard migration.

THE RESULT

Ingest scaled to 1,00,000 EPS, retention extended to 30 days, 30 LBs onboarded — with zero downtime throughout.

1,00,000

Events per second post-migration

30 days

Log retention achieved

30 LBs

Log sources, up from 2

0

Minutes of downtime

CUSTOMER OVERVIEW

This engagement builds on an earlier transformation where Ashnik helped the same payment solution company overhaul log management using the Elastic Stack. At the close of that phase, the platform ingested ~20,000 events per second from 2 load balancers with a 10-day retention window.

Since then, ingest grew to 1,00,000 events per second across 30 load balancers, and the business required 30-day log retention. The existing infrastructure — 6 physical servers running 21 Elasticsearch pods, with uneven disk, CPU, and RAM configurations — could no longer sustain this load. A full hardware replacement was unavoidable. The constraint: the platform could not go down.

THE CHALLENGE

Capacity Exhaustion
2 nodes had ~11 TB disk but limited CPU and RAM; 4 nodes had insufficient disk capacity.
Watermark Breaches
Elasticsearch’s high watermark thresholds were approached frequently, triggering shard allocation restrictions and slowing rebalancing.
Retention Gap
Extending from 10 to 30 days required 3x storage expansion — impossible on the existing hardware.
Non-Migratable Workloads
The cluster used local persistent volumes; StatefulSet pods could not be live-migrated and PVCs could not be detached or moved.
Hardware Lifecycle
Aging servers had exhausted backplane slots, hit RAID/controller scaling limits, and were approaching firmware end-of-life — making disk additions structurally unsafe.
I/O Pressure
Higher ingest volumes combined with larger shard sizes significantly increased disk throughput requirements, pushing the existing storage subsystem beyond safe operating limits.

ASHNIK’S APPROACH

Rather than treating this as a hardware swap, Ashnik designed an application-aware migration strategy that worked
within Kubernetes and Elasticsearch constraints — not around them.

01
Full Server Replacement over Disk Addition
A hardware assessment confirmed aging servers had exhausted expansion slots, undersized CPUs, and closing firmware support windows. Adding disks would increase failure risk and create CPU-to-disk imbalance.

02
Horizontal Expansion on Kubernetes 1.29
New servers were added to the existing cluster rather than modifying live nodes — bypassing Kubernetes migration limitations.

03
Application-Aware Shard Evacuation
Used Elasticsearch allocation controls to prevent new shards from landing on old nodes and evacuate existing shards.
cluster.routing.allocation.exclude._name

04
Coordinated Kubernetes Decommissioning
Each node was cordoned, pods recreated on new servers, and cluster rebalanced node-by-node.
kubectl cordon • Elasticsearch Operator

05
Performance Validation
Confirmed stable ingest at 1,00,000 EPS post-migration with improved shard rebalancing and retention.

BEFORE & AFTER

METRIC BEFORE AFTER
Ingest Rate ~20,000 events/sec 1,00,000 events/sec
Log Sources 2 Load Balancers 30 Load Balancers
Log Retention 10 days 30 days
Disk Watermark Status Frequently breached Normalized with headroom
Physical Servers Aging, expansion-exhausted Modern, balanced configuration
Shard Allocation Restricted due to disk pressure Unrestricted, healthy rebalancing
Service Downtime Not acceptable Zero — no ingestion interruption

Outcome

5x ingest capacity
Handling 1,00,000 events/sec across 30 load balancers without degradation.
3x longer retention
30-day log availability for compliance, investigation, and business intelligence.
Stable cluster health
disk watermarks normalized and shard rebalancing fully restored.
Future-ready headroom
balanced CPU, RAM, and disk on all nodes, with capacity beyond 1L EPS.

Conclusion

By treating a physical server replacement as an application-aware migration, Ashnik demonstrated that even the most constrained stateful infrastructure can be upgraded without disruption. The key was working with Elasticsearch’s shard allocation controls and Kubernetes’ pod lifecycle — not around them.

For a payments platform where downtime is not an option, this methodical, phased approach delivered a complete infrastructure modernization while the system continued processing transactions at full capacity. Large systems change safely not through disruption, but through controlled, phased, application-aware execution.