NL blog Zero 1

Written by Sudheer Kumar

| Feb 14, 2026

3 min read

Executing Zero-Downtime Server Replacement in a Large Payments Platform

In large-scale payment ecosystems, infrastructure upgrades are never routine. These platforms run 24×7, process critical financial transactions, and have near-zero tolerance for disruption.

This engagement builds upon the earlier transformation journey documented here:

From Chaos to Control: How a Payment Solution Company Transformed Log Management with Elastic Stack

At that stage:

  • Ingest rate: ~20,000 events per second
  • Log sources: 2 Load Balancers
  • Retention: 10 days

Today, the platform operates at an entirely different scale:

  • Ingest rate: 1,00,000 events per second
  • Log sources: 30 Load Balancers
  • Retention requirement: 30 days

This exponential growth required a fundamental infrastructure upgrade — including complete physical server replacement — executed on a Kubernetes 1.29 cluster, without any downtime.

The Starting Architecture and Constraints

Elasticsearch was deployed on Kubernetes v1.29, using local persistent volumes, across:

  • 6 physical servers
  • 21 Elasticsearch pods

However, the cluster had become resource-constrained:

  • 2 nodes had ~11 TB disk, but limited CPU and RAM
  • 4 nodes had insufficient disk capacity
  • Disk watermarks were frequently approached
  • Retention was capped at 10 days
  • Increasing shard allocation was becoming difficult

Because the setup relied on local persistent storage:

  • Stateful pods could not be live-migrated
  • Persistent volumes could not be detached and moved
  • In-place disk changes were risky

Downtime was not an option.

Why Disk Enhancement Became Mandatory

Disk enhancement was not merely a storage upgrade — it was a scale necessity.

Several factors made it unavoidable:

  1. Ingest Growth
    • From 20K events/sec → 1,00,000 events/sec
    • From 2 Load Balancers → 30 Load Balancers

    This is a 5x ingest increase and a 15x increase in log sources.

  2. Retention Increase
    • From 10 days → 30 days

    Retention growth alone required 3x storage expansion.

  3. Cluster Stability
    • High disk watermark thresholds were being approached frequently
    • Shard allocation restrictions were appearing
    • Rebalancing operations slowed down
  4. I/O Pressure

    Higher ingest combined with larger shard sizes increased disk throughput requirements.

    Without disk enhancement:

    • Shards would stop allocating
    • Retention extension would fail
    • Cluster performance would degrade
    • Operational risk would increase

    This was a structural capacity expansion — not optional scaling.

Why We Did NOT Just Add Disks to Old Servers

A common question was:

Why not simply attach additional disks instead of replacing servers?

After a detailed hardware assessment, the reasons were clear:

  • Servers were aging and nearing hardware lifecycle limits
  • Backplane expansion slots were exhausted
  • RAID/controller limitations prevented further safe scaling
  • CPU and RAM were already undersized relative to modern ingest needs
  • Firmware and hardware support windows were closing

Adding disks to aging hardware would have:

  • Increased failure risk
  • Created CPU-to-disk imbalance
  • Prolonged dependency on outdated infrastructure
  • Increased operational complexity

Instead, the strategic decision was taken to replace physical servers entirely with modern, balanced configurations.

This ensured:

  • Better CPU, RAM, and disk balance
  • Improved I/O performance
  • Long-term scalability
  • Sustainable 30-day retention
  • Headroom for future ingest growth beyond 1L EPS

Strategy: Horizontal Expansion Instead of Risky Modifications

Because Elasticsearch was running on Kubernetes 1.29 with local PVs, modifying nodes directly was high risk.

Kubernetes cannot live-migrate StatefulSets.
Local persistent volumes cannot be moved across physical machines.

Therefore, we adopted a horizontal expansion approach:

  1. Add new physical servers
  2. Join them to the Kubernetes cluster
  3. Migrate Elasticsearch data gradually
  4. Decommission old servers safely

This minimized operational risk.

Application-Level Migration Using Elasticsearch

Instead of treating this as a hardware migration, we treated it as an application-aware migration.

Using Elasticsearch shard allocation controls:

cluster.routing.allocation.exclude._name

We:

  • Prevented new shards from allocating to old nodes
  • Gradually evacuated existing shards
  • Monitored relocation progress
  • Maintained GREEN cluster status

Only after complete shard evacuation did we proceed to infrastructure-level operations.

Coordinating with Kubernetes (v1.29)

After shard evacuation:

  1. Node was cordoned using kubectl cordon
  2. Elasticsearch pod and PVC were deleted
  3. Elasticsearch Operator recreated pods on new physical servers
  4. Allocation exclusion was removed
  5. Cluster rebalanced safely

This process was repeated node-by-node.

  • No downtime.
  • No ingestion interruption.
  • No traffic impact.
  • No service degradation.

The Ashnik team executed this entire physical replacement seamlessly.

What This Phase Achieved

After completion:

  • All aging physical servers were replaced
  • Disk capacity significantly increased
  • Retention extended from 10 days to 30 days
  • Ingest stabilized at 1,00,000 events/sec
  • Log sources scaled from 2 LBs to 30 LBs
  • Disk watermarks normalized
  • Cluster rebalancing improved
  • Zero downtime is maintained throughout

Why This Approach Matters

This migration demonstrated:

  • Stateful workload upgrades require application-aware orchestration
  • Hardware lifecycle planning is critical at scale
  • Retention growth is capacity engineering, not just configuration
  • Kubernetes limitations must be addressed with intelligent design

Most importantly, it showed that:

Even large-scale physical server replacement can be executed without downtime when done methodically.

Summary

From:

  • 20K EPS → 1,00,000 EPS
  • 2 Load Balancers → 30 Load Balancers
  • 10 days retention → 30 days retention

The platform has evolved dramatically.

And instead of patching aging infrastructure, the right strategic decision was made — modernize the hardware foundation itself.

Large systems change safely not through disruption, but through controlled, phased, application-aware execution.

That is the approach we followed — and the system continues to operate stronger than ever.


Go to Top