Ashnik Delivers A Highly Scalable Logging And Monitoring Platform For A Fortune 500 Company

case_studies

Case Studies

Case Studies

Case Studies

case-studyPost

Ashnik delivers a highly scalable logging and monitoring platform for a Fortune 500 company

A leading Fintech solutions provider company had setup monitoring mechanism for IT infrastructure hosting core-banking-solution-as-a-service to multiple banks. After initial onboarding the customer faced scalability and response time challenges. In this case study you would learn how Ashnik consulting team helped the customer setup a scalable, reliable and high-performance platform and improved the customer’s service to its customers through Ashnik’s operational services.

The Background

Our customer is a Fortune 500 company and is a leading provider of financial services and solutions globally. In this particular case, this customer is providing entire banking operations as a service through its core banking application to multiple banks in India. The bank’s entire business, its reputation and customer connects depends on the 24×7 availability of this banking solution hosted and managed by the customer.

Imagine, what happens when you as an individual are doing an online transaction, you only know that your hard-earned money is debited, but you don’t get an SMS or email alert? Obviously, there’s panic, and at the same time big challenge for the bank’s reputation. Here in, we are talking about a possible scenario of millions of customers not getting information on real time basis if the infrastructure is not ready.

Incidentally, when the customer is hosting the solution and managing the operations of not just one bank but multiple banks, imagine the kind of infrastructure it has to provide and SLAs that it has to adhere to?

Setup before Ashnik came into picture

For the smooth operations of the core banking application, it is crucial to monitor functioning of its technology components on a 24×7 basis. The customer had setup log capturing mechanism to monitor the performance of key components – web servers, CPUs, server uptime, applications, database etc.(These logs provide very crucial insights about the uptime and performance of the overall system.)

Key challenges

Being a core-banking-as-a-service provider,the customer’s IT infrastructure was quite complex and had large amounts of logs (infra logs, app logs, database logs, etc.) coming in from multiple sources. It was important to aggregate these logs and provide real time analytics and alerts to the monitoring team to ensure smooth operations of multiple banks 24×7. Although this logs data helps in discovering vital trends and patterns; first, it has to be to be analyzed and parsed thoroughly and then offer alerts to the team.

Even though the customer had setup monitoring mechanism in the initial days, it was not giving timely insights as the number of banks increased. The customer was finding it very difficult to have complete view of all the infrastructure in one place. The customer had to consolidate results from each bank’s monitoring reports manually. It was not able to address the real-time needs of such a crucial and sensitive needs of its banking customer.

Initial Groundwork

The Ashnik team carefully studied the network topology and found that the customer was using a single node of community version of Elasticsearch for monitoring with its own KPIs in a distributed manner. i.e. it was hosting one ELK server ( single node ) for each bank. But as the number of banks increased it was facing a bottleneck.

Diagram 1 – Existing/Previous architecture

  • In the existing architecture, there were six standalone servers running community version of ELK stack. This setup was used for monitoring and managing NGINX services along with some other services. (The ELK stack version along with beats version was 5.x)
  • Metricbeat was used for monitoring resources at target (NGINX and other services). Additionally, for monitoring CPU, Memory, Disk I/O etc.
  • Heartbeat was majorly used for monitoring of up or down status of services
  • Filebeat was used for more detail monitoring such as response code, web pages details etc.
  • These data logs were reaching Elasticsearch through Logstash engine
  • These data logs were analyzed and shown as dashboards and reports on individual Kibana UI
  • There were five alerts set at dashboard, mainly like CPU spike, service up/down etc. These alerts were set through python script (Elastic alerts)
  • These alerts were integrated with email and Slack channel
  • At that point of time approximately 300 GB of data from all 6 ELK nodes was generated. Customer had data retention policy for 3 to 7 days

Above architecture was mainly used for document management, online services and core banking applications.

However, in spite of having an ELK stack, their model had quite a few limitations which are listed below:

  • No High Availability (HA) and Disaster Recovery (DR) for ELK
  • No product support SLA for ELK
  • Performance issues while processing large volumes of ELK data
  • No advanced level business alert mechanisms
  • Need to send transaction reports to their customers and top management every few hours which was manual and tedious task for the team
  • No consolidated view of monitoring data for all the banks (even) on different dashboards
  • Need to integrate ELK with email and their ticket platform ServiceNow for single issue progress view

Besides these, there were other challenges in particular:

  • Bulk rejections happening in write for all elastic nodes
  • Bulk rejections happening in alert for all elastic nodes
  • Huge number of document mismatch happening due to Elasticsearch rejections

The Solution: Ashnik’s approach

In the existing setup, there were six different Elasticsearch indexes with its corresponding six Logstash configuration and six different Kibana setup which were showing six different dashboards/reports for its corresponding instances of Elasticsearch.

In a new approach, we proposed a single Elasticsearch Index for all the beats nodes while stashing the data through single Logstash. When all the data comes through single index, we then plan to build and create reports and dashboards in Kibana in aggregate fashion.

The new architecture consists of ELK platinum version with X-pack of ELK stack with the latest GA version 6.x.It is 3 node ELK with required beat. X-pack comes with features like Alerts, Elastic stack monitoring, security, graphs, job scheduling (6.x) etc. It also integrates with email, slack and external utilities like ServiceNow. Besides, we can now initiate dashboard and reports export to PDF/CSV as well (6.x).

Along with Master/Data node, there are other two nodes for data ingestion. This also acts as Replica and Shard. It also serves the purpose of HA in case of failure.

Ashnik came up with a scalable architecture using three node ELK stack as the main technology which can be scaled very easily based on the source load. The architecture is described below along with the solution.

Diagram 2: New architecture

Description:

  • Filebeat for monitoring logs for latency and errors
  • Heartbeat for monitoring ports and services
  • Logstash for providing more insights on the data
  • Elasticsearch for storing the data for real time monitoring and also for storing history data
  • Kibana visualization to track the status of all the systems and reporting
  • Watcher/alerts in case of severity

Based on careful observation, Ashnik team has done some custom tweaking in Elasticsearch parameters which correctly get applied on the given workload.Based on the understanding of the frequency of data getting ingested from source to Elasticsearch Ashink team proposed the enterprise version of ELK platform with HA and replication for centralized ELK platform. This allowed us to build and deliver a consolidated ELK cluster for all the banks in a secure manner with configured advanced level alerts along with schedule reporting.

Furthermore, to ensure smooth and efficient functioning we took the following steps:

  • Added coordinating node thereby routing all the beats output to the dedicated coordinating node
  • Increased the heap size in jvm.options to 28 GB
  • Increased the thread pool and queue size so as to allow more number of requests

The above configuration helped the customer immensely. The customer can now identify the performance and Infrastructure related insights such as ‘Top Hosts’ by memory, CPU Usage Gauge and the number of unique users on their websites. The customer is even able to identify whether the transactions have been successful or not, all through the various dashboards.

Also, with the integration of watchers, timely alerts are sent which helps the customer to identify and pre-empt any disaster beforehand.

Capabilities Redefined

With the new architecture, the customer has seen marked improvement in performance in its workload.There were various logs viz. NGNIX log, Oracle logs, including others which were not captured earlier, but are now neatly captured and they give real time information.

The total number of documents ingested per day Approximately 42million
Amount of data ingested per day Approximately 50 GB
Number of alerts Approximately 30 alerts are created

Some examples of alerts are:

  • Alert is created which sends a warning and critical message based on the number of IMPS Pending requests. For example, if the number of IMPS pending requests are more than four and less than 10 daily – then it sends warning message, if it’s more than ten- it sends critical message.
  • An alert is sent daily based on the number of OTP SMS pending requests. OTP is sent as an SMS whenever a customer does transaction online. So, this is a very important aspect for customer service, as it determines whether the bank’s customer will wait for another OTP or should cancel the transaction.
  • Creates an alert based on error rate spike. Sends alert if the response code is between a certain range (499 & 599)
  • Creates an alert based on Latency Spike
  • Creates alert if an IMPS request is timed out

To help navigate the system, Ashnik simplified the visualization process.

Some examples of visualizations are:

  • Credit Info Dashboard
  • Internet Banking Dashboard
  • Profile UPI Latency & Performance Dashboard

Following chart highlights how the new platform is able to offer very high volume of data ingestion:

Cumulative statistics (approximately) are in this range:

Cluster 3 nodes of Elasticsearch
Total Data size ~600 GB
Total retention period 3 days
Number of Primary Indices ~300
Number of Documents ~600 M
Indexing rate 6000 documents/ sec
Search Rate 500 documents/sec
Total number of visualization 1000+
Total number of Alerts 200 +
Additional Functionality Use Timelion Visualization

Business Benefits

  • Offer its core-banking-as-a-service to more banks at lower cost
  • Offer better SLAs to its customers i.e. the banks
  • Able to integrate more banking channel than it was doing previously
  • Achieved higher levels of efficiency, flexibility and scalability to support new business initiatives
  • Smooth integration of various services such as ServiceNow, email and SLACK channel.
  • Processing more 2x data volume on commodity hardware.
  • Disaster recovery and business continuity for mission-critical applications

To summarize, Ashnik has delivered a highly scalable, real time monitoring and altering platform for highly business critical workload. This platform enables the customers to scale out its services at a very affordable cost, add new customers rapidly and offer better SLAs for a highly business critical workloads.

< Back