API and Microservices

Robust Big Data Log Analytics platform in open source

Written by Sandeep Khuperkar

| May 12, 2017

3 MIN READ

Log analysis has become a critical need of every business to improve their operational performance in IT and business. Today organizations generate massive amount of data from various aspects of their operations causing radical and unprecedented growth of log files. With data amounting to terabytes and more, it’s posing a big challenge for organizations to perform effective log analytics using traditional software. As business activities and transactions continue to exponentially grow, it’s becoming increasingly more challenging to store, process and analyze log data in an efficient and cost effective manner. This is driving most organizations to seriously look at the prospect of building and employing a Big Data Log Analytics Platform powered with log search capabilities.
In that context, HDFS (Hadoop) is known for storing and processing huge amount of data. HDFS supports HTTP interface WebHDFS. Elasticsearch is great for document indexing and powerful text search. Many organizations are adopting it as the platform for search analytics with Hadoop as Data Lake in front of Elasticsearch and employ ES-Hadoop connector to load data reliably into the Elastic cluster.
Sandeep blog diagram
The above figure shows the high level architecture of Logstash receiving data from HTTP and streaming it into HDFS using WEBHDFS REST API. Using ES-Hadoop connector, it further loads it to Elasticsearch and from there to Kibana, the data visualization and reporting engine. The above suggestive architecture helps build both analytics use cases like capturing data for routine repeatable task and for streaming data. Particularly since streaming analytics data cannot be reproduced – in other words if that data fails it’s gone forever, it becomes very critical to store streaming data in this architecture. We store data in Hadoop that can be replicated as and when required to Elasticsearch for analytics. Here in this case Hadoop also functions effectively as a system of records for the logs.
This suggestive architecture helps:

  • Gather and store raw log files in Hadoop from various systems in organizations (which may be hundreds of GB per day)
  • Loads into log analytics stack for query, search indexing and visualization
  • Allows business to augment this log data with other transactional data to perform large-scale analysis.

While there are a wide range of log analysis tools available, Elastic’s ELK (Elasticsearch, Logstash and Kibana) stack is given serious consideration by many organizations since the three components are seamlessly integrated together to provide a highly effective and efficient log analytics platform.

  • Elasticsearch: import of log files into a search engine for indexing and access through search
  • Logstash: collection, storage, and parsing of logs
  • Kibana:  reporting and visualization capabilities using a browser interface

Together, Elasticsearch, Logstash, Kibana and Hadoop can help you build a robust and enhanced open source log analytics platform for real-time data analysis and visualization.
Sandeep Khuperkar I Director and CTO, Ashnik


Sandeep is the Director and CTO at Ashnik. He brings more than 21 years of Industry experience (most of it spans across Red Hat & IBM India), with 14+ years in open source and building open source and Linux business model. He is on Advisory Board of JJM College of Engineering (Electronics Dept.)  And visiting lecturer with few of Engineering colleges and works towards enabling them on open source technologies. He is author, Enthusiast and community moderator at Opensource.com. He is also member of Open Source Initiative, Linux Foundation and Open Source Consortium Of India. 



Go to Top