No Comments

Building a scalable architecture for Machine Data Analytics

Ayandeep Das | Technical Specialist - ETL, Ashnik
Mumbai, 13 Feb 2019

by , , No Comments


In this article, I am sharing an architecture on how to get insights from machine data and further analyse it. This architecture can act as an analytical platform for a scalable solution to analyse machine data.

This article is divided into 2 parts. Part 1 describes the real-time processing of data using Spark and Elasticsearch. Part 2 describes the batch processing of data using Flume and HDFS.

Problem Statement

In the world of machine data, raw data does not provide any insights, hence it is important to set up a platform that would provide the ability to capture, manipulate and analyse the data to get meaningful insights. When it comes to data from sensors there are many parameters involved such as temperature, pressure, humidity etc. This article proposes an architecture to read any kind of machine data which is in a binary form and convert it into readable data. This data can be stored in variety of formats like JSON/XML/CSV or any other format based on requirement.

I’ve covered the basic architecture for data collection, data processing and data ingestion using the Big Data Platform along with Elasticsearch that can store and analyse real-time data with very little delay. This platform can be used to read any type of data from sensors and then converting that data into a readable format and then analysing it. Business users can use this solution as a data engineering platform for processing machine data both in real time and batch mode.

In the below example, we are using Raspberry Pi to generate sample machine data, converting it into JSON format and then utilising the data for real-time and batch processing.

Probable Use Cases

  • Monitoring Telecom devices: Real-time data from various telecom devices can be monitored using this architecture. The most important is search monitoring with various keywords which is done very prominently using Elastic search and Kibana. Real-time data from mobiles and phones can be captured and processed using Spark parallel processing and Elastic search and Kibana for monitoring. Predictive Analysis is another important criteria for telecom industries where they analyse the amount of data usage by customers. This can be done by the batch processing part of this architecture.
  • Traffic Monitoring: Monitoring and controlling traffic in real-time is one of the key challenges faced by customers. This architecture helps user not only to monitor traffic in real-time but also helps to control. Here Spark acts as a major factor for processing of data in real-time. It helps to filter by locations giving insights about the traffic and also send user alerts using an app on the traffic. Batch processing can be used for analysing the slow and fast traffic and giving users recommendation on the traffic every day and giving them insights on the traffic in the specific area.
  • Automated Car Parking
  • Log Monitoring System in Banks, Retails industry

System Architecture



Generating Machine Data using Raspberry Pi

Raspberry Pi is used to read sensor data and convert it into readable data. Here we are not using any actual sensor data, but we are using Raspberry pi’s in-built sensor data APIs to generate the data and then converting into JSON data using Java API.

Hosting the Raspberry Pie to generate data

Here we are writing a python code in the raspberry pi using Flask Restful API

from flask import Flask

from flask_restful import Resource, Api

import grovepi

app = Flask(__name__)

api = Api(app)

sensor = 7

class TempHum(Resource):

def get(self):

[temp,hum] = grovepi.dht(sensor,0)

return {‘temperature’ : temp,

‘humidity’ : hum }

api.add_resource(TempHum, ‘/’)

if __name__ = “__main__”:’′, port=80, debug=True)


Then run the python file, the raspberry pi will now act as a server. Type:

hostname -I

to get the IP address for the raspberry pi. Then whenever someone sends a GET request to this IP address, it will return a temperature and humidity reading in JSON format.

{ “humidity” : 44, “temperature” : 25 }

JAVA API to convert the binary data to JSON Data

Here we have used java code for getting the readings in Eclipse every 5 seconds. A JSON parser is used to parse the result return from the raspberry pi.

public class Monitor {

public static HttpURLConnection con;

public static URL url;

public static String url_string = “”;


public static void main(String[] args) {


List<Double> temperature = new ArrayList<>();

List<Double> humidity = new ArrayList<>();


for (int i = 0; i < 10; i++) {

try {

url = new URL(url_string);

con = (HttpURLConnection) url.openConnection();



InputStream is = con.getInputStream();

BufferedReader br = new BufferedReader(

new InputStreamReader(is));

StringBuilder sb = new StringBuilder();

String line;

while ((line = br.readLine()) != null) {



JSONObject obj = new JSONObject(sb.toString());

JSONParser parser = new JSONParser();

double[] readings = parser.parseReading(obj);





} catch (MalformedURLException e) {



Multi Broker Kafka Layer


Kafka acts as a message broker which consumes the data from the Java API and feeds them to Spark for real-time analysis. It also feeds data to Flume for storing raw data in HDFS for Batch Processing.

Kafka runs continuously to collect and ingest data at an interval of 5 seconds.

Source Data Collection:

  • Humidity Data: This is the data generated and contains information regarding the humidity of the weather
  • Temperature Data: This is the data generated and contains information regarding the temperature variations of the machine


  • Producer 1: Will send humidity data to the Kafka layer
  • Producer 2: Will send temperature data to Kafka layer

Kafka Cluster:

Contains 2 topics to carry the information of humidity data and temperature data respectively.

Multi-Broker Cluster:

Here we created 4 brokers that mean 4 instances of Kafka running in 4 ports which acts as a load balancer.

Consumer Group

  • We created 2 consumer groups, 1 for humidity and 1 for temperature data.
  • The consumer then feeds data continuously to Spark engine for Real-time processing of data
  • The consumer also feeds raw data to the flume which ultimately stores data in HDFS

Real-Time Processing and Monitoring of Data

Spark -> Real Time Processing of Data using Spark on Java


JSON data from Kafka is stored in Spark in the form of Resilient Distributed Dataset (RDD).

The data in RDD is split into chunks based on a key. RDDs are highly resilient and are easily recoverable from any kind of issues such as server down as each chunk are replicated across multiple executor nodes, so that even if one executor node fails the other is readily available to provide data. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.

Now we get data in the form of JSON. This JSON data is again joined with Postgressql data to get the master data. We are using a PostgreSQL connector in Java to set up a JDBC connection and then extracting the master data from Postgres. This data is now joined in the Java API with the JSON data to get the valid filtered data. The output will be filtered and restructured JSON. Please note that Spark engine stores data in memory and there is no actual storing of data. However, you can store data using the File Write API of Java if required. Since this is real-time streaming of data, we are not storing any data.

Elasticsearch-> Data stored in the form of indexes

Elasticsearch is an open-source, RESTful, distributed search and analytics engine built on Apache Lucene. We can send data in the form of JSON to elastic and elastic will store it in the form of indexes.

Here we have used the default structure of indexing in elastic with the default number of replicas and shards.

We have used Elasticsearch-Hadoop Spark support in Java through EsSpark package.

We have used a 3-node elastic cluster. The architecture for integrating Spark with Elasticsearch and Kibana is shown below:


Kibana-> Real-Time Monitoring of data with Kibana and ES

Kibana is a visualizing tool which helps us to analyse elastic data. Here we have used Kibana because it blends very well with Elastic to analyse the real-time data. We have created dashboards in Kibana to analyse the temperature and humidity data.

Finally, to conclude, the above architecture gives an overview of the functionality of each component. I will write in detail about batch processing of machine data in the next part of my article. Watch this space!


  • Ayandeep is a subject matter expert in Pentaho & Big Data Solutions. He has over 8 years of experience in designing and developing solutions on technologies like Pentaho, ETL, Big Data, Oracle, PLSQL, Core Java, Spark, Kafka. He is also an Oracle Certified Java Developer. Being a technology enthusiast, he loves writing blogs on new and upcoming technologies.

More From Ayandeep Das | Technical Specialist - ETL, Ashnik :