No Comments

Batch Processing of Machine Data

Ayandeep Das | Technical Specialist - ETL, Ashnik
Mumbai, 15 May 2019

by , , No Comments


This is in continuation of my original article ‘Building a scalable architecture for Machine Data Analytics’.  In my previous article I had elaborated on how data from machines can be converted into a readable format like JSON and then can be processed in real-time and analysed using Kibana. In this article I am going to share insights on the batch processing of data from the machines.

Before that, a quick recap from my previous article:

System Architecture


Generating Machine Data using Raspberry Pi

Raspberry Pi is used to read sensor data and convert it into readable data. Here we did not use any actual sensor data, but Raspberry Pi’s in-built sensor data APIs to generate the data and then converted that into JSON data using Java API.

Multi Broker Kafka Layer


Kafka here acted as a message broker which consumes the data from the Java API and feeds them to Spark for real-time analysis. It also feeds data to Flume for storing raw data in HDFS for Batch Processing.

Kafka runs continuously to collect and ingest data at an interval of 5 seconds.

Source Data Collection:

  • Humidity Data: This is the data generated and contains information regarding the humidity of the weather
  • Temperature Data: This is the data generated and contains information regarding the temperature variations of the machine


  • Producer 1: Will send humidity data to the Kafka layer
  • Producer 2: Will send temperature data to Kafka layer

Kafka Cluster:

Contains 2 topics to carry the information of humidity data and temperature data respectively.

Multi-Broker Cluster:

Here we created 4 brokers that mean 4 instances of Kafka running in 4 ports which acts as a load balancer.

Consumer Group

  • We created 2 consumer groups, 1 for humidity and 1 for temperature data.
  • The consumer then feeds data continuously to the Spark engine for real-time processing of data
  • The consumer also feeds raw data to the flume which ultimately stores data in HDFS

Coming back to the Part 2 which is the concluding part of my original article ‘Building a scalable architecture for Machine Data Analytics’. Here I am going to elaborate on the data processing in batch where we are going to store the raw data in HDFS using FLUME and then further analysing it using WEKA.

Part 2 – Batch Processing of Data


Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.



Here we have configured 2 flume agents as we need to get data from Kafka from 2 different topics.

Source: The source will be kafka topic for humidity and temperature data respectively

Channel: Channel will be HDFS-channel

Sink: The sink will be the destination i.e HDFS

The details of the configuration are as below:

Key Value
flume1.sources Kafka-source-1
Flume1.channels Hdfs-channel-1
Flume1.sinks Hdfs
Flume1.sources.kafka-source-1.type Org.apache.flume.source.kafka.KafkaSource
Flume1.sources.kafka-source-1.topic Topic1
Flume1.sources.kafka-source-1.batchsize 10
Flume1.sources.kafka-source-1.channels Hdfs-channel-1
Flume1.channels.hdfs-channel-1.type Memory Hdfs-channel-1
Flume1.sinks.hdfs.type Hdfs
Flume1.sinks.hdfs.fileType DataStream
Flume1.sinks.hdfs.fileSuffix . avro
Flume1.sinks.hdfs.path {Path to HDFS location}
Flume1.sinks.hdfs.writeFormat Text
Flume1.sinks.hdfs.serializer Avro_event
Flume1.sinks.hdfs.serializer.compressionCodec snappy


HDFS is a Hadoop File distribution File System used for storing huge amount of data in partitions. It employs a NameNode and DataNode architecture to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters.

Here we are using HDFS for raw storage of machine data. Data is stored in the form of partitions based on date. This data can be used further for any analysis and monitoring or predictive analysis.


I have introduced Weka here in order to bring in the basics of predictive analysis.

What is predictive Analysis: –

Predictive analytics is a form of advanced analytics that uses both new and historical data to forecast activity, behaviour and trends. It involves applying statistical analysis techniques, analytical queries and automated machine learning algorithms to data sets to create predictive models that place a numerical value — or score — on the likelihood of a particular event happening.

Here we have done predictive analysis on temperature data monitoring when the machine was undergoing extreme temperature rise and fall and then predicting the temperature rising in future using WEKA. This helps in predicting the machine failure.

What is WEKA

Weka is a collection of machine learning algorithms for data mining tasks. It contains tools for data preparation, classification, regression, clustering, association rules mining, and visualization.

Below is an example of a predictive analysis using Weka.

In the time series plot shown above (red and blue lines), the blue line indicates the actual temperature of the machine based on time as a parameter. The red line shows the prediction done on the temperature indicating that this is a failed machine as the temperature has risen very high.



This article concludes the explanation on how machine data can be read and can be used fruitfully to bring out business insights. Keeping the architecture constant, it can read any kind of data which can be in binary or unstructured format.

A consolidated key take away from both these articles is how this data can be utilized in both real time as well as in batch processing. Real time processing, as the name suggests, helps any business user in getting insights and giving solutions on real-time basis. For example if a car is using a sensor to park it in a parking lot, it will emit sensor data which will calculate the required space for parking and will automatically give alert whether it’s possible or not. Batch Processing on the other hand tilts towards Data science where we analyse the data and try to bring more insights from it using predictive analysis using various algorithms.

To conclude this, “Data is everywhere but it is useful only when you can manoeuvre it to produce insights out of it”.


  • Ayandeep is a Technical Specialist – ETL at Ashnik, Mumbai. He is instrumental in growing Ashnik’s business through his technical engagements and is a Subject Matter Expert in Pentaho & Big Data Solutions. He has over 8 years of experience in designing and developing solutions on technologies like Pentaho, ETL, Big Data, Oracle, PLSQL, Core Java, Spark, Kafka.

More From Ayandeep Das | Technical Specialist - ETL, Ashnik :