Batch Processing Of Machine Data

Batch Processing of Machine Data

Data Pipeline and Analytics | May 15, 2019

4 MIN READ

This is in continuation of my original article ‘Building a scalable architecture for Machine Data Analytics’. In my previous article I had elaborated on how data from machines can be converted into a readable format like JSON and then can be processed in real-time and analysed using Kibana. In this article I am going to share insights on the batch processing of data from the machines.
Before that, a quick recap from my previous article:

Table of Contents

System Architecture

Generating Machine Data using Raspberry Pi

Raspberry Pi is used to read sensor data and convert it into readable data. Here we did not use any actual sensor data, but Raspberry Pi’s in-built sensor data APIs to generate the data and then converted that into JSON data using Java API.

Multi Broker Kafka Layer

Kafka here acted as a message broker which consumes the data from the Java API and feeds them to Spark for real-time analysis. It also feeds data to Flume for storing raw data in HDFS for Batch Processing.
Kafka runs continuously to collect and ingest data at an interval of 5 seconds.
Source Data Collection:

Humidity Data: This is the data generated and contains information regarding the humidity of the weather
Temperature Data: This is the data generated and contains information regarding the temperature variations of the machine

Producers:

Producer 1: Will send humidity data to the Kafka layer
Producer 2: Will send temperature data to Kafka layer

Kafka Cluster:
Contains 2 topics to carry the information of humidity data and temperature data respectively.
Multi-Broker Cluster:
Here we created 4 brokers that mean 4 instances of Kafka running in 4 ports which acts as a load balancer.
Consumer Group

We created 2 consumer groups, 1 for humidity and 1 for temperature data.
The consumer then feeds data continuously to the Spark engine for real-time processing of data
The consumer also feeds raw data to the flume which ultimately stores data in HDFS

Coming back to the Part 2 which is the concluding part of my original article ‘Building a scalable architecture for Machine Data Analytics’. Here I am going to elaborate on the data processing in batch where we are going to store the raw data in HDFS using FLUME and then further analysing it using WEKA.

Part 2 – Batch Processing of Data

Flume
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.
Architecture

Here we have configured 2 flume agents as we need to get data from Kafka from 2 different topics.
Source: The source will be kafka topic for humidity and temperature data respectively
Channel: Channel will be HDFS-channel
Sink: The sink will be the destination i.e HDFS
The details of the configuration are as below:

Key	Value
flume1.sources	Kafka-source-1
Flume1.channels	Hdfs-channel-1
Flume1.sinks	Hdfs
Flume1.sources.kafka-source-1.type	Org.apache.flume.source.kafka.KafkaSource
Flume1.sources.kafka-source-1.zookeeperConnect	10.170.1.96:2181
Flume1.sources.kafka-source-1.topic	Topic1
Flume1.sources.kafka-source-1.batchsize	10
Flume1.sources.kafka-source-1.channels	Hdfs-channel-1
Flume1.channels.hdfs-channel-1.type	Memory
Flume1.sinks.hdfs.channel	Hdfs-channel-1
Flume1.sinks.hdfs.type	Hdfs
Flume1.sinks.hdfs.fileType	DataStream
Flume1.sinks.hdfs.fileSuffix	. avro
Flume1.sinks.hdfs.path	{Path to HDFS location}
Flume1.sinks.hdfs.writeFormat	Text
Flume1.sinks.hdfs.serializer	Avro_event
Flume1.sinks.hdfs.serializer.compressionCodec	snappy

HDFS
HDFS is a Hadoop File distribution File System used for storing huge amount of data in partitions. It employs a NameNode and DataNode architecture to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters.
Here we are using HDFS for raw storage of machine data. Data is stored in the form of partitions based on date. This data can be used further for any analysis and monitoring or predictive analysis.
Weka
I have introduced Weka here in order to bring in the basics of predictive analysis.
What is predictive Analysis: –
Predictive analytics is a form of advanced analytics that uses both new and historical data to forecast activity, behaviour and trends. It involves applying statistical analysis techniques, analytical queries and automated machine learning algorithms to data sets to create predictive models that place a numerical value — or score — on the likelihood of a particular event happening.
Here we have done predictive analysis on temperature data monitoring when the machine was undergoing extreme temperature rise and fall and then predicting the temperature rising in future using WEKA. This helps in predicting the machine failure.
What is WEKA
Weka is a collection of machine learning algorithms for data mining tasks. It contains tools for data preparation, classification, regression, clustering, association rules mining, and visualization.
Below is an example of a predictive analysis using Weka.
In the time series plot shown above (red and blue lines), the blue line indicates the actual temperature of the machine based on time as a parameter. The red line shows the prediction done on the temperature indicating that this is a failed machine as the temperature has risen very high.

CONCLUSION

This article concludes the explanation on how machine data can be read and can be used fruitfully to bring out business insights. Keeping the architecture constant, it can read any kind of data which can be in binary or unstructured format.
A consolidated key take away from both these articles is how this data can be utilized in both real time as well as in batch processing. Real time processing, as the name suggests, helps any business user in getting insights and giving solutions on real-time basis. For example if a car is using a sensor to park it in a parking lot, it will emit sensor data which will calculate the required space for parking and will automatically give alert whether it’s possible or not. Batch Processing on the other hand tilts towards Data science where we analyse the data and try to bring more insights from it using predictive analysis using various algorithms.
To conclude this, “Data is everywhere but it is useful only when you can manoeuvre it to produce insights out of it”.

Cookie	Duration	Description
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Talking Open Source Podcast: Demystifying AI For Enterprise - Part 1 Watch Now!

Revolutionize Your CX with
Unified Observability

CloudOps Automation tool for Infrastructure monitoring and deployment.

Indonesia’s top digital credit service provider leverages Ashnik’s PostgreSQL expertise and services

Revolutionize Your CX with Unified Observability

Automate and monitor your PostgreSQL with ease.

The CloudOps Automation Tool for easy Infrastructure deployment and monitoring

Maximize Potential of Your Data with Streaming Data Pipeline Architecture

End-to-End Traceability and Unified Observability for the Modern Infrastructure

Watch: How to auto-scale in deployments using Kubernetes(K8s): A Technical Demo

Batch Processing of Machine Data

System Architecture

Generating Machine Data using Raspberry Pi

Multi Broker Kafka Layer

Part 2 – Batch Processing of Data

CONCLUSION

Read More

Membina senibina berskala bagi Analitik Data Mesin

Building a scalable architecture for Machine Data Analytics

Introducing the Adaptive Execution Layer and Spark Architecture

Products