Turning Your Data Lake Into A Streamlined Data Refinery

Turning Your Data Lake into a Streamlined Data Refinery

Written by ,

Database Platform | Feb 15, 2016

2 MIN READ

It’s been over five years since Pentaho’s CTO, James Dixon coined the now-ubiquitous term data lake in his blog. His metaphor contrasted bottled water which is cleansed and packaged for easy consumption with the natural state of the water source – unstructured, uncleansed, and un-adulterated. The data lake represents the entire universe of available data before any transformation has been applied to it.
Data isn’t compromised by giving it undue context in order to fit it into existing structures, which could potentially compromise its utility to your business. You can store data at low cost and you can process it at scale. I won’t test your patience by further extending the metaphor; suffice it to say that James Dixon did not intend for this idea to be an end, but only a means to an end. The data lake, and all its associated hardware, software, and skills are key elements of any agile business.
The world has changed, but it’s still recognizable
Everything still begins with the synthesis and analysis of data.
So let’s start by asking, “What kind of business challenges are data challenges?” The short and unsurprising answer is that they’re all data challenges. There may also be matters of operational change, business process, compliance, and so on – it still begins by knowing where you are today (Enterprise Data Warehouse, Data Mart), knowing the state of the world around you (externally available data such as social media) and sourcing the ever-richer set of data within your organisation (IoT) to make predictions, improve your operations, and design new products for your consumer.
Towards the Streamlined Data Refinery
One technique that we’ve been pioneering at Pentaho is the Streamlined Data Refinery as a way of providing real-time, scalable exploration capabilities on any dataset without complex coding by end-users or designing your data integration processes in advance.
It’s way of turning your data lake into a structured set of information ready for examination by users not fluent in PERL or Java… think of it as automated cleansing and bottling of the entire lake without the development overhead of your Enterprise Data Warehouse.
The Streamlined Data Refinery also has some distinct advantages over other approaches to data lake analytics:

A highly interactive and high-performance user experience for exploration
An intuitive, guided interface that can be extended to large production user bases
An architected and governed process for on-demand data integration behind the scenes

Is this only for Big Data?
Of course not. This is for any data designed to be explored without the overhead of designing an Enterprise Data Warehouse.
Maybe you need to build big data capabilities. Maybe the maturity model of your business just isn’t to the point of being able to manage complex data science. That does not mean that you can afford to ignore the realities taking shape around you. You might consider a small, low-risk project like enabling a new Hadoop cluster in the cloud and using it as a processing platform for data exploration.
With the Pentaho Streamlined Data Refinery, your time to value is shortened, costs are dramatically diminished, the skills gained are invaluable, and the capabilities to gain new insights are potentially transformative to your business.
If you’ll excuse the hyperbole, if you’re not looking at all of your data, you’re not looking at all of your business.

Table of Contents

Wael Elrifai I Director of Enterprise Solutions & Big Data Guru, Pentaho

Modernizing Enterprise Data Warehouse solutions using open source

Apr 11, 2018 | 4 MIN READ

Implementing Hadoop: 7 Common Mistakes and How to Avoid Them

Mar 10, 2017 | 8 MIN READ

Hadoop Doesn't Have To Be Hard

Apr 11, 2016 | 3 MIN READ

Cookie	Duration	Description
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Talking Open Source Podcast: Demystifying AI For Enterprise - Part 1 Watch Now!

Revolutionize Your CX with
Unified Observability

CloudOps Automation tool for Infrastructure monitoring and deployment.

Indonesia’s top digital credit service provider leverages Ashnik’s PostgreSQL expertise and services

Revolutionize Your CX with Unified Observability

Automate and monitor your PostgreSQL with ease.

The CloudOps Automation Tool for easy Infrastructure deployment and monitoring

Maximize Potential of Your Data with Streaming Data Pipeline Architecture

End-to-End Traceability and Unified Observability for the Modern Infrastructure

Watch: How to auto-scale in deployments using Kubernetes(K8s): A Technical Demo

Turning Your Data Lake into a Streamlined Data Refinery

Wael Elrifai I Director of Enterprise Solutions & Big Data Guru, Pentaho

Read More

Modernizing Enterprise Data Warehouse solutions using open source

Implementing Hadoop: 7 Common Mistakes and How to Avoid Them

Hadoop Doesn't Have To Be Hard

Products