Turning Your Data Lake into a Streamlined Data Refinery
Wael Elrifai I Director of Enterprise Solutions & Big Data Guru, Pentaho
It’s been over five years since Pentaho’s CTO, James Dixon coined the now-ubiquitous term data lake in his blog. His metaphor contrasted bottled water which is cleansed and packaged for easy consumption with the natural state of the water source – unstructured, uncleansed, and un-adulterated. The data lake represents the entire universe of available data before any transformation has been applied to it.
Data isn’t compromised by giving it undue context in order to fit it into existing structures, which could potentially compromise its utility to your business. You can store data at low cost and you can process it at scale. I won’t test your patience by further extending the metaphor; suffice it to say that James Dixon did not intend for this idea to be an end, but only a means to an end. The data lake, and all its associated hardware, software, and skills are key elements of any agile business.
The world has changed, but it’s still recognizable
Everything still begins with the synthesis and analysis of data.
So let’s start by asking, “What kind of business challenges are data challenges?” The short and unsurprising answer is that they’re all data challenges. There may also be matters of operational change, business process, compliance, and so on – it still begins by knowing where you are today (Enterprise Data Warehouse, Data Mart), knowing the state of the world around you (externally available data such as social media) and sourcing the ever-richer set of data within your organisation (IoT) to make predictions, improve your operations, and design new products for your consumer.
Towards the Streamlined Data Refinery
One technique that we’ve been pioneering at Pentaho is the Streamlined Data Refinery as a way of providing real-time, scalable exploration capabilities on any dataset without complex coding by end-users or designing your data integration processes in advance.
It’s way of turning your data lake into a structured set of information ready for examination by users not fluent in PERL or Java… think of it as automated cleansing and bottling of the entire lake without the development overhead of your Enterprise Data Warehouse.
The Streamlined Data Refinery also has some distinct advantages over other approaches to data lake analytics:
- A highly interactive and high-performance user experience for exploration
- An intuitive, guided interface that can be extended to large production user bases
- An architected and governed process for on-demand data integration behind the scenes
Is this only for Big Data?
Of course not. This is for any data designed to be explored without the overhead of designing an Enterprise Data Warehouse.
Maybe you need to build big data capabilities. Maybe the maturity model of your business just isn’t to the point of being able to manage complex data science. That does not mean that you can afford to ignore the realities taking shape around you. You might consider a small, low-risk project like enabling a new Hadoop cluster in the cloud and using it as a processing platform for data exploration.
With the Pentaho Streamlined Data Refinery, your time to value is shortened, costs are dramatically diminished, the skills gained are invaluable, and the capabilities to gain new insights are potentially transformative to your business.
If you’ll excuse the hyperbole, if you’re not looking at all of your data, you’re not looking at all of your business.