Filling The Data Lake With Hadoop And Pentaho

Written by ,

Data Pipeline and Analytics | Sep 14, 2016

2 min read

Filling the Data Lake with Hadoop and Pentaho

A blueprint for big data success – What is the “Filling the Data Lake” blueprint?

The blueprint for filling the data lake refers to a modern data onboarding process for ingesting big data into Hadoop data lakes that is flexible, scalable, and repeatable. It streamlines data ingestion from a wide variety of source data and business users, reduces dependence on hard-coded data movement procedures, and it simplifies regular data movement at scale into the data lake.
The “Filling the Data Lake”blueprint provides developers with a roadmap to easily scale data ingestion processes and automate every step of the data pipeline, while simultaneously improving operational efficiency and lowering costs.
“Developers and data analysts need the ability to create one process that can support many different data sources by detecting metadata on the fly and using it to dynamically generate instructions that drive transformation logic in an automated fashion,” says Chuck Yarbrough, Senior Director of Solutions Marketing at Pentaho.
Within the Pentaho platform, this process is referred to as metadata injection. It helps organizations accelerate productivity and reduce risk in complex data onboarding projects by dynamically scaling out from one template to hundreds of actual transformations.

Why use this blueprint for big data success?

Today’s data onboarding projects involve managing an ever-changing array of data sources, establishing repeatable processes at scale, and maintaining control and governance. Whether an organization is implementing an ongoing process for ingesting hundreds of data sources into Hadoop or enabling business users to upload diverse data without IT assistance, onboarding projects tend to create major obstacles, such as repetitive manual design, time-consuming development, manual error risks, and the monopolization of IT sources.
Simplify the data ingestion process of disparate file sources into Hadoop
It’s easy enough to hard-code ingestion jobs to feed one or two data sources into Hadoop, but once you have a successful proof of concept, every business unit will want to get their data in – creating headaches if you’re manually hard-coding different transformations for each source. Pentaho’s unique metadata injection capability allows one transformation to become many, boosting productivity and reducing development time. The “instructions” derived from field names, types, lengths, and other metadata can dynamically generate the actual transformations, drastically reducing time spent designing transformations.
Reduce complexity and costs, while ensuring accuracy of data ingestion
Pentaho has accumulated crucial knowledge and best practices by working with several customers to facilitate enterprise-grade Hadoop data on-boarding projects. As such, the Filling the Data Lake blueprint is fairly prescriptive in terms of the data types involved and the expected business benefits. This blueprint:

Streamlines data ingestion from thousands of disparate files or database tables into Hadoop
Simplifies regular data movement at scale into Hadoop in the AVRO format
Reduces dependence on hard-coded data ingestion procedures
Minimizes risk of manual errors by decreasing dependence on hard-coded data ingestion procedures

How Metadata Injection works

Here’s a high level example of how the metadata injection process may look within a large financial services organization. This company uses metadata injection to move thousands of data sources into Hadoop using a streamlined, dynamic integration process.

– Ben Hopkins I Senior Product Manager, Pentaho

Implementing Hadoop: 7 Common Mistakes and How to Avoid Them

Mar 10, 2017 | 8 MIN READ

Operationalize Spark and Big Data with Pentaho’s Newest Enhancements

Oct 17, 2016 | 3 MIN READ

Hadoop Doesn't Have To Be Hard

Apr 11, 2016 | 3 MIN READ

Cookie	Duration	Description
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Migrating to NGINX Plus Ingress Controller: A Production-Grade Migration Plan

Revolutionize Your CX with
Unified Observability

CloudOps Automation tool for Infrastructure monitoring and deployment.

From Chaos to Control – Transforming Log Management for a Leading Payment Solution Company

Revolutionize Your CX with Unified Observability

Automate and monitor your PostgreSQL with ease.

The CloudOps Automation Tool for easy Infrastructure deployment and monitoring

Maximize Potential of Your Data with Streaming Data Pipeline Architecture

AI Is Not Failing Because of Models. It’s Failing Because of Architecture.

Watch: Building an MCP Server for PostgreSQL: Making Databases Talk to AI