No Comments

Hadoop Doesn’t Have To Be Hard

Chuck Yarbrough I Director of Solutions, Pentaho
Singapore, 11 Apr 2016

by , , No Comments


From ingestion to analytics, Pentaho’s Chuck Yarbrough discusses Hadoop is hard ahead of Strata + Hadoop World 2016 in San Jose.

Let’s face it, Hadoop is hard. Gartner predicts, “Through 2018, 70% of Hadoop deployments will fail to meet cost savings and revenue generation objectives due to skills and integration challenges.”

This statement should give most IT groups pause, but what is equally concerning is the fact that many organizations are still struggling to determine how to deliver value from Hadoop in the first place. To combat these key challenges, organizations must develop a clear plan for their Hadoop projects that addresses the end-to-end delivery of integrated, governed data, as well as business analytics , so the data becomes a strategic asset for the business. To maximize the ROI on their Hadoop investment, data professionals need to consider how each phase of the analytics pipeline adds value and supports overall business goals, from raw data to end user analytics.


1. Empower a broad base of your team members to integrate and process Hadoop data.

Teams face a lot of different options and approaches to Hadoop data integration and transformation. Hand coding is the default, but it restricts the process to programmers. Some integration tools allow teams to design transformations without programming, but they generate code that must be tuned and maintained, re-introducing the skills challenge and making for a steep learning curve. While legacy ETL vendors also support Hadoop, they take a “black-box” data integration approach that sacrifices process transparency and does not align with open source innovation, the fundamental building block of the Hadoop ecosystem. Organizations should demand a fast and minimally invasive process for Hadoop data integration, as well as tools that fully encapsulate transformation logic without creating a code management burden. The data integration approach should also be accessible to the broadest base of relevant users – programmers, ETL developers, data analysts, data scientists, and Hadoop administrators – this is driven by the portability and flexibility of the underlying data integration engine.

2. Establish a modern data onboarding process that is flexible and scalable.

A major challenge in today’s world of big data is getting data into the data lake in a simple, automated way. Many organizations use Python or another language to code their way through these processes. The problem is that with disparate sources of data numbering in the thousands, coding scripts for each source is time consuming and extremely difficult to manage and maintain. Developers need the ability to create one process that can support many different data sources by detecting metadata on the fly and using it to dynamically generate instructions that drive transformation logic in an automated fashion. At Pentaho, we call this process “metadata injection.” During my session at Strata + Hadoop World San Jose 2016 this week, I’ll outline a modern data onboarding process that is more than just data connectivity or movement. It includes managing a changing array of data sources, establishing repeatable processes at scale and maintaining control and governance along the way. With this capability, developers can parameterize ingestion processes and automate every step of the data pipeline.

3. Determine how to deliver governed analytic insights for large production user bases.

You can do all the data engineering and preparation in the world, but if there isn’t a strategy for big data analytics delivery, the project will underwhelm. This may be the most important consideration in terms of value delivery in the analytics pipeline — it’s where data can translate to better customer experiences and increased revenue.

While tools like Hive and Impala are very useful for initial data exploration, they may not provide the right degree of interactivity and ease of use for a broader business audience. At the same time, the wrong query at the wrong time can potentially strain cluster resources, interfering with the completion of other processing tasks. In a big data environment, enterprises need to both provide fast analytics access to Hadoop data and ensure a secure, governed process for delivering and analyzing the data.

One technique that we’ve been pioneering at Pentaho is the Streamlined Data Refinery as a way of providing real-time, scalable exploration capabilities on any dataset without complex coding by end-users or designing your data integration processes in advance. It’s a way of turning your data lake into a structured set of information ready for examination by users not fluent in PERL or Java… think of it as automated cleansing and bottling of the entire lake without the development overhead of your Enterprise Data Warehouse.

To conclude, each phase of the pipeline (from ingestion to analytics), when approached with a value-oriented attitude, presents particular technology challenges that link back to the project’s ROI. While Hadoop projects are complex by definition, when organizations can solve these crucial challenges they begin to unlock the potential for operational efficiency gains, cost savings, and new revenue generation.

Chuck Yarbrough I Director of Solutions at Pentaho, a Hitachi Group Company