Operationalize Spark And Big Data With Pentaho’s Newest Enhancements

Operationalize Spark and Big Data with Pentaho’s Newest Enhancements

Written by ,

Data Pipeline and Analytics | Oct 17, 2016

3 MIN READ

Over the last 18 months or so, we at Pentaho have witnessed the hype train around Spark crank into full gear.
The huge interest in Spark is of course justified. As a data processing engine, Spark can scream because it leverages in-memory computing. Spark is flexible – able to handle workloads like streaming, machine learning, and batch processing in a single application. Finally, Spark is developer-friendly – equipped to work with popular programming languages and simpler than traditional Hadoop MapReduce.
Many organizations are playing around with Spark, many are establishing use cases and proofs of concept. However, there is a difference between huge interest and huge results in production. Like Hadoop 1.0 and MapReduce before it, Spark has reached a point where it needs a little help to reach its true potential as a part of enterprise big data architectures.
This is a key reason why Pentaho is introducing its latest round of big data product enhancements to Pentaho Data Integration (PDI) – in order to help organizations drive value faster in big data environments, crossing the chasm between pilot projects and big data ROI.

Table of Contents

SQL ON SPARK

Leveraging SQL on Spark is a popular emerging technique, providing a way for data analysts – not data engineers to do fast joins and correlations that help answer analytic questions. This isn’t surprising – teams want to leverage existing skill sets to get value out of Spark without having to hire new Scala or Python programmers.
In our upcoming release, PDI users can access SQL on Spark as a data source (supported for Cloudera and Hortonworks), making it easier for ETL developers and big data analysts to query Spark data and integrate it with other data for preparation and analytics in Pentaho’s visual environment. This is a big step toward operationalizing Spark in the context of existing enterprise big data architectures.

EXPANDED SPARK ORCHESTRATION

As noted above, coders really like Spark. However, like any popular developer technology, Spark is more valuable when paired with tools that can make it easier to manage in production.
To meet this need, we are expanding PDI’s ability to visually orchestrate Spark applications. Now, users will be able to coordinate and schedule Spark Streaming, Spark SQL, Spark ML, and Spark MLlib applications as part of PDI jobs. We have also added Python as a supported programming language for Spark orchestration. Taken together, these enhancements will help enterprises manage their Spark application workflows along with existing PDI transformations and processes.
For example, big data engineers can use Pentaho to orchestrate a fraud detection workflow that ingests transaction data to Hadoop, trains an existing Spark ML model on the data to help predict fraudulent transactions, applies that Spark model to new data, and then routes the results downstream for reporting. A sample of such a process is depicted in the PDI job below.

HADOOP SECURITY UPDATES

Without adequate data security, few if any big data projects reach production, let alone their ROI potential. As such, we’ve also extended our compatibility with key Hadoop Security frameworks. Updates include:

PDI Integration with Cloudera Sentry to control access to specific data within Hadoop according to business rules
Expanded Kerberos compatibility that facilitates secure multi-user cluster authentication via PDI, enabling more granular control and auditing of which users are accessing the cluster through PDI

FURTHER ENHANCEMENTS

And there’s more! Pentaho is introducing the following additional data integration feature updates:Over 30 new PDI steps have been enabled for metadata injection, including several inputs and operations related to Hadoop, NoSQL, and analytic databases. This helps organizations drive further productivity in data onboarding use cases, which translates to huge time and cost savings. In one instance, a Pentaho customer estimated that every transformation automated through metadata injection saved approximately $1000 in manual development costs. When you are talking about hundreds or thousands of data sources, the savings really add up!
Support for Kafka step plug-ins to facilitate big data messaging use cases in PDI (for customers with Enterprise Support). Kafka is becoming a popular technology to help facilitate near real time data processing use cases, especially related to the Internet of Things (IoT).
Support for Avro and Parquet step plug-ins, expanding the Hadoop formats you can leverage with PDI. These are recommended output formats in our Filling the Data Lake design pattern for automating the ingestion of many data sources into Hadoop.

Taken together, these enhancements to PDI help big data projects deliver value faster and future-proof them in an ever-evolving data landscape. We can’t wait to see how customers are able to cement their big data ROI with the latest expansions to Pentaho’s powerful platform.
Ben Hopkins I Senior Product Marketing Manager, Pentaho

Introducing the Adaptive Execution Layer and Spark Architecture

Jul 18, 2017 | 7 MIN READ

Pentaho Cloud Deployment with Microsoft Azure

Jun 16, 2017 | 6 MIN READ

Implementing Hadoop: 7 Common Mistakes and How to Avoid Them

Mar 10, 2017 | 8 MIN READ

Cookie	Duration	Description
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Talking Open Source Podcast: Demystifying AI For Enterprise - Part 1 Watch Now!

Revolutionize Your CX with
Unified Observability

CloudOps Automation tool for Infrastructure monitoring and deployment.

Indonesia’s top digital credit service provider leverages Ashnik’s PostgreSQL expertise and services

Revolutionize Your CX with Unified Observability

Automate and monitor your PostgreSQL with ease.

The CloudOps Automation Tool for easy Infrastructure deployment and monitoring

Maximize Potential of Your Data with Streaming Data Pipeline Architecture

End-to-End Traceability and Unified Observability for the Modern Infrastructure

Watch: How to auto-scale in deployments using Kubernetes(K8s): A Technical Demo