No Comments

Operationalize Spark and Big Data with Pentaho’s Newest Enhancements

Ben Hopkins I Senior Product Marketing Manager, Pentaho
Singapore, 17 Oct 2016

by , , No Comments


Over the last 18 months or so, we at Pentaho have witnessed the hype train around Spark crank into full gear.

The huge interest in Spark is of course justified.  As a data processing engine, Spark can scream because it leverages in-memory computing.  Spark is flexible – able to handle workloads like streaming, machine learning, and batch processing in a single application.  Finally, Spark is developer-friendly – equipped to work with popular programming languages and simpler than traditional Hadoop MapReduce.

Many organizations are playing around with Spark, many are establishing use cases and proofs of concept.  However, there is a difference between huge interest and huge results in production.  Like Hadoop 1.0 and MapReduce before it, Spark has reached a point where it needs a little help to reach its true potential as a part of enterprise big data architectures.

This is a key reason why Pentaho is introducing its latest round of big data product enhancements to Pentaho Data Integration (PDI) – in order to help organizations drive value faster in big data environments, crossing the chasm between pilot projects and big data ROI.


Leveraging SQL on Spark is a popular emerging technique, providing a way for data analysts – not data engineers to do fast joins and correlations that help answer analytic questions.  This isn’t surprising – teams want to leverage existing skill sets to get value out of Spark without having to hire new Scala or Python programmers.

In our upcoming release, PDI users can access SQL on Spark as a data source (supported for Cloudera and Hortonworks), making it easier for ETL developers and big data analysts to query Spark data and integrate it with other data for preparation and analytics in Pentaho’s visual environment.  This is a big step toward operationalizing Spark in the context of existing enterprise big data architectures.


As noted above, coders really like Spark.  However, like any popular developer technology, Spark is more valuable when paired with tools that can make it easier to manage in production.

To meet this need, we are expanding PDI’s ability to visually orchestrate Spark applications.  Now, users will be able to coordinate and schedule Spark Streaming, Spark SQL, Spark ML, and Spark MLlib applications as part of PDI jobs.  We have also added Python as a supported programming language for Spark orchestration.  Taken together, these enhancements will help enterprises manage their Spark application workflows along with existing PDI transformations and processes.

For example, big data engineers can use Pentaho to orchestrate a fraud detection workflow that ingests transaction data to Hadoop, trains an existing Spark ML model on the data to help predict fraudulent transactions, applies that Spark model to new data, and then routes the results downstream for reporting.  A sample of such a process is depicted in the PDI job below.


Without adequate data security, few if any big data projects reach production, let alone their ROI potential.  As such, we’ve also extended our compatibility with key Hadoop Security frameworks.  Updates include:

  • PDI Integration with Cloudera Sentry to control access to specific data within Hadoop according to business rules
  • Expanded Kerberos compatibility that facilitates secure multi-user cluster authentication via PDI, enabling more granular control and auditing of which users are accessing the cluster through PDI


  • And there’s more!  Pentaho is introducing the following additional data integration feature updates:Over 30 new PDI steps have been enabled for metadata injection, including several inputs and operations related to Hadoop, NoSQL, and analytic databases.  This helps organizations drive further productivity in data onboarding use cases, which translates to huge time and cost savings.  In one instance, a Pentaho customer estimated that every transformation automated through metadata injection saved approximately $1000 in manual development costs.  When you are talking about hundreds or thousands of data sources, the savings really add up!
  • Support for Kafka step plug-ins to facilitate big data messaging use cases in PDI (for customers with Enterprise Support).  Kafka is becoming a popular technology to help facilitate near real time data processing use cases, especially related to the Internet of Things (IoT).
  • Support for Avro and Parquet step plug-ins, expanding the Hadoop formats you can leverage with PDI.  These are recommended output formats in our Filling the Data Lake design pattern for automating the ingestion of many data sources into Hadoop.

Taken together, these enhancements to PDI help big data projects deliver value faster and future-proof them in an ever-evolving data landscape. We can’t wait to see how customers are able to cement their big data ROI with the latest expansions to Pentaho’s powerful platform.

Ben Hopkins I Senior Product Marketing Manager, Pentaho


Tags: , , , , , , , , ,