Pentaho and Hadoop Framework Fundamentals

Training Course

This course is designed to introduce you to various big data concepts with the Hadoop framework of technologies and Pentaho products. Building upon Pentaho Data Integration Fundamentals, you will learn how Pentaho works with the following Hadoop Framework technologies:

  • HDFS
  • Sqoop
  • Pig
  • Oozie
  • MapReduce
  • YARN
  • Hive
  • Impala
  • HBase
  • Flume

This course focuses heavily on labs to allow you practical hands-on application of the topics covered in each section.

Back to Courses


Id: DI2000
Level: Advanced
Audience: Data Analyst
Delivery Method: Instructor-led online, Private on-site
Duration: 2 Day(s)
Cost: $1,350.00 USD
Credits: 2
Category: Pentaho Data Integration


2 Days

Upcoming Classes


Instructor-led online training

Location Oct 2016 Nov 2016 Dec 2016 Jan 2017 Feb 2017
Online Nov 29 – Nov 30

Class dates in bold are guaranteed to run!

Course Benefits

  • Improve productivity by giving your data integration team the skills they need to use Pentaho Data Integration with Hadoop data sources
  • Implement the Streamlined Data Refinery big data blueprint using PDI and Hadoop.
  • Interactive, hands-on training materials significantly improve skill development and maximize retention

Skills Achieved

At the completion of this course, you should be able to:

  • Use Hadoop technologies from the native command line and with Pentaho Data Integration
  • Employ data ingestion and processing best practices
  • Use Pentaho Interactive Reporting and Analyzer to report from Impala

This course is for experienced Pentaho Data Integration users that want to learn how PDI works with a wide variety of Hadoop Framework technologies. The content of this course is advanced and very technical.

DI1000 Pentaho Data Integration Fundamentals is required prior to taking this course. Basic PDI functional knowledge is used throughout this course.

BA1000 Business Analytics User Console and BA2000 Business Analytics Report Designer or equivalent field experience is required.

Some basic knowledge of the Linux operating system is required.

Prior exposure to Hadoop concepts is not required but is beneficial.

Students attending classroom courses in the United States are provided with a PC to use during class. Students attending courses outside the US should contact the Authorized Training Provider regarding PC requirements for Pentaho courses.

In general, if your training provider requires you to bring a PC to class, it must meet the following requirements. You can also verify your system against the Compatibility Matrix: List of Supported Products topic in the Pentaho Documentation site.

  • Windows XP, 7 desktop operating system (for Macintosh support, please contact your Customer Success Manager)
  • RAM: at least 4GB
  • Hard drive space: at least 2GB for the software, and more for solution and content files
  • Processor: dual-core AMD64 or Intel EM64T
  • USB port

Online courses require a broadband Internet connection, the use of a modern Web browser (such as Microsoft Internet Explorer or Mozilla Firefox), and the ability to connect to GoToTraining. For more information on GoToTraining requirements, see Online courses use Pentaho’s cloud-based exercise environment. Students are provided access to a virtual machine used to complete the exercises.

For online courses, students are provided with a secured, electronic course manual. Printed manuals are not provided for online courses. When an electronic manual is provided, students are encouraged to print the exercise book before class begins, though this is not required.

Students attending this course on-site should contact their Customer Success Manager for hardware and software requirements. You can also email us at for more information regarding on-site training requirements.

Day 1

Module 1: Introduction to Pentaho and Big Data

  Lesson 1: Big Data Architectures

  Lesson 2: Streamlined Data Refinery Use Case

  Lesson 3: Overview of Pentaho Tools

      Exercise 1: Exploring the Environment

Module 2: HDFS

  Lesson 1:Basics of HDFS

  • Understanding How HDFS Reads/Writes Data
  • Data Replication and Fault-Tolerance

  Lesson 2: HDFS Best Practices

  • Files Sizes
  • File Types
  • Compression

  Lesson 3: Pentaho Data Integration with HDFS

  • Import/Export Data Between Local File System and HDFS
  • File Management Steps
  • PDI/HDFS Best Practices

Module 3: Data Ingestion

  Lesson 1: Sqoop

  • Sqoop Basics
  • PDI and Sqoop

      Exercise 3: Import/Export Data between a DB and Hadoop Using Sqoop

  Lesson 2: Flume

  • Flume Basics
  • PDI and Flume

  Lesson 3: Data Ingestion Best Practices

  • PDI vs. Sqoop vs. Flume
  • Aggregating Smaller Files into Bigger Files
  • Best Ways to Store Non-Splitable Files Like XML and JSON

Module 4: Data Processing

  Lesson 1: MapReduce Concepts

  • Mapper
  • Reducer
  • Combiner

  Lesson 2: MR1 Architecture

  • Driver
  • Job Tracker
  • Task Tracker
  • Shuffle/Sort
  • Partioner

  Lesson 3: YARN / MR2

  • YARN Basics
  • MR2 Architecture

  Lesson 4: Using PDI to Write MapReduce

  • Developing MR Using PDI
  • MR1 vs. YARN/MR2 in PDI

  Lesson 5: PDI/MR Best Practices

  • The Do's and Don'ts of Writing PDI and MapReduce Apps
  • Using Hadoop's Distributed Cache
  • Compression

      Exercise 4:Using PDI to Develop PMR on MR2/YARN

  Lesson 6: PDI/Carte on YARN

  • Basics of Carte on YARN

      Exercise 5:Build a Transformation that Runs on YARN

  Lesson 7: Pig

  • Pig Basics
  • PDI and Pig

      Exercise 6: Run the Pig Application and Execute a Pig Script Using PDI

Day 2

Module 5: Job Orchestration

  Lesson 1: Oozie Basics

  Lesson 2: PDI Job Orchestration Features

  Lesson 3: PDI and Oozie

      Exercise 7: Run the Oozie Application and Execute an Oozie Script from PDI

Module 6: Hadoop and SQL

  Lesson 1: Traditional Hive

  Lesson 2: Hive/TEZ

  Lesson 3: Impala

  Lesson 4: Using PDI with Hive and Impala

  Lesson 5: SQL/Hadoop/PDI Best Practices

      Exercise 8: Working with Impala using the Command Line, HUE, and PDI

Module 7: HBase

  Lesson 1: HBase Basics

  Lesson 2: PDI and HBase

  Lesson 3: PD/HBase Best Practices

      Exercise 9: Working with HBase and PDI

Module 8: Reporting on Big Data

  Lesson 1: Using Pentaho Report Designer with Hadoop

  Lesson 2: Using Pentaho Analyzer with Hadoop

  Lesson 3: Best Practices for Reporting on Data in Hadoop

      Exercise 9: Create a PRD and Analyzer Report using Data in Hadoop

Module 9: Additional Pentaho and Big Data Technologies

  Lesson 1: Flume and PDI

  Lesson 2: Storm and PDI

  Lesson 3: Kafka and PDI

Onsite Training

For groups of six or more

Request Quote

Public Training


Classes marked with Confirmed are guaranteed to run. Sign up now while there is still space available!

Don't see a date that works for you?

Request Class

Pentaho and Hadoop Framework Fundamentals Ratings

Averaged from 50 responses.

Training Organized
Training Objectives
Training Expectations
Training Curriculum
Training Labs
Training Overall

What do these ratings mean?