Pentaho and Hadoop Framework Fundamentals

Pentaho and Hadoop Framework Fundamentals (DI2000)

Training Course

This course is designed to introduce you to various big data concepts with the Hadoop framework of technologies and Pentaho products. Building upon Pentaho Data Integration Fundamentals, you will learn how Pentaho works with the following Hadoop Framework technologies:

  • HDFS
  • Sqoop
  • Pig
  • Oozie
  • MapReduce
  • YARN
  • Hive
  • Impala
  • HBase
  • Flume
  • Spark

This course focuses heavily on labs to allow you practical hands-on application of the topics covered in each section.

Back to Courses


Id: DI2000
Level: Advanced
Audience: Data Analyst
Delivery Method: Instructor-led online, Private on-site
Duration: 2 Day(s)
Cost: $1,350.00 USD
Credits: 2
Category: Pentaho Data Integration


2 Days

Upcoming Classes


Instructor-led online training

Location Mar 2019 Apr 2019 May 2019 Jun 2019 Jul 2019
Online - APAC Apr 1 – Apr 2
Online - EMEA Apr 29 – Apr 30

Class dates in bold are guaranteed to run!

Course Benefits

  • Improve productivity by giving your data integration team the skills they need to use Pentaho Data Integration with Hadoop data sources
  • Interactive, hands-on training materials significantly improve skill development and maximize retention

Skills Achieved

At the completion of this course, you should be able to:

  • Use Hadoop technologies from the native command line and with Pentaho Data Integration
  • Employ data ingestion and processing best practices

This course is for experienced Pentaho Data Integration users that want to learn how PDI works with a wide variety of Hadoop Framework technologies. The content of this course is advanced and very technical.

DI1000 Pentaho Data Integration Fundamentals is required prior to taking this course. Basic PDI functional knowledge is used throughout this course.

Some basic knowledge of the Linux operating system is required.

Prior exposure to Hadoop concepts is not required but is beneficial.

Students attending classroom courses in the United States are provided with a PC to use during class. Students attending courses outside the US should contact the Authorized Training Provider regarding PC requirements for Pentaho courses.

In general, if your training provider requires you to bring a PC to class, it must meet the following requirements. You can also verify your system against the Compatibility Matrix: List of Supported Products topic in the Pentaho Documentation site.

  • Windows XP, 7 desktop operating system (for Macintosh support, please contact your Customer Success Manager)
  • RAM: at least 4GB
  • Hard drive space: at least 2GB for the software, and more for solution and content files
  • Processor: dual-core AMD64 or Intel EM64T
  • USB port

Online courses require a broadband Internet connection, the use of a modern Web browser (such as Microsoft Internet Explorer or Mozilla Firefox), and the ability to connect to GoToTraining. For more information on GoToTraining requirements, see Online courses use Pentaho’s cloud-based exercise environment. Students are provided access to a virtual machine used to complete the exercises.

For online courses, students are provided with a secured, electronic course manual. Printed manuals are not provided for online courses. When an electronic manual is provided, students are encouraged to print the exercise book before class begins, though this is not required.

Students attending this course on-site should contact their Customer Success Manager for hardware and software requirements. You can also email us at for more information regarding on-site training requirements.

Day 1

Module 1: Course Agenda and Structure


Module 2: Introduction to Pentaho and Big Data

      Exercise 1: Using the Virtual Exercise Environment

Module 3: Big Data Solutions Architectures

 Lesson 1: Batch Processing Architecture

 Lesson 2: Real-Time and Stream Processing Architecture

 Lesson 3: Mixed Batch and Real-Time Processing Architecture

Module 4: Hadoop and HDFS

  Lesson 1: Basics of HDFS

  Lesson 2: Working with HDFS in PDI

      Exercise 2: Reading and Writing Data with PDI and HDFS

  Lesson 3: HDFS and PDI Best Practices

Module 5: Hadoop Data Ingestion Tools

  Lesson 1: Apache Flume

  Lesson 2: Apache Sqoop

  Lesson 3: Ingestion Best Practices

Module 6: Data Processing in Hadoop using Map Reduce

  Lesson 1: Understanding Hadoop MapReduce

  Lesson 2: MapReduce with Pentaho Data Integration

      Exercise 3: Using Pentaho MapReduce

  Lesson 3: MapReduce Best Practices

Module 7: Data Processing in Hadoop using Carte/YARN

  Lesson 1: YARN Architecture

  Lesson 2: MapReduce2 on YARN

  Lesson 3: PDI/Carte on YARN

Day 2

Module 8: Data Processing with Pig

  Lesson 1: Pig Basics

  Lesson 2: Using Pig in Data Integration

Module 9: Job Orchestration with PDI and Oozie

  Lesson 1: Oozie Basics

  Lesson 2: Oozie with PDI

Module 10: Overview of SQL on Hadoop - Best Practices

  Lesson 1: Hive Basics

  Lesson 2: Impala Basics

  Lesson 3: Using Hive / Impala with PDI

      Exercise 4: Working with Hive and Impala

  Lesson 4: Hive Best Practices

Module 11: Overview of HBase

  Lesson 1: HBase Basics

  Lesson 2: HBase with PDI

  Lesson 3: Using HBase with PDI MapReduce

      Exercise 5: Working with HBase

  Lesson 4: HBase and PDI Best Practices

Module 12: Overview of Spark

  Lesson 1: Spark Basics

  Lesson 2: Spark SQL

  Lesson 3: Spark Streams

  Lesson 4: Spark MLlib and SparkR

  Lesson 5: Spark GraphX

  Lesson 6: Spark with PDI

Module 13: Reporting on Big Data

  Lesson 1: Pentaho Report Designer with Hadoop

  Lesson 2: Analyzer with Hadoop

Module 14: (Optional) PDI with Amazon Hadoop

Onsite Training

For groups of six or more

Request Quote

Public Training

Online - APAC

Online - EMEA

Don't see a date that works for you?

Request Class

What Our Clients Are Saying

Our trainer made the session as interactive as he can and he explained us the concepts from practical implementation stand point. It was an excellent 2 days session and we are thankful to him.

- Cognizant Technology Solutions Corporation

I've enjoyed this course a lot. Being a veteran PDI developer, and given the fact that our company is moving into Big Data this course was a perfect match for our needs.

- Wirecard Technologies GmbH

Pentaho and Hadoop Framework Fundamentals Ratings

Averaged from 119 responses.

Training Organized
Training Objectives
Training Expectations
Training Curriculum
Training Labs
Training Overall

What do these ratings mean?