Pentaho and Hadoop Framework Fundamentals
This course is designed to introduce you to various big data concepts with the Hadoop framework of technologies and Pentaho products. Building upon Pentaho Data Integration Fundamentals, you will learn how Pentaho works with the following Hadoop Framework technologies:
This course focuses heavily on labs to allow you practical hands-on application of the topics covered in each section.
- Improve productivity by giving your data integration team the skills they need to use Pentaho Data Integration with Hadoop data sources
- Interactive, hands-on training materials significantly improve skill development and maximize retention
At the completion of this course, you should be able to:
- Use Hadoop technologies from the native command line and with Pentaho Data Integration
- Employ data ingestion and processing best practices
DI1000 Pentaho Data Integration Fundamentals is required prior to taking this course. Basic PDI functional knowledge is used throughout this course.
Some basic knowledge of the Linux operating system is required.
Prior exposure to Hadoop concepts is not required but is beneficial.
Students attending classroom courses in the United States are provided with a PC to use during class. Students attending courses outside the US should contact the Authorized Training Provider regarding PC requirements for Pentaho courses.
In general, if your training provider requires you to bring a PC to class, it must meet the following requirements. You can also verify your system against the Compatibility Matrix: List of Supported Products topic in the Pentaho Documentation site.
- Windows XP, 7 desktop operating system (for Macintosh support, please contact your Customer Success Manager)
- RAM: at least 4GB
- Hard drive space: at least 2GB for the software, and more for solution and content files
- Processor: dual-core AMD64 or Intel EM64T
- USB port
Online courses require a broadband Internet connection, the use of a modern Web browser (such as Microsoft Internet Explorer or Mozilla Firefox), and the ability to connect to GoToTraining. For more information on GoToTraining requirements, see http://www.gotomeeting.com/online/training. Online courses use Pentaho’s cloud-based exercise environment. Students are provided access to a virtual machine used to complete the exercises.
For online courses, students are provided with a secured, electronic course manual. Printed manuals are not provided for online courses. When an electronic manual is provided, students are encouraged to print the exercise book before class begins, though this is not required.
Students attending this course on-site should contact their Customer Success Manager for hardware and software requirements. You can also email us at email@example.com for more information regarding on-site training requirements.
Module 1: Course Agenda and Structure
Module 2: Introduction to Pentaho and Big Data
Exercise 1: Using the Virtual Exercise Environment
Module 3: Big Data Solutions Architectures
Lesson 1: Batch Processing Architecture
Lesson 2: Real-Time and Stream Processing Architecture
Lesson 3: Mixed Batch and Real-Time Processing Architecture
Module 4: Hadoop and HDFS
Lesson 1: Basics of HDFS
Lesson 2: Working with HDFS in PDI
Exercise 2: Reading and Writing Data with PDI and HDFS
Lesson 3: HDFS and PDI Best Practices
Module 5: Hadoop Data Ingestion Tools
Lesson 1: Apache Flume
Lesson 2: Apache Sqoop
Lesson 3: Ingestion Best Practices
Module 6: Data Processing in Hadoop using Map Reduce
Lesson 1: Understanding Hadoop MapReduce
Lesson 2: MapReduce with Pentaho Data Integration
Exercise 3: Using Pentaho MapReduce
Lesson 3: MapReduce Best Practices
Module 7: Data Processing in Hadoop using Carte/YARN
Lesson 1: YARN Architecture
Lesson 2: MapReduce2 on YARN
Lesson 3: PDI/Carte on YARN
Module 8: Data Processing with Pig
Lesson 1: Pig Basics
Lesson 2: Using Pig in Data Integration
Module 9: Job Orchestration with PDI and Oozie
Lesson 1: Oozie Basics
Lesson 2: Oozie with PDI
Module 10: Overview of SQL on Hadoop - Best Practices
Lesson 1: Hive Basics
Lesson 2: Impala Basics
Lesson 3: Using Hive / Impala with PDI
Exercise 4: Working with Hive and Impala
Lesson 4: Hive Best Practices
Module 11: Overview of HBase
Lesson 1: HBase Basics
Lesson 2: HBase with PDI
Lesson 3: Using HBase with PDI MapReduce
Exercise 5: Working with HBase
Lesson 4: HBase and PDI Best Practices
Module 12: Overview of Spark
Lesson 1: Spark Basics
Lesson 2: Spark SQL
Lesson 3: Spark Streams
Lesson 4: Spark MLlib and SparkR
Lesson 5: Spark GraphX
Lesson 6: Spark with PDI
Module 13: Reporting on Big Data
Lesson 1: Pentaho Report Designer with Hadoop
Lesson 2: Analyzer with Hadoop