Who Should Attend
This course is best suited to developers, engineers, and architects who want to use use Hadoop and related tools to solve real-world problems.
Course Objectives
Skills learned in this course include: Creating a data set with Kite SDK Developing custom Flume components for data ingestion Managing a multi-stage workflow with Oozie Analyzing data with Crunch Writing user-defined functions for Hive and Impala Writing user-defined functions for Hive and Impala Indexing data with Cloudera Search
Agenda
1 - Application Architecture
- Scenario Explanation
- Understanding the Development Environment
- Identifying and Collecting Input Data
- Selecting Tools for Data Processing and Analysis
- Presenting Results to the User
2 - Defining and Using Data Sets
- Metadata Management
- What is Apache Avro?
- Avro Schemas
- Avro Schema Evolution
- Selecting a File Format
- Performance Considerations
3 - Using the Kite SDK Data Module
- What is the Kite SDK?
- Fundamental Data Module Concepts
- Creating New Data Sets Using the Kite SDK
- Loading, Accessing, and Deleting a Data Set
4 - Importing Relational Data with Apache Sqoop
- What is Apache Sqoop?
- Basic Imports
- Limiting Results
- Improving Sqoop's Performance
- Sqoop 2
5 - Capturing Data with Apache Flume
- What is Apache Flume?
- Basic Flume Architecture
- Flume Sources
- Flume Sinks
- Flume Configuration
- Logging Application Events to Hadoop
6 - Developing Custom Flume Components
- Flume Data Flow and Common Extension Points
- Custom Flume Sources
- Developing a Flume Pollable Source
- Developing a Flume Event-Driven Source
- Custom Flume Interceptors
- Developing a Header-Modifying Flume Interceptor
- Developing a Filtering Flume Interceptor
- Writing Avro Objects with a Custom Flume Interceptor
7 - Managing Workflows with Apache Oozie
- The Need for Workflow Management
- What is Apache Oozie?
- Defining an Oozie Workflow
- Validation, Packaging, and Deployment
- Running and Tracking Workflows Using the CLI
- Hue UI for Oozie
8 - Processing Data Pipelines with Apache Crunch
- What is Apache Crunch?
- Understanding the Crunch Pipeline
- Comparing Crunch to Java MapReduce
- Working with Crunch Projects
- Reading and Writing Data in Crunch
- Data Collection API
- Functions
- Utility Classes in the Crunch API
9 - Working with Tables in Apache Hive
- What is Apache Hive?
- Accessing Hive
- Basic Query Syntax
- Creating and Populating Hive Tables
- How Hive Reads Data
- Using the RegexSerDe in Hive
10 - Developing User-Defined Functions
- What are User-Defined Functions?
- Implementing a User-Defined Function
- Deploying Custom Libraries in Hive
- Registering a User-Defined Function in Hive
11 - Executing Interactive Queries with Impala
- What is Impala?
- Comparing Hive to Impala
- Running Queries in Impala
- Support for User-Defined Functions
- Data and Metadata Management
12 - Understanding Cloudera Search
- What is Cloudera Search?
- Search Architecture
- Supported Document Formats
13 - Indexing Data with Cloudera Search
- Collection and Schema Management
- Morphlines
- Indexing Data in Batch Mode
- Indexing Data in Near Real Time
14 - Presenting Results to Users
- Solr Query Syntax
- Building a Search UI with Hue
- Accessing Impala through JDBC
- Powering a Custom Web Application with Impala and Search