Cloudera Developer Training Slides
Cloudera Developer Training Slides
Cloudera Developer Training Slides
201611
Introduction
Chapter 1
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-3
Trademark Information
§ The names and logos of Apache products mentioned in Cloudera training courses,
including those listed below, are trademarks of the Apache Software Foundation
– Apache Accumulo – Apache Lucene
– Apache Avro – Apache Mahout
– Apache Bigtop – Apache Oozie
– Apache Crunch – Apache Parquet
– Apache Flume – Apache Pig
– Apache Hadoop – Apache Sentry
– Apache HBase – Apache Solr
– Apache HCatalog – Apache Spark
– Apache Hive – Apache Sqoop
– Apache Impala (incubating) – Apache Tika
– Apache Kafka – Apache Whirr
– Apache Kudu – Apache ZooKeeper
§ All other product names, logos, and brands cited herein are the property of their
respective owners
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-4
Chapter Topics
Introduction
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-5
Course Objectives
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-6
Chapter Topics
Introduction
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-7
About Cloudera (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-8
About Cloudera (2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-9
CDH
INTEGRATE
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-10
Cloudera Express
§ Cloudera Express
– Completely free to
download and use
§ The best way to get started
with Hadoop
§ Includes CDH
§ Includes Cloudera Manager
– End-to-end
administration for
Hadoop
– Deploy, manage, and
monitor your cluster
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-11
Cloudera Enterprise
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-12
Chapter Topics
Introduction
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-13
Logistics
Your instructor will give you details on how to access the course materials
and exercise instructions for the class
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-14
Chapter Topics
Introduction
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-15
Introductions
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-16
Introduction to Apache Hadoop and
the Hadoop Ecosystem
Chapter 2
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-2
Introduction to Apache Hadoop and the Hadoop Ecosystem
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-3
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-4
What Is Apache Hadoop?
INTEGRATE
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-5
Common Hadoop Use Cases
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-6
Distributed Processing with Hadoop
Processing
A Hadoop Cluster
• Apache Spark
• MapReduce
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-7
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-8
Data Ingest and Storage
§ Hadoop typically ingests data from many sources and in many formats
– Traditional data management systems such as databases
– Logs and other machine generated data (event data)
– Imported files
HDFS
Ingest HBase
Kudu
Data Sources
Data Storage
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-9
Data Storage: HDFS and Apache HBase
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-10
Data Storage: Apache Kudu
§ Apache Kudu
– Distributed columnar (key-value) storage for structured data
– Supports random access and updating data (unlike HDFS)
– Faster sequential reads than HBase to support SQL-based analytics
– Works directly on native file system; is not built on HDFS
– Integrates with Spark, MapReduce, and Apache Impala
– Created at Cloudera, donated to Apache Software Foundation
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-11
Data Ingest Tools (1)
§ HDFS
– Direct file transfer
§ Apache Sqoop
– High speed import to HDFS from relational
database (and vice versa)
– Supports many data storage systems
– Examples: Netezza, MongoDB, MySQL, HDFS
Teradata, and Oracle
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-12
Data Ingest Tools (2)
§ Apache Flume
– Distributed service for ingesting streaming data
– Ideally suited for event data from multiple systems
– For example, log files
§ Apache Kafka
– A high throughput, scalable messaging system
– Distributed, reliable publish-subscribe system
HDFS
– Integrates with Flume and Spark Streaming
Apache
Kafka
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-13
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-14
Apache Spark: An Engine for Large-Scale Data Processing
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-15
Hadoop MapReduce: The Original Hadoop Processing Engine
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-16
Apache Pig: Scripting for MapReduce
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-17
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-18
Apache Impala (Incubating): High-Performance SQL
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-19
Apache Hive: SQL on MapReduce or Spark
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-20
Cloudera Search: A Platform for Data Exploration
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-21
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-22
Apache Oozie: Workflow Management
§ Oozie Start
– Workflow engine for Workflow
Hadoop jobs
– Defines Web
dependencies Server
Logs in
between jobs HDFS?
Yes No
§ The Oozie server
Import Sales
submits the jobs to the Data with Sqoop Send e-mail to
server in the correct Administrator
sequence
Is today
Sunday?
Yes No
Generate Weekly
Process Data
Reports with End Workflow
with Spark
Hive
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-23
Hue: The UI for Hadoop
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-24
Apache Sentry: Hadoop Security
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-25
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-26
Introduction to the Hands-On Exercises
L udacre
mobile
o
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-27
Scenario Explanation
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-28
Introduction to Exercises: Classroom Virtual Machine
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-29
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-30
Essential Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-31
Bibliography
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-32
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-33
Hands-on Exercise: Query Hadoop Data with Apache Impala
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-34
Apache Hadoop File Storage
Chapter 3
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-2
Apache Hadoop File Storage
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-3
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-4
Hadoop Cluster Terminology
Worker Node
Worker Node
Worker Node
…
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-5
Cluster Components
Resource Storage
Management
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-6
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-7
HDFS Basic Concepts (1)
HDFS
Disk Storage
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-8
HDFS Basic Concepts (2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-9
How Files Are Stored
§ Data files are split into blocks (default 128MB) which are distributed at
load time
§ Each block is replicated on multiple data nodes (default 3x)
§ NameNode stores metadata Block 1
Block 3 Name
Block 1 Node
Block 1
Block 2
Metadata:
Block 2 Block 2
information
Very Block 3
Large Block 4
about files
Data File and blocks
Block 3 Block 2
Block 4
Block 1
Block 3
Block 4
Block 4
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-10
Example: Storing and Retrieving Files (1)
Local
Node A Node D
/logs/
031515.log
Node B Node E
/logs/
042316.log Node C
Hadoop
Cluster
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-11
Example: Storing and Retrieving Files (2)
1 Node A Node D
/logs/
2 1 3 1 5
031515.log
3 4 2
Node B Node E
1 2 2 5
3 4 4
/logs/
4
042316.log 5 Node C
3 5
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-12
Example: Storing and Retrieving Files (3)
1 Node A Node D
/logs/
/logs/04231
2 1 3 1 5 6.log?
031515.log
3 4 2
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-13
Example: Storing and Retrieving Files (4)
1 Node A Node D
/logs/
/logs/04231
2 1 3 1 5 6.log?
031515.log
3 4 2
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-14
HDFS NameNode Availability
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-16
Options for Accessing HDFS
put
§ From the command line HDFS
$ hdfs dfs Client Cluster
get
§ In Spark
– By URI—for example:
hdfs://nnhost:port/file…
§ Other programs
– Java API
– Used by Hadoop tools such as
MapReduce, Impala, Hue,
Sqoop, Flume
– RESTful interface
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-17
HDFS Command Line Examples (1)
§ Copy file foo.txt from local disk to the user’s directory in HDFS
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-18
HDFS Command Line Examples (2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-19
HDFS Command Line Examples (3)
§ Delete a file
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-20
The Hue HDFS File Browser
§ The File Browser in Hue lets you view and manage your HDFS directories
and files
– Create, move, rename, modify, upload, download, and delete
directories and files
– View file contents
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-21
HDFS Recommendations
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-22
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-23
Hadoop Data Storage Formats
§ Hadoop and the tools in the Hadoop ecosystem use several different file
formats to store data
§ The most common are
– Text
– SequenceFiles
– Apache Avro data format
– Apache Parquet
§ Which formats to use depend on your use case and which tools you use
§ You can also define custom formats
§ HDFS considers files to be simply a sequence of bytes
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-24
Hadoop File Formats: Text Files
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-25
Hadoop File Formats: SequenceFiles
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-26
Hadoop File Formats: Apache Avro Data Files
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-27
Inspecting Avro Data Files with Avro Tools
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-28
Columnar Formats
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-29
Hadoop File Formats: Apache Parquet Files
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-30
Inspecting Parquet Files with Parquet Tools
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-31
Data Format Summary
Binary format ✓ ✓ ✓
Embedded schema ✓ ✓
Columnar organization ✓
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-32
Data Compression
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-33
Compression Codecs
< less
LZ4
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-34
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-35
Essential Points
§ The Hadoop Distributed File System (HDFS) is the main storage layer for
Hadoop
§ HDFS chunks data into blocks and distributes them across the cluster when
data is stored
§ HDFS clusters are managed by a single NameNode running on a master
node
§ Access HDFS using Hue, the hdfs command, or the HDFS API
§ The Hadoop ecosystem supports several different file formats
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-36
Bibliography
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-37
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-38
Hands-On Exercise: Access HDFS with the Command Line and
Hue
§ In this exercise, you will
– Create a /loudacre base directory for course exercises
– Practice uploading and viewing data files
§ Please refer to the Hands-On Exercise Manual for instructions
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-39
Data Processing on an Apache Hadoop
Cluster
Chapter 4
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-2
Data Processing on an Apache Hadoop Cluster
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-3
Chapter Topics
§ YARN Architecture
§ Working With YARN
§ Essential Points
§ Hands-On Exercises: Run a YARN Job
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-4
What Is YARN?
§ YARN = Yet Another Resource Negotiator
§ YARN is the Hadoop processing layer that contains
– A resource manager
– A job scheduler
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-5
YARN Daemons
§ ResourceManager (RM)
– Runs on master node
– Global resource scheduler
Resource
– Arbitrates system resources between competing
Manager
applications
– Has a pluggable scheduler to support different
algorithms (such as Capacity or Fair Scheduler)
§ NodeManager (NM)
– Runs on worker nodes
Node
– Communicates with RM Manager
– Manages node resources
– Launches containers
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-6
Running Applications on YARN
§ Containers
– Containers allocate a certain amount of resources
(memory, CPU cores) on a worker node
Container
– Applications run in one or more containers
– Clients request containers from RM
§ ApplicationMaster (AM)
– One per application
Application
– Framework/application specific
Master
– Runs in a container
– Requests more containers to run application tasks
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-7
Running an Application on YARN (1)
NodeManager DataNode
NodeManager DataNode
NodeManager DataNode
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-8
Running an Application on YARN (2)
Client
NodeManager DataNode
NodeManager DataNode
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-9
Running an Application on YARN (3)
Client
NodeManager DataNode
Resource Request:
- 1 x Node1/1GB/1 core
- 1 x Node2/1GB/1 core
NodeManager DataNode Name
Application Node
Resource
Master
Manager
NodeManager DataNode
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-10
Running an Application on YARN (4)
Client
NodeManager DataNode
NodeManager DataNode
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-11
Running an Application on YARN (5)
Client
NodeManager DataNode
Task
NodeManager DataNode
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-12
Running an Application on YARN (6)
Client
NodeManager DataNode
NodeManager DataNode
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-13
Chapter Topics
§ YARN Architecture
§ Working With YARN
§ Essential Points
§ Hands-On Exercises: Run a YARN Job
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-14
Working with YARN
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-15
The Hue Job Browser
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-16
The YARN Web UI
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-17
ResourceManager UI: Nodes
Cluster Overview
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-18
ResourceManager UI: Applications
Cluster Overview
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-19
ResourceManager UI: Application Detail
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-20
History Server
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-21
YARN Command Line (1)
$ yarn <command>
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-22
YARN Command Line (2)
$ yarn -help
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-23
Cloudera Manager
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-24
Chapter Topics
§ YARN Architecture
§ Working With YARN
§ Essential Points
§ Hands-On Exercises: Run a YARN Job
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-25
Essential Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-26
Bibliography
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-27
Chapter Topics
§ YARN Architecture
§ Working With YARN
§ Essential Points
§ Hands-On Exercises: Run a YARN Job
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-28
Hands-On Exercise: Run a YARN Job
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-29
Importing Relational Data with Apache
Sqoop
Chapter 5
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-2
Importing Relational Data with Apache Sqoop
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-3
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-4
What Is Apache Sqoop?
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-5
How Does Sqoop Work?
Hadoop Cluster
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-6
Basic Syntax
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-7
Exploring a Database with Sqoop
§ This command will list all tables in the loudacre database in MySQL
$ sqoop list-tables \
--connect jdbc:mysql://dbhost/loudacre \
--username dbuser \
--password pw
$ sqoop eval \
--query "SELECT * FROM my_table LIMIT 5" \
--connect jdbc:mysql://dbhost/loudacre \
--username dbuser \
--password pw
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-8
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-9
Overview of the Import Process
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-10
Importing an Entire Database with Sqoop
$ sqoop import-all-tables \
--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw
$ sqoop import-all-tables \
--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw \
--warehouse-dir /loudacre
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-11
Importing a Single Table with Sqoop
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-12
Importing Partial Tables with Sqoop
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-13
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-14
Specifying a File Location
§ By default, Sqoop stores the data in the user’s HDFS home directory
– In a subdirectory corresponding to the table name
– For example /user/training/accounts
§ This example specifies an alternate location
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-15
Specifying an Alternate Delimiter
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-16
Using Compression with Sqoop
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-17
Storing Data in Other Data Formats
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-18
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-19
Exporting Data from Hadoop to RDBMS with Sqoop
$ sqoop export \
--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw \
--export-dir /loudacre/recommender_output \
--update-mode allowinsert \
--table product_recommendations
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-20
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-21
Essential Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-22
Bibliography
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-23
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-24
Hands-On Exercise: Import Data from MySQL Using Apache
Sqoop
§ In this exercise, you will
– Use Sqoop to import customer account data from an RDBMS to HDFS
§ Please refer to the Hands-On Exercise Manual for instructions
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-25
Apache Spark Basics
Chapter 6
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-2
Apache Spark Basics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-3
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-4
What Is Apache Spark?
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-5
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-6
Spark Shell
Welcome to
Welcome to
____ __
/ __/__ ___ _____/ /__ ____ __
_\ \/ _ \/ _ `/ __/ '_/ / __/__ ___ _____/ /__
/__ / .__/\_,_/_/ /_/\_\ version 1.6.0 _\ \/ _ \/ _ `/ __/ '_/
/_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
Using Python version 2.7.8 (default, …)
SparkContext available as sc, HiveContext
available as sqlContext. Using Scala version 2.10.5 (Java HotSpot(TM)
64-Bit Server VM, Java 1.8.0_60)
>>> Spark context available as sc (master = …)
SQL context available as sqlContext.
scala>
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-7
Spark Context
>>> sc.appName
u'PySparkShell'
Language: Scala
…
Spark context available as sc (master = …)
SQL context available as sqlContext.
scala> sc.appName
res0: String = Spark shell
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-8
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-9
RDD (Resilient Distributed Dataset)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-10
Creating an RDD
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-11
Example: A File-Based RDD
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-12
RDD Operations
value
– Actions return values
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-13
RDD Operations: Actions
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-14
RDD Operations: Transformations
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-15
Example: map and filter Transformations
Language: Python I've never seen a purple cow. Language: Scala
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-16
Lazy Execution (1)
File: purplecow.txt
§ Data in RDDs is not processed until I've never seen a purple cow.
an action is performed I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
Language: Scala
>
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-17
Lazy Execution (2)
File: purplecow.txt
§ Data in RDDs is not processed until I've never seen a purple cow.
an action is performed I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-18
Lazy Execution (3)
File: purplecow.txt
§ Data in RDDs is not processed until I've never seen a purple cow.
an action is performed I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
RDD: mydata_uc
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-19
Lazy Execution (4)
File: purplecow.txt
§ Data in RDDs is not processed until I've never seen a purple cow.
an action is performed I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
RDD: mydata_filt
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-20
Lazy Execution (5)
File: purplecow.txt
§ Data in RDDs is not processed until I've never seen a purple cow.
an action is performed I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
RDD: mydata_filt
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-21
Chaining Transformations (Scala)
is equivalent to
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-22
Chaining Transformations (Python)
is exactly equivalent to
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-23
RDD Lineage and toDebugString (Scala)
File: purplecow.txt
§ Spark maintains each RDD’s lineage— I've never seen a purple cow.
the previous RDDs on which it I never hope to see one;
But I can tell you, anyhow,
depends I'd rather see than be one.
lineage of an RDD
> val mydata_filt =
sc.textFile("purplecow.txt").
RDD[6]
map(line => line.toUpperCase()).
filter(line => line.startsWith("I"))
> mydata_filt.toDebugString
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-24
RDD Lineage and toDebugString (Python)
> mydata_filt.toDebugString()
(1) PythonRDD[8] at RDD at …\n | purplecow.txt MappedRDD[7] at textFile
at …[]\n | purplecow.txt HadoopRDD[6] at textFile at …[]
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-25
Pipelining (1)
File: purplecow.txt
§ When possible, Spark will perform I've never seen a purple cow.
sequences of transformations by I never hope to see one;
But I can tell you, anyhow,
element so no data is stored I'd rather see than be one.
Language: Scala
> val mydata = sc.textFile("purplecow.txt") I've never seen a purple cow.
> val mydata_uc = mydata.map(line =>
line.toUpperCase())
> val mydata_filt = mydata_uc.filter(line
=> line.startsWith("I"))
> mydata_filt.take(2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-26
Pipelining (2)
File: purplecow.txt
§ When possible, Spark will perform I've never seen a purple cow.
sequences of transformations by I never hope to see one;
But I can tell you, anyhow,
element so no data is stored I'd rather see than be one.
Language: Scala
> val mydata = sc.textFile("purplecow.txt")
> val mydata_uc = mydata.map(line =>
line.toUpperCase())
> val mydata_filt = mydata_uc.filter(line
I'VE NEVER SEEN A PURPLE COW.
=> line.startsWith("I"))
> mydata_filt.take(2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-27
Pipelining (3)
File: purplecow.txt
§ When possible, Spark will perform I've never seen a purple cow.
sequences of transformations by I never hope to see one;
But I can tell you, anyhow,
element so no data is stored I'd rather see than be one.
Language: Scala
> val mydata = sc.textFile("purplecow.txt")
> val mydata_uc = mydata.map(line =>
line.toUpperCase())
> val mydata_filt = mydata_uc.filter(line
=> line.startsWith("I"))
> mydata_filt.take(2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-28
Pipelining (4)
File: purplecow.txt
§ When possible, Spark will perform I've never seen a purple cow.
sequences of transformations by I never hope to see one;
But I can tell you, anyhow,
element so no data is stored I'd rather see than be one.
Language: Scala
> val mydata = sc.textFile("purplecow.txt")
> val mydata_uc = mydata.map(line =>
line.toUpperCase())
> val mydata_filt = mydata_uc.filter(line
=> line.startsWith("I"))
> mydata_filt.take(2)
I'VE NEVER SEEN A PURPLE COW.
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-29
Pipelining (5)
File: purplecow.txt
§ When possible, Spark will perform I've never seen a purple cow.
sequences of transformations by I never hope to see one;
But I can tell you, anyhow,
element so no data is stored I'd rather see than be one.
Language: Scala
> val mydata = sc.textFile("purplecow.txt") I never hope to see one;
> val mydata_uc = mydata.map(line =>
line.toUpperCase())
> val mydata_filt = mydata_uc.filter(line
=> line.startsWith("I"))
> mydata_filt.take(2)
I'VE NEVER SEEN A PURPLE COW.
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-30
Pipelining (6)
File: purplecow.txt
§ When possible, Spark will perform I've never seen a purple cow.
sequences of transformations by I never hope to see one;
But I can tell you, anyhow,
element so no data is stored I'd rather see than be one.
Language: Scala
> val mydata = sc.textFile("purplecow.txt")
> val mydata_uc = mydata.map(line =>
line.toUpperCase())
> val mydata_filt = mydata_uc.filter(line
I NEVER HOPE TO SEE ONE;
=> line.startsWith("I"))
> mydata_filt.take(2)
I'VE NEVER SEEN A PURPLE COW.
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-31
Pipelining (7)
File: purplecow.txt
§ When possible, Spark will perform I've never seen a purple cow.
sequences of transformations by I never hope to see one;
But I can tell you, anyhow,
element so no data is stored I'd rather see than be one.
Language: Scala
> val mydata = sc.textFile("purplecow.txt")
> val mydata_uc = mydata.map(line =>
line.toUpperCase())
> val mydata_filt = mydata_uc.filter(line
=> line.startsWith("I"))
> mydata_filt.take(2)
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-32
Pipelining (8)
File: purplecow.txt
§ When possible, Spark will perform I've never seen a purple cow.
sequences of transformations by I never hope to see one;
But I can tell you, anyhow,
element so no data is stored I'd rather see than be one.
Language: Scala
> val mydata = sc.textFile("purplecow.txt")
> val mydata_uc = mydata.map(line =>
line.toUpperCase())
> val mydata_filt = mydata_uc.filter(line
=> line.startsWith("I"))
> mydata_filt.take(2)
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-33
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-34
Functional Programming in Spark
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-35
Passing Functions as Parameters
RDD {
map(fn(x)) {
foreach record in rdd
emit fn(record)
}
}
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-36
Example: Passing Named Functions
Language: Python
> def toUpper(s):
return s.upper()
> mydata = sc.textFile("purplecow.txt")
> mydata.map(toUpper).take(2)
Language: Scala
> def toUpper(s: String): String =
{ s.toUpperCase }
> val mydata = sc.textFile("purplecow.txt")
> mydata.map(toUpper).take(2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-37
Anonymous Functions
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-38
Example: Passing Anonymous Functions
§ Python:
§ Scala:
OR
> mydata.map(_.toUpperCase()).take(2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-39
Example: Java
Language: Java 7
...
JavaRDD<String> lines = sc.textFile("file");
JavaRDD<String> mydata_uc =
mydata.map(new Function<String,String>() {
@Override
public String call(String s) {
return (s.toUpperCase());
}
});
...
Language: Java 8
...
JavaRDD<String> lines = sc.textFile("file");
JavaRDD<String> lines_uc = lines.map(
line -> line.toUpperCase());
...
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-40
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-41
Essential Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-42
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-43
Introduction to Spark Exercises: Choose Your Language
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-44
Hands-On Exercise: Explore RDDs Using the Spark Shell
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-45
Working with RDDs
Chapter 7
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-2
Working with RDDs
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-3
Chapter Topics
§ Creating RDDs
§ Other General RDD Operations
§ Essential Points
§ Hands-On Exercise: Process Data Files with Apache Spark
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-4
RDDs
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-5
Creating RDDs from Collections
§ Useful when
– Testing
– Generating data programmatically
– Integrating
– Learning
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-6
Creating RDDs from Text Files (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-7
Creating RDDs from Text Files (2)
I've never seen a purple cow.\n I've never seen a purple cow.
I never hope to see one;\n I never hope to see one;
But I can tell you, anyhow,\n
But I can tell you, anyhow,
I'd rather see than be one.\n
I'd rather see than be one.
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-8
Input and Output Formats (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-9
Input and Output Formats (2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-10
Using wholeTextFiles (1)
file2.json
§ sc.wholeTextFiles(directory) {
– Maps entire contents of each file in a directory "firstName":"Barney",
"lastName":"Rubble",
to a single RDD element "userid":"234"
}
– Works only for small files (element must fit in
memory)
(file1.json,{"firstName":"Fred","lastName":"Flintstone","userid":"123"} )
(file2.json,{"firstName":"Barney","lastName":"Rubble","userid":"234"} )
(file3.json,… )
(file4.json,… )
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-11
Using wholeTextFiles (2)
Language: Python
> import json
> myrdd1 = sc.wholeTextFiles(mydir)
> myrdd2 = myrdd1 \
.map(lambda (fname,s): json.loads(s))
> for record in myrdd2.take(2): Output:
> print record.get("firstName",None) Fred
Barney
Language: Scala
> import scala.util.parsing.json.JSON
> val myrdd1 = sc.wholeTextFiles(mydir)
> val myrdd2 = myrdd1.
map(pair => JSON.parseFull(pair._2).get.
asInstanceOf[Map[String,String]])
> for (record <- myrdd2.take(2))
println(record.getOrElse("firstName",null))
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-12
Chapter Topics
§ Creating RDDs
§ Other General RDD Operations
§ Essential Points
§ Hands-On Exercise: Process Data Files with Apache Spark
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-13
Some Other General RDD Operations
§ Single-RDD Transformations
– flatMap maps one element in the base RDD to multiple elements
– distinct filters out duplicates
– sortBy uses the provided function to sort
§ Multi-RDD Transformations
– intersection creates a new RDD with all elements in both original
RDDs
– union adds all elements of two RDDs into a single new RDD
– zip pairs each element of the first RDD with the corresponding
element of the second
– subtract removes the elements in the second RDD from the first RDD
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-14
Example: flatMap and distinct
> sc.textFile(file) \
.flatMap(lambda line: line.split(' ')) \ Language: Python
.distinct()
> sc.textFile(file).
flatMap(line => line.split(' ')). Language: Scala
distinct()
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-15
Examples: Multi-RDD Transformations (1)
rdd1 rdd2
Chicago San Francisco
Boston Boston
Paris Amsterdam
San Francisco Mumbai
Tokyo McMurdo Station
rdd1.subtract(rdd2) rdd1.zip(rdd2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-16
Examples: Multi-RDD Transformations (2)
rdd1.union(rdd2)
rdd1 rdd2
Chicago San Francisco
Boston Boston
Chicago
Paris Amsterdam
Boston
San Francisco Mumbai
Paris
Tokyo McMurdo Station
San Francisco
Tokyo
rdd1.intersection(rdd2) San Francisco
Boston
Boston Amsterdam
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-17
Some Other General RDD Operations
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-18
Chapter Topics
§ Creating RDDs
§ Other General RDD Operations
§ Essential Points
§ Hands-On Exercise: Process Data Files with Apache Spark
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-19
Essential Points
§ RDDs can be created from files, parallelized data in memory, or other RDDs
§ sc.textFile reads newline-delimited text, one line per RDD element
§ sc.wholeTextFile reads multiple files, one file per RDD element
§ Generic RDDs can consist of any type of data
§ Generic RDDs provide a wide range of transformation operations
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-20
Chapter Topics
§ Creating RDDs
§ Other General RDD Operations
§ Essential Points
§ Hands-On Exercise: Process Data Files with Apache Spark
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-21
Hands-On Exercise: Process Data Files with Apache Spark
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-22
Aggregating Data with Pair RDDs
Chapter 8
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-2
Aggregating Data with Pair RDDs
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-3
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-4
Pair RDDs
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-5
Creating Pair RDDs
§ The first step in most workflows is to get the data into key/value form
– What should the RDD should be keyed on?
– What is the value?
§ Commonly used functions to create pair RDDs
– map
– flatMap / flatMapValues
– keyBy
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-6
Example: A Simple Pair RDD
Language: Scala
> val users = sc.textFile(file).
map(line => line.split('\t').
map(fields => (fields(0),fields(1)))
(user001,Fred Flintstone)
user001\tFred Flintstone
(user090,Bugs Bunny)
user090\tBugs Bunny
user111\tHarry Potter (user111,Harry Potter)
… …
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-7
Example: Keying Web Logs by User ID
Language: Python
> sc.textFile(logfile) \
.keyBy(lambda line: line.split(' ')[2])
Language: Scala
> sc.textFile(logfile).
keyBy(line => line.split(' ')(2))
User ID
56.38.234.188 - 99788 "GET /KBDOC-00157.html HTTP/1.0" …
56.38.234.188 - 99788 "GET /theme.css HTTP/1.0" …
203.146.17.59 - 25254 "GET /KBDOC-00230.html HTTP/1.0" …
…
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-8
Question 1: Pairs with Complex Values
(00210,(43.005895,-71.013202))
00210 43.005895 -71.013202
00211 43.005895 -71.013202 (00211,(43.005895,-71.013202))
00212 43.005895 -71.013202
? (00212,(43.005895,-71.013202))
00213 43.005895 -71.013202 (00213,(43.005895,-71.013202))
00214 43.005895 -71.013202
…
…
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-9
Answer 1: Pairs with Complex Values
Language: Python
> sc.textFile(file) \
.map(lambda line: line.split('\t')) \
.map(lambda fields: (fields[0],(fields[1],fields[2])))
Language: Scala
> sc.textFile(file).
map(line => line.split('\t')).
map(fields => (fields(0),(fields(1),fields(2))))
(00210,(43.005895,-71.013202))
00210 43.005895 -71.013202
01014 42.170731 -72.604842 (01014,(42.170731,-72.604842))
01062 42.324232 -72.67915 (01062,(42.324232,-72.67915))
01263 42.3929 -73.228483 (01263,(42.3929,-73.228483))
…
…
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-10
Question 2: Mapping Single Rows to Multiple Pairs (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-11
Question 2: Mapping Single Rows to Multiple Pairs (2)
00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
00004 sku411
(00001,(sku010,sku933,sku022))
(00002,(sku912,sku331))
(00003,(sku888,sku022,sku010,sku594))
(00004,(sku411))
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-12
Answer 2: Mapping Single Rows to Multiple Pairs (1)
Language: Python
> sc.textFile(file)
00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
00004 sku411
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-13
Answer 2: Mapping Single Rows to Multiple Pairs (2)
Language: Python
> sc.textFile(file) \
.map(lambda line: line.split('\t'))
00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
[00001,sku010:sku933:sku022]
00004 sku411
[00002,sku912:sku331] Note that split returns
[00003,sku888:sku022:sku010:sku594] 2-element arrays, not
[00004,sku411] pairs/tuples
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-14
Answer 2: Mapping Single Rows to Multiple Pairs (3)
Language: Python
> sc.textFile(file) \
.map(lambda line: line.split('\t')) \
.map(lambda fields: (fields[0],fields[1]))
00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
[00001,sku010:sku933:sku022]
00004 sku411
[00002,sku912:sku331]
[00003,sku888:sku022:sku010:sku594]
(00001,sku010:sku933:sku022)
[00004,sku411]
(00002,sku912:sku331) Map array elements to
(00003,sku888:sku022:sku010:sku594) tuples to produce a
pair RDD
(00004,sku411)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-15
Answer 2: Mapping Single Rows to Multiple Pairs (4)
Language: Python
> sc.textFile(file) \
.map(lambda line: line.split('\t')) \
.map(lambda fields: (fields[0],fields[1]))
.flatMapValues(lambda skus: skus.split(':'))
00001 sku010:sku933:sku022
00002 sku912:sku331 (00001,sku010)
00003 sku888:sku022:sku010:sku594 (00001,sku933)
[00001,sku010:sku933:sku022]
00004 sku411 (00001,sku022)
[00002,sku912:sku331]
[00003,sku888:sku022:sku010:sku594] (00002,sku912)
(00001,sku010:sku933:sku022)
[00004,sku411] (00002,sku331)
(00002,sku912:sku331)
(00003,sku888)
(00003,sku888:sku022:sku010:sku594)
…
(00004,sku411)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-16
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-17
Map-Reduce
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-18
Map-Reduce in Spark
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-19
Map-Reduce Example: Word Count
Result
on 2
Input Data
sofa 1
the cat sat on the mat mat 1
the aardvark sat on the sofa ? aardvark 1
the 4
cat 1
sat 2
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-20
Example: Word Count (1)
Language: Python
> counts = sc.textFile(file)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-21
Example: Word Count (2)
Language: Python
> counts = sc.textFile(file) \
.flatMap(lambda line: line.split(' '))
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-22
Example: Word Count (3)
Language: Python
> counts = sc.textFile(file) \
.flatMap(lambda line: line.split(' ')) \
.map(lambda word: (word,1)) Key-
Value
Pairs
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-23
Example: Word Count (4)
Language: Python
> counts = sc.textFile(file) \
.flatMap(lambda line: line.split(' ')) \
.map(lambda word: (word,1)) \
.reduceByKey(lambda v1,v2: v1+v2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-24
reduceByKey (1)
Language: Python
§ The function passed to counts = sc.textFile(file) \
reduceByKey combines values .flatMap(lambda line: line.split(' '))\
from two keys .map(lambda word: (word,1)) \
.reduceByKey(lambda v1,v2: v1+v2)
– Function must be binary
(the,1)
(cat,1)
(the,2)
(sat,1) (on,2)
(on,1) (sofa,1)
(the,1) (the,3) (mat,1)
(mat,1) (aardvark,1)
(the,4)
(the,1) (the,4)
(aardvark,1) (cat,1)
(sat,1) (sat,2)
(on,1)
(the,1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-25
reduceByKey (2)
Language: Python
§ The function might be called in counts = sc.textFile(file) \
any order, therefore must be .flatMap(lambda line: line.split(' '))\
.map(lambda word: (word,1)) \
– Commutative: x+y = y+x .reduceByKey(lambda v1,v2: v1+v2)
– Associative: (x+y)+z = x+(y+z)
(the,1)
(cat,1)
(the,2)
(sat,1) (on,2)
(on,1) (sofa,1)
(the,1) (mat,1)
(mat,1) (aardvark,1)
(the,4)
(the,1) (the,4)
(aardvark,1) (cat,1)
(the,2)
(sat,1) (sat,2)
(on,1)
(the,1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-26
Word Count Recap (The Scala Version)
OR
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-27
Why Do We Care about Counting Words?
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-28
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-29
Pair RDD Operations
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-30
Example: Pair RDD Operations
(00004,sku411)
(00003,sku888)
(00001,sku010)
(00003,sku022)
(00001,sku933)
(00003,sku010)
(00001,sku022)
(00003,sku594)
(00002,sku912)
(00002,sku912)
(00002,sku331)
…
(00003,sku888)
…
(00002,[sku912,sku331])
(00001,[sku022,sku010,sku933])
(00003,[sku888,sku022,sku010,sku594])
(00004,[sku411])
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-31
Example: Joining by Key
(Casablanca,($3.7M,1942))
(Star Wars,($775M,1977))
(Annie Hall,($38M,1977))
(Argo,($232M,2012))
…
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-32
Other Pair Operations
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-33
A Common Join Pattern
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-34
Example: Join Web Log with Knowledge Base Documents (1)
weblogs
56.38.234.188 - 99788 "GET /KBDOC-00157.html HTTP/1.0" …
56.38.234.188 - 99788 "GET /theme.css HTTP/1.0" …
203.146.17.59 - 25254 "GET /KBDOC-00230.html HTTP/1.0" …
221.78.60.155 - 45402 "GET /titanic_4000_sales.html HTTP/1.0" …
65.187.255.81 - 14242 "GET /KBDOC-00107.html HTTP/1.0" …
…
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-35
Example: Join Web Log with Knowledge Base Documents (2)
§ Steps
1. Map separate datasets into key-value pair RDDs
a. Map web log requests to (docid,userid)
b. Map KB Doc index to (docid,title)
2. Join by key: docid
3. Map joined data into the desired format: (userid,title)
4. Further processing: group titles by User ID
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-36
Step 1a: Map Web Log Requests to (docid,userid)
Language: Python
> import re
> def getRequestDoc(s):
return re.search(r'KBDOC-[0-9]*',s).group()
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-37
Step 1b: Map KB Index to (docid,title)
Language: Python
> kblist = sc.textFile(kblistfile) \
.map(lambda line: line.split(':')) \
.map(lambda fields: (fields[0],fields[1]))
kblist
(KBDOC-00157,Ronin Novelty Note 3 - Back up files)
(KBDOC-00230,Sorrento F33L - Transfer Contacts)
(KBDOC-00050,Titanic 1000 - Transfer Contacts)
(KBDOC-00107,MeeToo 5.0 - Transfer Contacts)
…
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-38
Step 2: Join by Key docid
Language: Python
> titlereqs = kbreqs.join(kblist)
kbreqs kblist
(KBDOC-00157,99788) (KBDOC-00157,Ronin Novelty Note 3 - Back up files)
(KBDOC-00230,25254) (KBDOC-00230,Sorrento F33L - Transfer Contacts)
(KBDOC-00107,14242) (KBDOC-00050,Titanic 1000 - Transfer Contacts)
… (KBDOC-00107,MeeToo 5.0 - Transfer Contacts)
…
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-39
Step 3: Map Result to Desired Format (userid,title)
Language: Python
> titlereqs = kbreqs.join(kblist) \
.map(lambda (docid,(userid,title)): (userid,title))
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-40
Step 4: Continue Processing—Group Titles by User ID
Language: Python
> titlereqs = kbreqs.join(kblist) \
.map(lambda (docid,(userid,title)): (userid,title)) \
.groupByKey()
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-41
Example Output
Language: Python
> for (userid,titles) in titlereqs.take(10):
print 'user id: ',userid
for title in titles: print '\t',title
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-42
Aside: Anonymous Function Parameters
§ Python and Scala pattern matching can help improve code readability
Language: Python
> map(lambda (docid,(userid,title)): (userid,title))
Language: Scala
> map(pair => (pair._2._1,pair._2._2))
OR
Language: Scala
> map{case (docid,(userid,title)) => (userid,title)}
(KBDOC-00157,(99788,…title…)) (99788,…title…)
(KBDOC-00230,(25254,…title…)) (25254,…title…)
(KBDOC-00107,(14242,…title…)) (14242,…title…)
… …
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-43
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-44
Essential Points
§ Pair RDDs are a special form of RDD consisting of key-value pairs (tuples)
§ Spark provides several operations for working with pair RDDs
§ Map-reduce is a generic programming model for distributed processing
– Spark implements map-reduce with pair RDDs
– Hadoop MapReduce and other implementations are limited to a single
map and single reduce phase per job
– Spark allows flexible chaining of map and reduce operations
– Spark provides operations to easily perform common map-reduce
algorithms like joining, sorting, and grouping
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-45
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-46
Hands-On Exercise: Use Pair RDDs to Join Two Datasets
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-47
Writing and Running Apache Spark
Applications
Chapter 9
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-2
Writing and Running Apache Spark Applications
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-3
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-4
Spark Shell vs. Spark Applications
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-5
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-6
The Spark Context
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-7
Python Example: Word Count
import sys
from pyspark import SparkContext
if __name__ == "__main__":
if len(sys.argv) < 2:
print >> sys.stderr, "Usage: WordCount.py <file>"
exit(-1)
sc = SparkContext()
counts = sc.textFile(sys.argv[1]) \
.flatMap(lambda line: line.split()) \
.map(lambda word: (word,1)) \
.reduceByKey(lambda v1,v2: v1+v2)
sc.stop()
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-8
Scala Example: Word Count
import org.apache.spark.SparkContext
object WordCount {
def main(args: Array[String]) {
if (args.length < 1) {
System.err.println("Usage: WordCount <file>")
System.exit(1)
}
sc.stop()
}
}
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-9
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-10
Building a Spark Application: Scala or Java
§ Scala or Java Spark applications must be compiled and assembled into JAR
files
– JAR file will be passed to worker nodes
§ Apache Maven is a popular build tool
– For specific setting recommendations, see the Spark Programming
Guide
§ Build details will differ depending on
– Version of Hadoop (HDFS)
– Deployment platform (YARN, Mesos, Spark Standalone)
§ Consider using an Integrated Development Environment (IDE)
– IntelliJ or Eclipse are two popular examples
– Can run Spark locally in a debugger
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-11
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-12
Running a Spark Application
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-13
Spark Application Cluster Options
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-14
Supported Cluster Resource Managers
§ Hadoop YARN
– Included in CDH
– Most common for production sites
– Allows sharing cluster resources with other applications
§ Spark Standalone
– Included with Spark
– Easy to install and run
– Limited configurability and scalability
– No security support
– Useful for learning, testing, development, or small systems
§ Apache Mesos
– First platform supported by Spark
– Not supported by Cloudera
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-15
How Spark Runs on YARN: Client Mode (1)
Node A
NodeManager DataNode
Executor Executor
Name
Resource Node C Node
Manager NodeManager DataNode
Application
Master 1
Node D
NodeManager DataNode
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-16
How Spark Runs on YARN: Client Mode (2)
Node A
NodeManager DataNode
Executor Executor
Name
Resource Node C Node
Manager NodeManager DataNode
Application
Master 1
Node D
NodeManager DataNode
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-17
How Spark Runs on YARN: Client Mode (3)
Node A
NodeManager DataNode
Executor Executor
Name
Resource Node C Node
Manager NodeManager DataNode
Application Executor
Master 1
Node D
Driver Program NodeManager DataNode
Spark
Application
Context Executor
Master 2
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-18
How Spark Runs on YARN: Client Mode (4)
Node A
NodeManager DataNode
Executor Executor
Name
Resource Node C Node
Manager NodeManager DataNode
Application Executor
Master 1
Node D
Driver Program NodeManager DataNode
Spark
Application
Context Executor
Master 2
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-19
How Spark Runs on YARN: Cluster Mode (1)
Node A
NodeManager DataNode
Executor
Node B
NodeManager DataNode
submit
Executor Executor
Name
Resource Node C Node
Manager NodeManager
Application Master DataNode
Driver Program
Spark Context
Node D
NodeManager DataNode
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-20
How Spark Runs on YARN: Cluster Mode (2)
Node A
NodeManager DataNode
Executor
Node B
NodeManager DataNode
submit
Executor Executor
Name
Resource Node C Node
Manager NodeManager
Application Master DataNode
Driver Program
Spark Context
Node D
NodeManager DataNode
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-21
Running a Spark Application Locally
Language: Python
$ spark-submit --master 'local[3]' \
WordCount.py fileURL
Language: Scala/Java
$ spark-submit --master 'local[3]' --class \
WordCount MyJarFile.jar fileURL
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-22
Running a Spark Application on a Cluster
Language: Python
$ spark-submit --master yarn-cluster \
WordCount.py fileURL
Language: Scala/Java
$ spark-submit --master yarn-cluster --class \
WordCount MyJarFile.jar fileURL
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-23
Starting the Spark Shell on a Cluster
Language: Python
$ pyspark --master yarn
Language: Scala
$ spark-shell --master yarn
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-24
Options when Submitting a Spark Application to a Cluster
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-25
Dynamic Resource Allocation (1)
Node A
NodeManager DataNode Dynamic allocation
Driver Program Executor allows a Spark
Spark
application to add or
Context Node B
release executors as
NodeManager DataNode
needed.
Executor Executor
Name
Resource Node C Node
Manager NodeManager DataNode
Application
Master 1
Node D
NodeManager DataNode
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-26
Dynamic Resource Allocation (2)
Node A
NodeManager DataNode Dynamic allocation
Driver Program Executor allows a Spark
Spark
application to add or
Context Node B
release executors as
NodeManager DataNode
needed.
Executor Executor
Name
Resource Node C Node
Manager NodeManager DataNode
Application Executor
Master 1
Node D
NodeManager DataNode
Executor
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-27
Dynamic Resource Allocation (3)
Node A
NodeManager DataNode Dynamic allocation
Driver Program Executor allows a Spark
Spark
application to add or
Context Node B
release executors as
NodeManager DataNode
needed.
Executor Executor
Name
Resource Node C Node
Manager NodeManager DataNode
Application Executor
Master 1
Node D
NodeManager DataNode
Executor
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-28
Dynamic Resource Allocation (4)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-29
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-30
The Spark Application Web UI
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-31
Accessing the Spark UI
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-32
Viewing Spark Job History (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-33
Viewing Spark Job History (2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-34
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-35
Essential Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-36
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-37
Building and Running Scala Applications in the
Hands-On Exercises
§ Basic Apache Maven projects are provided in the exercise directory
– stubs: starter Scala file, do exercises here
– solution: final exercise solution
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-38
Hands-On Exercise: Write and Run an Apache Spark Application
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-39
Configuring Apache Spark Applications
Chapter 10
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-2
Configuring Apache Spark Applications
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-3
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-4
Spark Application Configuration
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-5
Spark Application Configuration Options
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-6
Declarative Configuration Options
§ spark-submit script
– Examples:
– spark-submit --driver-memory 500M
– spark-submit --conf spark.executor.cores=4
§ Properties file
– Tab- or space-separated list of properties and values
– Load with spark-submit --properties-file filename
– Example:
spark.master yarn-cluster
spark.local.dir /tmp
spark.ui.port
§ Site defaults properties file
– SPARK_HOME/conf/spark-defaults.conf
– Template file provided
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-7
Setting Configuration Properties Programmatically
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-8
SparkConf Example (Python)
import sys
from pyspark import SparkContext
from pyspark import SparkConf
if __name__ == "__main__":
if len(sys.argv) < 2:
print >> sys.stderr, "Usage: WordCount <file>"
exit(-1)
sconf = SparkConf() \
.setAppName("Word Count") \
.set("spark.ui.port","4141")
sc = SparkContext(conf=sconf)
counts = sc.textFile(sys.argv[1]) \
.flatMap(lambda line: line.split()) \
.map(lambda w: (w,1)) \
.reduceByKey(lambda v1,v2: v1+v2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-9
SparkConf Example (Scala)
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object WordCount {
def main(args: Array[String]) {
if (args.length < 1) {
System.err.println("Usage: WordCount <file>")
System.exit(1)
}
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-10
Viewing Spark Properties
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-11
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-12
Spark Logging
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-13
Spark Log Files (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-14
Spark Log Files (2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-15
Configuring Spark Logging (1)
§ Logging levels can be set for the cluster, for individual applications, or even
for specific components or subsystems
§ Default for machine: SPARK_HOME/conf/log4j.properties*
– Start by copying log4j.properties.template
Default for all Spark
log4j.properties.template applications
# Set everything to be logged to the console
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
…
log4j.logger.org.apache.spark.repl.Main=WARN
…
Default override for Spark
shell (Scala)
* Located in /usr/lib/spark/conf on course VM
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-16
Configuring Spark Logging (2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-17
Configuring Spark Logging (3)
Language: Python/Scala
> sc.setLogLevel("ERROR")
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-18
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-19
Essential Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-20
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-21
Hands-On Exercise: Configure an Apache Spark Application
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-22
Parallel Processing in Apache Spark
Chapter 11
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-2
Parallel Processing in Apache Spark
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-3
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-4
Review of Spark on YARN (1)
Worker Nodes
$ spark-submit \
--master yarn-client \
--class MyClass \
MyApp.jar
Resource Name
Manager Node
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-5
Review of Spark on YARN (2)
Worker Nodes
$ spark-submit \ Container
Driveryarn-client
--master Program \
Spark
--class MyClass \
MyApp.jarContext
Container
Resource Name
Manager Node
Container
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-6
Review of Spark on YARN (3)
Worker Nodes
$ spark-submit \ Container
Executor
Driveryarn-client
--master Program \
Spark
--class MyClass \
MyApp.jarContext
Executor
Container
Resource Name
Manager Node
Executor
Container
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-7
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-8
RDDs on a Cluster
Executor
rdd_1_2
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-9
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-10
File Partitioning: Single Files
Language: Python/Scala
§ Partitions from single files sc.textFile("myfile",3)
– Partitions based on size
– You can optionally specify a minimum RDD
number of partitions
textFile(file, minPartitions) Executor
Executor
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-11
File Partitioning: Multiple Files
RDD
§ sc.textFile("mydir/*")
Executor
– Each file becomes (at least) one
partition file1
§ sc.wholeTextFiles("mydir")
– For many small files RDD
– Creates a key-value PairRDD Executor
– key = file name
– value = file contents
Executor
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-12
Operating on Partitions
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-13
Example: foreachPartition
myrdd = …
myrdd.foreachPartition(lambda i: printFirstLine(i))
Language: Scala
def printFirstLine(iter: Iterator[Any]) = {
println(iter.next)
}
val myrdd = …
myrdd.foreachPartition(printFirstLine)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-14
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-15
HDFS and Data Locality (1)
Node A
Node B
Node C
Node D
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-16
HDFS and Data Locality (2)
HDFS:
Node A mydata
HDFS
Block 1
Node B
HDFS
Block 2
Node C
HDFS
Block 3
Node D
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-17
HDFS and Data Locality (3)
sc.textFile("hdfs://…mydata").collect()
HDFS:
Driver Program Node A mydata
Executor HDFS
Spark
Block 1
Context
Node B
Executor HDFS
Block 2
Node C
Executor HDFS
Block 3
Node D
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-18
HDFS and Data Locality (4)
Node B
Executor HDFS
Block 2
Node C
Executor HDFS
Block 3
Node D
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-19
HDFS and Data Locality (5)
HDFS:
RDD
Driver Program Node A mydata
Executor HDFS
Spark task Block 1
Context
Node B
Executor HDFS
task Block 2
Node C
Executor HDFS
task Block 3
Node D
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-20
HDFS and Data Locality (6)
HDFS:
RDD
Driver Program Node A mydata
Executor HDFS
Spark
Block 1
Context
Node B
Executor HDFS
Block 2
Node C
Executor HDFS
Block 3
Node D
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-21
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-22
Parallel Operations on Partitions
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-23
Example: Average Word Length by Letter (1)
Language: Python
> avglens = sc.textFile(file)
RDD
HDFS:
mydata
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-24
Example: Average Word Length by Letter (2)
Language: Python
> avglens = sc.textFile(file) \
.flatMap(lambda line: line.split(' '))
RDD RDD
HDFS:
mydata
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-25
Example: Average Word Length by Letter (3)
Language: Python
> avglens = sc.textFile(file) \
.flatMap(lambda line: line.split(' ')) \
.map(lambda word: (word[0],len(word)))
HDFS:
mydata
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-26
Example: Average Word Length by Letter (4)
Language: Python
> avglens = sc.textFile(file) \
.flatMap(lambda line: line.split(' ')) \
.map(lambda word: (word[0],len(word))) \
.groupByKey()
HDFS:
mydata
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-27
Example: Average Word Length by Letter (5)
Language: Python
> avglens = sc.textFile(file) \
.flatMap(lambda line: line.split(' ')) \
.map(lambda word: (word[0],len(word))) \
.groupByKey() \
.map(lambda (k, values): \
(k, sum(values)/len(values)))
HDFS:
mydata
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-28
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-29
Stages
§ Operations that can run on the same partition are executed in stages
§ Tasks within a stage are pipelined together
§ Developers should be aware of stages to improve performance
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-30
Spark Execution: Stages (1)
Language: Scala
> val avglens = sc.textFile(myfile).
flatMap(line => line.split(' ')).
map(word => (word(0),word.length)).
groupByKey().
map(pair => (pair._1, pair._2.sum/pair._2.size.toDouble))
> avglens.saveAsTextFile("avglen-output")
Stage 0 Stage 1
RDD RDD RDD
RDD RDD
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-31
Spark Execution: Stages (2)
Language: Scala
> val avglens = sc.textFile(myfile).
flatMap(line => line.split(' ')).
map(word => (word(0),word.length)).
groupByKey().
map(pair => (pair._1, pair._2.sum/pair._2.size.toDouble))
> avglens.saveAsTextFile("avglen-output")
Stage 0 Stage 1
Task 1
Task 5
Task 2
Task 3 Task 6
Task 4
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-32
Spark Execution: Stages (3)
Language: Scala
> val avglens = sc.textFile(myfile).
flatMap(line => line.split(' ')).
map(word => (word(0),word.length)).
groupByKey().
map(pair => (pair._1, pair._2.sum/pair._2.size.toDouble))
> avglens.saveAsTextFile("avglen-output")
Stage 0 Stage 1
Task 1
Task 5
Task 2
Task 3 Task 6
Task 4
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-33
Spark Execution: Stages (4)
Language: Scala
> val avglens = sc.textFile(myfile).
flatMap(line => line.split(' ')).
map(word => (word(0),word.length)).
groupByKey().
map(pair => (pair._1, pair._2.sum/pair._2.size.toDouble))
> avglens.saveAsTextFile("avglen-output")
Stage 0 Stage 1
Task 1
Task 5
Task 2
Task 3 Task 6
Task 4
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-34
Summary of Spark Terminology
Job Stage
Task
RDD RDD RDD
RDD RDD
Stage
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-35
How Spark Calculates Stages
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-36
Controlling the Level of Parallelism
spark.default.parallelism 10
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-37
Viewing Stages in the Spark Application UI (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-38
Viewing Stages in the Spark Application UI (2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-39
Viewing Stages in the Spark Application UI (3)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-40
Viewing the Stages Using toDebugString (Scala)
Language: Scala
> val avglens = sc.textFile(myfile).
flatMap(line => line.split(' ')).
map(word => (word(0),word.length)).
groupByKey().
map(pair => (pair._1, pair._2.sum/pair._2.size.toDouble))
> avglens.toDebugString()
Indents indicate
stages (shuffle
boundaries)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-41
Viewing the Stages Using toDebugString (Python)
Language: Python
> avglens = sc.textFile(myfile) \
.flatMap(lambda line: line.split(' ')) \
.map(lambda word: (word[0],len(word))) \
.groupByKey() \
.map(lambda (k, values): \
(k, sum(values)/len(values)))
Indents indicate
stages (shuffle
boundaries)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-42
Spark Task Execution (1)
Task 4
HDFS Node B
Executor
Block 2
Driver Program
Spark
Context HDFS Node C
Executor
Block 3
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-43
Spark Task Execution (2)
Task 6Client
HDFS Node A
Executor
Block 1
Task 1
HDFS Node B
Executor
Block 2
Driver Program Task 2
Spark
Context HDFS Node C
Executor
Block 3
Task 3
Task 6Client
HDFS Node A
Executor
Block 1
Shuffle
Data
Task 1
HDFS Node B
Executor
Block 2
Shuffle
Driver Program Data
Task 2
Spark
Context HDFS Node C
Executor
Block 3
Shuffle
Data
Task 3
Task 6Client
HDFS Node A
Executor
Block 1
Shuffle
Data
HDFS Node B
Executor
Block 2
Shuffle
Driver Program Data
Spark
Context HDFS Node C
Executor
Block 3
Shuffle
Data
HDFS Node B
Executor
Block 2
Shuffle Task 5
Driver Program Data
Spark
Context HDFS Node C
Executor
Block 3
Shuffle Task 6
Data
HDFS Node A
Executor
Block 1
HDFS Node B
Executor
Block 2
Task 5
Driver Program part-00000
Spark
Context HDFS Node C
Executor
Block 3
Task 6
part-00001
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-48
Spark Task Execution (Alternate Ending)
HDFS Node A
Executor
Block 1
HDFS Node B
Executor
Block 2
Task 5
Driver Program
Spark
Context HDFS Node C
Executor
Block 3
Task 6
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-49
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-50
Essential Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-51
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-52
Hands-On Exercise: View Jobs and Stages in the Spark
Application UI
§ In this exercise, you will
– Use the Spark Application UI to view how jobs, stages, and tasks are
executed in a job
§ Please refer to the Hands-On Exercise Manual for instructions
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-53
RDD Persistence
Chapter 12
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-2
RDD Persistence
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-3
Chapter Topics
RDD Persistence
§ RDD Lineage
§ RDD Persistence Overview
§ Distributed Persistence
§ Essential Points
§ Hands-On Exercise: Persist an RDD
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-4
Lineage Example (1)
File: purplecow.txt
§ Each transformation operation I've never seen a purple cow.
creates a new child RDD I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
Language: Python
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-5
Lineage Example (2)
File: purplecow.txt
§ Each transformation operation I've never seen a purple cow.
creates a new child RDD I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
RDD[1] (mydata)
Language: Python
> mydata = sc.textFile("purplecow.txt")
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-6
Lineage Example (3)
File: purplecow.txt
§ Each transformation operation I've never seen a purple cow.
creates a new child RDD I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
RDD[1] (mydata)
Language: Python
> mydata = sc.textFile("purplecow.txt")
> myrdd = mydata.map(lambda s: s.upper())\
.filter(lambda s:s.startswith('I'))
RDD[2]
RDD[3] (myrdd)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-7
Lineage Example (4)
File: purplecow.txt
§ Spark keeps track of the parent RDD I've never seen a purple cow.
for each new RDD I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
§ Child RDDs depend on their parents
RDD[1] (mydata)
Language: Python
> mydata = sc.textFile("purplecow.txt")
> myrdd = mydata.map(lambda s: s.upper())\
.filter(lambda s:s.startswith('I'))
RDD[2]
RDD[3] (myrdd)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-8
Lineage Example (5)
File: purplecow.txt
§ Action operations execute the I've never seen a purple cow.
parent transformations I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
RDD[1] (mydata)
Language: Python I've never seen a purple cow.
> mydata = sc.textFile("purplecow.txt") I never hope to see one;
> myrdd = mydata.map(lambda s: s.upper())\ But I can tell you, anyhow,
.filter(lambda s:s.startswith('I')) I'd rather see than be one.
> myrdd.count()
3 RDD[2]
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
BUT I CAN TELL YOU, ANYHOW,
I'D RATHER SEE THAN BE ONE.
RDD[3] (myrdd)
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-9
Lineage Example (6)
File: purplecow.txt
§ Each action re-executes the lineage I've never seen a purple cow.
transformations starting with the I never hope to see one;
But I can tell you, anyhow,
base I'd rather see than be one.
RDD[3] (myrdd)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-10
Lineage Example (7)
File: purplecow.txt
§ Each action re-executes the lineage I've never seen a purple cow.
transformations starting with the I never hope to see one;
But I can tell you, anyhow,
base I'd rather see than be one.
RDD[3] (myrdd)
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-11
Chapter Topics
RDD Persistence
§ RDD Lineage
§ RDD Persistence Overview
§ Distributed Persistence
§ Essential Points
§ Hands-On Exercise: Persist an RDD
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-12
RDD Persistence (1)
File: purplecow.txt
§ Persisting an RDD saves the data (in I've never seen a purple cow.
memory, by default) I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-13
RDD Persistence (2)
File: purplecow.txt
§ Persisting an RDD saves the data (in I've never seen a purple cow.
memory, by default) I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
RDD[2] (myrdd1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-14
RDD Persistence (3)
File: purplecow.txt
§ Persisting an RDD saves the data (in I've never seen a purple cow.
memory, by default) I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-15
RDD Persistence (4)
File: purplecow.txt
§ Persisting an RDD saves the data (in I've never seen a purple cow.
memory, by default) I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
RDD[3] (myrdd2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-16
RDD Persistence (5)
File: purplecow.txt
§ Persisting an RDD saves the data (in I've never seen a purple cow.
memory, by default) I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
RDD[3] (myrdd2)
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-17
RDD Persistence (6)
File: purplecow.txt
§ Subsequent operations use saved I've never seen a purple cow.
data I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-18
RDD Persistence (7)
File: purplecow.txt
§ Subsequent operations use saved I've never seen a purple cow.
data I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-19
Memory Persistence
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-20
Chapter Topics
RDD Persistence
§ RDD Lineage
§ RDD Persistence Overview
§ Distributed Persistence
§ Essential Points
§ Hands-On Exercise: Persist an RDD
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-21
Persistence and Fault-Tolerance
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-22
Distributed Persistence
Node B
Executor
task rdd_1_1
Node C
Executor
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-23
RDD Fault-Tolerance (1)
RDD
Node A
Driver Executor
task rdd_1_0
Node B
Node C
Executor
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-24
RDD Fault-Tolerance (2)
§ The driver starts a new task to recompute the partition on a different node
§ Lineage is preserved, data is never lost
RDD
Node A
Driver Executor
task rdd_1_0
Node B
Node C
Executor
task rdd_1_1
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-25
Persistence Levels
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-26
Persistence Levels: Storage Location
Language: Scala
> import org.apache.spark.storage.StorageLevel
> myrdd.persist(StorageLevel.DISK_ONLY)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-27
Persistence Levels: Memory Format
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-28
Persistence Levels: Partition Replication
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-29
Default Persistence Levels
myrdd.cache()
is equivalent to
myrdd.persist()
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-30
Disk Persistence
RDD
Client Node A
Driver Executor
task rdd_0_0
Node B
Executor
rdd_0_1
task rdd_0_1
Node C
Executor
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-31
Disk Persistence with Replication (1)
RDD
Client Node A
Driver Executor
task rdd_0_0
Node B
Executor
rdd_0_1
task rdd_0_1
Node C
Executor
rdd_0_1
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-32
Disk Persistence with Replication (2)
Node B
Node C
Executor
task rdd_0_1 rdd_1
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-33
When and Where to Persist
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-34
Changing Persistence Options
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-35
Chapter Topics
RDD Persistence
§ RDD Lineage
§ RDD Persistence Overview
§ Distributed Persistence
§ Essential Points
§ Hands-On Exercise: Persist an RDD
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-36
Essential Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-37
Chapter Topics
RDD Persistence
§ RDD Lineage
§ RDD Persistence Overview
§ Distributed Persistence
§ Essential Points
§ Hands-On Exercise: Persist an RDD
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-38
Hands-On Exercises: Persist an RDD
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-39
Common Patterns in Apache Spark
Data Processing
Chapter 13
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-2
Common Patterns in Apache Spark Data Processing
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-3
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-4
Common Spark Use Cases (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-5
Common Spark Use Cases (2)
§ Risk analysis
– “How likely is this borrower to pay back a loan?”
§ Recommendations
– “Which products will this customer enjoy?”
§ Predictions
– “How can we prevent service outages instead of simply reacting to
them?”
§ Classification
– “How can we tell which mail is spam and which is legitimate?”
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-6
Spark Examples
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-8
Example: PageRank
§ PageRank gives web pages a ranking score based on links from other pages
– Higher scores given for more links, and links from other high ranking
pages
§ PageRank is a classic example of big data analysis (like word count)
– Lots of data: Needs an algorithm that is distributable and scalable
– Iterative: The more iterations, the better than answer
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-9
PageRank Algorithm (1)
Page 1
1.0
Page 2 Page 3
1.0 1.0
Page 4
1.0
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-10
PageRank Algorithm (2)
Page 1
1.0
Page 2 Page 3
1.0 1.0
Page 4
1.0
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-11
PageRank Algorithm (3)
Page 1 Iteration 1
1.85
Page 2 Page 3
0.58 1.0
Page 4
0.58
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-12
PageRank Algorithm (4)
Page 1 Iteration 2
1.31
Page 2 Page 3
0.39 1.7
Page 4
0.57
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-13
PageRank Algorithm (5)
Page 1 Iteration 10
1.43 (Final)
Page 2 Page 3
0.46 1.38
Page 4
0.73
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-14
PageRank in Spark: Neighbor Contribution Function
Language: Python
def computeContribs(neighbors, rank):
for neighbor in neighbors: yield(neighbor, rank/len(neighbors))
Page 1
Page 2 Page 3
Page 4
1.0
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-15
PageRank in Spark: Example Data
Page 1
Page 2 Page 3
Page 4
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-16
PageRank in Spark: Pairs of Page Links
page1 page3
Language: Python page2 page1
def computeContribs(neighbors, rank):… page4 page1
page3 page1
page4 page2
links = sc.textFile(file)\ page3 page4
.map(lambda line: line.split())\
.map(lambda pages: (pages[0],pages[1]))\
.distinct() (page1,page3)
(page2,page1)
(page4,page1)
(page3,page1)
(page4,page2)
(page3,page4)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-17
PageRank in Spark: Page Links Grouped by Source Page
page1 page3
Language: Python page2 page1
def computeContribs(neighbors, rank):… page4 page1
page3 page1
page4 page2
links = sc.textFile(file)\ page3 page4
.map(lambda line: line.split())\
.map(lambda pages: (pages[0],pages[1]))\
.distinct()\ (page1,page3)
.groupByKey()
(page2,page1)
(page4,page1)
(page3,page1)
(page4,page2)
(page3,page4)
links
(page4, [page2,page1])
(page2, [page1])
(page3, [page1,page4])
(page1, [page3])
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-18
PageRank in Spark: Persisting the Link Pair RDD
page1 page3
Language: Python page2 page1
def computeContribs(neighbors, rank):… page4 page1
page3 page1
page4 page2
links = sc.textFile(file)\ page3 page4
.map(lambda line: line.split())\
.map(lambda pages: (pages[0],pages[1]))\
.distinct()\ (page1,page3)
.groupByKey()\
(page2,page1)
.persist()
(page4,page1)
(page3,page1)
(page4,page2)
(page3,page4)
links
(page4, [page2,page1])
(page2, [page1])
(page3, [page1,page4])
(page1, [page3])
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-19
PageRank in Spark: Set Initial Ranks
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-20
PageRank in Spark: First Iteration (1)
Language: Python
def computeContribs(neighbors, rank):… links ranks
(page4, [page2,page1]) (page4, 1.0)
links = …
(page2, [page1]) (page2, 1.0)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-21
PageRank in Spark: First Iteration (2)
Language: Python
def computeContribs(neighbors, rank):… links ranks
(page4, [page2,page1]) (page4, 1.0)
links = …
(page2, [page1]) (page2, 1.0)
contribs
(page2,0.5)
(page1,0.5)
(page1,1.0)
(page1,0.5)
(page4,0.5)
(page3,1.0)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-22
PageRank in Spark: First Iteration (3)
contribs
Language: Python (page2,0.5)
def computeContribs(neighbors, rank):… (page1,0.5)
(page1,1.0)
links = …
(page1,0.5)
ranks = … (page4,0.5)
(page3,1.0)
for x in xrange(10):
contribs=links\
.join(ranks)\ (page4,0.5)
.flatMap(lambda (page,(neighbors,rank)): \
(page2,0.5)
computeContribs(neighbors,rank))
ranks=contribs\ (page3,1.0)
.reduceByKey(lambda v1,v2: v1+v2) (page1,2.0)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-23
PageRank in Spark: First Iteration (4)
contribs
Language: Python (page2,0.5)
def computeContribs(neighbors, rank):… (page1,0.5)
(page1,1.0)
links = …
(page1,0.5)
ranks = … (page4,0.5)
(page3,1.0)
for x in xrange(10):
contribs=links\
.join(ranks)\ (page4,0.5)
.flatMap(lambda (page,(neighbors,rank)): \
(page2,0.5)
computeContribs(neighbors,rank))
ranks=contribs\ (page3,1.0)
.reduceByKey(lambda v1,v2: v1+v2)\ (page1,2.0)
.map(lambda (page,contrib): \
(page,contrib * 0.85 + 0.15)) ranks
(page4,.58)
(page2,.58)
(page3,1.0)
(page1,1.85)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-24
PageRank in Spark: Second Iteration
Language: Python
def computeContribs(neighbors, rank):… links ranks
(page4, [page2,page1]) (page4,0.58)
links = …
(page2, [page1]) (page2,0.58)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-25
Checkpointing (1)
§ Maintaining RDD lineage provides resilience but can also cause problems
when the lineage gets very long
– For example: iterative algorithms, Iter1
streaming data…
Iter2
data…
data…
§ Recovery can be very expensive data…
Iter3
data…
data…
data… Iter4
data…
§ Potential stack overflow data…
data…
data…
data…
data…
Language: Python data…
data…
myrdd = …initial-value… data…
while x in xrange(100):
…
myrdd = myrdd.transform(…) Iter100
myrdd.saveAsTextFile(dir) data…
data…
data…
data…
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-26
Checkpointing (2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-27
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-28
Fundamentals of Computer Programming
#!/usr/bin/env python
import sys
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-29
What is Machine Learning?
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-30
Types of Machine Learning
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-31
What is Collaborative Filtering?
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-32
Applications Involving Collaborative Filtering
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-33
What is Clustering?
Store Online
Brand status
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-34
Unsupervised Learning (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-35
Unsupervised Learning (2)
§ Once the model has been created, you can use it to assign groups
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-36
Applications Involving Clustering
§ Market segmentation
– Group similar customers in order to target them effectively
§ Finding related news articles
– Google News
§ Epidemiological studies
– Identifying a “cancer cluster” and finding a root cause
§ Computer vision (groups of pixels that cohere into objects)
– Related pixels clustered to recognize faces or license plates
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-37
What is Classification?
35
30
Weight (lb.)
25
20
15
Cat
10
5
3 6 9 12 15 18 21 24 27
Height (in.)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-38
Supervised Learning (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-39
Supervised Learning (2)
§ Once the model has been trained, you can make predictions
– This will take new (previously unseen) data as input
– The new data will not have labels
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-40
Applications Involving Classification
§ Spam filtering
– Train using a set of spam and non-spam messages
– System will eventually learn to detect unwanted email
§ Oncology
– Train using images of benign and malignant tumors
– System will eventually learn to identify cancer
§ Risk Analysis
– Train using financial records of customers who do/don’t default
– System will eventually learn to identify risk customers
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-41
Relationship of Algorithms and Data Volume (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-42
Relationship of Algorithms and Data Volume (2)
0.95
0.90
Test Accuracy
0.85
0.80
Memory-Based
Winnow
0.75 Perceptron
Naive Bayes
0.70
0.1 1 10 100 1000
Millions of Words
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-43
Machine Learning Challenges
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-44
Spark MLlib and Spark ML
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-45
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-46
k-means Clustering
§ k-means clustering
– A common iterative algorithm used in graph analysis and machine
learning
– You will implement a simplified version in the Hands-On Exercises
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-47
Clustering (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-48
Clustering (2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-49
Example: k-means Clustering (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-50
Example: k-means Clustering (2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-51
Example: k-means Clustering (3)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-52
Example: k-means Clustering (4)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-53
Example: k-means Clustering (5)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-54
Example: k-means Clustering (6)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-55
Example: k-means Clustering (7)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-56
Example: k-means Clustering (8)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-57
Example: k-means Clustering (9)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-58
Example: Approximate k-means Clustering
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-59
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-60
Essential Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-61
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-62
Hands-On Exercise: Implement an Iterative Algorithm with
Apache Spark
§ In this exercise, you will
– Implement k-means in Spark in order to identify clustered location data
points from Loudacre device status logs
– Find the geographic centers of device activity
§ Please refer to the Hands-On Exercise Manual for instructions
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-63
DataFrames and Apache Spark SQL
Chapter 14
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-2
DataFrames and Apache Spark SQL
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-3
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-4
What is Spark SQL?
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-5
SQL Context
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-6
Creating a SQL Context
Language: Scala
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-7
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-8
DataFrames
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-9
Creating DataFrames
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-10
Creating a DataFrame from a Data Source
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-11
Example: Creating a DataFrame from a JSON File
Language: Python
sqlContext = HiveContext(sc)
peopleDF = sqlContext.read.json("people.json")
Language: Scala
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
val peopleDF = sqlContext.read.json("people.json")
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-12
Example: Creating a DataFrame from a Hive/Impala Table
Language: Python
sqlContext = HiveContext(sc)
customerDF = sqlContext.read.table("customers")
Language: Scala
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
val customerDF = sqlContext.read.table("customers")
Table: customers
cust_id name country cust_id name country
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-13
Loading from a Data Source Manually
sqlContext.read.
format("jdbc").
option("url","jdbc:mysql://localhost/loudacre").
option("dbtable","accounts").
option("user","training").
option("password","training").
load()
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-14
Data Sources
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-15
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-16
DataFrame Basic Operations (1)
§ Basic operations deal with DataFrame metadata (rather than its data)
§ Some examples
– schema returns a schema object describing the data
– printSchema displays the schema as a visual tree
– cache / persist persists the DataFrame to disk or memory
– columns returns an array containing the names of the columns
– dtypes returns an array of (column name,type) pairs
– explain prints debug information about the DataFrame to the
console
§ Most of these examples will be demonstrated in the next several slides
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-17
DataFrame Basic Operations (2)
Language: Scala
> val peopleDF = sqlContext.read.json("people.json")
> peopleDF.dtypes.foreach(println)
(age,LongType)
(name,StringType)
(pcode,StringType)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-18
Working with Data in a DataFrame
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-19
DataFrame Actions
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-20
DataFrame Queries (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-21
DataFrame Queries (2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-22
DataFrame Query Strings (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-23
DataFrame Query Strings (2)
§ Example: where
peopleDF.
where("age > 21")
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-24
Querying DataFrames using Columns (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-25
Querying DataFrames using Columns (2)
§ Scala
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-26
Querying DataFrames using Columns (3)
Language: Scala
peopleDF.select(peopleDF("name"),peopleDF("age")+10)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-27
Querying DataFrames using Columns (4)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-28
Joining DataFrames (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-29
Joining DataFrames (2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-30
Joining DataFrames (3)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-31
SQL Queries (1)
Table: customers
cust_id name country
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-32
SQL Queries (2)
Language: Python/Scala
peopleDF.registerTempTable("people")
sqlContext.
sql("""SELECT * FROM people WHERE name LIKE "A%" """)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-33
SQL Queries (3)
§ You can query directly from Parquet or JSON files without needing to
create a DataFrame or register a temporary table
Language: Python/Scala
sqlContext.
sql("""SELECT * FROM json.`/user/training/people.json` WHERE
name LIKE "A%" """)
File: people.json
{"name":"Alice", "pcode":"94304"}
{"name":"Brayden", "age":30, "pcode":"94304"}
{"name":"Carla", "age":19, "pcode":"10036"}
{"name":"Diana", "age":46}
{"name":"Étienne", "pcode":"94104"}
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-34
Other Query Functions
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-35
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-36
Saving DataFrames
Language: Python/Scala
peopleDF.write.saveAsTable("people")
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-37
Options for Saving DataFrames
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-38
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-39
DataFrames and RDDs (1)
peopleRDD = peopleDF.rdd
peopleDF peopleRDD
age name pcode Row[null,Alice,94304]
null Alice 94304 Row[30,Brayden,94304]
30 Brayden 94304 Row[19,Carla,10036]
19 Carla 10036 Row[46,Diana,null]
46 Diana null Row[null,Étienne,94104]
null Étienne 94104
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-40
DataFrames and RDDs (2)
§ Row RDDs have all the standard Spark actions and transformations
– Actions: collect, take, count, and so on
– Transformations: map, flatMap, filter, and so on
§ Row RDDs can be transformed into pair RDDs to use map-reduce methods
§ DataFrames also provide convenience methods (such as map, flatMap,
and foreach)for converting to RDDs
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-41
Working with Row Objects
§ The syntax for extracting data from Row objects depends on language
§ Python
– Column names are object attributes
– row.age returns age column value from row
§ Scala
– Use Array-like syntax to return values with type Any
– row(n) returns element in the nth column
– row.fieldIndex("age")returns index of the age column
– Use methods to get correctly typed values
– row.getAs[Long]("age")
– Use type-specific get methods to return typed values
– row.getString(n) returns nth column as a string
– row.getInt(n) returns nth column as an integer
– And so on
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-42
Example: Extracting Data from Row Objects
Row[null,Alice,94304]
§ Extract data from Row objects
Row[30,Brayden,94304]
Language: Python Row[19,Carla,10036]
peopleRDD = peopleDF \ Row[46,Diana,null]
.map(lambda row:(row.pcode,row.name))
Row[null,Étienne,94104]
peopleByPCode = peopleRDD \
.groupByKey()
(94304,Alice)
(94304,Brayden)
Language: Scala (10036,Carla)
val peopleRDD = peopleDF. (null,Diana)
map(row => (94104,Étienne)
(row(row.fieldIndex("pcode")),
row(row.fieldIndex("name"))))
val peopleByPCode = peopleRDD. (null,[Diana])
groupByKey() (94304,[Alice,Brayden])
(10036,[Carla])
(94104,[Étienne])
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-43
Converting RDDs to DataFrames
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-44
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-45
Comparing Impala to Spark SQL
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-46
Comparing Spark SQL with Hive on Spark
§ Spark SQL
– Provides the DataFrame API to allow structured data
processing in a Spark application
– Programmers can mix SQL with procedural processing
§ Hive on Spark
– Hive provides a SQL abstraction layer over MapReduce or
Spark
– Allows non-programmers to analyze data using familiar
SQL
– Hive on Spark replaces MapReduce as the engine
underlying Hive
– Does not affect the user experience of Hive
– Except queries run many times faster!
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-47
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-48
What’s Coming in Spark 2.x?
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-49
Spark Datasets
Language: Scala
val countsRDD = sc.textFile(filename).
Word count flatMap(line => line.split(" ")).
using RDDs map(word => (word,1)).
reduceByKey((v1,v2) => v1+v2)
Language: Scala
val countsDS =
Word count
sqlContext.read.text(filename).as[String].
using Datasets
flatMap(line => line.split(" ")).
groupBy(word => word).
count()
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-50
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-51
Essential Points
§ Spark SQL is a Spark API for handling structured and semi-structured data
§ Entry point is a SQL context
§ DataFrames are the key unit of data
– DataFrames are based on an underlying RDD of Row objects
– DataFrames query methods return new DataFrames; similar to RDD
transformations
– The full Spark API can be used with Spark SQL data by accessing the
underlying RDD
§ Spark SQL is not a replacement for a database, or a specialized SQL engine
like Impala
– Spark SQL is most useful for ETL or incorporating structured data into a
Spark application
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-52
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-53
Hands-On Exercise: Use Apache Spark SQL for ETL
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-54
Message Processing with Apache Kafka
Chapter 15
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-2
Message Processing with Apache Kafka
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-3
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-4
What Is Apache Kafka?
Apache Kafka
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-5
Characteristics of Kafka
§ Scalable
– Kafka is a distributed system that supports multiple nodes
§ Fault-tolerant
– Data is persisted to disk and can be replicated throughout the cluster
§ High throughput
– Each broker can process hundreds of thousands of messages per second *
§ Low latency
– Data is delivered in a fraction of a second
§ Flexible
– Decouples the production of data from its consumption
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-7
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-8
Key Terminology
§ Message
– A single data record passed by Kafka
§ Topic
– A named log or feed of messages within Kafka
§ Producer
– A program that writes messages to Kafka
§ Consumer
– A program that reads messages from Kafka
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-9
Example: High-Level Architecture
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-10
Messages (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-11
Messages (2)
§ Kafka retains all messages for a defined time period and/or total size
– Administrators can specify retention on global or per-topic basis
– Kafka will retain messages regardless of whether they were read
– Kafka discards messages automatically after the retention period or
total size is exceeded (whichever limit is reached first)
– Default retention is one week
– Retention can reasonably be one year or longer
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-12
Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-13
Producers
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-14
Consumers
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-15
Producers and Consumers
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-16
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-17
Scaling Kafka
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-18
Topic Partitioning
§ One or more consumers can form their own consumer group that work
together to consume the messages in a topic
§ Each partition is consumed by only one member of a consumer group
§ Message ordering is preserved per partition, but not across the topic
Topic: click-tracking
Kafka Partition Partition Partition Partition
Cluster 0 3 1 2
Consumer Consumer
1 2
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-20
Increasing Consumer Throughput
Topic: click-tracking
Kafka Partition Partition Partition Partition
Cluster 0 3 1 2
C1 C2 C3 C4
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-21
Multiple Consumer Groups
Topic: click-tracking
Kafka Partition Partition Partition Partition
Cluster 0 3 1 2
C1 C2 C3 C4 C5 C6
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-22
Publish and Subscribe to Topic
Topic: click-tracking
Kafka Partition Partition Partition Partition
Cluster 0 3 1 2
C1 C2 C3 C4 C5 C6 C7
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-23
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-24
Kafka Clusters
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-25
Apache ZooKeeper
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-26
Kafka Brokers
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-27
Topic Replication
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-28
Messages Are Replicated
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-29
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-30
Creating Topics from the Command Line
$ kafka-topics --create \
--zookeeper zkhost1:2181,zkhost2:2181,zkhost3:2181 \
--replication-factor 3 \
--partitions 5 \
--topic device_status
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-31
Displaying Topics from the Command Line
$ kafka-topics --list \
--zookeeper zkhost1:2181,zkhost2:2181,zkhost3:2181
$ kafka-topics --help
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-32
Running a Producer from the Command Line (1)
$ kafka-console-producer \
--broker-list brokerhost1:9092,brokerhost2:9092 \
--topic device_status
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-33
Running a Producer from the Command Line (2)
§ You may see a few log messages in the terminal after the producer starts
§ The producer will then accept input in the terminal window
– Each line you type will be a message sent to the topic
§ Until you have configured a consumer for this topic, you will see no other
output from Kafka
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-34
Writing File Contents to Topics Using the Command Line
§ Using UNIX pipes or redirection, you can read input from files
– The data can then be sent to a topic using the command line producer
§ This example shows how to read input from a file named alerts.txt
– Each line in this file becomes a separate message in the topic
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-35
Running a Consumer from the Command Line
$ kafka-console-consumer \
--zookeeper zkhost1:2181,zkhost2:2181,zkhost3:2181 \
--topic device_status \
--from-beginning
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-36
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-37
Essential Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-38
Bibliography
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-39
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-40
Hands-On Exercise: Produce and Consume Apache Kafka
Messages
§ In this exercise, you will
– Use Kafka’s command line utilities to create a new topic, publish
messages to the topic with a producer, and read messages from the
topic with a consumer
§ Please refer to the Hands-On Exercise Manual for instructions
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-41
Capturing Data with Apache Flume
Chapter 16
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-2
Capturing Data with Apache Flume
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-3
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-4
What Is Apache Flume?
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-5
Flume’s Design Goals: Reliability
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-6
Flume’s Design Goals: Scalability
§ Scalability
– The ability to increase system performance linearly—or better—by
adding more resources to the system
– Flume scales horizontally
– As load increases, add more agents to the machine, and add more
machines to the system
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-7
Flume’s Design Goals: Extensibility
§ Extensibility
– The ability to add new functionality to a system
§ Flume can be extended by adding sources and sinks to existing storage
layers or data platforms
– Flume includes sources that can read data from files, syslog, and
standard output from any Linux process
– Flume includes sinks that can write to files on the local filesystem, HDFS,
Kudu, HBase, and so on
– Developers can write their own sources or sinks
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-8
Common Flume Data Sources
Sensor Data
Log Files Status Updates
Hadoop Cluster
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-9
Large-Scale Deployment Example
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-10
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-11
Flume Events
timestamp: 1395256884
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-12
Components in Flume’s Architecture
§ Source
– Receives events from the external actor that generates them
§ Sink
– Sends an event to its destination
§ Channel
– Buffers events from the source until they are drained by the sink
§ Agent
– Configures and hosts the source, channel, and sink
– A Java process that runs in a JVM
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-13
Flume Data Flow
Flume Agent
syslog Server
Hadoop Cluster
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-14
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-15
Notable Built-In Flume Sources
§ Syslog
– Captures messages from UNIX syslog daemon over the network
§ Netcat
– Captures any data written to a socket on an arbitrary TCP port
§ Exec
– Executes a UNIX program and reads events from standard output *
§ Spooldir
– Extracts events from files appearing in a specified (local) directory
§ HTTP Source
– Retrieves events from HTTP requests
§ Kafka
– Retrieves events by consuming messages from a Kafka topic
* Asynchronous sources do not guarantee that events will be delivered
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-16
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-17
Some Interesting Built-In Flume Sinks
§ Null
– Discards all events (Flume equivalent of /dev/null)
§ Logger
– Logs event to INFO level using SLF4J*
§ IRC
– Sends event to a specified Internet Relay Chat channel
§ HDFS
– Writes event to a file in the specified directory in HDFS
§ Kafka
– Sends event as a message to a Kafka topic
§ HBaseSink
– Stores event in HBase
*SLF4J: Simple Logging Façade for Java
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-18
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-19
Built-In Flume Channels
§ Memory
– Stores events in the machine’s RAM
– Extremely fast, but not reliable (memory is volatile)
§ File
– Stores events on the machine’s local disk
– Slower than RAM, but more reliable (data is written to disk)
§ Kafka
– Uses Kafka as a scalable, reliable, and highly available channel between
any source and sink type
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-20
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-21
Flume Agent Configuration File
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-22
Example: Configuring Flume Components (1)
syslog Server
Hadoop C
src1 ch1 sink1
agent1
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-23
Example: Configuring Flume Components (2)
agent1.sources = src1
agent1.sinks = sink1
agent1.channels = ch1
agent1.channels.ch1.type = memory
agent1.sources.src1.type = spooldir
agent1.sources.src1.spoolDir = /var/flume/incoming Connects source
agent1.sources.src1.channels = ch1 and channel
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path = /loudacre/logdata Connects sink
agent1.sinks.sink1.channel = ch1 and channel
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-24
Aside: HDFS Sink Configuration
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path = /loudacre/logdata/%y-%m-%d
agent1.sinks.sink1.hdfs.codeC = snappy
agent1.sinks.sink1.channel = ch1
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-25
Starting a Flume Agent
$ flume-ng agent \
--conf /etc/flume-ng/conf \
--conf-file /path/to/flume.conf \
--name agent1 \
-Dflume.root.logger=INFO,console
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-26
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-27
Essential Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-28
Bibliography
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-29
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-30
Hands-On Exercise: Collect Web Server Logs with Apache Flume
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-31
Integrating Apache Flume and Apache
Kafka
Chapter 17
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-2
Integrating Apache Flume and Apache Kafka
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-3
Chapter Topics
§ Overview
§ Use Cases
§ Configuration
§ Essential Points
§ Hands-On Exercise: Send Web Server Log Messages from Apache Flume to
Apache Kafka
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-4
Should I Use Kafka or Flume?
§ Both Flume and Kafka are widely used for data ingest
– Although these tools differ, their functionality has some overlap
– Some use cases could be implemented with either Flume or Kafka
§ How do you determine which is a better choice for your use case?
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-5
Characteristics of Flume
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-6
Characteristics of Kafka
Apache Kafka
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-7
Flafka = Flume + Kafka
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-8
Chapter Topics
§ Overview
§ Use Cases
§ Configuration
§ Essential Points
§ Hands-On Exercise: Send Web Server Log Messages from Apache Flume to
Apache Kafka
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-9
Using a Flume Kafka Sink as a Producer
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-10
Using a Flume Kafka Source as a Consumer
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-11
Using a Flume Kafka Channel
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-12
Using a Kafka Channel as a Consumer (Sourceless Channel)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-13
Chapter Topics
§ Overview
§ Use Cases
§ Configuration
§ Essential Points
§ Hands-On Exercise: Send Web Server Log Messages from Apache Flume to
Apache Kafka
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-14
Configuring Flume with a Kafka Source
§ The table below describes some key properties of the Kafka source
Name Description
type org.apache.flume.source.kafka.KafkaSource
zookeeperConnect ZooKeeper connection string (example: zkhost:2181)
topic Name of Kafka topic from which messages will be read
groupId Unique ID to use for the consumer group (default: flume)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-15
Example: Configuring Flume with a Kafka Source (1)
§ This is the Flume configuration for the example on the previous slide
– It defines a source for reading messages from a Kafka topic
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-16
Example: Configuring Flume with a Kafka Source (2)
§ The remaining portion of the file configures the channel and sink
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-17
Configuring Flume with a Kafka Sink
§ The table below describes some key properties of the Kafka sink
Name Description
type Must be set to org.apache.flume.sink.kafka.KafkaSink
brokerList Comma-separated list of brokers (format host:port) to contact
topic The topic in Kafka to which the messages will be published
batchSize How many messages to process in one batch
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-18
Example: Configuring Flume with a Kafka Sink (1)
§ This is the Flume configuration for the example on the previous slide
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-19
Example: Configuring Flume with a Kafka Sink (2)
§ The remaining portion of the configuration file sets up the Kafka sink
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-20
Configuring Flume with a Kafka Channel
§ The table below describes some key properties of the Kafka channel
Name Description
type org.apache.flume.channel.kafka.KafkaChannel
zookeeperConnect ZooKeeper connection string (example: zkhost:2181)
brokerList Comma-separated list of brokers (format host:port) to
contact
topic Name of Kafka topic from which messages will be read
(optional, default=flume-channel)
parseAsFlumeEvent Set to false for sourceless configuration (optional,
default=true)
readSmallestOffset Set to true to read from the beginning of the Kafka topic
(optional, default=false)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-21
Example: Configuring a Sourceless Kafka Channel (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-22
Example: Configuring a Sourceless Kafka Channel (2)
§ This is the Flume configuration for the example on the previous slide
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-23
Chapter Topics
§ Overview
§ Use Cases
§ Configuration
§ Essential Points
§ Hands-On Exercise: Send Web Server Log Messages from Apache Flume to
Apache Kafka
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-24
Essential Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-25
Bibliography
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-26
Chapter Topics
§ Overview
§ Use Cases
§ Configuration
§ Essential Points
§ Hands-On Exercise: Send Web Server Log Messages from Apache Flume
to Apache Kafka
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-27
Hands-On Exercise: Send Web Server Log Messages from Apache
Flume to Apache Kafka
§ In this exercise, you will
– Configure a Flume agent using a Kafka sink to produce Kafka messages
from data that was received by a Flume source
§ Please refer to the Hands-On Exercise Manual for instructions
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-28
Apache Spark Streaming: Introduction
to DStreams
Chapter 18
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-2
Apache Spark Streaming: Introduction to DStreams
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-3
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-4
What Is Spark Streaming?
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-5
Why Spark Streaming?
§ Many big data applications need to process large data streams in real time,
such as
– Continuous ETL
– Website monitoring
– Fraud detection
– Ad monetization
– Social media analysis
– Financial market trends
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-6
Spark Streaming Features
§ Second-scale latencies
§ Scalability and efficient fault tolerance
§ “Once and only once” processing
§ Integrates batch and real-time processing
§ Easy to develop
– Uses Spark’s high-level API
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-7
Spark Streaming Overview
…1001101001000111000011100010…
Spark Streaming
Dstream—RDDs (batches of
n seconds)
Spark
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-8
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-9
Example: Streaming Request Count (Scala Overview)
Language: Scala
object StreamingRequestCount {
userreqs.print()
ssc.start()
ssc.awaitTermination()
}
}
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-10
Example: Configuring StreamingContext
Language: Scala
object StreamingRequestCount {
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-11
Streaming Example: Creating a DStream
Language: Scala
object StreamingRequestCount {
ssc.start()
ssc.awaitTermination()
}
}
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-12
Streaming Example: DStream Transformations
Language: Scala
object StreamingRequestCount {
userreqs.print()
§ DStream operations are applied to each batch RDD in the stream
§ Similar to RDD operations—filter, map, reduce,
ssc.start()
ssc.awaitTermination()
joinByKey, and so on.
}
}
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-13
Streaming Example: DStream Result Output
Language: Scala
object StreamingRequestCount {
userreqs.print()
ssc.start()
§ Print out the first 10 elements of each RDD
ssc.awaitTermination()
}
}
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-14
Streaming Example: Starting the Streams
Language: Scala
object StreamingRequestCount {
ssc.start()
ssc.awaitTermination()
}
}
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-15
Streaming Example: Python versus Scala
Language: Python
if __name__ == "__main__":
sc = SparkContext()
ssc = StreamingContext(sc,2)
userreqs.pprint()
ssc.start() userreqs.print()
ssc.awaitTermination()
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-16
Streaming Example: Streaming Request Count (Recap)
Language: Scala
object StreamingRequestCount {
userreqs.print()
ssc.start()
ssc.awaitTermination()
}
}
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-17
DStreams
Time
t0 t1 t2 t3
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-18
Streaming Example Output (1)
-------------------------------------------
Time: 1401219545000 ms Starts 2 seconds after
------------------------------------------- ssc.start (time
(23713,2)
(53,2) interval t1)
(24433,2)
(127,2)
(93,2)
...
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-19
Streaming Example Output (2)
-------------------------------------------
Time: 1401219545000 ms
-------------------------------------------
(23713,2)
(53,2)
(24433,2)
(127,2)
(93,2)
...
------------------------------------------- t2: 2 seconds later…
Time: 1401219547000 ms
-------------------------------------------
(42400,2)
(24996,2)
(97464,2)
(161,2)
(6011,2)
...
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-20
Streaming Example Output (3)
-------------------------------------------
Time: 1401219545000 ms
-------------------------------------------
(23713,2)
(53,2)
(24433,2)
(127,2)
(93,2)
...
-------------------------------------------
Time: 1401219547000 ms
-------------------------------------------
(42400,2)
(24996,2)
(97464,2)
(161,2)
(6011,2)
...
------------------------------------------- t3: 2 seconds later…
Time: 1401219549000 ms
-------------------------------------------
(44390,2)
(48712,2)
(165,2)
(465,2) Continues until
(120,2)
...
termination…
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-21
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-22
DStream Data Sources
§ DStreams are defined for a given input stream (such as a Unix socket)
– Created by the Streaming context
ssc.socketTextStream(hostname, port)
– Similar to how RDDs are created by the Spark context
§ Out-of-the-box data sources
– Network
– Sockets
– Services such as Flume, Akka Actors, Kafka, ZeroMQ, or Twitter
– Files
– Monitors an HDFS directory for new content
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-23
DStream Operations
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-24
DStream Transformations (1)
Language: Python
distinctDS =
myDS.transform(lambda rdd: rdd.distinct())
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-25
DStream Transformations (2)
reqcounts = userreqs.
reduceByKey((x,y) => x+y)
(user002,5) (user710,9) (user002,1)
(user033,1) (user022,4) (user808,8)
reqcounts (user912,2) (user001,4) (user018,2)
… … …
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-26
DStream Output Operations
§ Console output
– print (Scala) / pprint (Python) prints out the first 10 elements of
each RDD
– Optionally pass an integer to print another number of elements
§ File output
– saveAsTextFiles saves data as text
– saveAsObjectFiles saves as serialized object files (SequenceFiles)
§ Executing other functions
– foreachRDD(function)performs a function on each RDD in the
DStream
– Function input parameters
– The RDD on which to perform the function
– The time stamp of the RDD (optional)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-27
Saving DStream Results as Files
Language: Scala
val userreqs = logs
.map(line => (line.split(' ')(2),1))
.reduceByKey((v1,v2) => v1+v2)
userreqs.print()
userreqs.saveAsTextFiles("…/outdir/reqcounts")
(user002,1)
(user002,5) (user710,9)
(the,5) (the,9)
(user022,4)
(user808,8)
(word1,n)
(user033,1)
(the,5) (the,9) (word1,n)
(fat,1) (angry,1) (user018,2)
(word2,n)
(user912,2)
(fat,1) (user001,4)
(angry,1) (word2,n)
… (sat,4)
… (word3,n)
… (on,2)
(on,2) (sat,4) (word3,n)
… … …
… … …
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-28
Scala Example: Find Top Users (1)
…
val userreqs = logs
.map(line => (line.split(' ')(2),1))
.reduceByKey((v1,v2) => v1+v2)
userreqs.saveAsTextFiles(path)
sortedreqs.foreachRDD((rdd,time) => {
Transform
println("Top users @ " each RDD: swap userID/count, sort by count
+ time)
rdd.take(5).foreach(
pair => printf("User: %s (%s)\n",pair._2, pair._1))
}
)
ssc.start()
ssc.awaitTermination()
…
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-29
Scala Example: Find Top Users (2)
…
val userreqs = logs
.map(line => (line.split(' ')(2),1))
.reduceByKey((v1,v2) => v1+v2)
userreqs.saveAsTextFiles(path)
sortedreqs.foreachRDD((rdd,time) => {
println("Top users @ " + time)
rdd.take(5).foreach(
pair => printf("User: %s (%s)\n",pair._2, pair._1))
}
)
ssc.start()
ssc.awaitTermination()
…
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-30
Python Example: Find Top Users (1)
def printTop5(r,t):
print "Top users @",t
for count,user in r.take(5):
print "User:",user,"("+str(count)+")"
…
userreqs = mystream \
.map(lambda line: (line.split(' ')[2],1)) \
Transform
.reduceByKey(lambda each v1+v2)
v1,v2: RDD: swap userID/count, sort by count
userreqs.saveAsTextFiles("streamreq/reqcounts")
sortedreqs=userreqs \
.map(lambda (k,v): (v,k)) \
.transform(lambda rdd: rdd.sortByKey(False))
ssc.start()
ssc.awaitTermination()
…
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-31
Python Example: Find Top Users (2)
def printTop5(r,t):
print "Top users @",t
for count,user in r.take(5):
print "User:",user,"("+str(count)+")"
…
userreqs = mystream \
.map(lambda line: (line.split(' ')[2],1)) \
.reduceByKey(lambda v1,v2: v1+v2)
userreqs.saveAsTextFiles("streamreq/reqcounts")
sortedreqs=userreqs \
Print out
.map(lambda (k,v): the top
(v,k)) \ 5 users as “User: userID (count)”
.transform(lambda rdd: rdd.sortByKey(False))
ssc.start()
ssc.awaitTermination()
…
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-32
Example: Find Top Users—Output (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-33
Example: Find Top Users—Output (2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-34
Example: Find Top Users—Output (3)
Continues until
termination…
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-35
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-36
Building and Running Spark Streaming Applications
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-37
Using Spark Streaming with Spark Shell
or
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-38
The Spark Streaming Application UI
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-39
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-40
Essential Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-41
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-42
Hands-On Exercise: Write an Apache Spark Streaming
Application
§ In this exercise, you will
– Write a Spark Streaming application to process web log data
§ Please refer to the Hands-On Exercise Manual for instructions
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-43
Apache Spark Streaming: Processing
Multiple Batches
Chapter 19
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-2
Apache Spark Streaming: Processing Multiple Batches
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-3
Chapter Topics
§ Multi-Batch Operations
§ Time Slicing
§ State Operations
§ Sliding Window Operations
§ Essential Points
§ Hands-On Exercise: Process Multiple Batches with Apache Spark Streaming
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-4
Multi-Batch DStream Operations
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-5
Chapter Topics
§ Multi-Batch Operations
§ Time Slicing
§ State Operations
§ Sliding Window Operations
§ Essential Points
§ Hands-On Exercise: Process Multiple Batches with Apache Spark Streaming
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-6
Time Slicing
§ DStream.slice(fromTime, toTime)
– Returns a collection of batch RDDs based on data from the stream
§ StreamingContext.remember(duration)
– By default, input data is automatically cleared when no RDD’s lineage
depends on it
– slice will return no data for time periods for data has already been
cleared
– Use remember to keep data around longer
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-7
Chapter Topics
§ Multi-Batch Operations
§ Time Slicing
§ State Operations
§ Sliding Window Operations
§ Essential Points
§ Hands-On Exercise: Process Multiple Batches with Apache Spark Streaming
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-8
State DStreams (1)
t1
(user001,5)
Requests (user102,1)
(user009,2)
(user001,5)
(user102,1)
Total
(user009,2)
Requests
(State)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-9
State DStreams (2)
t1 t2
(user001,5) (user001,4)
Requests (user102,1) (user012,2)
(user009,2) (user921,5)
(user001,5) (user001,9)
(user102,1) (user102,1)
Total
(user009,2) (user009,2)
Requests
(user012,2)
(State)
(user921,5)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-10
State DStreams (3)
t1 t2 t3
(user001,5) (user001,4) (user102,7)
Requests (user102,1) (user012,2) (user012,3)
(user009,2) (user921,5) (user660,4)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-11
Python Example: Total User Request Count (1)
Language: Python
…
userreqs = logs \
.map(lambda line: (line.split(' ')[2],1)) \
.reduceByKey(lambda v1,v2: v1+v2)
…
ssc.checkpoint("checkpoints")
…
totalUserreqs = userreqs \
Set checkpoint directory to enable checkpointing.
.updateStateByKey(lambda newCounts, state: \
Required to prevent infinite
updateCount(newCounts, lineages.
state))
totalUserreqs.pprint()
ssc.start()
ssc.awaitTermination()
…
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-12
Python Example: Total User Request Count (2)
Language: Python
…
userreqs = logs \
.map(lambda Compute a state DStream
line: (line.split(' based on \the previous states,
')[2],1))
.reduceByKey(lambda v1,v2:
updated with v1+v2)from the current batch of request
the values
…
counts.
ssc.checkpoint("checkpoints")
…
totalUserreqs = userreqs \
.updateStateByKey(lambda newCounts, state: \
updateCount(newCounts, state))
totalUserreqs.pprint()
next slide…
ssc.start()
ssc.awaitTermination()
…
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-13
Python Example: Total User Request Count—Update Function
t1 t2
(user001,5) (user001,4) § Example at t2
Requests (user102,1) (user012,2) user001:
(user009,2) (user921,5) updateCount([4],5) à 9
user012:
Total (user001,5) (user001,9)
Requests (user102,1) updateCount([2],None)) à 2
(user102,1)
(State) user921:
(user009,2) (user009,2)
updateCount([5],None)) à 5
(user012,2)
(user921,5)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-14
Scala Example: Total User Request Count (1)
Language: Scala
…
val userreqs = logs
.map(line => (line.split(' ')(2),1))
.reduceByKey((x,y) => x+y)
…
ssc.checkpoint("checkpoints")
val totalUserreqs = userreqs.updateStateByKey(updateCount)
totalUserreqs.print()
Set checkpoint directory to enable checkpointing.
ssc.start()
Required to prevent infinite lineages.
ssc.awaitTermination()
…
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-15
Scala Example: Total User Request Count (2)
Language: Scala
…
val userreqs = logs
.map(line => (line.split(' ')(2),1))
.reduceByKey((x,y) => x+y)
…
next slide…
ssc.checkpoint("checkpoints")
val totalUserreqs = userreqs.updateStateByKey(updateCount)
totalUserreqs.print()
ssc.start()
Compute a state DStream based on the previous states,
ssc.awaitTermination()
… updated with the values from the current batch of request
counts.
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-16
Scala Example: Total User Request Count—Update Function
t1 t2
(user001,5) (user001,4) § Example at t2
Requests (user102,1) (user012,2) user001:
(user009,2) (user921,5) updateCount([4],Some[5]) à 9
user012:
Total (user001,5) (user001,9)
Requests (user102,1) updateCount([2],None)) à 2
(user102,1)
(State) user921:
(user009,2) (user009,2)
updateCount([5],None)) à 5
(user012,2)
(user921,5)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-17
Example: Maintaining State—Output
-------------------------------------------
Time: 1401219545000 ms
------------------------------------------- (user001,5)
(user001,5)
(user102,1) t1 (user102,1)
(user009,2) (user009,2)
-------------------------------------------
Time: 1401219547000 ms
------------------------------------------- (user001,9)
(user001,9)
(user102,1) (user102,1)
(user009,2) t2 (user009,2)
(user012,2)
(user921,5) (user012,2)
------------------------------------------- (user921,5)
Time: 1401219549000 ms
-------------------------------------------
(user001,9) (user001,9)
(user102,8)
(user102,8)
(user009,2)
(user012,5) (user009,2)
(user921,5)
t3
(user012,5)
(user660,4)
------------------------------------------- (user921,5)
Time: 1401219541000 ms
(user660,4)
-------------------------------------------
…
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-18
Chapter Topics
§ Multi-Batch Operations
§ Time Slicing
§ State Operations
§ Sliding Window Operations
§ Essential Points
§ Hands-On Exercise: Process Multiple Batches with Apache Spark Streaming
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-19
Sliding Window Operations (1)
§ Regular DStream operations execute for each RDD based on SSC duration
§ “Window” operations span RDDs over a given duration
– For example reduceByKeyAndWindow, countByWindow
Window Duration
Regular
DStream
Window
DStream
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-20
Sliding Window Operations (2)
Window Duration
Regular
DStream
(batch size =
Seconds(2))
reduceByKeyAndWindow(fn,
Seconds(12))
Window
DStream
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-21
Sliding Window Operations (3)
§ You can specify a different slide duration (must be a multiple of the SSC
duration)
Window Duration
Regular
DStream
(batch size =
Seconds(2))
reduceByKeyAndWindow(fn,
Seconds(12), Seconds(4))
Window
DStream
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-22
Scala Example: Count and Sort User Requests by Window (1)
Language: Scala
…
val ssc = new StreamingContext(new SparkConf(), Seconds(2))
val logs = ssc.socketTextStream(hostname, port)
…
val reqcountsByWindow = logs.
map(line => (line.split(' ')(2),1)).
reduceByKeyAndWindow((v1: Int, v2: Int) => v1+v2,
Minutes(5),Seconds(30))
val topreqsByWindow=reqcountsByWindow.
Every 30 seconds, count requests by user over the last
map(pair => pair.swap).
transform(rdd five minutes.
=> rdd.sortByKey(false))
topreqsByWindow.map(pair => pair.swap).print()
ssc.start()
ssc.awaitTermination()
…
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-23
Scala Example: Count and Sort User Requests by Window (2)
Language: Scala
…
val ssc = new StreamingContext(new SparkConf(), Seconds(2))
val logs = ssc.socketTextStream(hostname, port)
…
val reqcountsByWindow = logs.
map(line => (line.split(' ')(2),1)).
Sort and print the top users for every RDD (every 30
reduceByKeyAndWindow((v1: Int, v2: Int) => v1+v2,
seconds).
Minutes(5),Seconds(30))
val topreqsByWindow=reqcountsByWindow.
map(pair => pair.swap).
transform(rdd => rdd.sortByKey(false))
topreqsByWindow.map(pair => pair.swap).print()
ssc.start()
ssc.awaitTermination()
…
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-24
Python Example: Count and Sort User Requests by Window (1)
Language: Python
…
ssc = new StreamingContext(new SparkConf(), 2)
logs = ssc.socketTextStream(hostname, port)
…
reqcountsByWindow = logs. \
map(lambda line: (line.split(' ')[2],1)).\
reduceByKeyAndWindow(lambda v1,v2: v1+v2,5*60,30)
topreqsByWindow=reqcountsByWindow.
Every 30 seconds, count\requests by user over the last
map(lambda (k,v): (v,k)). \
five minutes.
transform(lambda rdd: rdd.sortByKey(False))
topreqsByWindow.map(lambda (k,v): (v,k)).pprint()
ssc.start()
ssc.awaitTermination()
…
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-25
Python Example: Count and Sort User Requests by Window (2)
Language: Python
…
ssc = new StreamingContext(new SparkConf(), 2)
logs = ssc.socketTextStream(hostname, port)
…
reqcountsByWindow = logs. \
Sort(line.split('
map(lambda line: and print the top users for every RDD (every 30
')[2],1)).\
seconds).
reduceByKeyAndWindow(lambda v1,v2: v1+v2,5*60,30)
topreqsByWindow=reqcountsByWindow. \
map(lambda (k,v):(v,k)). \ #swap
transform(lambda rdd: rdd.sortByKey(False))
topreqsByWindow.map(lambda (k,v):(v,k)).pprint()
ssc.start()
ssc.awaitTermination()
…
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-26
Chapter Topics
§ Multi-Batch Operations
§ Time Slicing
§ State Operations
§ Sliding Window Operations
§ Essential Points
§ Hands-On Exercise: Process Multiple Batches with Apache Spark Streaming
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-27
Essential Points
§ You can get a “slice” of data from a stream based on absolute start and
end times
– For example, all data received between midnight October 1, 2016 and
midnight October 2, 2016
§ You can update state based on prior state
– For example, total requests by user
§ You can perform operations on “windows” of data
– For example, number of logins in the last hour
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-28
Chapter Topics
§ Multi-Batch Operations
§ Time Slicing
§ State Operations
§ Sliding Window Operations
§ Essential Points
§ Hands-On Exercise: Process Multiple Batches with Apache Spark
Streaming
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-29
Hands-On Exercise: Process Multiple Batches With Apache Spark
Streaming
§ In this exercise, you will
– Extend an Apache Spark Streaming application to perform multi-batch
analysis on web log data
§ Please refer to the Hands-On Exercise Manual for instructions
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-30
Apache Spark Streaming: Data Sources
Chapter 20
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-2
Apache Spark Streaming: Data Sources
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-3
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-4
Spark Streaming Data Sources
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-5
Receiver-Based Data Sources
@ t1 @ t2 @ t3 @ t4
DStream
RDD RDD RDD RDD
RDD
Executor rdd_0_1
Receiver
Network
Executor Data Source
rdd_0_0
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-6
Receiver-Based Replication
Executor rdd_0_1
Receiver
Network
Executor rdd_0_1
Data Source
rdd_0_0
Executor
rdd_0_0
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-7
Receiver-Based Fault Tolerance
Executor
Receiver
Network
Executor rdd_0_1
Data Source
Receiver
rdd_0_0
Executor
rdd_0_0
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-8
Managing Incoming Data
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-9
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-10
Overview: Spark Streaming with Flume
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-11
Overview: Spark Streaming with Kafka
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-12
Receiver-Based Kafka Integration (1)
§ Receiver-based
– Streams (receivers) are configured with a Kafka topic and a partition in
that topic
– To protect from data loss, enable write ahead logs (introduced in Spark
1.2)
– Scala and Java support added in Spark 1.1
– Python support added in Spark 1.3
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-13
Receiver-Based Kafka Integration (2)
§ Receiver-based
– Kafka supports partitioning of message topics for scalability
– Receiver-based streaming allows multiple receivers, each configured for
individual topic partitions
Executor
Kafka Broker
RDD 1 Receiver Kafka Broker
(topic Kafka Broker
rdd_1_0 partition 1)
Executor
rdd_1_1
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-14
Kafka Direct Integration (1)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-15
Kafka Direct Integration (2)
Executor RDD
rdd_0_0
topic
partition 0
Executor
Kafka Broker
Kafka Broker
rdd_0_1
topic Kafka Broker
partition 1
Executor
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-16
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-17
Scala Example: Direct Kafka Integration (1)
import org.apache.spark.SparkContext
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.kafka._
import kafka.serializer.StringDecoder
object StreamingRequestCount {
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-18
Scala Example: Direct Kafka Integration (2)
userreqs.print()
ssc.start()
ssc.awaitTermination()
}
}
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-19
Python Example: Direct Kafka Integration (1)
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
if __name__ == "__main__":
sc = SparkContext()
ssc = StreamingContext(sc,2)
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-20
Python Example: Direct Kafka Integration (2)
kafkaStream = KafkaUtils. \
createDirectStream(ssc, ["mytopic"], \
{"metadata.broker.list": "broker1:port,broker2:port"})
userreqs = logs \
.map(lambda line: (line.split(' ')[2],1)) \
.reduceByKey(lambda v1,v2: v1+v2)
userreqs.pprint()
ssc.start()
ssc.awaitTermination()
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-21
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-22
Essential Points
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-23
Chapter Topics
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-24
Hands-On Exercise: Process Apache Kafka Messages with Apache
Spark Streaming
§ In this exercise, you will
– Write a Spark Streaming application to process web logs using a direct
Kafka data source
§ Please refer to the Hands-On Exercise Manual for instructions
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-25
Conclusion
Chapter 21
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 21-2
Course Objectives
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 21-3
Which Course to Take Next?
Cloudera offers a range of training courses for you and your team
§ For developers
– Designing and Building Big Data Applications
– Cloudera Training for Apache HBase
§ For system administrators
– Cloudera Administrator Training for Apache Hadoop
§ For data analysts and data scientists
– Cloudera Data Analyst Training: Using Pig, Hive, and Impala with Hadoop
– Data Science at Scale using Spark and Hadoop
§ For architects, managers, CIOs, and CTOs
– Cloudera Essentials for Apache Hadoop
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 21-4