Cloudera Developer Training Slides

Download as pdf or txt
Download as pdf or txt
You are on page 1of 729

Developer Training for Apache

Spark and Hadoop

201611
Introduction
Chapter 1
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-3
Trademark Information

§ The names and logos of Apache products mentioned in Cloudera training courses,
including those listed below, are trademarks of the Apache Software Foundation
– Apache Accumulo – Apache Lucene
– Apache Avro – Apache Mahout
– Apache Bigtop – Apache Oozie
– Apache Crunch – Apache Parquet
– Apache Flume – Apache Pig
– Apache Hadoop – Apache Sentry
– Apache HBase – Apache Solr
– Apache HCatalog – Apache Spark
– Apache Hive – Apache Sqoop
– Apache Impala (incubating) – Apache Tika
– Apache Kafka – Apache Whirr
– Apache Kudu – Apache ZooKeeper

§ All other product names, logos, and brands cited herein are the property of their
respective owners

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-4
Chapter Topics

Introduction

§ About This Course


§ About Cloudera
§ Course Logistics
§ Introductions

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-5
Course Objectives

During this course, you will learn


§ How the Apache Hadoop ecosystem fits in with the data processing
lifecycle
§ How data is distributed, stored, and processed in a Hadoop cluster
§ How to write, configure, and deploy Apache Spark applications on a
Hadoop cluster
§ How to use the Spark shell and Spark applications to explore, process, and
analyze distributed data
§ How to process and query structured data using Spark SQL
§ How to use Spark Streaming to process a live data stream
§ How to use Apache Flume and Apache Kafka to ingest data for Spark
Streaming

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-6
Chapter Topics

Introduction

§ About This Course


§ About Cloudera
§ Course Logistics
§ Introductions

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-7
About Cloudera (1)

§ The leader in Apache Hadoop-based software and services


§ Founded by Hadoop experts from Facebook, Yahoo, Google, and Oracle
§ Provides support, consulting, training, and certification for Hadoop users
§ Staff includes committers to virtually all Hadoop projects
§ Many authors of industry standard books on Apache Hadoop projects

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-8
About Cloudera (2)

§ Our customers include many key users of Hadoop


§ We offer several public training courses, such as
– Cloudera Administrator Training for Apache Hadoop
– Cloudera Data Analyst Training: Using Pig, Hive, and Impala with Hadoop
– Designing and Building Big Data Applications
– Data Science at Scale using Spark and Hadoop
– Cloudera Search Training
– Cloudera Training for Apache HBase
§ On-site and customized training is also available

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-9
CDH

CDH (Cloudera’s Distribution including Apache Hadoop)


§ 100% open source,
enterprise-ready PROCESS, ANALYZE, SERVE
distribution of Hadoop
and related projects BATCH STREAM SQL SEARCH
Spark, Hive, Pig Spark Impala Solr
§ The most complete, MapReduce

tested, and widely


deployed distribution of UNIFIED SERVICES
Hadoop RESOURCE MANAGEMENT SECURITY
YARN Sentry, RecordService

§ Integrates all the key


Hadoop ecosystem FILESYSTEM RELATIONAL NoSQL
projects HDFS Kudu HBase

§ Available as RPMs and STORE


Ubuntu, Debian, or SuSE
packages, or as a tarball BATCH REAL-TIME
Sqoop Kafka, Flume

INTEGRATE

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-10
Cloudera Express

§ Cloudera Express
– Completely free to
download and use
§ The best way to get started
with Hadoop
§ Includes CDH
§ Includes Cloudera Manager
– End-to-end
administration for
Hadoop
– Deploy, manage, and
monitor your cluster

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-11
Cloudera Enterprise

§ Subscription product including CDH and Cloudera Manager


§ Provides advanced features, such as
– Operational and utilization reporting
– Configuration history and rollbacks
– Rolling updates & service restarts
– External authentication (LDAP/SAML)
– Automated backup and disaster recovery
§ Specific editions offer additional capabilities, such as
– Governance and data management (Cloudera Navigator)
– Active data optimization (Cloudera Navigator Optimizer)
– Comprehensive encryption (Cloudera Navigator Encrypt )
– Key management (Cloudera Navigator Key Trustee)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-12
Chapter Topics

Introduction

§ About This Course


§ About Cloudera
§ Course Logistics
§ Introductions

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-13
Logistics

§ Class start and finish times


§ Lunch
§ Breaks
§ Restrooms
§ Wi-Fi access
§ Virtual machines

Your instructor will give you details on how to access the course materials
and exercise instructions for the class

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-14
Chapter Topics

Introduction

§ About This Course


§ About Cloudera
§ Course Logistics
§ Introductions

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-15
Introductions

§ About your instructor


§ About you
– Where do you work? What do you do there?
– How much Hadoop and Spark experience do you have?
– What do you expect to gain from this course?
– Which language do you plan to use in the course: Python or Scala?

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-16
Introduction to Apache Hadoop and
the Hadoop Ecosystem
Chapter 2
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-2
Introduction to Apache Hadoop and the Hadoop Ecosystem

In this chapter you will learn


§ What Apache Hadoop is and what kind of use cases it is best suited for
§ How the major components of the Hadoop ecosystem fit together
§ How to get started with the Hands-On Exercises in this course

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-3
Chapter Topics

Introduction to Apache Hadoop and


the Hadoop Ecosystem

§ Apache Hadoop Overview


§ Data Storage and Ingest
§ Data Processing
§ Data Analysis and Exploration
§ Other Ecosystem Tools
§ Introduction to the Hands-On Exercises
§ Essential Points
§ Hands-On Exercise: Query Hadoop Data with Apache Impala

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-4
What Is Apache Hadoop?

§ Scalable and economical PROCESS, ANALYZE, SERVE


data storage, processing,
and analysis BATCH STREAM SQL SEARCH
Spark, Hive, Pig Spark Impala Solr
– Distributed and fault- MapReduce

tolerant UNIFIED SERVICES


– Harnesses the power RESOURCE MANAGEMENT SECURITY
YARN Sentry, RecordService
of industry standard
hardware FILESYSTEM RELATIONAL NoSQL
HDFS Kudu HBase
§ Heavily inspired by
STORE
technical documents
published by Google BATCH REAL-TIME
Sqoop Kafka, Flume

INTEGRATE

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-5
Common Hadoop Use Cases

§ Extract, Transform, and Load (ETL) § Data storage


§ Data analysis § Collaborative filtering
§ Text mining § Prediction models
§ Index building § Sentiment analysis
§ Graph creation and analysis § Risk assessment
§ Pattern recognition

§ What do these workloads have in common? Nature of the data…


– Volume
– Velocity
– Variety

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-6
Distributed Processing with Hadoop

Processing
A Hadoop Cluster
• Apache Spark
• MapReduce

Resource Management Storage


• YARN • HDFS
• Apache Mesos • Amazon S3
• Spark Standalone • Apache Kudu

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-7
Chapter Topics

Introduction to Apache Hadoop and


the Hadoop Ecosystem

§ Apache Hadoop Overview


§ Data Storage and Ingest
§ Data Processing
§ Data Analysis and Exploration
§ Other Ecosystem Tools
§ Introduction to the Hands-On Exercises
§ Essential Points
§ Hands-On Exercise: Query Hadoop Data with Apache Impala

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-8
Data Ingest and Storage

§ Hadoop typically ingests data from many sources and in many formats
– Traditional data management systems such as databases
– Logs and other machine generated data (event data)
– Imported files

HDFS

Ingest HBase

Kudu
Data Sources
Data Storage

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-9
Data Storage: HDFS and Apache HBase

§ Hadoop Distributed File System (HDFS)


– HDFS is the main storage layer for Hadoop
– Provides inexpensive reliable storage for massive
amounts of data on industry-standard hardware
– Data is distributed when stored
§ Apache HBase: The Hadoop Database
HDFS
– A NoSQL distributed database built on HDFS
– Scales to support very large amounts of data
and high throughput
– A table can have thousands of columns
– Covered in Cloudera Training for Apache HBase

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-10
Data Storage: Apache Kudu

§ Apache Kudu
– Distributed columnar (key-value) storage for structured data
– Supports random access and updating data (unlike HDFS)
– Faster sequential reads than HBase to support SQL-based analytics
– Works directly on native file system; is not built on HDFS
– Integrates with Spark, MapReduce, and Apache Impala
– Created at Cloudera, donated to Apache Software Foundation

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-11
Data Ingest Tools (1)

§ HDFS
– Direct file transfer
§ Apache Sqoop
– High speed import to HDFS from relational
database (and vice versa)
– Supports many data storage systems
– Examples: Netezza, MongoDB, MySQL, HDFS
Teradata, and Oracle

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-12
Data Ingest Tools (2)

§ Apache Flume
– Distributed service for ingesting streaming data
– Ideally suited for event data from multiple systems
– For example, log files
§ Apache Kafka
– A high throughput, scalable messaging system
– Distributed, reliable publish-subscribe system
HDFS
– Integrates with Flume and Spark Streaming

Apache
Kafka

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-13
Chapter Topics

Introduction to Apache Hadoop and


the Hadoop Ecosystem

§ Apache Hadoop Overview


§ Data Storage and Ingest
§ Data Processing
§ Data Analysis and Exploration
§ Other Ecosystem Tools
§ Introduction to the Hands-On Exercises
§ Essential Points
§ Hands-On Exercise: Query Hadoop Data with Apache Impala

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-14
Apache Spark: An Engine for Large-Scale Data Processing

§ Spark is a large-scale data processing engine


– General purpose
– Runs on Hadoop clusters and processes data in HDFS
§ Supports a wide range of workloads
– Machine learning
– Business intelligence
– Streaming
– Batch processing
– Querying structured data
§ This course uses Spark for data processing

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-15
Hadoop MapReduce: The Original Hadoop Processing Engine

§ Hadoop MapReduce is the original Hadoop


framework for processing big data
– Primarily Java-based
§ Based on the MapReduce programming model
§ The core Hadoop processing engine before Spark was introduced
§ Still the dominant technology
– But losing ground to Spark fast
§ Many existing tools are still built using MapReduce code
§ Has extensive and mature fault tolerance built into the framework

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-16
Apache Pig: Scripting for MapReduce

§ Apache Pig builds on Hadoop to offer high-level data processing


– An alternative to writing low-level MapReduce code
– Especially good at joining and transforming data
§ The Pig interpreter runs on the client machine
– Turns Pig Latin scripts into MapReduce or Sparkjobs
– Submits those jobs to a Hadoop cluster

people = LOAD '/user/training/customers' AS (cust_id, name);


orders = LOAD '/user/training/orders' AS (ord_id, cust_id, cost);
groups = GROUP orders BY cust_id;
totals = FOREACH groups GENERATE group, SUM(orders.cost) AS t;
result = JOIN totals BY group, people BY cust_id;
DUMP result;

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-17
Chapter Topics

Introduction to Apache Hadoop and


the Hadoop Ecosystem

§ Apache Hadoop Overview


§ Data Storage and Ingest
§ Data Processing
§ Data Analysis and Exploration
§ Other Ecosystem Tools
§ Introduction to the Hands-On Exercises
§ Essential Points
§ Hands-On Exercise: Query Hadoop Data with Apache Impala

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-18
Apache Impala (Incubating): High-Performance SQL

§ Impala is a high-performance SQL engine


– Runs on Hadoop clusters
– Data stored in HDFS files, or in HBase or Kudu tables
– Inspired by Google’s Dremel project
– Very low latency—measured in milliseconds
– Ideal for interactive analysis
§ Impala supports a dialect of SQL (Impala SQL)
– Data in HDFS modeled as database tables
§ Impala was developed by Cloudera
– Donated to the Apache Software Foundation, where it is incubating
– 100% open source, released under the Apache software license

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-19
Apache Hive: SQL on MapReduce or Spark

§ Hive is an abstraction layer on top of Hadoop


– Hive uses a SQL-like language called HiveQL
– Similar to Impala SQL
– Useful for data processing and ETL
– Impala is preferred for interactive analytics
§ Hive executes queries using MapReduce or Spark

SELECT zipcode, SUM(cost) AS total


FROM customers
JOIN orders
ON (customers.cust_id = orders.cust_id)
WHERE zipcode LIKE '63%'
GROUP BY zipcode
ORDER BY total DESC;

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-20
Cloudera Search: A Platform for Data Exploration

§ Interactive full-text search for data in a Hadoop cluster


§ Allows non-technical users to access your data
– Nearly everyone can use a search engine
§ Cloudera Search enhances Apache Solr
– Integrates Apache Solr with HDFS, MapReduce,
HBase, and Flume
– Supports file formats widely used with Hadoop
– Dynamic web-based dashboard interface with Hue
§ Cloudera Search is 100% open source

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-21
Chapter Topics

Introduction to Apache Hadoop and


the Hadoop Ecosystem

§ Apache Hadoop Overview


§ Data Storage and Ingest
§ Data Processing
§ Data Analysis and Exploration
§ Other Ecosystem Tools
§ Introduction to the Hands-On Exercises
§ Essential Points
§ Hands-On Exercise: Query Hadoop Data with Apache Impala

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-22
Apache Oozie: Workflow Management

§ Oozie Start
– Workflow engine for Workflow

Hadoop jobs
– Defines Web
dependencies Server
Logs in
between jobs HDFS?
Yes No
§ The Oozie server
Import Sales
submits the jobs to the Data with Sqoop Send e-mail to
server in the correct Administrator
sequence
Is today
Sunday?

Yes No
Generate Weekly
Process Data
Reports with End Workflow
with Spark
Hive

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-23
Hue: The UI for Hadoop

§ Hue = Hadoop User Experience


§ Hue provides a web front-end to Hadoop
– Upload and browse data in HDFS
– Query tables in Impala and Hive
– Run Spark jobs, Pig jobs, and Oozie workflows
– Build an interactive Cloudera Search dashboard
– And much more
§ Makes Hadoop easier to use
§ Created by Cloudera
– 100% open source
– Released under Apache license

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-24
Apache Sentry: Hadoop Security

§ Sentry provides fine-grained access control


(authorization) to various Hadoop ecosystem
components
– Impala
– Hive
– Cloudera Search
– HDFS
§ In conjunction with Kerberos authentication, Sentry
authorization provides an overall cluster security
solution
§ Created by Cloudera
– Donated to Apache Software Foundation
– Now an open-source Apache project

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-25
Chapter Topics

Introduction to Apache Hadoop and


the Hadoop Ecosystem

§ Apache Hadoop Overview


§ Data Storage and Ingest
§ Data Processing
§ Data Analysis and Exploration
§ Other Ecosystem Tools
§ Introduction to the Hands-On Exercises
§ Essential Points
§ Hands-On Exercise: Query Hadoop Data with Apache Impala

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-26
Introduction to the Hands-On Exercises

§ The best way to learn is to do!


§ Most topics in this course have Hands-On Exercises to practice the skills
you have learned in the course
§ The exercises are based on a hypothetical scenario
– However, the concepts apply to nearly any organization
§ Loudacre Mobile is a (fictional) fast-growing wireless carrier
– Provides mobile service to customers throughout western USA

L udacre
mobile
o

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-27
Scenario Explanation

§ Loudacre needs to migrate their existing infrastructure to Hadoop


– The size and velocity of their data has exceeded their ability to process
and analyze their data
§ Loudacre data sources
– MySQL database: customer account data (name, address, phone
numbers, and devices)
– Apache web server logs from Customer Service site
– HTML files: Knowledge Base articles
– XML files: Device activation records
– Real-time device status logs
– Base station files: Cell tower locations

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-28
Introduction to Exercises: Classroom Virtual Machine

§ Your virtual machine


– Log in as user training (password training)
– Home directory is /home/training (often referenced as ~)
– Pre-installed and configured with
– Spark and CDH (Cloudera’s Distribution, including Apache Hadoop)
– Various tools including Firefox, gedit, Emacs, Eclipse, and Apache
Maven
§ Training materials: ~/training_materials/devsh folder on the VM
– examples: all the example code in this course
– exercises: starter files, scripts and solutions for the Hands-On
Exercises
– scripts: course setup scripts
§ Course data: ~/training_materials/data

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-29
Chapter Topics

Introduction to Apache Hadoop and


the Hadoop Ecosystem

§ Apache Hadoop Overview


§ Data Storage and Ingest
§ Data Processing
§ Data Analysis and Exploration
§ Other Ecosystem Tools
§ Introduction to the Hands-On Exercises
§ Essential Points
§ Hands-On Exercise: Query Hadoop Data with Apache Impala

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-30
Essential Points

§ Hadoop is a framework for distributed storage and processing


§ Core Hadoop includes HDFS for storage and YARN for cluster resource
management
§ The Hadoop ecosystem includes many components for
– Ingesting data (Flume, Sqoop, Kafka)
– Storing data (HDFS, HBase, Kudu)
– Processing data (Spark, Hadoop MapReduce, Pig)
– Modeling data as tables for SQL access (Impala, Hive)
– Exploring data (Hue, Search)
– Protecting Data (Sentry)
§ Hands-On Exercises let you practice and refine your Hadoop skills

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-31
Bibliography

The following offer more information on topics discussed in this chapter


§ Hadoop: The Definitive Guide (published by O’Reilly)
– http://tiny.cloudera.com/hadooptdg
§ Cloudera Essentials for Apache Hadoop—free online training
– http://tiny.cloudera.com/esscourse

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-32
Chapter Topics

Introduction to Apache Hadoop and


the Hadoop Ecosystem

§ Apache Hadoop Overview


§ Data Storage and Ingest
§ Data Processing
§ Data Analysis and Exploration
§ Other Ecosystem Tools
§ Introduction to the Hands-On Exercises
§ Essential Points
§ Hands-On Exercise: Query Hadoop Data with Apache Impala

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-33
Hands-on Exercise: Query Hadoop Data with Apache Impala

§ In this exercise, you will


– Use the Hue Impala Query Editor to explore data in a Hadoop cluster
§ Please refer to the Hands-On Exercise Manual for instructions

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-34
Apache Hadoop File Storage
Chapter 3
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-2
Apache Hadoop File Storage

In this chapter you will learn


§ How the Apache Hadoop Distributed File System (HDFS) stores data across
a cluster
§ How to use HDFS using the Hue File Browser or the hdfs command
§ What the major supported file storage formats in Hadoop are, and how to
choose which to use

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-3
Chapter Topics

Apache Hadoop File Storage

§ Apache Hadoop Cluster Components


§ HDFS Architecture
§ Using HDFS
§ Apache Hadoop File Formats
§ Essential Points
§ Hands-On Exercises: Access HDFS with the Command Line and Hue

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-4
Hadoop Cluster Terminology

§ A cluster is a group of computers working together


– Provides data storage, data processing, and resource management
§ A node is an individual computer in the cluster
– Master nodes manage distribution of work and data to worker nodes
§ A daemon is a program running on a node
– Each performs different functions in the cluster

Worker Node

Master Node Worker Node Master Node

Worker Node

Worker Node

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-5
Cluster Components

§ Three main components of a cluster


§ Work together to provide distributed data processing
§ The first topic is the Storage component
– HDFS
Processing

Resource Storage
Management

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-6
Chapter Topics

Apache Hadoop File Storage

§ Apache Hadoop Cluster Components


§ HDFS Architecture
§ Using HDFS
§ Apache Hadoop File Formats
§ Essential Points
§ Hands-On Exercises: Access HDFS with the Command Line and Hue

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-7
HDFS Basic Concepts (1)

§ HDFS is a file system written in Java


– Based on Google File System
§ Sits on top of a native file system
– Such as ext3, ext4, or xfs
§ Provides redundant storage for massive amounts of data
– Using readily-available, industry-standard computers

HDFS

Native OS file system

Disk Storage

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-8
HDFS Basic Concepts (2)

§ HDFS performs best with a “modest” number of large files


– Millions, rather than billions, of files
– Each file typically 100MB or more
§ Files in HDFS are “write once”
– No random writes to files are allowed
§ HDFS is optimized for large, streaming reads of files
– Rather than random reads

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-9
How Files Are Stored

§ Data files are split into blocks (default 128MB) which are distributed at
load time
§ Each block is replicated on multiple data nodes (default 3x)
§ NameNode stores metadata Block 1

Block 3 Name
Block 1 Node
Block 1

Block 2
Metadata:
Block 2 Block 2
information
Very Block 3
Large Block 4
about files
Data File and blocks
Block 3 Block 2

Block 4

Block 1

Block 3
Block 4
Block 4

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-10
Example: Storing and Retrieving Files (1)

Local

Node A Node D
/logs/
031515.log

Node B Node E

/logs/
042316.log Node C

Hadoop
Cluster
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-11
Example: Storing and Retrieving Files (2)

Metadata B1: A,B,D NameNode


B2: B,D,E
B3: A,B,C
/logs/031515.log: B1,B2,B3 B4: A,B,E
/logs/042316.log: B4,B5 B5: C,E,D

1 Node A Node D
/logs/
2 1 3 1 5
031515.log
3 4 2

Node B Node E
1 2 2 5
3 4 4
/logs/
4
042316.log 5 Node C
3 5

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-12
Example: Storing and Retrieving Files (3)

Metadata B1: A,B,D NameNode


B2: B,D,E
B3: A,B,C
/logs/031515.log: B1,B2,B3 B4: A,B,E
/logs/042316.log: B4,B5 B5: C,E,D

1 Node A Node D
/logs/
/logs/04231
2 1 3 1 5 6.log?
031515.log
3 4 2

Node B Node E B4,B5


1 2 2 5
3 4 4
/logs/
4
042316.log 5 Node C Client
3 5

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-13
Example: Storing and Retrieving Files (4)

Metadata B1: A,B,D NameNode


B2: B,D,E
B3: A,B,C
/logs/031515.log: B1,B2,B3 B4: A,B,E
/logs/042316.log: B4,B5 B5: C,E,D

1 Node A Node D
/logs/
/logs/04231
2 1 3 1 5 6.log?
031515.log
3 4 2

Node B Node E B4,B5


1 2 2 5
3 4 4
/logs/
4
042316.log 5 Node C Client
3 5

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-14
HDFS NameNode Availability

§ The NameNode daemon must be running at all times


– If the NameNode stops, the cluster becomes inaccessible

§ HDFS is typically set up for High


Availability Active Standby
– Two NameNodes: Active and Standby Name Name
– Standby NameNode takes over Node Node
automatically if Active NameNode
fails
§ Small clusters may use classic mode
– One NameNode Secondary
Name Name
– One “helper” node called the Node Node
Secondary NameNode
– Bookkeeping, not backup
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-15
Chapter Topics

Apache Hadoop File Storage

§ Apache Hadoop Cluster Components


§ HDFS Architecture
§ Using HDFS
§ Apache Hadoop File Formats
§ Essential Points
§ Hands-On Exercises: Access HDFS with the Command Line and Hue

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-16
Options for Accessing HDFS
put
§ From the command line HDFS
$ hdfs dfs Client Cluster
get

§ In Spark
– By URI—for example:
hdfs://nnhost:port/file…

§ Other programs
– Java API
– Used by Hadoop tools such as
MapReduce, Impala, Hue,
Sqoop, Flume
– RESTful interface

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-17
HDFS Command Line Examples (1)

§ Copy file foo.txt from local disk to the user’s directory in HDFS

$ hdfs dfs -put foo.txt foo.txt

– This will copy the file to /user/username/foo.txt


§ Get a directory listing of the user’s home directory in HDFS

$ hdfs dfs -ls

§ Get a directory listing of the HDFS root directory

$ hdfs dfs –ls /

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-18
HDFS Command Line Examples (2)

§ Display the contents of the HDFS file /user/fred/bar.txt

$ hdfs dfs -cat /user/fred/bar.txt

§ Copy that file to the local disk, named as baz.txt

$ hdfs dfs -get /user/fred/bar.txt baz.txt

§ Create a directory called input under the user’s home directory

$ hdfs dfs -mkdir input

Note: copyFromLocal is a synonym for put; copyToLocal is a synonym for get

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-19
HDFS Command Line Examples (3)

§ Delete a file

$ hdfs dfs -rm input_old/file1

§ Delete a set of files using a wildcard

$ hdfs dfs -rm input_old/*

§ Delete the directory input_old and all its contents

$ hdfs dfs -rm -r input_old

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-20
The Hue HDFS File Browser

§ The File Browser in Hue lets you view and manage your HDFS directories
and files
– Create, move, rename, modify, upload, download, and delete
directories and files
– View file contents

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-21
HDFS Recommendations

§ HDFS is a repository for your data


– Structure and organize carefully!
§ Best practices include
– Define a standard directory structure
– Include separate locations for staging data
§ Example organization
– /user/…—data and configuration belonging only to a single user
– /etl—work in progress in Extract/Transform/Load stage
– /tmp—temporary generated data shared between users
– /data—data sets that are processed and available across the
organization for analysis
– /app—non-data files such as configuration, JAR files, SQL files

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-22
Chapter Topics

Apache Hadoop File Storage

§ Apache Hadoop Cluster Components


§ HDFS Architecture
§ Using HDFS
§ Apache Hadoop File Formats
§ Essential Points
§ Hands-On Exercises: Access HDFS with the Command Line and Hue

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-23
Hadoop Data Storage Formats

§ Hadoop and the tools in the Hadoop ecosystem use several different file
formats to store data
§ The most common are
– Text
– SequenceFiles
– Apache Avro data format
– Apache Parquet
§ Which formats to use depend on your use case and which tools you use
§ You can also define custom formats
§ HDFS considers files to be simply a sequence of bytes

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-24
Hadoop File Formats: Text Files

§ Text files are the most basic file type in Hadoop


– Can be read or written from virtually any programming language
– Comma- and tab-delimited files are compatible with many applications
§ Text files are human readable since all data is in string format
– Useful when debugging
§ Text files are inefficient at scale
– Representing numeric values as strings wastes storage space
– Difficult to represent binary data such as images
– Often resort to techniques such as Base64 encoding
– Conversion to/from native types adds performance penalty
§ Verdict: Good interoperability, but poor performance

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-25
Hadoop File Formats: SequenceFiles

§ SequenceFiles store key-value pairs in a binary container format


– Less verbose and more efficient than text files
– Capable of storing binary data such as images
– Format is Java-specific and tightly coupled to Hadoop
§ Verdict: Better performance, but poor interoperability

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-26
Hadoop File Formats: Apache Avro Data Files

§ Efficient storage due to optimized binary encoding


§ Widely supported throughout the Hadoop ecosystem
– Can also be used outside of Hadoop
§ Ideal for long-term storage of important data
– Many languages can read and write Avro files
– Embeds schema in the file, so will always be readable
– In JSON format and not Java-specific
– Schema evolution can accommodate changes
§ Verdict: Excellent interoperability and performance
– Best choice for general-purpose storage in Hadoop

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-27
Inspecting Avro Data Files with Avro Tools

§ Avro data files are an efficient way to store data


– However, the binary format makes debugging difficult
§ Use the avro-tools command to work with binary files
– Allows you to read the schema or data for an Avro file

$ avro-tools tojson mydatafile.avro


{"name":"Alice","salary": 56500,"city":"Anaheim"}
{"name":"Bob","salary": 51400,"city":"Bellevue"}

$ avro-tools getschema mydatafile.avro


{
"type" : "record",
"name" : "DeviceData",
"namespace" : "com.loudacre.data", Remainder of schema omitted

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-28
Columnar Formats

§ Hadoop also supports a few columnar formats


– These organize data storage by column, rather than by row
– Very efficient when selecting only a subset of a table’s columns

id name city occupation id name city occupation

1 Alice Palo Alto Accountant 1 Alice Palo Alto Accountant

2 Bob Sunnyvale Accountant 2 Bob Sunnyvale Accountant

3 Bob Palo Alto Dentist 3 Bob Palo Alto Dentist

4 Bob Palo Alto Manager 4 Bob Palo Alto Manager

5 Carol Palo Alto Manager 5 Carol Palo Alto Manager

6 David Sunnyvale Mechanic 6 David Sunnyvale Mechanic

Organization of data in Organization of data in


traditional row-based formats columnar formats

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-29
Hadoop File Formats: Apache Parquet Files

§ Parquet is a columnar format developed by Cloudera and Twitter


– Supported in Spark, MapReduce, Hive, Pig, Impala, and others
– Schema metadata is embedded in the file (like Avro)
§ Uses advanced optimizations described in Google’s Dremel paper
– Reduces storage space
– Increases performance
§ Most efficient when adding many records at once
– Some optimizations rely on identifying repeated patterns
§ Verdict: Excellent interoperability and performance
– Best choice for column-based access patterns

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-30
Inspecting Parquet Files with Parquet Tools

§ Parquet files are binary files


– Binary format makes debugging difficult
§ Use parquet-tools command to work with binary files
– Allows you to read the schema or data for a Parquet file

$ parquet-tools head mydatafile.parquet



name = Alice
salary = 56500

$ parquet-tools schema mydatafile.parquet



optional binary name (UTF8);
optional int32 salary;

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-31
Data Format Summary

Feature Text Sequence Avro Parquet


Files
Supported by many tools ✓ ✓ ✓

Good performance at scale ✓ ✓ ✓

Binary format ✓ ✓ ✓

Embedded schema ✓ ✓

Columnar organization ✓

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-32
Data Compression

§ Each file format may also support compression


– This reduces amount of disk space required to store data
§ Compression is a tradeoff between CPU time and bandwidth/storage
space
– Aggressive algorithms take a long time, but save more space
– Less aggressive algorithms save less space but are much faster
§ Can significantly improve performance Physical Area on Disk Required to
Store a Given Amount of Data
– Many Hadoop jobs are I/O-bound
– Using compression allows you to uncompressed
handle more data per I/O operation compressed

– Compression can also improve the


performance of network transfers

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-33
Compression Codecs

§ The implementation of a compression algorithm is known as a codec


– Short for compressor/decompressor
§ Many codecs are commonly used with Hadoop
– Each has different performance characteristics
– Not all Hadoop tools are compatible with all codecs
§ Overall, BZip2 saves the most space

Time Required more >


– But LZ4 and Snappy are much faster BZip2
– Impala supports Snappy but not LZ4
§ For “hot” data, speed matters most GZip

– Better to compress by 40% in one


LZO
second than by 80% in 10 seconds Snappy

< less
LZ4

< less Space Saved more >

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-34
Chapter Topics

Apache Hadoop File Storage

§ Apache Hadoop Cluster Components


§ HDFS Architecture
§ Using HDFS
§ Apache Hadoop File Formats
§ Essential Points
§ Hands-On Exercises: Access HDFS with the Command Line and Hue

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-35
Essential Points

§ The Hadoop Distributed File System (HDFS) is the main storage layer for
Hadoop
§ HDFS chunks data into blocks and distributes them across the cluster when
data is stored
§ HDFS clusters are managed by a single NameNode running on a master
node
§ Access HDFS using Hue, the hdfs command, or the HDFS API
§ The Hadoop ecosystem supports several different file formats

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-36
Bibliography

The following offer more information on topics discussed in this chapter


§ HDFS User Guide
– http://tiny.cloudera.com/hdfsuser
§ Hue website
– http://gethue.com/

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-37
Chapter Topics

Apache Hadoop File Storage

§ Apache Hadoop Cluster Components


§ HDFS Architecture
§ Using HDFS
§ Apache Hadoop File Formats
§ Essential Points
§ Hands-On Exercises: Access HDFS with the Command Line and Hue

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-38
Hands-On Exercise: Access HDFS with the Command Line and
Hue
§ In this exercise, you will
– Create a /loudacre base directory for course exercises
– Practice uploading and viewing data files
§ Please refer to the Hands-On Exercise Manual for instructions

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-39
Data Processing on an Apache Hadoop
Cluster
Chapter 4
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-2
Data Processing on an Apache Hadoop Cluster

In this chapter you will learn


§ How Hadoop YARN provides cluster resource management for distributed
data processing
§ How to use Hue, the YARN web UI, or the yarn command to monitor your
cluster

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-3
Chapter Topics

Data Processing on an Apache


Hadoop Cluster

§ YARN Architecture
§ Working With YARN
§ Essential Points
§ Hands-On Exercises: Run a YARN Job

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-4
What Is YARN?
§ YARN = Yet Another Resource Negotiator
§ YARN is the Hadoop processing layer that contains
– A resource manager
– A job scheduler

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-5
YARN Daemons

§ ResourceManager (RM)
– Runs on master node
– Global resource scheduler
Resource
– Arbitrates system resources between competing
Manager
applications
– Has a pluggable scheduler to support different
algorithms (such as Capacity or Fair Scheduler)

§ NodeManager (NM)
– Runs on worker nodes
Node
– Communicates with RM Manager
– Manages node resources
– Launches containers

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-6
Running Applications on YARN

§ Containers
– Containers allocate a certain amount of resources
(memory, CPU cores) on a worker node
Container
– Applications run in one or more containers
– Clients request containers from RM

§ ApplicationMaster (AM)
– One per application
Application
– Framework/application specific
Master
– Runs in a container
– Requests more containers to run application tasks

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-7
Running an Application on YARN (1)

NodeManager DataNode

NodeManager DataNode

NodeManager DataNode Name


Node
Resource
Manager

NodeManager DataNode

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-8
Running an Application on YARN (2)

$ my-hadoop-app NodeManager DataNode

Client

NodeManager DataNode

NodeManager DataNode Name


Application Node
Resource
Launch Master
Manager

NodeManager DataNode

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-9
Running an Application on YARN (3)

$ my-hadoop-app NodeManager DataNode

Client

NodeManager DataNode

Resource Request:
- 1 x Node1/1GB/1 core
- 1 x Node2/1GB/1 core
NodeManager DataNode Name
Application Node
Resource
Master
Manager

NodeManager DataNode

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-10
Running an Application on YARN (4)

$ my-hadoop-app NodeManager DataNode

Client

NodeManager DataNode

NodeManager DataNode Name


Application Node
Resource
Master
Manager “Here are your
containers”

NodeManager DataNode

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-11
Running an Application on YARN (5)

$ my-hadoop-app NodeManager DataNode


Task

Client

NodeManager DataNode
Task

NodeManager DataNode Name


Application Node
Resource
Master
Manager

NodeManager DataNode

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-12
Running an Application on YARN (6)

$ my-hadoop-app NodeManager DataNode

Client

NodeManager DataNode

NodeManager DataNode Name


Application Node
Resource “I’m done!”
Master
Manager

NodeManager DataNode

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-13
Chapter Topics

Data Processing on an Apache


Hadoop Cluster

§ YARN Architecture
§ Working With YARN
§ Essential Points
§ Hands-On Exercises: Run a YARN Job

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-14
Working with YARN

§ Developers need to be able to


– Submit jobs (applications) to run on the YARN cluster
– Monitor and manage jobs
§ There are three major YARN tools for developers
– The Hue Job Browser
– The YARN web UI
– The YARN command line
§ YARN administrators can use Cloudera Manager
– May also be helpful for developers
– Included in Cloudera Express and Cloudera Enterprise

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-15
The Hue Job Browser

§ The Hue Job Browser allows you to


– Monitor the status of a job
– View the logs
– Kill a running job

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-16
The YARN Web UI

§ The ResourceManager UI is the main entry point


– Runs on the RM host on port 8088 by default
§ Provides more detailed view than Hue
§ Does not provide any control or configuration

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-17
ResourceManager UI: Nodes

Cluster Overview

List of all nodes


Link to Node in the cluster
Manager UI

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-18
ResourceManager UI: Applications

Cluster Overview

Link to application List of running and


details… (next slide) recent applications

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-19
ResourceManager UI: Application Detail

Link to Application Master


(UI depends on specific
framework) View aggregated log
files (optional)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-20
History Server

§ YARN does not keep track of job history


History
§ Spark and MapReduce each provide a history server Server
– Archives jobs’ metrics and metadata
– Can be accessed through the history server UI or Hue

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-21
YARN Command Line (1)

§ Command to configure and view information about the YARN cluster

$ yarn <command>

§ Most YARN commands are for administrators rather than developers


§ Some helpful commands for developers
– List running applications
$ yarn application -list

– Kill a running application


$ yarn application -kill <app-id>

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-22
YARN Command Line (2)

– View the logs of the specified application

$ yarn logs -applicationId <app-id>

– View the full list of command options

$ yarn -help

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-23
Cloudera Manager

§ Cloudera Manager providers a greater ability to monitor and configure a


cluster from a single location
– Covered in Cloudera Administrator Training for Apache Hadoop

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-24
Chapter Topics

Data Processing on an Apache


Hadoop Cluster

§ YARN Architecture
§ Working With YARN
§ Essential Points
§ Hands-On Exercises: Run a YARN Job

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-25
Essential Points

§ YARN manages resources in a Hadoop cluster and schedules jobs


§ YARN works with HDFS to run tasks where the data is stored
§ Worker nodes run NodeManager daemons, managed by a
ResourceManager on a master node
§ Monitor jobs using Hue, the YARN web UI, or the yarn command

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-26
Bibliography

The following offer more information on topics discussed in this chapter


§ Hadoop Application Architectures: Designing Real-World Big Data
Applications (published by O’Reilly)
– http://tiny.cloudera.com/archbook
§ YARN documentation
– http://tiny.cloudera.com/yarndocs
§ Cloudera Engineering Blog YARN articles
– http://tiny.cloudera.com/yarnblog

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-27
Chapter Topics

Data Processing on an Apache


Hadoop Cluster

§ YARN Architecture
§ Working With YARN
§ Essential Points
§ Hands-On Exercises: Run a YARN Job

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-28
Hands-On Exercise: Run a YARN Job

§ In this exercise, you will


– Use the YARN web UI to view your YARN cluster “at rest”
– Submit an application to run on the cluster
– Monitor the job using both the YARN UI and Hue
§ Please refer to the Hands-On Exercise Manual for instructions

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-29
Importing Relational Data with Apache
Sqoop
Chapter 5
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-2
Importing Relational Data with Apache Sqoop

In this chapter you will learn


§ How to import tables from an RDBMS into your Hadoop cluster
§ How to change the delimiter and file format of imported tables
§ How to control which tables, columns, and rows are imported

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-3
Chapter Topics

Importing Relational Data with


Apache Sqoop

§ Apache Sqoop Overview


§ Importing Data
§ Import File Options
§ Exporting Data
§ Essential Points
§ Hands-On Exercise: Import Data from MySQL Using Apache Sqoop

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-4
What Is Apache Sqoop?

§ Open source Apache project originally developed by Cloudera


– The name is a contraction of “SQL-to-Hadoop”
§ Sqoop exchanges data between a database and HDFS
– Can import all tables, a single table, or a partial table into HDFS
– Data can be imported in a variety of formats
– Sqoop can also export data from HDFS to a database

Database Hadoop Cluster

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-5
How Does Sqoop Work?

§ Sqoop is a client-side application that imports data using Hadoop


MapReduce
§ A basic import involves three steps Database Server
orchestrated by Sqoop
– Examine table details
– Create and submit job to cluster 1

– Fetch records from table and Sqoop 3


write this data to HDFS User

Hadoop Cluster

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-6
Basic Syntax

§ Sqoop is a command-line utility with several subcommands, called tools


– There are tools for import, export, listing database contents, and more
– Run sqoop help to see a list of all tools
– Run sqoop help tool-name for help on using a specific tool
§ Basic syntax of a Sqoop invocation

$ sqoop tool-name [tool-options]

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-7
Exploring a Database with Sqoop

§ This command will list all tables in the loudacre database in MySQL

$ sqoop list-tables \
--connect jdbc:mysql://dbhost/loudacre \
--username dbuser \
--password pw

§ You can perform database queries using the eval tool

$ sqoop eval \
--query "SELECT * FROM my_table LIMIT 5" \
--connect jdbc:mysql://dbhost/loudacre \
--username dbuser \
--password pw

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-8
Chapter Topics

Importing Relational Data with


Apache Sqoop

§ Apache Sqoop Overview


§ Importing Data
§ Import File Options
§ Exporting Data
§ Essential Points
§ Hands-On Exercise: Import Data from MySQL Using Apache Sqoop

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-9
Overview of the Import Process

§ Imports are performed using Hadoop MapReduce jobs


§ Sqoop begins by examining the table to be imported
– Determines the primary key, if possible
– Runs a boundary query to see how many records will be imported
– Divides result of boundary query by the number of tasks (mappers)
– Uses this to configure tasks so that they will have equal loads
§ Sqoop also generates a Java source file for each table being imported
– It compiles and uses this during the import process
– The file remains after import, but can be safely deleted

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-10
Importing an Entire Database with Sqoop

§ The import-all-tables tool imports an entire database


– Stored as comma-delimited files
– Default base location is your HDFS home directory
– Data will be in subdirectories corresponding to name of each table

$ sqoop import-all-tables \
--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw

§ Use the --warehouse-dir option to specify a different base directory

$ sqoop import-all-tables \
--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw \
--warehouse-dir /loudacre

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-11
Importing a Single Table with Sqoop

§ The import tool imports a single table


§ This example imports the accounts table
– It stores the data in HDFS as comma-delimited fields

$ sqoop import --table accounts \


--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-12
Importing Partial Tables with Sqoop

§ Import only specified columns from accounts table

$ sqoop import --table accounts \


--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw \
--columns "id,first_name,last_name,state"

§ Import only matching rows from accounts table

$ sqoop import --table accounts \


--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw \
--where "state='CA'"

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-13
Chapter Topics

Importing Relational Data with


Apache Sqoop

§ Apache Sqoop Overview


§ Importing Data
§ Import File Options
§ Exporting Data
§ Essential Points
§ Hands-On Exercise: Import Data from MySQL Using Apache Sqoop

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-14
Specifying a File Location

§ By default, Sqoop stores the data in the user’s HDFS home directory
– In a subdirectory corresponding to the table name
– For example /user/training/accounts
§ This example specifies an alternate location

$ sqoop import --table accounts \


--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw \
--target-dir /loudacre/customer_accounts

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-15
Specifying an Alternate Delimiter

§ By default, Sqoop generates text files with comma-delimited fields


§ This example writes tab-delimited fields instead

$ sqoop import --table accounts \


--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw \
--fields-terminated-by "\t"

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-16
Using Compression with Sqoop

§ Sqoop supports storing data in a compressed file


– Use the --compression-codec flag

$ sqoop import --table accounts \


--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw \
--compression-codec \
org.apache.hadoop.io.compress.SnappyCodec

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-17
Storing Data in Other Data Formats

§ By default, Sqoop stores data in text format files


§ Sqoop supports importing data as Parquet or Avro files

$ sqoop import --table accounts \


--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw \
--as-parquetfile

$ sqoop import --table accounts \


--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw \
--as-avrodatafile

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-18
Chapter Topics

Importing Relational Data with


Apache Sqoop

§ Apache Sqoop Overview


§ Importing Data
§ Import File Options
§ Exporting Data
§ Essential Points
§ Hands-On Exercise: Import Data from MySQL Using Apache Sqoop

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-19
Exporting Data from Hadoop to RDBMS with Sqoop

§ Sqoop's import tool pulls records from an RDBMS into HDFS


§ It is sometimes necessary to push data in HDFS back to an RDBMS
– Good solution when you must do batch processing on large data sets
– Export results to a relational database for access by other systems
§ Sqoop supports this via the export tool
– The RDBMS table must already exist prior to export

$ sqoop export \
--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw \
--export-dir /loudacre/recommender_output \
--update-mode allowinsert \
--table product_recommendations

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-20
Chapter Topics

Importing Relational Data with


Apache Sqoop

§ Apache Sqoop Overview


§ Importing Data
§ Import File Options
§ Exporting Data
§ Essential Points
§ Hands-On Exercise: Import Data from MySQL Using Apache Sqoop

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-21
Essential Points

§ Sqoop exchanges data between a database and a Hadoop cluster


– Provides subcommands (tools) for importing, exporting, and more
§ Tables are imported using MapReduce jobs
– These are written as comma-delimited text by default
– You can specify alternate delimiters or file formats
§ Sqoop provides many options to control imports
– You can select only certain columns or limit rows

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-22
Bibliography

The following offer more information on topics discussed in this chapter


§ Sqoop User Guide
– http://tiny.cloudera.com/sqoopuser
§ Apache Sqoop Cookbook (published by O’Reilly)
– http://tiny.cloudera.com/sqoopcookbook

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-23
Chapter Topics

Importing Relational Data with


Apache Sqoop

§ Apache Sqoop Overview


§ Importing Data
§ Import File Options
§ Exporting Data
§ Essential Points
§ Hands-On Exercise: Import Data from MySQL Using Apache Sqoop

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-24
Hands-On Exercise: Import Data from MySQL Using Apache
Sqoop
§ In this exercise, you will
– Use Sqoop to import customer account data from an RDBMS to HDFS
§ Please refer to the Hands-On Exercise Manual for instructions

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-25
Apache Spark Basics
Chapter 6
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-2
Apache Spark Basics

In this chapter you will learn


§ How to start the Spark shell
§ How to use the Spark context to access Spark functionality
§ Key concepts of Resilient Distributed Datasets (RDDs)
– What are they?
– How do you create them?
– What operations can you perform with them?
§ How Spark uses the principles of functional programming

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-3
Chapter Topics

Apache Spark Basics

§ What is Apache Spark?


§ Using the Spark Shell
§ RDDs (Resilient Distributed Datasets)
§ Functional Programming in Spark
§ Essential Points
§ Hands-On Exercise: Explore RDDs Using the Spark Shell

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-4
What Is Apache Spark?

§ Apache Spark is a fast and general engine for large-scale


data processing
§ Written in Scala
– Functional programming language that runs in a JVM
§ Spark shell
– Interactive—for learning or data exploration
– Python or Scala
§ Spark applications
– For large scale data processing
– Python, Scala, or Java

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-5
Chapter Topics

Apache Spark Basics

§ What is Apache Spark?


§ Using the Spark Shell
§ RDDs (Resilient Distributed Datasets)
§ Functional Programming in Spark
§ Essential Points
§ Hands-On Exercise: Explore RDDs Using the Spark Shell

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-6
Spark Shell

§ The Spark shell provides interactive data exploration (REPL)

Python shell: pyspark Scala shell: spark-shell


$ pyspark $ spark-shell

Welcome to
Welcome to
____ __
/ __/__ ___ _____/ /__ ____ __
_\ \/ _ \/ _ `/ __/ '_/ / __/__ ___ _____/ /__
/__ / .__/\_,_/_/ /_/\_\ version 1.6.0 _\ \/ _ \/ _ `/ __/ '_/
/_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
Using Python version 2.7.8 (default, …)
SparkContext available as sc, HiveContext
available as sqlContext. Using Scala version 2.10.5 (Java HotSpot(TM)
64-Bit Server VM, Java 1.8.0_60)
>>> Spark context available as sc (master = …)
SQL context available as sqlContext.

scala>

REPL: Read/Evaluate/Print Loop

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-7
Spark Context

§ Every Spark application requires a Spark context


– The main entry point to the Spark API
§ The Spark shell provides a preconfigured Spark context called sc
Language: Python
Using Python version 2.7.8 (default, Apr 19 2016 07:37:49)
SparkContext available as sc, HiveContext available as sqlContext.

>>> sc.appName
u'PySparkShell'

Language: Scala

Spark context available as sc (master = …)
SQL context available as sqlContext.

scala> sc.appName
res0: String = Spark shell

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-8
Chapter Topics

Apache Spark Basics

§ What is Apache Spark?


§ Using the Spark Shell
§ RDDs (Resilient Distributed Datasets)
§ Functional Programming in Spark
§ Essential Points
§ Hands-On Exercise: Explore RDDs Using the Spark Shell

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-9
RDD (Resilient Distributed Dataset)

§ RDD (Resilient Distributed Dataset)


– Resilient: If data in memory is lost, it can be recreated
– Distributed: Processed across the cluster
– Dataset: Initial data can come from a source such as a file, or it can be
created programmatically
§ RDDs are the fundamental unit of data in Spark
§ Most Spark programming consists of performing operations on RDDs

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-10
Creating an RDD

§ Three ways to create an RDD


– From a file or set of files
– From data in memory
– From another RDD

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-11
Example: A File-Based RDD

Language: Scala File: purplecow.txt

> val mydata = sc.textFile("purplecow.txt") I've never seen a purple cow.


… I never hope to see one;
15/01/29 06:20:37 INFO storage.MemoryStore: But I can tell you, anyhow,
I'd rather see than be one.
Block broadcast_0 stored as values to
memory (estimated size 151.4 KB, free 296.8
MB)

> mydata.count() RDD: mydata


I've never seen a purple cow.

I never hope to see one;
15/01/29 06:27:37 INFO spark.SparkContext: Job
finished: take at <stdin>:1, took But I can tell you, anyhow,
0.160482078 s I'd rather see than be one.
4

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-12
RDD Operations

§ Two broad types of RDD operations


RDD

value
– Actions return values

Base RDD New RDD


– Transformations define a new RDD
based on the current one(s)

§ Which type of operation is count()?

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-13
RDD Operations: Actions

§ Some common actions


– count() returns the number of elements RDD

– take(n) returns an array of the first n elements value


– collect() returns an array of all elements
– saveAsTextFile(dir) saves to text file(s)

Language: Python Language: Scala


> mydata = > val mydata =
sc.textFile("purplecow.txt") sc.textFile("purplecow.txt")

> mydata.count() > mydata.count()


4 4

> for line in mydata.take(2): > for (line <- mydata.take(2))


print line println(line)
I've never seen a purple cow. I've never seen a purple cow.
I never hope to see one; I never hope to see one;

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-14
RDD Operations: Transformations

§ Transformations create a new RDD from


Base RDD New RDD
an existing one
§ RDDs are immutable
– Data in an RDD is never changed
– Transform in sequence to modify the
data as needed
§ Two common transformations
– map(function)creates a new RDD by performing a function on
each record in the base RDD
– filter(function)creates a new RDD by including or excluding
each record in the base RDD according to a Boolean function

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-15
Example: map and filter Transformations
Language: Python I've never seen a purple cow. Language: Scala
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

map(lambda line: line.upper()) map(line => line.toUpperCase)

I'VE NEVER SEEN A PURPLE COW.


I NEVER HOPE TO SEE ONE;
BUT I CAN TELL YOU, ANYHOW,
I'D RATHER SEE THAN BE ONE.

filter(lambda line: line.startswith('I')) filter(line => line.startsWith('I'))

I'VE NEVER SEEN A PURPLE COW.


I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-16
Lazy Execution (1)
File: purplecow.txt
§ Data in RDDs is not processed until I've never seen a purple cow.
an action is performed I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

Language: Scala
>

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-17
Lazy Execution (2)
File: purplecow.txt
§ Data in RDDs is not processed until I've never seen a purple cow.
an action is performed I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

Language: Scala RDD: mydata

> val mydata = sc.textFile("purplecow.txt")

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-18
Lazy Execution (3)
File: purplecow.txt
§ Data in RDDs is not processed until I've never seen a purple cow.
an action is performed I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

Language: Scala RDD: mydata

> val mydata = sc.textFile("purplecow.txt")


> val mydata_uc = mydata.map(line =>
line.toUpperCase())

RDD: mydata_uc

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-19
Lazy Execution (4)
File: purplecow.txt
§ Data in RDDs is not processed until I've never seen a purple cow.
an action is performed I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

Language: Scala RDD: mydata

> val mydata = sc.textFile("purplecow.txt")


> val mydata_uc = mydata.map(line =>
line.toUpperCase())
> val mydata_filt = mydata_uc.filter(line
RDD: mydata_uc
=> line.startsWith("I"))

RDD: mydata_filt

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-20
Lazy Execution (5)
File: purplecow.txt
§ Data in RDDs is not processed until I've never seen a purple cow.
an action is performed I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

Language: Scala RDD: mydata


I've never seen a purple cow.
> val mydata = sc.textFile("purplecow.txt")
I never hope to see one;
> val mydata_uc = mydata.map(line => But I can tell you, anyhow,
line.toUpperCase()) I'd rather see than be one.
> val mydata_filt = mydata_uc.filter(line
RDD: mydata_uc
=> line.startsWith("I")) I'VE NEVER SEEN A PURPLE COW.
> mydata_filt.count() I NEVER HOPE TO SEE ONE;
3 BUT I CAN TELL YOU, ANYHOW,
I'D RATHER SEE THAN BE ONE.

RDD: mydata_filt
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-21
Chaining Transformations (Scala)

§ Transformations may be chained together

> val mydata = sc.textFile("purplecow.txt")


> val mydata_uc = mydata.map(line => line.toUpperCase())
> val mydata_filt = mydata_uc.filter(line => line.startsWith("I"))
> mydata_filt.count()
3

is equivalent to

> sc.textFile("purplecow.txt").map(line => line.toUpperCase()).


filter(line => line.startsWith("I")).count()
3

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-22
Chaining Transformations (Python)

§ Same example in Python

> mydata = sc.textFile("purplecow.txt")


> mydata_uc = mydata.map(lambda s: s.upper())
> mydata_filt = mydata_uc.filter(lambda s: s.startswith('I'))
> mydata_filt.count()
3

is exactly equivalent to

> sc.textFile("purplecow.txt").map(lambda line: line.upper()) \


.filter(lambda line: line.startswith('I')).count()
3

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-23
RDD Lineage and toDebugString (Scala)
File: purplecow.txt
§ Spark maintains each RDD’s lineage— I've never seen a purple cow.
the previous RDDs on which it I never hope to see one;
But I can tell you, anyhow,
depends I'd rather see than be one.

§ Use toDebugString to view the RDD[5]

lineage of an RDD
> val mydata_filt =
sc.textFile("purplecow.txt").
RDD[6]
map(line => line.toUpperCase()).
filter(line => line.startsWith("I"))
> mydata_filt.toDebugString

(2) FilteredRDD[7] at filter … RDD[7]


| MappedRDD[6] at map …
| purplecow.txt MappedRDD[5] …
| purplecow.txt HadoopRDD[4] …

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-24
RDD Lineage and toDebugString (Python)

§ toDebugString output is not displayed as nicely in Python

> mydata_filt.toDebugString()
(1) PythonRDD[8] at RDD at …\n | purplecow.txt MappedRDD[7] at textFile
at …[]\n | purplecow.txt HadoopRDD[6] at textFile at …[]

§ Use print for prettier output

> print mydata_filt.toDebugString()


(1) PythonRDD[8] at RDD at …
| purplecow.txt MappedRDD[7] at textFile at …
| purplecow.txt HadoopRDD[6] at textFile at …

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-25
Pipelining (1)
File: purplecow.txt
§ When possible, Spark will perform I've never seen a purple cow.
sequences of transformations by I never hope to see one;
But I can tell you, anyhow,
element so no data is stored I'd rather see than be one.

Language: Scala
> val mydata = sc.textFile("purplecow.txt") I've never seen a purple cow.
> val mydata_uc = mydata.map(line =>
line.toUpperCase())
> val mydata_filt = mydata_uc.filter(line
=> line.startsWith("I"))
> mydata_filt.take(2)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-26
Pipelining (2)
File: purplecow.txt
§ When possible, Spark will perform I've never seen a purple cow.
sequences of transformations by I never hope to see one;
But I can tell you, anyhow,
element so no data is stored I'd rather see than be one.

Language: Scala
> val mydata = sc.textFile("purplecow.txt")
> val mydata_uc = mydata.map(line =>
line.toUpperCase())
> val mydata_filt = mydata_uc.filter(line
I'VE NEVER SEEN A PURPLE COW.
=> line.startsWith("I"))
> mydata_filt.take(2)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-27
Pipelining (3)
File: purplecow.txt
§ When possible, Spark will perform I've never seen a purple cow.
sequences of transformations by I never hope to see one;
But I can tell you, anyhow,
element so no data is stored I'd rather see than be one.

Language: Scala
> val mydata = sc.textFile("purplecow.txt")
> val mydata_uc = mydata.map(line =>
line.toUpperCase())
> val mydata_filt = mydata_uc.filter(line
=> line.startsWith("I"))
> mydata_filt.take(2)

I'VE NEVER SEEN A PURPLE COW.

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-28
Pipelining (4)
File: purplecow.txt
§ When possible, Spark will perform I've never seen a purple cow.
sequences of transformations by I never hope to see one;
But I can tell you, anyhow,
element so no data is stored I'd rather see than be one.

Language: Scala
> val mydata = sc.textFile("purplecow.txt")
> val mydata_uc = mydata.map(line =>
line.toUpperCase())
> val mydata_filt = mydata_uc.filter(line
=> line.startsWith("I"))
> mydata_filt.take(2)
I'VE NEVER SEEN A PURPLE COW.

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-29
Pipelining (5)
File: purplecow.txt
§ When possible, Spark will perform I've never seen a purple cow.
sequences of transformations by I never hope to see one;
But I can tell you, anyhow,
element so no data is stored I'd rather see than be one.

Language: Scala
> val mydata = sc.textFile("purplecow.txt") I never hope to see one;
> val mydata_uc = mydata.map(line =>
line.toUpperCase())
> val mydata_filt = mydata_uc.filter(line
=> line.startsWith("I"))
> mydata_filt.take(2)
I'VE NEVER SEEN A PURPLE COW.

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-30
Pipelining (6)
File: purplecow.txt
§ When possible, Spark will perform I've never seen a purple cow.
sequences of transformations by I never hope to see one;
But I can tell you, anyhow,
element so no data is stored I'd rather see than be one.

Language: Scala
> val mydata = sc.textFile("purplecow.txt")
> val mydata_uc = mydata.map(line =>
line.toUpperCase())
> val mydata_filt = mydata_uc.filter(line
I NEVER HOPE TO SEE ONE;
=> line.startsWith("I"))
> mydata_filt.take(2)
I'VE NEVER SEEN A PURPLE COW.

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-31
Pipelining (7)
File: purplecow.txt
§ When possible, Spark will perform I've never seen a purple cow.
sequences of transformations by I never hope to see one;
But I can tell you, anyhow,
element so no data is stored I'd rather see than be one.

Language: Scala
> val mydata = sc.textFile("purplecow.txt")
> val mydata_uc = mydata.map(line =>
line.toUpperCase())
> val mydata_filt = mydata_uc.filter(line
=> line.startsWith("I"))
> mydata_filt.take(2)
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-32
Pipelining (8)
File: purplecow.txt
§ When possible, Spark will perform I've never seen a purple cow.
sequences of transformations by I never hope to see one;
But I can tell you, anyhow,
element so no data is stored I'd rather see than be one.

Language: Scala
> val mydata = sc.textFile("purplecow.txt")
> val mydata_uc = mydata.map(line =>
line.toUpperCase())
> val mydata_filt = mydata_uc.filter(line
=> line.startsWith("I"))
> mydata_filt.take(2)
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-33
Chapter Topics

Apache Spark Basics

§ What is Apache Spark?


§ Using the Spark Shell
§ RDDs (Resilient Distributed Datasets)
§ Functional Programming in Spark
§ Essential Points
§ Hands-On Exercise: Explore RDDs Using the Spark Shell

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-34
Functional Programming in Spark

§ Spark depends heavily on the concepts of functional programming


– Functions are the fundamental unit of programming
– Functions have input and output only
– No state or side effects
§ Key concepts
– Passing functions as input to other functions
– Anonymous functions

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-35
Passing Functions as Parameters

§ Many RDD operations take functions as parameters


§ Pseudocode for the RDD map operation
– Applies function fn to each record in the RDD

RDD {
map(fn(x)) {
foreach record in rdd
emit fn(record)
}
}

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-36
Example: Passing Named Functions

Language: Python
> def toUpper(s):
return s.upper()
> mydata = sc.textFile("purplecow.txt")
> mydata.map(toUpper).take(2)

Language: Scala
> def toUpper(s: String): String =
{ s.toUpperCase }
> val mydata = sc.textFile("purplecow.txt")
> mydata.map(toUpper).take(2)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-37
Anonymous Functions

§ Functions defined inline without an identifier


– Best for short, one-off functions
§ Supported in many programming languages
– Python: lambda x: ...
– Scala: x => ...
– Java 8: x -> ...

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-38
Example: Passing Anonymous Functions

§ Python:

> mydata.map(lambda line: line.upper()).take(2)

§ Scala:

> mydata.map(line => line.toUpperCase()).take(2)

OR

> mydata.map(_.toUpperCase()).take(2)

Scala allows anonymous


parameters using underscore (_)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-39
Example: Java

Language: Java 7
...
JavaRDD<String> lines = sc.textFile("file");
JavaRDD<String> mydata_uc =
mydata.map(new Function<String,String>() {
@Override
public String call(String s) {
return (s.toUpperCase());
}
});
...

Language: Java 8
...
JavaRDD<String> lines = sc.textFile("file");
JavaRDD<String> lines_uc = lines.map(
line -> line.toUpperCase());
...

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-40
Chapter Topics

Apache Spark Basics

§ What is Apache Spark?


§ Using the Spark Shell
§ RDDs (Resilient Distributed Datasets)
§ Functional Programming With Spark
§ Essential Points
§ Hands-On Exercise: Explore RDDs Using the Spark Shell

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-41
Essential Points

§ Spark can be used interactively via the Spark shell


– Python or Scala
§ RDDs (Resilient Distributed Datasets) are a key concept in Spark
§ RDD Operations
– Transformations create a new RDD based on an existing one
– Actions return a value from an RDD
§ Lazy Execution
– Transformations are not executed until required by an action
§ Spark uses functional programming
– Passing functions as parameters
– Anonymous functions in supported languages (Python, Scala, Java 8)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-42
Chapter Topics

Apache Spark Basics

§ What is Apache Spark?


§ Using the Spark Shell
§ RDDs (Resilient Distributed Datasets)
§ Functional Programming With Spark
§ Essential Points
§ Hands-On Exercise: Explore RDDs Using the Spark Shell

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-43
Introduction to Spark Exercises: Choose Your Language

§ Your choice: Python or Scala


– For the Spark-based exercises in this course, you may choose to work
with either Python or Scala
§ Solution and example files
– .pyspark: Python commands that can be copied into the PySpark
shell
– .scalaspark: Scala commands that can be copied into the Scala
Spark shell
– .py: Complete Python Spark applications
– .scala: Complete Scala Spark applications

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-44
Hands-On Exercise: Explore RDDs Using the Spark Shell

§ In this exercise, you will


– View the Spark documentation
– Start the Spark shell
– Follow the instructions for either the Python or Scala shell
– Use RDDs to transform a dataset
– Explore Loudacre Web log files
§ Please refer to the Hands-On Exercise Manual for instructions

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-45
Working with RDDs
Chapter 7
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-2
Working with RDDs

In this chapter you will learn


§ How RDDs are created from files or data in memory
§ How to handle file formats with multi-line records
§ How to use some additional operations on RDDs

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-3
Chapter Topics

Working with RDDs

§ Creating RDDs
§ Other General RDD Operations
§ Essential Points
§ Hands-On Exercise: Process Data Files with Apache Spark

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-4
RDDs

§ RDDs can hold any serializable type of element


– Primitive types such as integers, characters, and booleans
– Sequence types such as strings, lists, arrays, tuples, and dicts (including
nested data types)
– Scala/Java Objects (if serializable)
– Mixed types
§ Some RDDs are specialized and have additional functionality
– Pair RDDs
– RDDs consisting of key-value pairs
– Double RDDs
– RDDs consisting of numeric data

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-5
Creating RDDs from Collections

§ You can create RDDs from collections instead of files


– sc.parallelize(collection)
Language: Python
> myData = ["Alice","Carlos","Frank","Barbara"]
> myRdd = sc.parallelize(myData)
> myRdd.take(2)
['Alice', 'Carlos']

§ Useful when
– Testing
– Generating data programmatically
– Integrating
– Learning

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-6
Creating RDDs from Text Files (1)

§ For file-based RDDs, use SparkContext.textFile


– Accepts a single file, a directory of files, a wildcard list of files, or a
comma-separated list of files
– Examples
– sc.textFile("myfile.txt")
– sc.textFile("mydata/")
– sc.textFile("mydata/*.log")
– sc.textFile("myfile1.txt,myfile2.txt")
– Each line in each file is a separate record in the RDD
§ Files are referenced by absolute or relative URI
– Absolute URI:
– file:/home/training/myfile.txt
– hdfs://nnhost/loudacre/myfile.txt
– Relative URI (uses default file system): myfile.txt

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-7
Creating RDDs from Text Files (2)

§ textFile maps each line in a file to a separate RDD element

I've never seen a purple cow.\n I've never seen a purple cow.
I never hope to see one;\n I never hope to see one;
But I can tell you, anyhow,\n
But I can tell you, anyhow,
I'd rather see than be one.\n
I'd rather see than be one.

§ textFile only works with newline-terminated text files


§ What about other formats?

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-8
Input and Output Formats (1)

§ Spark uses Hadoop InputFormat and OutputFormat Java classes


– Some examples from core Hadoop
– TextInputFormat / TextOutputFormat (newline-
terminated text files)
– SequenceInputFormat / SequenceOutputFormat
– FixedLengthInputFormat
– Many implementations available in additional libraries
– Such as AvroKeyInputFormat / AvroKeyOutputFormat in
the Avro library
– You may also implement custom InputFormat and OutputFormat
types

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-9
Input and Output Formats (2)

§ Specify any input format using sc.hadoopFile


– or newAPIhadoopFile for New API classes
§ Specify any output format using rdd.saveAsHadoopFile
– or saveAsNewAPIhadoopFile for New API classes
§ textFile and saveAsTextFile are convenience functions
– textFile just calls hadoopFile specifying TextInputFormat
– saveAsTextFile calls saveAsHadoopFile specifying
TextOutputFormat

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-10
Using wholeTextFiles (1)

§ sc.textFile maps each line in a file to a file1.json


{
separate RDD element "firstName":"Fred",
"lastName":"Flintstone",
– What about files with a multi-line input "userid":"123"
format, such as XML or JSON? }

file2.json
§ sc.wholeTextFiles(directory) {
– Maps entire contents of each file in a directory "firstName":"Barney",
"lastName":"Rubble",
to a single RDD element "userid":"234"
}
– Works only for small files (element must fit in
memory)

(file1.json,{"firstName":"Fred","lastName":"Flintstone","userid":"123"} )
(file2.json,{"firstName":"Barney","lastName":"Rubble","userid":"234"} )
(file3.json,… )
(file4.json,… )

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-11
Using wholeTextFiles (2)

Language: Python
> import json
> myrdd1 = sc.wholeTextFiles(mydir)
> myrdd2 = myrdd1 \
.map(lambda (fname,s): json.loads(s))
> for record in myrdd2.take(2): Output:
> print record.get("firstName",None) Fred
Barney
Language: Scala
> import scala.util.parsing.json.JSON
> val myrdd1 = sc.wholeTextFiles(mydir)
> val myrdd2 = myrdd1.
map(pair => JSON.parseFull(pair._2).get.
asInstanceOf[Map[String,String]])
> for (record <- myrdd2.take(2))
println(record.getOrElse("firstName",null))

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-12
Chapter Topics

Working with RDDs

§ Creating RDDs
§ Other General RDD Operations
§ Essential Points
§ Hands-On Exercise: Process Data Files with Apache Spark

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-13
Some Other General RDD Operations

§ Single-RDD Transformations
– flatMap maps one element in the base RDD to multiple elements
– distinct filters out duplicates
– sortBy uses the provided function to sort
§ Multi-RDD Transformations
– intersection creates a new RDD with all elements in both original
RDDs
– union adds all elements of two RDDs into a single new RDD
– zip pairs each element of the first RDD with the corresponding
element of the second
– subtract removes the elements in the second RDD from the first RDD

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-14
Example: flatMap and distinct

> sc.textFile(file) \
.flatMap(lambda line: line.split(' ')) \ Language: Python
.distinct()

> sc.textFile(file).
flatMap(line => line.split(' ')). Language: Scala
distinct()

the cat sat on the mat the on


the aardvark sat on the sofa cat mat
sat sofa
on aardvark
the cat
mat the
the sat
aardvark
sat

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-15
Examples: Multi-RDD Transformations (1)

rdd1 rdd2
Chicago San Francisco
Boston Boston
Paris Amsterdam
San Francisco Mumbai
Tokyo McMurdo Station

rdd1.subtract(rdd2) rdd1.zip(rdd2)

Tokyo (Chicago,San Francisco)


Paris (Boston,Boston)
Chicago (Paris,Amsterdam)
(San Francisco,Mumbai)
(Tokyo,McMurdo Station)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-16
Examples: Multi-RDD Transformations (2)

rdd1.union(rdd2)
rdd1 rdd2
Chicago San Francisco
Boston Boston
Chicago
Paris Amsterdam
Boston
San Francisco Mumbai
Paris
Tokyo McMurdo Station
San Francisco
Tokyo
rdd1.intersection(rdd2) San Francisco
Boston

Boston Amsterdam

San Francisco Mumbai


McMurdo Station

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-17
Some Other General RDD Operations

§ Other RDD operations


– first returns the first element of the RDD
– foreach applies a function to each element in an RDD
– top(n) returns the largest n elements using natural ordering
§ Sampling operations
– sample creates a new RDD with a sampling of elements
– takeSample returns an array of sampled elements
§ Double RDD operations
– Statistical functions, such as mean, sum, variance, and stdev
– Documented in API for
org.apache.spark.rdd.DoubleRDDFunctions

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-18
Chapter Topics

Working with RDDs

§ Creating RDDs
§ Other General RDD Operations
§ Essential Points
§ Hands-On Exercise: Process Data Files with Apache Spark

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-19
Essential Points

§ RDDs can be created from files, parallelized data in memory, or other RDDs
§ sc.textFile reads newline-delimited text, one line per RDD element
§ sc.wholeTextFile reads multiple files, one file per RDD element
§ Generic RDDs can consist of any type of data
§ Generic RDDs provide a wide range of transformation operations

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-20
Chapter Topics

Working with RDDs

§ Creating RDDs
§ Other General RDD Operations
§ Essential Points
§ Hands-On Exercise: Process Data Files with Apache Spark

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-21
Hands-On Exercise: Process Data Files with Apache Spark

§ In this exercise, you will


– Process a set of XML files using wholeTextFiles
– Reformat a dataset to standardize format (bonus)
§ Please refer to the Hands-On Exercise Manual for instructions

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-22
Aggregating Data with Pair RDDs
Chapter 8
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-2
Aggregating Data with Pair RDDs

In this chapter you will learn


§ How to create pair RDDs of key-value pairs from generic RDDs
§ Special operations available on pair RDDs
§ How map-reduce algorithms are implemented in Spark

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-3
Chapter Topics

Aggregating Data with Pair RDDs

§ Key-Value Pair RDDs


§ Map-Reduce
§ Other Pair RDD Operations
§ Essential Points
§ Hands-On Exercise: Use Pair RDDs to Join Two Datasets

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-4
Pair RDDs

§ Pair RDDs are a special form of RDD Pair RDD


– Each element must be a key-value pair (a (key1,value1)
two-element tuple) (key2,value2)
– Keys and values can be any type (key3,value3)
§ Why? …
– Use with map-reduce algorithms
– Many additional functions are available for
common data processing needs
– Such as sorting, joining, grouping, and counting

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-5
Creating Pair RDDs

§ The first step in most workflows is to get the data into key/value form
– What should the RDD should be keyed on?
– What is the value?
§ Commonly used functions to create pair RDDs
– map
– flatMap / flatMapValues
– keyBy

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-6
Example: A Simple Pair RDD

§ Example: Create a pair RDD from a tab-separated file


Language: Python
> users = sc.textFile(file) \
.map(lambda line: line.split('\t')) \
.map(lambda fields: (fields[0],fields[1]))

Language: Scala
> val users = sc.textFile(file).
map(line => line.split('\t').
map(fields => (fields(0),fields(1)))

(user001,Fred Flintstone)
user001\tFred Flintstone
(user090,Bugs Bunny)
user090\tBugs Bunny
user111\tHarry Potter (user111,Harry Potter)
… …

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-7
Example: Keying Web Logs by User ID
Language: Python
> sc.textFile(logfile) \
.keyBy(lambda line: line.split(' ')[2])

Language: Scala
> sc.textFile(logfile).
keyBy(line => line.split(' ')(2))
User ID
56.38.234.188 - 99788 "GET /KBDOC-00157.html HTTP/1.0" …
56.38.234.188 - 99788 "GET /theme.css HTTP/1.0" …
203.146.17.59 - 25254 "GET /KBDOC-00230.html HTTP/1.0" …

(99788,56.38.234.188 - 99788 "GET /KBDOC-00157.html…)


(99788,56.38.234.188 - 99788 "GET /theme.css…)
(25254,203.146.17.59 - 25254 "GET /KBDOC-00230.html…)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-8
Question 1: Pairs with Complex Values

§ How would you do this?


– Input: a tab-delimited list of postal codes with latitude and longitude
– Output: postal code (key) and lat/long pair (value)

(00210,(43.005895,-71.013202))
00210 43.005895 -71.013202
00211 43.005895 -71.013202 (00211,(43.005895,-71.013202))
00212 43.005895 -71.013202
? (00212,(43.005895,-71.013202))
00213 43.005895 -71.013202 (00213,(43.005895,-71.013202))
00214 43.005895 -71.013202

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-9
Answer 1: Pairs with Complex Values
Language: Python
> sc.textFile(file) \
.map(lambda line: line.split('\t')) \
.map(lambda fields: (fields[0],(fields[1],fields[2])))

Language: Scala
> sc.textFile(file).
map(line => line.split('\t')).
map(fields => (fields(0),(fields(1),fields(2))))

(00210,(43.005895,-71.013202))
00210 43.005895 -71.013202
01014 42.170731 -72.604842 (01014,(42.170731,-72.604842))
01062 42.324232 -72.67915 (01062,(42.324232,-72.67915))
01263 42.3929 -73.228483 (01263,(42.3929,-73.228483))

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-10
Question 2: Mapping Single Rows to Multiple Pairs (1)

§ How would you do this?


– Input: order numbers with a list of SKUs in the order
– Output: order (key) and sku (value)

Input Data Pair RDD


00001 sku010:sku933:sku022 (00001,sku010)
00002 sku912:sku331 (00001,sku933)
00003 sku888:sku022:sku010:sku594
00004 sku411
? (00001,sku022)
(00002,sku912)
(00002,sku331)
(00003,sku888)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-11
Question 2: Mapping Single Rows to Multiple Pairs (2)

§ Hint: map alone won’t work

00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
00004 sku411

(00001,(sku010,sku933,sku022))
(00002,(sku912,sku331))
(00003,(sku888,sku022,sku010,sku594))
(00004,(sku411))

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-12
Answer 2: Mapping Single Rows to Multiple Pairs (1)
Language: Python
> sc.textFile(file)

00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
00004 sku411

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-13
Answer 2: Mapping Single Rows to Multiple Pairs (2)
Language: Python
> sc.textFile(file) \
.map(lambda line: line.split('\t'))

00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
[00001,sku010:sku933:sku022]
00004 sku411
[00002,sku912:sku331] Note that split returns
[00003,sku888:sku022:sku010:sku594] 2-element arrays, not
[00004,sku411] pairs/tuples

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-14
Answer 2: Mapping Single Rows to Multiple Pairs (3)
Language: Python
> sc.textFile(file) \
.map(lambda line: line.split('\t')) \
.map(lambda fields: (fields[0],fields[1]))

00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
[00001,sku010:sku933:sku022]
00004 sku411
[00002,sku912:sku331]
[00003,sku888:sku022:sku010:sku594]
(00001,sku010:sku933:sku022)
[00004,sku411]
(00002,sku912:sku331) Map array elements to
(00003,sku888:sku022:sku010:sku594) tuples to produce a
pair RDD
(00004,sku411)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-15
Answer 2: Mapping Single Rows to Multiple Pairs (4)
Language: Python
> sc.textFile(file) \
.map(lambda line: line.split('\t')) \
.map(lambda fields: (fields[0],fields[1]))
.flatMapValues(lambda skus: skus.split(':'))

00001 sku010:sku933:sku022
00002 sku912:sku331 (00001,sku010)
00003 sku888:sku022:sku010:sku594 (00001,sku933)
[00001,sku010:sku933:sku022]
00004 sku411 (00001,sku022)
[00002,sku912:sku331]
[00003,sku888:sku022:sku010:sku594] (00002,sku912)
(00001,sku010:sku933:sku022)
[00004,sku411] (00002,sku331)
(00002,sku912:sku331)
(00003,sku888)
(00003,sku888:sku022:sku010:sku594)

(00004,sku411)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-16
Chapter Topics

Aggregating Data with Pair RDDs

§ Key-Value Pair RDDs


§ Map-Reduce
§ Other Pair RDD Operations
§ Essential Points
§ Hands-On Exercise: Use Pair RDDs to Join Two Datasets

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-17
Map-Reduce

§ Map-reduce is a common programming model


– Easily applicable to distributed processing of large data sets
§ Hadoop MapReduce is the major implementation
– Somewhat limited
– Each job has one map phase, one reduce phase
– Job output is saved to files
§ Spark implements map-reduce with much greater flexibility
– Map and reduce functions can be interspersed
– Results can be stored in memory
– Operations can easily be chained

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-18
Map-Reduce in Spark

§ Map-reduce in Spark works on pair RDDs


§ Map phase
– Operates on one record at a time
– “Maps” each record to zero or more new records
– Examples: map, flatMap, filter, keyBy
§ Reduce phase
– Works on map output
– Consolidates multiple records
– Examples: reduceByKey, sortByKey, mean

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-19
Map-Reduce Example: Word Count
Result
on 2
Input Data
sofa 1
the cat sat on the mat mat 1
the aardvark sat on the sofa ? aardvark 1
the 4
cat 1
sat 2

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-20
Example: Word Count (1)

Language: Python
> counts = sc.textFile(file)

the cat sat on the


mat
the aardvark sat on
the sofa

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-21
Example: Word Count (2)

Language: Python
> counts = sc.textFile(file) \
.flatMap(lambda line: line.split(' '))

the cat sat on the the


mat cat
the aardvark sat on
sat
the sofa
on
the
mat
the
aardvark

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-22
Example: Word Count (3)

Language: Python
> counts = sc.textFile(file) \
.flatMap(lambda line: line.split(' ')) \
.map(lambda word: (word,1)) Key-
Value
Pairs

the cat sat on the the (the, 1)


mat cat (cat, 1)
the aardvark sat on
sat (sat, 1)
the sofa
on (on, 1)
the (the, 1)
mat (mat, 1)
the (the, 1)
aardvark (aardvark, 1)
… …

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-23
Example: Word Count (4)

Language: Python
> counts = sc.textFile(file) \
.flatMap(lambda line: line.split(' ')) \
.map(lambda word: (word,1)) \
.reduceByKey(lambda v1,v2: v1+v2)

the cat sat on the the (the, 1) (on, 2)


mat cat (cat, 1) (sofa, 1)
the aardvark sat on
sat (sat, 1) (mat, 1)
the sofa
on (on, 1) (aardvark, 1)
the (the, 1) (the, 4)
mat (mat, 1) (cat, 1)
the (the, 1) (sat, 2)
aardvark (aardvark, 1)
… …

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-24
reduceByKey (1)
Language: Python
§ The function passed to counts = sc.textFile(file) \
reduceByKey combines values .flatMap(lambda line: line.split(' '))\
from two keys .map(lambda word: (word,1)) \
.reduceByKey(lambda v1,v2: v1+v2)
– Function must be binary

(the,1)
(cat,1)
(the,2)
(sat,1) (on,2)
(on,1) (sofa,1)
(the,1) (the,3) (mat,1)
(mat,1) (aardvark,1)
(the,4)
(the,1) (the,4)
(aardvark,1) (cat,1)
(sat,1) (sat,2)
(on,1)
(the,1)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-25
reduceByKey (2)
Language: Python
§ The function might be called in counts = sc.textFile(file) \
any order, therefore must be .flatMap(lambda line: line.split(' '))\
.map(lambda word: (word,1)) \
– Commutative: x+y = y+x .reduceByKey(lambda v1,v2: v1+v2)
– Associative: (x+y)+z = x+(y+z)

(the,1)
(cat,1)
(the,2)
(sat,1) (on,2)
(on,1) (sofa,1)
(the,1) (mat,1)
(mat,1) (aardvark,1)
(the,4)
(the,1) (the,4)
(aardvark,1) (cat,1)
(the,2)
(sat,1) (sat,2)
(on,1)
(the,1)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-26
Word Count Recap (The Scala Version)

> val counts = sc.textFile(file).


flatMap(line => line.split(' ')).
map(word => (word,1)).
reduceByKey((v1,v2) => v1+v2)

OR

> val counts = sc.textFile(file).


flatMap(_.split(' ')).
map((_,1)).
reduceByKey(_+_)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-27
Why Do We Care about Counting Words?

§ Word count is challenging over massive amounts of data


– Using a single compute node would be too time-consuming
§ Statistics are often simple aggregate functions
– Distributive in nature
– For example: max, min, sum, and count
§ Map-reduce breaks complex tasks down into smaller elements which can
be executed in parallel
§ Many common tasks are very similar to word count
– Such as log file analysis

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-28
Chapter Topics

Aggregating Data with Pair RDDs

§ Key-Value Pair RDDs


§ Map-Reduce
§ Other Pair RDD Operations
§ Essential Points
§ Hands-On Exercise: Use Pair RDDs to Join Two Datasets

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-29
Pair RDD Operations

§ In addition to map and reduceByKey operations, Spark has several


operations specific to pair RDDs
§ Examples
– countByKey returns a map with the count of occurrences of each key
– groupByKey groups all the values for each key in an RDD
– sortByKey sorts in ascending or descending order
– join returns an RDD containing all pairs with matching keys from two
RDDs

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-30
Example: Pair RDD Operations

(00004,sku411)
(00003,sku888)
(00001,sku010)
(00003,sku022)
(00001,sku933)
(00003,sku010)
(00001,sku022)
(00003,sku594)
(00002,sku912)
(00002,sku912)
(00002,sku331)

(00003,sku888)

(00002,[sku912,sku331])
(00001,[sku022,sku010,sku933])
(00003,[sku888,sku022,sku010,sku594])
(00004,[sku411])

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-31
Example: Joining by Key

> movies = moviegross.join(movieyear)

RDD: moviegross RDD: movieyear


(Casablanca,$3.7M) (Casablanca,1942)
(Star Wars,$775M) (Star Wars,1977)
(Annie Hall,$38M) (Annie Hall,1977)
(Argo,$232M) (Argo,2012)
… …

(Casablanca,($3.7M,1942))
(Star Wars,($775M,1977))
(Annie Hall,($38M,1977))
(Argo,($232M,2012))

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-32
Other Pair Operations

§ Some other pair operations


– keys returns an RDD of just the keys, without the values
– values returns an RDD of just the values, without keys
– lookup(key) returns the value(s) for a key
– leftOuterJoin, rightOuterJoin , fullOuterJoin join two
RDDs, including keys defined in the left, right or either RDD respectively
– mapValues, flatMapValues execute a function on just the values,
keeping the key the same
§ See the PairRDDFunctions class Scaladoc for a full list

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-33
A Common Join Pattern

§ A common programming pattern


1. Map separate datasets into key-value pair RDDs
2. Join by key
3. Map joined data into the desired format
4. Save, display, or continue processing…

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-34
Example: Join Web Log with Knowledge Base Documents (1)

weblogs
56.38.234.188 - 99788 "GET /KBDOC-00157.html HTTP/1.0" …
56.38.234.188 - 99788 "GET /theme.css HTTP/1.0" …
203.146.17.59 - 25254 "GET /KBDOC-00230.html HTTP/1.0" …
221.78.60.155 - 45402 "GET /titanic_4000_sales.html HTTP/1.0" …
65.187.255.81 - 14242 "GET /KBDOC-00107.html HTTP/1.0" …

User ID Requested File


join
kblist
KBDOC-00157:Ronin Novelty Note 3 - Back up files
KBDOC-00230:Sorrento F33L - Transfer Contacts
KBDOC-00050:Titanic 1000 - Transfer Contacts
KBDOC-00107:MeeToo 5.0 - Transfer Contacts
KBDOC-00300:iFruit 5A - overheats

Document ID Document Title

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-35
Example: Join Web Log with Knowledge Base Documents (2)

§ Steps
1. Map separate datasets into key-value pair RDDs
a. Map web log requests to (docid,userid)
b. Map KB Doc index to (docid,title)
2. Join by key: docid
3. Map joined data into the desired format: (userid,title)
4. Further processing: group titles by User ID

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-36
Step 1a: Map Web Log Requests to (docid,userid)

Language: Python
> import re
> def getRequestDoc(s):
return re.search(r'KBDOC-[0-9]*',s).group()

> kbreqs = sc.textFile(logfile) \


.filter(lambda line: 'KBDOC-' in line) \
.map(lambda line: (getRequestDoc(line),line.split(' ')[2])) \
.distinct()

56.38.234.188 - 99788 "GET /KBDOC-00157.html HTTP/1.0" …


56.38.234.188 - 99788 "GET /theme.css HTTP/1.0" …
203.146.17.59 - 25254 "GET /KBDOC-00230.html HTTP/1.0" …
221.78.60.155 - 45402 "GET kbreqs …
/titanic_4000_sales.html HTTP/1.0"
65.187.255.81 - 14242 "GET /KBDOC-00107.html HTTP/1.0" …
… (KBDOC-00157,99788)
(KBDOC-00203,25254)
(KBDOC-00107,14242)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-37
Step 1b: Map KB Index to (docid,title)

Language: Python
> kblist = sc.textFile(kblistfile) \
.map(lambda line: line.split(':')) \
.map(lambda fields: (fields[0],fields[1]))

KBDOC-00157:Ronin Novelty Note 3 - Back up files


KBDOC-00230:Sorrento F33L - Transfer Contacts
KBDOC-00050:Titanic 1000 - Transfer Contacts
KBDOC-00107:MeeToo 5.0 - Transfer Contacts
KBDOC-00206:iFruit 5A - overheats

kblist
(KBDOC-00157,Ronin Novelty Note 3 - Back up files)
(KBDOC-00230,Sorrento F33L - Transfer Contacts)
(KBDOC-00050,Titanic 1000 - Transfer Contacts)
(KBDOC-00107,MeeToo 5.0 - Transfer Contacts)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-38
Step 2: Join by Key docid

Language: Python
> titlereqs = kbreqs.join(kblist)

kbreqs kblist
(KBDOC-00157,99788) (KBDOC-00157,Ronin Novelty Note 3 - Back up files)
(KBDOC-00230,25254) (KBDOC-00230,Sorrento F33L - Transfer Contacts)
(KBDOC-00107,14242) (KBDOC-00050,Titanic 1000 - Transfer Contacts)
… (KBDOC-00107,MeeToo 5.0 - Transfer Contacts)

(KBDOC-00157,(99788,Ronin Novelty Note 3 - Back up files))


(KBDOC-00230,(25254,Sorrento F33L - Transfer Contacts))
(KBDOC-00107,(14242,MeeToo 5.0 - Transfer Contacts))

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-39
Step 3: Map Result to Desired Format (userid,title)

Language: Python
> titlereqs = kbreqs.join(kblist) \
.map(lambda (docid,(userid,title)): (userid,title))

(KBDOC-00157,(99788,Ronin Novelty Note 3 - Back up


files))
(KBDOC-00230,(25254,Sorrento F33L - Transfer Contacts))
(KBDOC-00107,(14242,MeeToo 5.0 - Transfer Contacts))

(99788,Ronin Novelty Note 3 - Back up files)


(25254,Sorrento F33L - Transfer Contacts)
(14242,MeeToo 5.0 - Transfer Contacts)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-40
Step 4: Continue Processing—Group Titles by User ID

Language: Python
> titlereqs = kbreqs.join(kblist) \
.map(lambda (docid,(userid,title)): (userid,title)) \
.groupByKey()

(99788,Ronin Novelty Note 3 - Back up files)


(25254,Sorrento F33L - Transfer Contacts)
(14242,MeeToo 5.0 - Transfer Contacts)

(99788,[Ronin Novelty Note 3 - Back up files,


Ronin S3 - overheating])
(25254,[Sorrento F33L - Transfer Contacts])

Note: values (14242,[MeeToo 5.0 - Transfer Contacts,


MeeToo 5.1 - Back up files,
are grouped iFruit 1 - Back up files,
into Iterables MeeToo 3.1 - Transfer Contacts])

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-41
Example Output

Language: Python
> for (userid,titles) in titlereqs.take(10):
print 'user id: ',userid
for title in titles: print '\t',title

user id: 99788


Ronin Novelty Note 3 - Back up files
Ronin S3 - overheating (99788,[Ronin Novelty Note 3 - Back up files,
Ronin S3 - overheating])
user id: 25254
(25254,[Sorrento F33L - Transfer Contacts])
Sorrento F33L - Transfer Contacts
(14242,[MeeToo 5.0 - Transfer Contacts,
user id: 14242 MeeToo 5.1 - Back up files,
MeeToo 5.0 - Transfer Contacts iFruit 1 - Back up files,
MeeToo 3.1 - Transfer Contacts])
MeeToo 5.1 - Back up files
iFruit 1 - Back up files …

MeeToo 3.1 - Transfer Contacts


© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-42
Aside: Anonymous Function Parameters

§ Python and Scala pattern matching can help improve code readability
Language: Python
> map(lambda (docid,(userid,title)): (userid,title))

Language: Scala
> map(pair => (pair._2._1,pair._2._2))

OR
Language: Scala
> map{case (docid,(userid,title)) => (userid,title)}

(KBDOC-00157,(99788,…title…)) (99788,…title…)
(KBDOC-00230,(25254,…title…)) (25254,…title…)
(KBDOC-00107,(14242,…title…)) (14242,…title…)
… …

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-43
Chapter Topics

Aggregating Data with Pair RDDs

§ Key-Value Pair RDDs


§ Map-Reduce
§ Other Pair RDD Operations
§ Essential Points
§ Hands-On Exercise: Use Pair RDDs to Join Two Datasets

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-44
Essential Points

§ Pair RDDs are a special form of RDD consisting of key-value pairs (tuples)
§ Spark provides several operations for working with pair RDDs
§ Map-reduce is a generic programming model for distributed processing
– Spark implements map-reduce with pair RDDs
– Hadoop MapReduce and other implementations are limited to a single
map and single reduce phase per job
– Spark allows flexible chaining of map and reduce operations
– Spark provides operations to easily perform common map-reduce
algorithms like joining, sorting, and grouping

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-45
Chapter Topics

Aggregating Data with Pair RDDs

§ Key-Value Pair RDDs


§ Map-Reduce
§ Other Pair RDD Operations
§ Essential Points
§ Hands-On Exercise: Use Pair RDDs to Join Two Datasets

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-46
Hands-On Exercise: Use Pair RDDs to Join Two Datasets

§ In this exercise, you will


– Continue exploring web server log files using key-value pair RDDs
– Join log data with user account data
§ Please refer to the Hands-On Exercise Manual for instructions

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-47
Writing and Running Apache Spark
Applications
Chapter 9
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-2
Writing and Running Apache Spark Applications

In this chapter you will learn


§ How to write a Spark application
§ How to run a Spark application or the Spark shell on a YARN cluster
§ How to access and use the Spark application web UI

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-3
Chapter Topics

Writing and Running Apache Spark


Applications

§ Apache Spark Applications vs. Spark Shell


§ Creating the Spark Context
§ Building an Apache Spark Application (Scala and Java)
§ Running an Apache Spark Application
§ The Spark Application Web UI
§ Essential Points
§ Hands-On Exercise: Write and Run an Apache Spark Application

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-4
Spark Shell vs. Spark Applications

§ The Spark shell allows interactive exploration and manipulation of data


– REPL using Python or Scala
§ Spark applications run as independent programs
– Python, Scala, or Java
– For jobs such as ETL processing, streaming, and so on

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-5
Chapter Topics

Writing and Running Apache Spark


Applications

§ Apache Spark Applications vs. Spark Shell


§ Creating the Spark Context
§ Building an Apache Spark Application (Scala and Java)
§ Running an Apache Spark Application
§ The Spark Application Web UI
§ Essential Points
§ Hands-On Exercise: Write and Run an Apache Spark Application

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-6
The Spark Context

§ Every Spark program needs a SparkContext object


– The interactive shell creates one for you
§ In your own Spark application you create your own SparkContext
object
– Named sc by convention
– Call sc.stop when program terminates

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-7
Python Example: Word Count

import sys
from pyspark import SparkContext

if __name__ == "__main__":
if len(sys.argv) < 2:
print >> sys.stderr, "Usage: WordCount.py <file>"
exit(-1)

sc = SparkContext()

counts = sc.textFile(sys.argv[1]) \
.flatMap(lambda line: line.split()) \
.map(lambda word: (word,1)) \
.reduceByKey(lambda v1,v2: v1+v2)

for pair in counts.take(5): print pair

sc.stop()

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-8
Scala Example: Word Count

import org.apache.spark.SparkContext

object WordCount {
def main(args: Array[String]) {
if (args.length < 1) {
System.err.println("Usage: WordCount <file>")
System.exit(1)
}

val sc = new SparkContext()

val counts = sc.textFile(args(0)).


flatMap(line => line.split("\\W")).
map(word => (word,1)).reduceByKey(_ + _)
counts.take(5).foreach(println)

sc.stop()
}
}

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-9
Chapter Topics

Writing and Running Apache Spark


Applications

§ Apache Spark Applications vs. Spark Shell


§ Creating the Spark Context
§ Building an Apache Spark Application (Scala and Java)
§ Running an Apache Spark Application
§ The Spark Application Web UI
§ Essential Points
§ Hands-On Exercise: Write and Run an Apache Spark Application

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-10
Building a Spark Application: Scala or Java

§ Scala or Java Spark applications must be compiled and assembled into JAR
files
– JAR file will be passed to worker nodes
§ Apache Maven is a popular build tool
– For specific setting recommendations, see the Spark Programming
Guide
§ Build details will differ depending on
– Version of Hadoop (HDFS)
– Deployment platform (YARN, Mesos, Spark Standalone)
§ Consider using an Integrated Development Environment (IDE)
– IntelliJ or Eclipse are two popular examples
– Can run Spark locally in a debugger

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-11
Chapter Topics

Writing and Running Apache Spark


Applications

§ Apache Spark Applications vs. Spark Shell


§ Creating the Spark Context
§ Building an Apache Spark Application (Scala and Java)
§ Running an Apache Spark Application
§ The Spark Application Web UI
§ Essential Points
§ Hands-On Exercise: Write and Run an Apache Spark Application

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-12
Running a Spark Application

§ The easiest way to run a Spark application is using the spark-submit


script

Python $ spark-submit WordCount.py fileURL

$ spark-submit --class WordCount \


Scala/Java
MyJarFile.jar fileURL

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-13
Spark Application Cluster Options

§ Spark can run


– Locally
– No distributed processing
– Locally with multiple worker threads
– On a cluster
§ Local mode is useful for development and testing
§ Production use is almost always on a cluster

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-14
Supported Cluster Resource Managers

§ Hadoop YARN
– Included in CDH
– Most common for production sites
– Allows sharing cluster resources with other applications
§ Spark Standalone
– Included with Spark
– Easy to install and run
– Limited configurability and scalability
– No security support
– Useful for learning, testing, development, or small systems
§ Apache Mesos
– First platform supported by Spark
– Not supported by Cloudera

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-15
How Spark Runs on YARN: Client Mode (1)
Node A
NodeManager DataNode

Driver Program Executor


Spark
Context Node B
NodeManager DataNode

Executor Executor
Name
Resource Node C Node
Manager NodeManager DataNode

Application
Master 1

Node D
NodeManager DataNode

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-16
How Spark Runs on YARN: Client Mode (2)
Node A
NodeManager DataNode

Driver Program Executor


Spark
Context Node B
NodeManager DataNode

Executor Executor
Name
Resource Node C Node
Manager NodeManager DataNode

Application
Master 1

Node D
NodeManager DataNode

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-17
How Spark Runs on YARN: Client Mode (3)
Node A
NodeManager DataNode

Driver Program Executor


Spark
Context Node B
NodeManager DataNode

Executor Executor
Name
Resource Node C Node
Manager NodeManager DataNode

Application Executor
Master 1

Node D
Driver Program NodeManager DataNode
Spark
Application
Context Executor
Master 2

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-18
How Spark Runs on YARN: Client Mode (4)
Node A
NodeManager DataNode

Driver Program Executor


Spark
Context Node B
NodeManager DataNode

Executor Executor
Name
Resource Node C Node
Manager NodeManager DataNode

Application Executor
Master 1

Node D
Driver Program NodeManager DataNode
Spark
Application
Context Executor
Master 2

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-19
How Spark Runs on YARN: Cluster Mode (1)
Node A
NodeManager DataNode

Executor

Node B
NodeManager DataNode
submit

Executor Executor
Name
Resource Node C Node
Manager NodeManager
Application Master DataNode
Driver Program
Spark Context

Node D
NodeManager DataNode

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-20
How Spark Runs on YARN: Cluster Mode (2)
Node A
NodeManager DataNode

Executor

Node B
NodeManager DataNode
submit

Executor Executor
Name
Resource Node C Node
Manager NodeManager
Application Master DataNode
Driver Program
Spark Context

Node D
NodeManager DataNode

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-21
Running a Spark Application Locally

§ Use spark-submit --master to specify cluster option


– Local options
– local[*] runs locally with as many threads as cores (default)
– local[n] runs locally with n threads
– local runs locally with a single thread

Language: Python
$ spark-submit --master 'local[3]' \
WordCount.py fileURL

Language: Scala/Java
$ spark-submit --master 'local[3]' --class \
WordCount MyJarFile.jar fileURL

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-22
Running a Spark Application on a Cluster

§ Use spark-submit --master to specify cluster option


– Cluster options
– yarn-client
– yarn-cluster
– spark://masternode:port (Spark Standalone)
– mesos://masternode:port (Mesos)

Language: Python
$ spark-submit --master yarn-cluster \
WordCount.py fileURL

Language: Scala/Java
$ spark-submit --master yarn-cluster --class \
WordCount MyJarFile.jar fileURL

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-23
Starting the Spark Shell on a Cluster

§ The Spark shell can also be run on a cluster


§ pyspark and spark-shell both have a --master option
– yarn (client mode only)
– Spark or Mesos cluster manager URL
– local[*] runs with as many threads as cores (default)
– local[n] runs locally with n worker threads
– local runs locally without distributed processing

Language: Python
$ pyspark --master yarn

Language: Scala
$ spark-shell --master yarn

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-24
Options when Submitting a Spark Application to a Cluster

§ Some other spark-submit options for clusters


• --jars: Additional JAR files (Scala and Java only)
• --py-files: Additional Python files (Python only)
• --driver-java-options: Parameters to pass to the driver JVM
• --executor-memory: Memory per executor (for example: 1000m,
2g) (Default: 1g)
• --packages: Maven coordinates of an external library to include
§ Plus several YARN-specific options
• --num-executors: Number of executors to start
• --executor-cores: Number cores to allocate for each executor
• --queue: YARN queue to submit the application to
§ Show all available options
• --help

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-25
Dynamic Resource Allocation (1)
Node A
NodeManager DataNode Dynamic allocation
Driver Program Executor allows a Spark
Spark
application to add or
Context Node B
release executors as
NodeManager DataNode
needed.

Executor Executor
Name
Resource Node C Node
Manager NodeManager DataNode

Application
Master 1

Node D
NodeManager DataNode

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-26
Dynamic Resource Allocation (2)
Node A
NodeManager DataNode Dynamic allocation
Driver Program Executor allows a Spark
Spark
application to add or
Context Node B
release executors as
NodeManager DataNode
needed.

Executor Executor
Name
Resource Node C Node
Manager NodeManager DataNode

Application Executor
Master 1

Node D
NodeManager DataNode

Executor

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-27
Dynamic Resource Allocation (3)
Node A
NodeManager DataNode Dynamic allocation
Driver Program Executor allows a Spark
Spark
application to add or
Context Node B
release executors as
NodeManager DataNode
needed.

Executor Executor
Name
Resource Node C Node
Manager NodeManager DataNode

Application Executor
Master 1

Node D
NodeManager DataNode

Executor

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-28
Dynamic Resource Allocation (4)

§ Dynamic allocation in YARN is enabled by default starting in CDH 5.5


– Enabled at a site level in YARN, not application level
– Can be disabled for an individual application
– Specify the --num-executors flag when using spark-
submit
– Or set property spark.dynamicAllocation.enabled to
false

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-29
Chapter Topics

Writing and Running Apache Spark


Applications

§ Apache Spark Applications vs. Spark Shell


§ Creating the Spark Context
§ Building an Apache Spark Application (Scala and Java)
§ Running an Apache Spark Application
§ The Spark Application Web UI
§ Essential Points
§ Hands-On Exercise: Write and Run an Apache Spark Application

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-30
The Spark Application Web UI

The Spark UI lets


you monitor
running jobs, and
view statistics and
configuration

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-31
Accessing the Spark UI

§ The web UI is run by the Spark driver


– When running locally: http://localhost:4040
– When running on a cluster, access via the YARN UI

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-32
Viewing Spark Job History (1)

§ Viewing Spark Job History


– Spark UI is only available while the application is running
– Use Spark History Server to view metrics for a completed application
– Optional Spark component
§ Accessing the History Server
– For local jobs, access by URL
– Example: localhost:18080
– For YARN Jobs, click History link in YARN UI

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-33
Viewing Spark Job History (2)

§ Spark History Server

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-34
Chapter Topics

Writing and Running Apache Spark


Applications

§ Apache Spark Applications vs. Spark Shell


§ Creating the Spark Context
§ Building an Apache Spark Application (Scala and Java)
§ Running an Apache Spark Application
§ The Spark Application Web UI
§ Essential Points
§ Hands-On Exercise: Write and Run an Apache Spark Application

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-35
Essential Points

§ Use the Spark shell for interactive data exploration


§ Write a Spark application to run independently
§ Spark applications require a SparkContext object
§ Use the spark-submit script to run Spark applications
§ Spark is designed to run on a cluster
– Most large production sites deploy on YARN (included in CDH)
§ The resource manager distributes tasks to individual worker nodes in the
cluster
– Tasks run in executors—JVMs running on worker nodes

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-36
Chapter Topics

Writing and Running Apache Spark


Applications

§ Apache Spark Applications vs. Spark Shell


§ Creating the Spark Context
§ Building an Apache Spark Application (Scala and Java)
§ Running an Apache Spark Application
§ The Spark Application Web UI
§ Essential Points
§ Hands-On Exercise: Write and Run an Apache Spark Application

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-37
Building and Running Scala Applications in the
Hands-On Exercises
§ Basic Apache Maven projects are provided in the exercise directory
– stubs: starter Scala file, do exercises here
– solution: final exercise solution

Project Directory Structure


+countjpgs
$ mvn package -pom.xml
+src
+main
$ spark-submit \ +scala
--class stubs.CountJPGs \ +solution
-CountJPGs.scala
target/countjpgs-1.0.jar \ +stubs
weblogs.* -CountJPGs.scala
+target
-countjpgs-1.0.jar

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-38
Hands-On Exercise: Write and Run an Apache Spark Application

§ In this exercise, you will


– Write a Spark application to count JPG requests in a web server log
– If you use Scala, compile and package the application in a JAR file
– Run the application locally to test
– Submit the application to run on the YARN cluster
§ Please refer to the Hands-On Exercise Manual for instructions

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-39
Configuring Apache Spark Applications
Chapter 10
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-2
Configuring Apache Spark Applications

In this chapter you will learn


§ How to configure Spark application properties, either programmatically or
declaratively
§ How to set logging levels for a Spark application

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-3
Chapter Topics

Configuring Apache Spark


Applications

§ Configuring Apache Spark Properties


§ Logging
§ Essential Points
§ Hands-On Exercise: Configure an Apache Spark Application

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-4
Spark Application Configuration

§ Spark provides numerous properties for configuring your application


§ Some example properties
– spark.master
– spark.app.name
– spark.local.dir: Where to store local files such as shuffle output
(default /tmp)
– spark.ui.port: Port to run the Spark Application UI (default 4040)
– spark.executor.memory: How much memory to allocate to each
Executor (default 1g)
– spark.driver.memory: How much memory to allocate to the
driver in client mode (default 1g)
– And many more...
– See Spark Configuration page in the Spark documentation for more
details

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-5
Spark Application Configuration Options

§ Spark applications can be configured


– Declaratively, using spark-submit options or properties files
– Programmatically, within your application code

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-6
Declarative Configuration Options

§ spark-submit script
– Examples:
– spark-submit --driver-memory 500M
– spark-submit --conf spark.executor.cores=4
§ Properties file
– Tab- or space-separated list of properties and values
– Load with spark-submit --properties-file filename
– Example:
spark.master yarn-cluster
spark.local.dir /tmp
spark.ui.port
§ Site defaults properties file
– SPARK_HOME/conf/spark-defaults.conf
– Template file provided

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-7
Setting Configuration Properties Programmatically

§ Spark configuration settings are part of the Spark context


§ Configure using a SparkConf object
§ Some example set functions
– setAppName(name)
– setMaster(master)
– set(property-name, value)
§ set functions return a SparkConf object to support chaining

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-8
SparkConf Example (Python)

import sys
from pyspark import SparkContext
from pyspark import SparkConf

if __name__ == "__main__":
if len(sys.argv) < 2:
print >> sys.stderr, "Usage: WordCount <file>"
exit(-1)

sconf = SparkConf() \
.setAppName("Word Count") \
.set("spark.ui.port","4141")
sc = SparkContext(conf=sconf)

counts = sc.textFile(sys.argv[1]) \
.flatMap(lambda line: line.split()) \
.map(lambda w: (w,1)) \
.reduceByKey(lambda v1,v2: v1+v2)

for pair in counts.take(5): print pair


sc.stop()

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-9
SparkConf Example (Scala)

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf

object WordCount {
def main(args: Array[String]) {
if (args.length < 1) {
System.err.println("Usage: WordCount <file>")
System.exit(1)
}

val sconf = new SparkConf()


.setAppName("Word Count")
.set("spark.ui.port","4141")
val sc = new SparkContext(sconf)

val counts = sc.textFile(args(0)).


flatMap(line => line.split("\\W")).
map(word => (word,1)).
reduceByKey(_ + _)
counts.take(5).foreach(println)
sc.stop()
}
}

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-10
Viewing Spark Properties

§ You can view the Spark


property settings in the
Spark Application UI
– Environment tab

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-11
Chapter Topics

Configuring Apache Spark


Applications

§ Configuring Apache Spark Properties


§ Logging
§ Essential Points
§ Hands-On Exercise: Configure an Apache Spark Application

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-12
Spark Logging

§ Spark uses Apache Log4j for logging


– Allows for controlling logging at runtime using a properties file
– Enable or disable logging, set logging levels, select output
destination
– For more info see http://logging.apache.org/log4j/1.2/
§ Log4j provides several logging levels
– TRACE
– DEBUG
– INFO
– WARN
– ERROR
– FATAL
– OFF

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-13
Spark Log Files (1)

§ Log file locations depend on your cluster management platform


§ YARN
– If log aggregation is off, logs are stored locally on each worker node
– If log aggregation is on, logs are stored in HDFS
– Default /var/log/hadoop-yarn
– Access with yarn logs command or YARN Resource Manager UI

$ yarn application -list

Application-Id Application-Name Application-Type…


application_1441395433148_0003 Spark shell SPARK …
application_1441395433148_0001 myapp.jar MAPREDUCE …

$ yarn logs -applicationId <appid>


© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-14
Spark Log Files (2)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-15
Configuring Spark Logging (1)

§ Logging levels can be set for the cluster, for individual applications, or even
for specific components or subsystems
§ Default for machine: SPARK_HOME/conf/log4j.properties*
– Start by copying log4j.properties.template
Default for all Spark
log4j.properties.template applications
# Set everything to be logged to the console
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err

log4j.logger.org.apache.spark.repl.Main=WARN

Default override for Spark
shell (Scala)
* Located in /usr/lib/spark/conf on course VM
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-16
Configuring Spark Logging (2)

§ Copy log4j.properties.template to log4j.properties


– May require administrator privileges
§ Modify the rootCategory or repl.Main settings

Default for all Spark


log4j.properties applications
# Set everything to be logged to the console
log4j.rootCategory=DEBUG, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err

log4j.logger.org.apache.spark.repl.Main=DEBUG

Default override for Spark
shell (Scala)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-17
Configuring Spark Logging (3)

§ Logging in the Spark shell can be configured interactively


– The setLogLevel method sets the logging level temporarily
– Added in Spark 1.4

Language: Python/Scala

> sc.setLogLevel("ERROR")

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-18
Chapter Topics

Configuring Apache Spark


Applications

§ Configuring Apache Spark Properties


§ Logging
§ Essential Points
§ Hands-On Exercise: Configure an Apache Spark Application

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-19
Essential Points

§ Spark configuration parameters can be set declaratively using the


spark-submit script or a properties file, or set programmatically using
a SparkConf object
§ Spark uses Log4j for logging
– Configure using a log4j.properties file or sc.setLogLevel

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-20
Chapter Topics

Configuring Apache Spark


Applications

§ Configuring Apache Spark Properties


§ Logging
§ Essential Points
§ Hands-On Exercise: Configure an Apache Spark Application

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-21
Hands-On Exercise: Configure an Apache Spark Application

§ In this exercise, you will


– Set properties using spark-submit
– Set properties in a properties file
– Change the logging levels in a log4j.properties file
§ Please refer to the Hands-On Exercise Manual for instructions

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-22
Parallel Processing in Apache Spark
Chapter 11
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-2
Parallel Processing in Apache Spark

In this chapter you will learn


§ How RDDs are distributed across a cluster
§ How Apache Spark partitions file-based RDDs
§ How Spark executes RDD operations in parallel
§ How to control parallelization through partitioning
§ How to view and monitor tasks and stages

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-3
Chapter Topics

Parallel Processing in Apache Spark

§ Review: Apache Spark on a Cluster


§ RDD Partitions
§ Partitioning of File-Based RDDs
§ HDFS and Data Locality
§ Executing Parallel Operations
§ Stages and Tasks
§ Essential Points
§ Hands-On Exercise: View Jobs and Stages in the Spark Application UI

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-4
Review of Spark on YARN (1)
Worker Nodes

$ spark-submit \
--master yarn-client \
--class MyClass \
MyApp.jar

Resource Name
Manager Node

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-5
Review of Spark on YARN (2)
Worker Nodes

$ spark-submit \ Container
Driveryarn-client
--master Program \
Spark
--class MyClass \
MyApp.jarContext

Container

Resource Name
Manager Node

Container

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-6
Review of Spark on YARN (3)
Worker Nodes

$ spark-submit \ Container
Executor
Driveryarn-client
--master Program \
Spark
--class MyClass \
MyApp.jarContext

Executor
Container

Resource Name
Manager Node

Executor
Container

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-7
Chapter Topics

Parallel Processing in Apache Spark

§ Review: Apache Spark on a Cluster


§ RDD Partitions
§ Partitioning of File-Based RDDs
§ HDFS and Data Locality
§ Executing Parallel Operations
§ Stages and Tasks
§ Essential Points
§ Hands-On Exercise: View Jobs and Stages in the Spark Application UI

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-8
RDDs on a Cluster

§ Resilient Distributed Datasets RDD 1


– Data is partitioned across worker nodes
Executor
§ Partitioning is done automatically by Spark rdd_1_0
– Optionally, you can control how many
partitions are created Executor
rdd_1_1

Executor
rdd_1_2

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-9
Chapter Topics

Parallel Processing in Apache Spark

§ Review: Apache Spark on a Cluster


§ RDD Partitions
§ Partitioning of File-Based RDDs
§ HDFS and Data Locality
§ Executing Parallel Operations
§ Stages and Tasks
§ Essential Points
§ Hands-On Exercise: View Jobs and Stages in the Spark Application UI

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-10
File Partitioning: Single Files
Language: Python/Scala
§ Partitions from single files sc.textFile("myfile",3)
– Partitions based on size
– You can optionally specify a minimum RDD
number of partitions
textFile(file, minPartitions) Executor

– Default is two when running on a cluster


myfile
– Default is one when running locally with
a single thread Executor

– More partitions = more parallelization

Executor

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-11
File Partitioning: Multiple Files
RDD
§ sc.textFile("mydir/*")
Executor
– Each file becomes (at least) one
partition file1

– File-based operations can be done Executor


per-partition, for example parsing
XML file2
Executor

§ sc.wholeTextFiles("mydir")
– For many small files RDD
– Creates a key-value PairRDD Executor
– key = file name
– value = file contents
Executor

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-12
Operating on Partitions

§ Most RDD operations work on each element of an RDD


§ A few work on each partition
– foreachPartition calls a function for each partition
– mapPartitions creates a new RDD by executing a function on each
partition in the current RDD
– mapPartitionsWithIndex works the same as mapPartitions
but includes index of the partition
§ Functions for partition operations take iterators

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-13
Example: foreachPartition

§ Example: Print out the first line of each partition


Language: Python
def printFirstLine(iter):
print iter.next()

myrdd = …
myrdd.foreachPartition(lambda i: printFirstLine(i))

Language: Scala
def printFirstLine(iter: Iterator[Any]) = {
println(iter.next)
}

val myrdd = …
myrdd.foreachPartition(printFirstLine)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-14
Chapter Topics

Parallel Processing in Apache Spark

§ Review: Apache Spark on a Cluster


§ RDD Partitions
§ Partitioning of File-Based RDDs
§ HDFS and Data Locality
§ Executing Parallel Operations
§ Stages and Tasks
§ Essential Points
§ Hands-On Exercise: View Jobs and Stages in the Spark Application UI

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-15
HDFS and Data Locality (1)

Node A

Node B

Node C

Node D

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-16
HDFS and Data Locality (2)

$ hdfs dfs -put mydata

HDFS:
Node A mydata
HDFS
Block 1

Node B
HDFS
Block 2

Node C
HDFS
Block 3

Node D

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-17
HDFS and Data Locality (3)

sc.textFile("hdfs://…mydata").collect()

HDFS:
Driver Program Node A mydata
Executor HDFS
Spark
Block 1
Context

Node B
Executor HDFS
Block 2

Node C
Executor HDFS
Block 3

Node D

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-18
HDFS and Data Locality (4)

By default, Spark partitions


sc.textFile("hdfs://…mydata").collect() file-based RDDs by block.
Each block loads into a
single partition.
HDFS:
RDD
Driver Program Node A mydata
Executor HDFS
Spark
Block 1
Context

Node B
Executor HDFS
Block 2

Node C
Executor HDFS
Block 3

Node D

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-19
HDFS and Data Locality (5)

An action triggers execution:


sc.textFile("hdfs://…mydata").collect() tasks on executors load data
from blocks into partitions

HDFS:
RDD
Driver Program Node A mydata
Executor HDFS
Spark task Block 1
Context

Node B
Executor HDFS
task Block 2

Node C
Executor HDFS
task Block 3

Node D

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-20
HDFS and Data Locality (6)

Data is distributed across


sc.textFile("hdfs://…mydata").collect() executors until an action
returns a value to the driver

HDFS:
RDD
Driver Program Node A mydata
Executor HDFS
Spark
Block 1
Context

Node B
Executor HDFS
Block 2

Node C
Executor HDFS
Block 3

Node D

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-21
Chapter Topics

Parallel Processing in Apache Spark

§ Review: Apache Spark on a Cluster


§ RDD Partitions
§ Partitioning of File-Based RDDs
§ HDFS and Data Locality
§ Executing Parallel Operations
§ Stages and Tasks
§ Essential Points
§ Hands-On Exercise: View Jobs and Stages in the Spark Application UI

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-22
Parallel Operations on Partitions

§ RDD operations are executed in parallel on each partition


– When possible, tasks execute on the worker nodes where the data is in
stored
§ Some operations preserve partitioning
– Such as map, flatMap, or filter
§ Some operations repartition
– Such as reduceByKey, sortByKey, join, or groupByKey

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-23
Example: Average Word Length by Letter (1)

Language: Python
> avglens = sc.textFile(file)

RDD

HDFS:
mydata

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-24
Example: Average Word Length by Letter (2)

Language: Python
> avglens = sc.textFile(file) \
.flatMap(lambda line: line.split(' '))

RDD RDD

HDFS:
mydata

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-25
Example: Average Word Length by Letter (3)

Language: Python
> avglens = sc.textFile(file) \
.flatMap(lambda line: line.split(' ')) \
.map(lambda word: (word[0],len(word)))

RDD RDD RDD

HDFS:
mydata

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-26
Example: Average Word Length by Letter (4)

Language: Python
> avglens = sc.textFile(file) \
.flatMap(lambda line: line.split(' ')) \
.map(lambda word: (word[0],len(word))) \
.groupByKey()

RDD RDD RDD


RDD

HDFS:
mydata

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-27
Example: Average Word Length by Letter (5)

Language: Python
> avglens = sc.textFile(file) \
.flatMap(lambda line: line.split(' ')) \
.map(lambda word: (word[0],len(word))) \
.groupByKey() \
.map(lambda (k, values): \
(k, sum(values)/len(values)))

RDD RDD RDD


RDD RDD

HDFS:
mydata

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-28
Chapter Topics

Parallel Processing in Apache Spark

§ Review: Apache Spark on a Cluster


§ RDD Partitions
§ Partitioning of File-Based RDDs
§ HDFS and Data Locality
§ Executing Parallel Operations
§ Stages and Tasks
§ Essential Points
§ Hands-On Exercise: View Jobs and Stages in the Spark Application UI

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-29
Stages

§ Operations that can run on the same partition are executed in stages
§ Tasks within a stage are pipelined together
§ Developers should be aware of stages to improve performance

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-30
Spark Execution: Stages (1)
Language: Scala
> val avglens = sc.textFile(myfile).
flatMap(line => line.split(' ')).
map(word => (word(0),word.length)).
groupByKey().
map(pair => (pair._1, pair._2.sum/pair._2.size.toDouble))

> avglens.saveAsTextFile("avglen-output")

Stage 0 Stage 1
RDD RDD RDD
RDD RDD

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-31
Spark Execution: Stages (2)
Language: Scala
> val avglens = sc.textFile(myfile).
flatMap(line => line.split(' ')).
map(word => (word(0),word.length)).
groupByKey().
map(pair => (pair._1, pair._2.sum/pair._2.size.toDouble))

> avglens.saveAsTextFile("avglen-output")

Stage 0 Stage 1

Task 1
Task 5
Task 2
Task 3 Task 6

Task 4

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-32
Spark Execution: Stages (3)
Language: Scala
> val avglens = sc.textFile(myfile).
flatMap(line => line.split(' ')).
map(word => (word(0),word.length)).
groupByKey().
map(pair => (pair._1, pair._2.sum/pair._2.size.toDouble))

> avglens.saveAsTextFile("avglen-output")

Stage 0 Stage 1

Task 1
Task 5
Task 2
Task 3 Task 6

Task 4

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-33
Spark Execution: Stages (4)
Language: Scala
> val avglens = sc.textFile(myfile).
flatMap(line => line.split(' ')).
map(word => (word(0),word.length)).
groupByKey().
map(pair => (pair._1, pair._2.sum/pair._2.size.toDouble))

> avglens.saveAsTextFile("avglen-output")

Stage 0 Stage 1

Task 1
Task 5
Task 2
Task 3 Task 6

Task 4

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-34
Summary of Spark Terminology

§ Job—a set of tasks executed as a result of an action


§ Stage—a set of tasks in a job that can be executed in parallel
§ Task—an individual unit of work sent to one executor
§ Application—the set of jobs managed by a single driver

Job Stage
Task
RDD RDD RDD
RDD RDD

Stage
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-35
How Spark Calculates Stages

§ Spark constructs a DAG (Directed Acyclic Graph) of RDD dependencies


§ Narrow dependencies RDD RDD
– Each partition in the child RDD depends on
just one partition of the parent RDD
– No shuffle required between executors
– Can be collapsed into a single stage
– Examples: map, filter, and union
§ Wide (or shuffle) dependencies RDD
– Child partitions depend on multiple RDD
partitions in the parent RDD
– Defines a new stage
– Examples: reduceByKey, join,
and groupByKey

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-36
Controlling the Level of Parallelism

§ Wide operations (such as reduceByKey) partition resulting RDDs


– More partitions = more parallel tasks
– Cluster will be under-utilized if there are too few partitions
§ You can control how many partitions
– Optional numPartitions parameter in function call

> words.reduceByKey(lambda v1, v2: v1 + v2, 15)

– Configure the spark.default.parallelism property

spark.default.parallelism 10

– The default is the number of partitions of the parent if you do not


specify either

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-37
Viewing Stages in the Spark Application UI (1)

§ You can view jobs and stages in the Spark Application UI

Jobs are identified by the


action that triggered the job
execution

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-38
Viewing Stages in the Spark Application UI (2)

§ Select the job to view execution stages

Stages are Number of tasks =


Data shuffled
identified by the number of
between stages
last operation partitions

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-39
Viewing Stages in the Spark Application UI (3)

§ Click DAG Visualization for an


interactive map of stages

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-40
Viewing the Stages Using toDebugString (Scala)

Language: Scala
> val avglens = sc.textFile(myfile).
flatMap(line => line.split(' ')).
map(word => (word(0),word.length)).
groupByKey().
map(pair => (pair._1, pair._2.sum/pair._2.size.toDouble))

> avglens.toDebugString()

(2) MappedRDD[5] at map at …


Stage 1
| ShuffledRDD[4] at groupByKey at …
+-(4) MappedRDD[3] at map at …
| FlatMappedRDD[2] at flatMap at …
Stage 0
| myfile MappedRDD[1] at textFile at …
| myfile HadoopRDD[0] at textFile at …

Indents indicate
stages (shuffle
boundaries)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-41
Viewing the Stages Using toDebugString (Python)
Language: Python
> avglens = sc.textFile(myfile) \
.flatMap(lambda line: line.split(' ')) \
.map(lambda word: (word[0],len(word))) \
.groupByKey() \
.map(lambda (k, values): \
(k, sum(values)/len(values)))

> print avglens.toDebugString()


(2) PythonRDD[13] at RDD at …
| MappedRDD[12] at values at … Stage 1
| ShuffledRDD[11] at partitionBy at …
+-(4) PairwiseRDD[10] at groupByKey at …
| PythonRDD[9] at groupByKey at …
Stage 0
| myfile MappedRDD[7] at textFile at …
| myfile HadoopRDD[6] at textFile at …

Indents indicate
stages (shuffle
boundaries)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-42
Spark Task Execution (1)

val avglens = sc.textFile(myfile).


flatMap(line => line.split(' ')).
map(word => (word(0),word.length)).
Stage 0 Stage 1 groupByKey().
map(pair => (pair._1, pair._2.sum/pair._2.size.toDouble))
Task 1 Task 5 avglens.saveAsTextFile("avglen-output")

Task 2 Task 6Client


HDFS Node A
Executor
Task 3 Block 1

Task 4
HDFS Node B
Executor
Block 2
Driver Program
Spark
Context HDFS Node C
Executor
Block 3

Executor HDFS Node D


Block 4

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-43
Spark Task Execution (2)

val avglens = sc.textFile(myfile).


flatMap(line => line.split(' ')).
map(word => (word(0),word.length)).
Stage 0 Stage 1 groupByKey().
map(pair => (pair._1, pair._2.sum/pair._2.size.toDouble))
Task 5 avglens.saveAsTextFile("avglen-output")

Task 6Client
HDFS Node A
Executor
Block 1
Task 1

HDFS Node B
Executor
Block 2
Driver Program Task 2
Spark
Context HDFS Node C
Executor
Block 3
Task 3

Executor HDFS Node D


Block 4
Task 4
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-44
Spark Task Execution (3)

val avglens = sc.textFile(myfile).


flatMap(line => line.split(' ')).
map(word => (word(0),word.length)).
Stage 0 Stage 1 groupByKey().
map(pair => (pair._1, pair._2.sum/pair._2.size.toDouble))
Task 5 avglens.saveAsTextFile("avglen-output")

Task 6Client
HDFS Node A
Executor
Block 1
Shuffle
Data
Task 1

HDFS Node B
Executor
Block 2
Shuffle
Driver Program Data
Task 2
Spark
Context HDFS Node C
Executor
Block 3
Shuffle
Data
Task 3

Executor HDFS Node D


Block 4
Shuffle
Data
Task 4
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-45
Spark Task Execution (4)

val avglens = sc.textFile(myfile).


flatMap(line => line.split(' ')).
map(word => (word(0),word.length)).
Stage 1 groupByKey().
map(pair => (pair._1, pair._2.sum/pair._2.size.toDouble))
Task 5 avglens.saveAsTextFile("avglen-output")

Task 6Client
HDFS Node A
Executor
Block 1
Shuffle
Data

HDFS Node B
Executor
Block 2
Shuffle
Driver Program Data
Spark
Context HDFS Node C
Executor
Block 3
Shuffle
Data

Executor HDFS Node D


Block 4
Shuffle
Data
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-46
Spark Task Execution (5)

val avglens = sc.textFile(myfile).


flatMap(line => line.split(' ')).
map(word => (word(0),word.length)).
Stage 1 groupByKey().
map(pair => (pair._1, pair._2.sum/pair._2.size.toDouble))
avglens.saveAsTextFile("avglen-output")

Client HDFS Node A


Executor
Block 1
Shuffle
Data

HDFS Node B
Executor
Block 2
Shuffle Task 5
Driver Program Data
Spark
Context HDFS Node C
Executor
Block 3
Shuffle Task 6
Data

Executor HDFS Node D


Block 4
Shuffle
Data
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-47
Spark Task Execution (6)

val avglens = sc.textFile(myfile).


flatMap(line => line.split(' ')).
map(word => (word(0),word.length)).
groupByKey().
map(pair => (pair._1, pair._2.sum/pair._2.size.toDouble))
avglens.saveAsTextFile("avglen-output")

HDFS Node A
Executor
Block 1

HDFS Node B
Executor
Block 2
Task 5
Driver Program part-00000
Spark
Context HDFS Node C
Executor
Block 3
Task 6
part-00001

Executor HDFS Node D


Block 4

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-48
Spark Task Execution (Alternate Ending)

val avglens = sc.textFile(myfile).


flatMap(line => line.split(' ')).
map(word => (word(0),word.length)).
groupByKey().
map(pair => (pair._1, pair._2.sum/pair._2.size.toDouble))
avglens.collect()

HDFS Node A
Executor
Block 1

HDFS Node B
Executor
Block 2
Task 5
Driver Program
Spark
Context HDFS Node C
Executor
Block 3
Task 6

Executor HDFS Node D


Block 4

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-49
Chapter Topics

Parallel Processing in Apache Spark

§ Review: Apache Spark on a Cluster


§ RDD Partitions
§ Partitioning of File-Based RDDs
§ HDFS and Data Locality
§ Executing Parallel Operations
§ Stages and Tasks
§ Essential Points
§ Hands-On Exercise: View Jobs and Stages in the Spark Application UI

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-50
Essential Points

§ RDDs are processed in the memory of Spark executor JVMs


§ Data is split into partitions—each partition in a separate executor
§ RDD operations are executed on partitions in parallel
§ Operations that depend on the same partition are pipelined together in
stages
– Examples: map and filter
§ Operations that depend on multiple partitions are executed in separate
stages
– Examples: join and reduceByKey

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-51
Chapter Topics

Parallel Processing in Apache Spark

§ Review: Apache Spark on a Cluster


§ RDD Partitions
§ Partitioning of File-Based RDDs
§ HDFS and Data Locality
§ Executing Parallel Operations
§ Stages and Tasks
§ Essential Points
§ Hands-On Exercise: View Jobs and Stages in the Spark Application UI

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-52
Hands-On Exercise: View Jobs and Stages in the Spark
Application UI
§ In this exercise, you will
– Use the Spark Application UI to view how jobs, stages, and tasks are
executed in a job
§ Please refer to the Hands-On Exercise Manual for instructions

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-53
RDD Persistence
Chapter 12
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-2
RDD Persistence

In this chapter you will learn


§ How Apache Spark uses an RDD’s lineage in operations
§ How to persist RDDs to improve performance

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-3
Chapter Topics

RDD Persistence

§ RDD Lineage
§ RDD Persistence Overview
§ Distributed Persistence
§ Essential Points
§ Hands-On Exercise: Persist an RDD

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-4
Lineage Example (1)
File: purplecow.txt
§ Each transformation operation I've never seen a purple cow.
creates a new child RDD I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

Language: Python

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-5
Lineage Example (2)
File: purplecow.txt
§ Each transformation operation I've never seen a purple cow.
creates a new child RDD I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

RDD[1] (mydata)
Language: Python
> mydata = sc.textFile("purplecow.txt")

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-6
Lineage Example (3)
File: purplecow.txt
§ Each transformation operation I've never seen a purple cow.
creates a new child RDD I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

RDD[1] (mydata)
Language: Python
> mydata = sc.textFile("purplecow.txt")
> myrdd = mydata.map(lambda s: s.upper())\
.filter(lambda s:s.startswith('I'))

RDD[2]

RDD[3] (myrdd)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-7
Lineage Example (4)
File: purplecow.txt
§ Spark keeps track of the parent RDD I've never seen a purple cow.
for each new RDD I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
§ Child RDDs depend on their parents
RDD[1] (mydata)
Language: Python
> mydata = sc.textFile("purplecow.txt")
> myrdd = mydata.map(lambda s: s.upper())\
.filter(lambda s:s.startswith('I'))

RDD[2]

RDD[3] (myrdd)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-8
Lineage Example (5)
File: purplecow.txt
§ Action operations execute the I've never seen a purple cow.
parent transformations I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

RDD[1] (mydata)
Language: Python I've never seen a purple cow.
> mydata = sc.textFile("purplecow.txt") I never hope to see one;
> myrdd = mydata.map(lambda s: s.upper())\ But I can tell you, anyhow,
.filter(lambda s:s.startswith('I')) I'd rather see than be one.
> myrdd.count()
3 RDD[2]
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
BUT I CAN TELL YOU, ANYHOW,
I'D RATHER SEE THAN BE ONE.

RDD[3] (myrdd)
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-9
Lineage Example (6)
File: purplecow.txt
§ Each action re-executes the lineage I've never seen a purple cow.
transformations starting with the I never hope to see one;
But I can tell you, anyhow,
base I'd rather see than be one.

– By default RDD[1] (mydata)


Language: Python
> mydata = sc.textFile("purplecow.txt")
> myrdd = mydata.map(lambda s: s.upper())\
.filter(lambda s:s.startswith('I'))
> myrdd.count()
3 RDD[2]
> myrdd.count()

RDD[3] (myrdd)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-10
Lineage Example (7)
File: purplecow.txt
§ Each action re-executes the lineage I've never seen a purple cow.
transformations starting with the I never hope to see one;
But I can tell you, anyhow,
base I'd rather see than be one.

– By default RDD[1] (mydata)


Language: Python I've never seen a purple cow.
> mydata = sc.textFile("purplecow.txt") I never hope to see one;
> myrdd = mydata.map(lambda s: s.upper())\ But I can tell you, anyhow,
.filter(lambda s:s.startswith('I')) I'd rather see than be one.
> myrdd.count()
3 RDD[2]
I'VE NEVER SEEN A PURPLE COW.
> myrdd.count()
3 I NEVER HOPE TO SEE ONE;
BUT I CAN TELL YOU, ANYHOW,
I'D RATHER SEE THAN BE ONE.

RDD[3] (myrdd)
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-11
Chapter Topics

RDD Persistence

§ RDD Lineage
§ RDD Persistence Overview
§ Distributed Persistence
§ Essential Points
§ Hands-On Exercise: Persist an RDD

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-12
RDD Persistence (1)
File: purplecow.txt
§ Persisting an RDD saves the data (in I've never seen a purple cow.
memory, by default) I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-13
RDD Persistence (2)
File: purplecow.txt
§ Persisting an RDD saves the data (in I've never seen a purple cow.
memory, by default) I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

Language: Python RDD[1] (mydata)


> mydata = sc.textFile("purplecow.txt")
> myrdd1 = mydata.map(lambda s:
s.upper())

RDD[2] (myrdd1)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-14
RDD Persistence (3)
File: purplecow.txt
§ Persisting an RDD saves the data (in I've never seen a purple cow.
memory, by default) I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

Language: Python RDD[1] (mydata)


> mydata = sc.textFile("purplecow.txt")
> myrdd1 = mydata.map(lambda s:
s.upper())
> myrdd1.persist()
RDD[2] (myrdd1)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-15
RDD Persistence (4)
File: purplecow.txt
§ Persisting an RDD saves the data (in I've never seen a purple cow.
memory, by default) I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

Language: Python RDD[1] (mydata)


> mydata = sc.textFile("purplecow.txt")
> myrdd1 = mydata.map(lambda s:
s.upper())
> myrdd1.persist()
> myrdd2 = myrdd1.filter(lambda \ RDD[2] (myrdd1)
s:s.startswith('I'))

RDD[3] (myrdd2)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-16
RDD Persistence (5)
File: purplecow.txt
§ Persisting an RDD saves the data (in I've never seen a purple cow.
memory, by default) I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

Language: Python RDD[1] (mydata)


> mydata = sc.textFile("purplecow.txt") I've never seen a purple cow.

> myrdd1 = mydata.map(lambda s: I never hope to see one;

s.upper()) But I can tell you, anyhow,


I'd rather see than be one.
> myrdd1.persist()
> myrdd2 = myrdd1.filter(lambda \ RDD[2] (myrdd1)
I'VE NEVER SEEN A PURPLE COW.
s:s.startswith('I'))
I NEVER HOPE TO SEE ONE;
> myrdd2.count()
BUT I CAN TELL YOU, ANYHOW,
3
I'D RATHER SEE THAN BE ONE.

RDD[3] (myrdd2)
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-17
RDD Persistence (6)
File: purplecow.txt
§ Subsequent operations use saved I've never seen a purple cow.
data I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

Language: Python RDD[1] (mydata)


> mydata = sc.textFile("purplecow.txt")
> myrdd1 = mydata.map(lambda s:
s.upper())
> myrdd1.persist()
> myrdd2 = myrdd1.filter(lambda \ RDD[2] (myrdd1)
I'VE NEVER SEEN A PURPLE COW.
s:s.startswith('I'))
I NEVER HOPE TO SEE ONE;
> myrdd2.count()
BUT I CAN TELL YOU, ANYHOW,
3
I'D RATHER SEE THAN BE ONE.
> myrdd2.count()
RDD[3] (myrdd2)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-18
RDD Persistence (7)
File: purplecow.txt
§ Subsequent operations use saved I've never seen a purple cow.
data I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

Language: Python RDD[1] (mydata)


> mydata = sc.textFile("purplecow.txt")
> myrdd1 = mydata.map(lambda s:
s.upper())
> myrdd1.persist()
> myrdd2 = myrdd1.filter(lambda \ RDD[2] (myrdd1)
I'VE NEVER SEEN A PURPLE COW.
s:s.startswith('I'))
I NEVER HOPE TO SEE ONE;
> myrdd2.count()
BUT I CAN TELL YOU, ANYHOW,
3
I'D RATHER SEE THAN BE ONE.
> myrdd2.count()
3 RDD[3] (myrdd2)
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-19
Memory Persistence

§ In-memory persistence is a suggestion to Spark


– If not enough memory is available, persisted partitions will be cleared
from memory
– Least recently used partitions cleared first
– Transformations will be re-executed using the lineage when needed

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-20
Chapter Topics

RDD Persistence

§ RDD Lineage
§ RDD Persistence Overview
§ Distributed Persistence
§ Essential Points
§ Hands-On Exercise: Persist an RDD

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-21
Persistence and Fault-Tolerance

§ RDD = Resilient Distributed Dataset


– Resiliency is a product of tracking lineage
– RDDs can always be recomputed from their base if needed

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-22
Distributed Persistence

§ RDD partitions are distributed across a cluster


§ By default, partitions are persisted in memory in Executor JVMs
RDD
Node A
Driver Executor
task rdd_1_0

Node B
Executor
task rdd_1_1

Node C
Executor

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-23
RDD Fault-Tolerance (1)

§ What happens if a partition persisted in memory becomes unavailable?

RDD
Node A
Driver Executor
task rdd_1_0

Node B

Node C
Executor

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-24
RDD Fault-Tolerance (2)

§ The driver starts a new task to recompute the partition on a different node
§ Lineage is preserved, data is never lost
RDD
Node A
Driver Executor
task rdd_1_0

Node B

Node C
Executor
task rdd_1_1

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-25
Persistence Levels

§ By default, the persist method stores data in memory only


§ The persist method offers other options called storage levels
§ Storage levels let you control
– Storage location (memory or disk)
– Format in memory
– Partition replication

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-26
Persistence Levels: Storage Location

§ Storage location—where is the data stored?


– MEMORY_ONLY: Store data in memory if it fits
– MEMORY_AND_DISK: Store partitions on disk if they do not fit in
memory
– Called spilling
– DISK_ONLY: Store all partitions on disk
Language: Python
> from pyspark import StorageLevel
> myrdd.persist(StorageLevel.DISK_ONLY)

Language: Scala
> import org.apache.spark.storage.StorageLevel
> myrdd.persist(StorageLevel.DISK_ONLY)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-27
Persistence Levels: Memory Format

§ Serialization—you can choose to serialize the data in memory


– MEMORY_ONLY_SER and MEMORY_AND_DISK_SER
– Much more space efficient
– Less time efficient
– If using Java or Scala, choose a fast serialization library such as Kryo

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-28
Persistence Levels: Partition Replication

§ Replication—store partitions on two nodes


– DISK_ONLY_2
– MEMORY_AND_DISK_2
– MEMORY_ONLY_2
– MEMORY_AND_DISK_SER_2
– MEMORY_ONLY_SER_2
– You can also define custom storage levels

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-29
Default Persistence Levels

§ The storageLevel parameter for the persist() operation is


optional
– If no storage level is specified, the default value depends on the
language
– Scala default: MEMORY_ONLY
– Python default: MEMORY_ONLY_SER
§ cache() is a synonym for persist() with no storage level specified

myrdd.cache()

is equivalent to

myrdd.persist()

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-30
Disk Persistence

§ Disk-persisted partitions are stored in local files

RDD
Client Node A
Driver Executor
task rdd_0_0

Node B
Executor
rdd_0_1
task rdd_0_1

Node C
Executor

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-31
Disk Persistence with Replication (1)

§ Persistence replication makes recomputation less likely to be necessary

RDD
Client Node A
Driver Executor
task rdd_0_0

Node B
Executor
rdd_0_1
task rdd_0_1

Node C
Executor
rdd_0_1

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-32
Disk Persistence with Replication (2)

§ Replicated data on disk will be used to recreate the partition if possible


– Will be recomputed if the data is unavailable
– For example, when the node is down
RDD
Client Node A
Driver Executor
task rdd_0_0

Node B

Node C
Executor
task rdd_0_1 rdd_1

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-33
When and Where to Persist

§ When should you persist a dataset?


– When a dataset is likely to be re-used
– Such as in iterative algorithms and machine learning
§ How to choose a persistence level
– Memory only—choose when possible, best performance
– Save space by saving as serialized objects in memory if necessary
– Disk—choose when recomputation is more expensive than disk read
– Such as with expensive functions or filtering large datasets
– Replication—choose when recomputation is more expensive than
memory

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-34
Changing Persistence Options

§ To stop persisting and remove from memory and disk


– rdd.unpersist()
§ To change an RDD to a different persistence level
– Unpersist first

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-35
Chapter Topics

RDD Persistence

§ RDD Lineage
§ RDD Persistence Overview
§ Distributed Persistence
§ Essential Points
§ Hands-On Exercise: Persist an RDD

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-36
Essential Points

§ Spark keeps track of each RDD’s lineage


– Provides fault tolerance
§ By default, every RDD operation executes the entire lineage
§ If an RDD will be used multiple times, persist it to avoid re-computation
§ Persistence options
– Location—memory only, memory and disk, disk only
– Format—in-memory data can be serialized to save memory (but at the
cost of performance)
– Replication—saves data on multiple locations in case a node goes down,
for job recovery without recomputation

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-37
Chapter Topics

RDD Persistence

§ RDD Lineage
§ RDD Persistence Overview
§ Distributed Persistence
§ Essential Points
§ Hands-On Exercise: Persist an RDD

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-38
Hands-On Exercises: Persist an RDD

§ In this exercise, you will


– Persist an RDD before reusing it
– Use the Spark Application UI to see how an RDD is persisted
§ Please refer to the Hands-On Exercise Manual for instructions

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-39
Common Patterns in Apache Spark
Data Processing
Chapter 13
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-2
Common Patterns in Apache Spark Data Processing

In this chapter you will learn


§ At what kinds of processing and analysis Apache Spark is best
§ How to implement an iterative algorithm in Spark
§ What major features and benefits are provided by Spark’s machine
learning libraries

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-3
Chapter Topics

Common Patterns in Apache Spark


Data Processing

§ Common Apache Spark Use Cases


§ Iterative Algorithms in Apache Spark
§ Machine Learning
§ Example: k-means
§ Essential Points
§ Hands-On Exercise: Implement an Iterative Algorithm with Apache Spark

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-4
Common Spark Use Cases (1)

§ Spark is especially useful when working with any combination of:


– Large amounts of data
– Distributed storage
– Intensive computations
– Distributed computing
– Iterative algorithms
– In-memory processing and pipelining

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-5
Common Spark Use Cases (2)

§ Risk analysis
– “How likely is this borrower to pay back a loan?”
§ Recommendations
– “Which products will this customer enjoy?”
§ Predictions
– “How can we prevent service outages instead of simply reacting to
them?”
§ Classification
– “How can we tell which mail is spam and which is legitimate?”

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-6
Spark Examples

§ Spark includes many example programs that demonstrate some common


Spark programming patterns and algorithms
– k-means
– Logistic regression
– Calculating pi
– Alternating least squares (ALS)
– Querying Apache web logs
– Processing Twitter feeds
§ Examples
– SPARK_HOME/lib*
– spark-examples.jar: Java and Scala examples
– python.tar.gz: Pyspark examples

*SPARK_HOME is /usr/lib/spark on the course VM


© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-7
Chapter Topics

Common Patterns in Apache Spark


Data Processing

§ Common Apache Spark Use Cases


§ Iterative Algorithms in Apache Spark
§ Machine Learning
§ Example: k-means
§ Essential Points
§ Hands-On Exercise: Implement an Iterative Algorithm with Apache Spark

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-8
Example: PageRank

§ PageRank gives web pages a ranking score based on links from other pages
– Higher scores given for more links, and links from other high ranking
pages
§ PageRank is a classic example of big data analysis (like word count)
– Lots of data: Needs an algorithm that is distributable and scalable
– Iterative: The more iterations, the better than answer

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-9
PageRank Algorithm (1)

1. Start each page with a rank of 1.0

Page 1
1.0

Page 2 Page 3
1.0 1.0
Page 4
1.0

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-10
PageRank Algorithm (2)

1. Start each page with a rank of 1.0


2. On each iteration:
a. Each page contributes to its neighbors its own rank divided by the
number of its neighbors: contribp = rankp / neighborsp

Page 1
1.0

Page 2 Page 3
1.0 1.0
Page 4
1.0

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-11
PageRank Algorithm (3)

1. Start each page with a rank of 1.0


2. On each iteration:
a. Each page contributes to its neighbors its own rank divided by the
number of its neighbors: contribp = rankp / neighborsp
b. Set each page’s new rank based on the sum of its neighbors
contribution: new_rank = Σcontrib * .85 + .15

Page 1 Iteration 1
1.85

Page 2 Page 3
0.58 1.0
Page 4
0.58

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-12
PageRank Algorithm (4)

1. Start each page with a rank of 1.0


2. On each iteration:
a. Each page contributes to its neighbors its own rank divided by the
number of its neighbors: contribp = rankp / neighborsp
b. Set each page’s new rank based on the sum of its neighbors
contribution: new_rank = Σcontrib * .85 + .15
3. Each iteration incrementally improves the page ranking

Page 1 Iteration 2
1.31

Page 2 Page 3
0.39 1.7
Page 4
0.57

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-13
PageRank Algorithm (5)

1. Start each page with a rank of 1.0


2. On each iteration:
a. Each page contributes to its neighbors its own rank divided by the
number of its neighbors: contribp = rankp / neighborsp
b. Set each page’s new rank based on the sum of its neighbors
contribution: new_rank = Σcontrib * .85 + .15
3. Each iteration incrementally improves the page ranking

Page 1 Iteration 10
1.43 (Final)
Page 2 Page 3
0.46 1.38
Page 4
0.73

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-14
PageRank in Spark: Neighbor Contribution Function

Language: Python
def computeContribs(neighbors, rank):
for neighbor in neighbors: yield(neighbor, rank/len(neighbors))

neighbors: [page1,page2] (page1,.5)


rank: 1.0 (page2,.5)

Page 1

Page 2 Page 3

Page 4
1.0

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-15
PageRank in Spark: Example Data

Data Format: page1 page3


source-page destination-page page2 page1
… page4 page1
page3 page1
page4 page2
page3 page4

Page 1

Page 2 Page 3

Page 4

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-16
PageRank in Spark: Pairs of Page Links

page1 page3
Language: Python page2 page1
def computeContribs(neighbors, rank):… page4 page1
page3 page1
page4 page2
links = sc.textFile(file)\ page3 page4
.map(lambda line: line.split())\
.map(lambda pages: (pages[0],pages[1]))\
.distinct() (page1,page3)
(page2,page1)
(page4,page1)
(page3,page1)
(page4,page2)
(page3,page4)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-17
PageRank in Spark: Page Links Grouped by Source Page

page1 page3
Language: Python page2 page1
def computeContribs(neighbors, rank):… page4 page1
page3 page1
page4 page2
links = sc.textFile(file)\ page3 page4
.map(lambda line: line.split())\
.map(lambda pages: (pages[0],pages[1]))\
.distinct()\ (page1,page3)
.groupByKey()
(page2,page1)
(page4,page1)
(page3,page1)
(page4,page2)
(page3,page4)

links
(page4, [page2,page1])
(page2, [page1])
(page3, [page1,page4])
(page1, [page3])

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-18
PageRank in Spark: Persisting the Link Pair RDD

page1 page3
Language: Python page2 page1
def computeContribs(neighbors, rank):… page4 page1
page3 page1
page4 page2
links = sc.textFile(file)\ page3 page4
.map(lambda line: line.split())\
.map(lambda pages: (pages[0],pages[1]))\
.distinct()\ (page1,page3)
.groupByKey()\
(page2,page1)
.persist()
(page4,page1)
(page3,page1)
(page4,page2)
(page3,page4)

links
(page4, [page2,page1])
(page2, [page1])
(page3, [page1,page4])
(page1, [page3])

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-19
PageRank in Spark: Set Initial Ranks

Language: Python links


def computeContribs(neighbors, rank):… (page4, [page2,page1])
(page2, [page1])
links = sc.textFile(file)\ (page3, [page1,page4])
.map(lambda line: line.split())\
(page1, [page3])
.map(lambda pages: (pages[0],pages[1]))\
.distinct()\
.groupByKey()\
ranks
.persist()
(page4, 1.0)

ranks=links.map(lambda (page,neighbors): (page,1.0)) (page2, 1.0)


(page3, 1.0)
(page1, 1.0)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-20
PageRank in Spark: First Iteration (1)

Language: Python
def computeContribs(neighbors, rank):… links ranks
(page4, [page2,page1]) (page4, 1.0)
links = …
(page2, [page1]) (page2, 1.0)

ranks = … (page3, [page1,page4]) (page3, 1.0)


(page1, [page3]) (page1, 1.0)
for x in xrange(10):
contribs=links\
.join(ranks) (page4, ([page2,page1], 1.0))
(page2, ([page1], 1.0))
(page3, ([page1,page4], 1.0))
(page1, ([page3], 1.0))

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-21
PageRank in Spark: First Iteration (2)

Language: Python
def computeContribs(neighbors, rank):… links ranks
(page4, [page2,page1]) (page4, 1.0)
links = …
(page2, [page1]) (page2, 1.0)

ranks = … (page3, [page1,page4]) (page3, 1.0)


(page1, [page3]) (page1, 1.0)
for x in xrange(10):
contribs=links\
.join(ranks)\ (page4, ([page2,page1], 1.0))
.flatMap(lambda (page,(neighbors,rank)): \ (page2, ([page1], 1.0))
computeContribs(neighbors,rank))
(page3, ([page1,page4], 1.0))
(page1, ([page3], 1.0))

contribs
(page2,0.5)
(page1,0.5)
(page1,1.0)
(page1,0.5)
(page4,0.5)
(page3,1.0)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-22
PageRank in Spark: First Iteration (3)

contribs
Language: Python (page2,0.5)
def computeContribs(neighbors, rank):… (page1,0.5)
(page1,1.0)
links = …
(page1,0.5)

ranks = … (page4,0.5)
(page3,1.0)
for x in xrange(10):
contribs=links\
.join(ranks)\ (page4,0.5)
.flatMap(lambda (page,(neighbors,rank)): \
(page2,0.5)
computeContribs(neighbors,rank))
ranks=contribs\ (page3,1.0)
.reduceByKey(lambda v1,v2: v1+v2) (page1,2.0)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-23
PageRank in Spark: First Iteration (4)

contribs
Language: Python (page2,0.5)
def computeContribs(neighbors, rank):… (page1,0.5)
(page1,1.0)
links = …
(page1,0.5)

ranks = … (page4,0.5)
(page3,1.0)
for x in xrange(10):
contribs=links\
.join(ranks)\ (page4,0.5)
.flatMap(lambda (page,(neighbors,rank)): \
(page2,0.5)
computeContribs(neighbors,rank))
ranks=contribs\ (page3,1.0)
.reduceByKey(lambda v1,v2: v1+v2)\ (page1,2.0)
.map(lambda (page,contrib): \
(page,contrib * 0.85 + 0.15)) ranks
(page4,.58)
(page2,.58)
(page3,1.0)
(page1,1.85)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-24
PageRank in Spark: Second Iteration

Language: Python
def computeContribs(neighbors, rank):… links ranks
(page4, [page2,page1]) (page4,0.58)
links = …
(page2, [page1]) (page2,0.58)

ranks = … (page3, [page1,page4]) (page3,1.0)


(page1, [page3]) (page1,1.85)
for x in xrange(10):
contribs=links\
.join(ranks)\
.flatMap(lambda (page,(neighbors,rank)): \
computeContribs(neighbors,rank))

ranks=contribs\
.reduceByKey(lambda v1,v2: v1+v2)\
.map(lambda (page,contrib): \ ranks
(page,contrib * 0.85 + 0.15)) (page4,0.57)
(page2,0.21)
for rank in ranks.collect(): print rank
(page3,1.0)
(page1,0.77)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-25
Checkpointing (1)

§ Maintaining RDD lineage provides resilience but can also cause problems
when the lineage gets very long
– For example: iterative algorithms, Iter1
streaming data…
Iter2
data…
data…
§ Recovery can be very expensive data…
Iter3
data…
data…
data… Iter4
data…
§ Potential stack overflow data…
data…
data…
data…
data…
Language: Python data…
data…
myrdd = …initial-value… data…
while x in xrange(100):

myrdd = myrdd.transform(…) Iter100

myrdd.saveAsTextFile(dir) data…
data…
data…
data…

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-26
Checkpointing (2)

§ Checkpointing saves the data to HDFS


– Provides fault-tolerant storage across nodes
§ Lineage is not saved HDFS
data…
§ Must be checkpointed before any checkpoint data…
Iter3
data…
actions on the RDD data…
Iter4
data…
Language: Python data…
data…
data…
sc.setCheckpointDir(directory) data…
data…
myrdd = …initial-value…
data…
while x in xrange(100):
myrdd = myrdd.transform(…)

Iter100
if x % 3 == 0: data…
myrdd.checkpoint() data…
myrdd.count() data…
myrdd.saveAsTextFile(dir) data…

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-27
Chapter Topics

Common Patterns in Apache Spark


Data Processing

§ Common Apache Spark Use Cases


§ Iterative Algorithms in Apache Spark
§ Machine Learning
§ Example: k-means
§ Essential Points
§ Hands-On Exercise: Implement an Iterative Algorithm with Apache Spark

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-28
Fundamentals of Computer Programming

§ First consider how a typical program works


– Hardcoded conditional logic
– Predefined reactions when those conditions are met

#!/usr/bin/env python

import sys

for line in sys.stdin:


if "Make MONEY Fa$t At Home!!!" in line:
print "This message is likely spam"

if "Happy Birthday from Aunt Betty" in line:


print "This message is probably OK"

§ The programmer must consider all possibilities at design time


§ An alternative technique is to have computers learn what to do

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-29
What is Machine Learning?

§ Machine learning is a field within artificial intelligence (AI)


– AI: “The science and engineering of making intelligent machines”
§ Machine learning focuses on automated knowledge acquisition
– Primarily through the design and implementation of algorithms
– These algorithms require empirical data as input
§ Machine learning algorithms “learn” from data and often produce a
predictive model as their output
– Model can then be used to make predictions as new data arrives
§ For example, consider a predictive model based on credit card customers
– Build model with data about customers who did/did not default on debt
– Model can then be used to predict whether new customers will default

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-30
Types of Machine Learning

§ Three established categories of machine learning techniques:


– Collaborative filtering (recommendations)
– Clustering
– Classification

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-31
What is Collaborative Filtering?

§ Collaborative filtering is a technique for making recommendations


§ Helps users find items of relevance
– Among a potentially vast number of choices
– Based on comparison of preferences between users
– Preferences can be either explicit (stated) or implicit (observed)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-32
Applications Involving Collaborative Filtering

§ Collaborative filtering is domain agnostic


§ Can use the same algorithm to recommend practically anything
– Movies (Netflix, Amazon Instant Video)
– Television (TiVO Suggestions)
– Music (several popular music download and streaming services)
§ Amazon uses CF to recommend a variety of products

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-33
What is Clustering?

§ Clustering algorithms discover structure in collections of data


– Where no formal structure previously existed
§ They discover which clusters (“groupings”) naturally occur in data
– By examining various properties of the input data
§ Clustering is often used for exploratory analysis
– Divide huge amount of data into smaller groups
– Can then tune analysis for each group Price

Store Online

Brand status

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-34
Unsupervised Learning (1)

§ Clustering is an example of unsupervised learning


– Begin with a data set that has no apparent label
– Use an algorithm to discover structure in the data

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-35
Unsupervised Learning (2)

§ Once the model has been created, you can use it to assign groups

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-36
Applications Involving Clustering

§ Market segmentation
– Group similar customers in order to target them effectively
§ Finding related news articles
– Google News
§ Epidemiological studies
– Identifying a “cancer cluster” and finding a root cause
§ Computer vision (groups of pixels that cohere into objects)
– Related pixels clustered to recognize faces or license plates

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-37
What is Classification?

§ Classification is a form of supervised learning


– This requires training with data that has known labels
– A classifier can then label new data based on what it learned in training
§ This example depicts how a classifier might identify animals
– In this case, it learned to distinguish
between these two classes of animals
based on height and weight 45
40
Dog

35
30

Weight (lb.)
25
20
15
Cat
10
5

3 6 9 12 15 18 21 24 27
Height (in.)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-38
Supervised Learning (1)

§ Classification is an example of supervised learning


– Begin with a data set that includes the value to be predicted (the label)
– Use an algorithm to train a predictive model using the data-label pairs

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-39
Supervised Learning (2)

§ Once the model has been trained, you can make predictions
– This will take new (previously unseen) data as input
– The new data will not have labels

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-40
Applications Involving Classification

§ Spam filtering
– Train using a set of spam and non-spam messages
– System will eventually learn to detect unwanted email
§ Oncology
– Train using images of benign and malignant tumors
– System will eventually learn to identify cancer
§ Risk Analysis
– Train using financial records of customers who do/don’t default
– System will eventually learn to identify risk customers

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-41
Relationship of Algorithms and Data Volume (1)

§ There are many algorithms for each type of machine learning


– There’s no overall “best” algorithm
– Each algorithm has advantages and limitations
§ Algorithm choice is often related to data volume
– Some scale better than others
§ Most algorithms offer better results as volume increases
– Best approach = simple algorithm + lots of data
§ Spark is an excellent platform for machine learning over large data sets
– Resilient, iterative, parallel computations over distributed data sets

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-42
Relationship of Algorithms and Data Volume (2)

“It’s not who has the best algorithms that wins.


It’s who has the most data.” [Banko and Brill, 2001]
1.00

0.95

0.90
Test Accuracy

0.85

0.80
Memory-Based
Winnow
0.75 Perceptron
Naive Bayes

0.70
0.1 1 10 100 1000
Millions of Words

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-43
Machine Learning Challenges

§ Highly computation-intensive and iterative


§ Many traditional numerical processing systems do not scale to very large
datasets
– For example, MATLAB

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-44
Spark MLlib and Spark ML

§ Spark MLlib is a Spark machine learning library


– Makes practical machine learning scalable and easy
– Includes many common machine learning algorithms
– Includes base data types for efficient calculations at scale
– Supports scalable statistics and data transformations
§ Spark ML is a new higher-level API for machine learning pipelines
– Built on top of Spark’s DataFrames API
– Simple and clean interface for running a series of complex tasks
– Supports most functionality included in Spark MLlib
§ Spark MLlib and ML support a variety of machine learning algorithms
– Such as ALS (alternating least squares), k-means, linear regression,
logistic regression, gradient descent

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-45
Chapter Topics

Common Patterns in Apache Spark


Data Processing

§ Common Apache Spark Use Cases


§ Iterative Algorithms in Apache Spark
§ Machine Learning
§ Example: k-means
§ Essential Points
§ Hands-On Exercise: Implement an Iterative Algorithm with Apache Spark

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-46
k-means Clustering

§ k-means clustering
– A common iterative algorithm used in graph analysis and machine
learning
– You will implement a simplified version in the Hands-On Exercises

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-47
Clustering (1)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-48
Clustering (2)

Goal: Find “clusters” of data points

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-49
Example: k-means Clustering (1)

1. Choose k random points as


starting centers

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-50
Example: k-means Clustering (2)

1. Choose k random points as


starting centers
2. Find all points closest to each
center

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-51
Example: k-means Clustering (3)

1. Choose k random points as


starting centers
2. Find all points closest to each
center
3. Find the center (mean) of each
cluster

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-52
Example: k-means Clustering (4)

1. Choose k random points as


starting centers
2. Find all points closest to each
center
3. Find the center (mean) of each
cluster
4. If the centers changed, iterate
again

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-53
Example: k-means Clustering (5)

1. Choose k random points as


starting centers
2. Find all points closest to each
center
3. Find the center (mean) of each
cluster
4. If the centers changed, iterate
again

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-54
Example: k-means Clustering (6)

1. Choose k random points as


starting centers
2. Find all points closest to each
center
3. Find the center (mean) of each
cluster
4. If the centers changed, iterate
again

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-55
Example: k-means Clustering (7)

1. Choose k random points as


starting centers
2. Find all points closest to each
center
3. Find the center (mean) of each
cluster
4. If the centers changed, iterate
again

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-56
Example: k-means Clustering (8)

1. Choose k random points as


starting centers
2. Find all points closest to each
center
3. Find the center (mean) of each
cluster
4. If the centers changed, iterate
again

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-57
Example: k-means Clustering (9)

1. Choose k random points as


starting centers
2. Find all points closest to each
center
3. Find the center (mean) of each
cluster
4. If the centers changed, iterate
again

5. Done!

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-58
Example: Approximate k-means Clustering

1. Choose k random points as


starting centers
2. Find all points closest to each
center
3. Find the center (mean) of each
cluster
4. If the centers changed by
more than c, iterate again

5. Close enough!

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-59
Chapter Topics

Common Patterns in Apache Spark


Data Processing

§ Common Apache Spark Use Cases


§ Iterative Algorithms in Apache Spark
§ Machine Learning
§ Example: k-means
§ Essential Points
§ Hands-On Exercise: Implement an Iterative Algorithm with Apache Spark

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-60
Essential Points

§ Spark is especially suited to big data problems that require iteration


– In-memory persistence makes this very efficient
§ Common in many types of analysis
– For example, common algorithms such as PageRank and k-means
§ Spark is especially well-suited for implementing machine learning
§ Spark includes MLlib and ML
– Specialized libraries to implement many common machine learning
functions
– Efficient, scalable functions for machine learning (for example: logistic
regression, k-means)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-61
Chapter Topics

Common Patterns in Apache Spark


Data Processing

§ Common Apache Spark Use Cases


§ Iterative Algorithms in Apache Spark
§ Machine Learning
§ Example: k-means
§ Essential Points
§ Hands-On Exercise: Implement an Iterative Algorithm with Apache Spark

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-62
Hands-On Exercise: Implement an Iterative Algorithm with
Apache Spark
§ In this exercise, you will
– Implement k-means in Spark in order to identify clustered location data
points from Loudacre device status logs
– Find the geographic centers of device activity
§ Please refer to the Hands-On Exercise Manual for instructions

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-63
DataFrames and Apache Spark SQL
Chapter 14
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-2
DataFrames and Apache Spark SQL

In this chapter you will learn


§ What Spark SQL is
§ What features the DataFrame API provides
§ How to create a SQLContext
§ How to load existing data into a DataFrame
§ How to query data in a DataFrame
§ How to convert from DataFrames to pair RDDs

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-3
Chapter Topics

DataFrames and Apache Spark SQL

§ Apache Spark SQL and the SQL Context


§ Creating DataFrames
§ Transforming and Querying DataFrames
§ Saving DataFrames
§ DataFrames and RDDs
§ Comparing Spark SQL, Apache Impala, and Apache Hive-on-Spark
§ Apache Spark SQL in Spark 2.x
§ Essential Points
§ Hands-On Exercise: Use Apache Spark SQL for ETL

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-4
What is Spark SQL?

§ What is Spark SQL?


– Spark module for structured data processing
– Replaces Shark (a prior Spark module, now deprecated)
– Built on top of core Spark
§ What does Spark SQL provide?
– The DataFrame API—a library for working with data as tables
– Defines DataFrames containing rows and columns
– DataFrames are the focus of this chapter!
– Catalyst Optimizer—an extensible optimization framework
– A SQL engine and command line interface

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-5
SQL Context

§ The main Spark SQL entry point is a SQL context object


– Requires a SparkContext object
– The SQL context in Spark SQL is similar to Spark context in core Spark
§ There are two implementations
– SQLContext
– Basic implementation
– HiveContext
– Reads and writes Hive/HCatalog tables directly
– Supports full HiveQL language
– Requires the Spark application be linked with Hive libraries
– Cloudera recommends using HiveContext

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-6
Creating a SQL Context

§ The Spark shell creates a HiveContext instance automatically


– Call sqlContext
– You will need to create one when writing a Spark application
– Having multiple SQL context objects is allowed
§ A SQL context object is created based on the Spark context
Language: Python
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)

Language: Scala
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-7
Chapter Topics

DataFrames and Apache Spark SQL

§ Apache Spark SQL and the SQL Context


§ Creating DataFrames
§ Transforming and Querying DataFrames
§ Saving DataFrames
§ DataFrames and RDDs
§ Comparing Spark SQL, Apache Impala, and Apache Hive-on-Spark
§ Apache Spark SQL in Spark 2.x
§ Essential Points
§ Hands-On Exercise: Use Apache Spark SQL for ETL

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-8
DataFrames

§ DataFrames are the main abstraction in Spark SQL


– Analogous to RDDs in core Spark
– A distributed collection of structured data organized into named
columns
– Built on a base RDD containing Row objects

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-9
Creating DataFrames

§ DataFrames can be created


– From an existing structured data source
– Such as a Hive table, Parquet file, or JSON file
– From an existing RDD
– By performing an operation or query on another DataFrame
– By programmatically defining a schema

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-10
Creating a DataFrame from a Data Source

§ sqlContext.read returns a DataFrameReader object


§ DataFrameReader provides the functionality to load data into a
DataFrame
§ Convenience functions
– json(filename)
– parquet(filename)
– orc(filename)
– table(hive-tablename)
– jdbc(url,table,options)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-11
Example: Creating a DataFrame from a JSON File
Language: Python
sqlContext = HiveContext(sc)
peopleDF = sqlContext.read.json("people.json")

Language: Scala
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
val peopleDF = sqlContext.read.json("people.json")

age name pcode


File: people.json
{"name":"Alice", "pcode":"94304"} null Alice 94304
{"name":"Brayden", "age":30, "pcode":"94304"} 30 Brayden 94304
{"name":"Carla", "age":19, "pcode":"10036"}
{"name":"Diana", "age":46} 19 Carla 10036
{"name":"Étienne", "pcode":"94104"} 46 Diana null
null Étienne 94104

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-12
Example: Creating a DataFrame from a Hive/Impala Table
Language: Python
sqlContext = HiveContext(sc)
customerDF = sqlContext.read.table("customers")

Language: Scala
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
val customerDF = sqlContext.read.table("customers")

Table: customers
cust_id name country cust_id name country

001 Ani us 001 Ani us


002 Bob ca 002 Bob ca
003 Carlos mx 003 Carlos mx
… … … … … …

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-13
Loading from a Data Source Manually

§ You can specify settings for the DataFrameReader


– format: Specify a data source type
– option: A key/value setting for the underlying data source
– schema: Specify a schema instead of inferring from the data source
§ Then call the generic base function load
sqlContext.read.
format("com.databricks.spark.avro").
load("/loudacre/accounts_avro")

sqlContext.read.
format("jdbc").
option("url","jdbc:mysql://localhost/loudacre").
option("dbtable","accounts").
option("user","training").
option("password","training").
load()

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-14
Data Sources

§ Spark SQL 1.6 built-in data source types


– table
– json
– parquet
– jdbc
– orc
§ You can also use third party data source libraries, such as
– Avro (included in CDH)
– HBase
– CSV
– MySQL
– and more being added all the time

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-15
Chapter Topics

DataFrames and Apache Spark SQL

§ Apache Spark SQL and the SQL Context


§ Creating DataFrames
§ Transforming and Querying DataFrames
§ Saving DataFrames
§ DataFrames and RDDs
§ Comparing Spark SQL, Apache Impala, and Apache Hive-on-Spark
§ Apache Spark SQL in Spark 2.x
§ Essential Points
§ Hands-On Exercise: Use Apache Spark SQL for ETL

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-16
DataFrame Basic Operations (1)

§ Basic operations deal with DataFrame metadata (rather than its data)
§ Some examples
– schema returns a schema object describing the data
– printSchema displays the schema as a visual tree
– cache / persist persists the DataFrame to disk or memory
– columns returns an array containing the names of the columns
– dtypes returns an array of (column name,type) pairs
– explain prints debug information about the DataFrame to the
console
§ Most of these examples will be demonstrated in the next several slides

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-17
DataFrame Basic Operations (2)

§ Example: Displaying column data types using dtypes


Language: Python
> peopleDF = sqlContext.read.json("people.json")
> for item in peopleDF.dtypes: print item
('age', 'bigint')
('name', 'string')
('pcode', 'string')

Language: Scala
> val peopleDF = sqlContext.read.json("people.json")
> peopleDF.dtypes.foreach(println)
(age,LongType)
(name,StringType)
(pcode,StringType)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-18
Working with Data in a DataFrame

§ Queries—create a new DataFrame


– DataFrames are immutable
– Queries are analogous to RDD transformations
§ Actions—return data to the driver
– Actions trigger “lazy” execution of queries

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-19
DataFrame Actions

§ Some DataFrame actions


– collect returns all rows as an array of Row objects
– take(n) returns the first n rows as an array of Row objects
– count returns the number of rows
– show(n)displays the first n rows (default=20)

Language: Python Language: Scala


> peopleDF.count() > peopleDF.count()
5L res7: Long = 5

> peopleDF.show(3) > peopleDF.show(3)


age name pcode age name pcode
null Alice 94304 null Alice 94304
30 Brayden 94304 30 Brayden 94304
19 Carla 10036 19 Carla 10036

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-20
DataFrame Queries (1)

§ DataFrame query methods return new DataFrames


– Queries can be chained like transformations
§ Some query methods
– distinct returns a new DataFrame with distinct elements of this DF
– join joins this DataFrame with a second DataFrame
– Variants for inside, outside, left, and right joins
– limit returns a new DataFrame with the first n rows of this DF
– select returns a new DataFrame with data from one or more columns
of the base DataFrame
– where returns a new DataFrame with rows meeting specified query
criteria (alias for filter)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-21
DataFrame Queries (2)

§ Example: A basic query with limit


age name pcode
Language: Scala/Python null Alice 94304
> peopleDF.limit(3).show() 30 Brayden 94304
19 Carla 10036
46 Diana null
null Étienne 94104
age name pcode
Output null Alice 94304
of show 30 Brayden 94304
19 Carla 10036

age name pcode


null Alice 94304
30 Brayden 94304
19 Carla 10036

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-22
DataFrame Query Strings (1)

§ Some query operations take strings containing age


simple query expressions
null
– Such as select and where
30
§ Example: select 19
46
age name pcode null
null Alice 94304
30 Brayden 94304 name age
19 Carla 10036 Alice null
46 Diana null Brayden 30
null Étienne 94104 Carla 19
Diana 46
Étienne null

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-23
DataFrame Query Strings (2)

§ Example: where

peopleDF.
where("age > 21")

age name pcode age name pcode


null Alice 94304 30 Brayden 94304
30 Brayden 94304 46 Diana null
19 Carla 10036
46 Diana null
null Étienne 94104

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-24
Querying DataFrames using Columns (1)

§ Some DataFrame queries take one or more columns or column expressions


– Required for more sophisticated operations
§ Some examples
– select
– sort
– join
– where

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-25
Querying DataFrames using Columns (2)

§ Columns can be referenced in multiple ways


age name pcode
§ Python
null Alice 94304
ageDF = peopleDF.select(peopleDF['age']) 30 Brayden 94304
19 Carla 10036

ageDF = peopleDF.select(peopleDF.age) 46 Diana null


null Étienne 94104

§ Scala

val ageDF = peopleDF.select(peopleDF("age")) age


null
30
val ageDF = peopleDF.select($"age")
19
46
null

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-26
Querying DataFrames using Columns (3)

§ Column references can also be column expressions


Language: Python
peopleDF.select(peopleDF['name'],peopleDF['age']+10)

Language: Scala
peopleDF.select(peopleDF("name"),peopleDF("age")+10)

age name pcode name age+10


null Alice 94304 Alice null
30 Brayden 94304 Brayden 40
19 Carla 10036 Carla 29
46 Diana null Diana 56
null Étienne 94104 Étienne null

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-27
Querying DataFrames using Columns (4)

§ Example: Sorting by columns (descending)


Language: Python .asc and .desc
peopleDF.sort(peopleDF['age'].desc()) are column expression
methods used with
sort
Language: Scala
peopleDF.sort(peopleDF("age").desc)

age name pcode age name pcode


null Alice 94304 46 Diana null
30 Brayden 94304 30 Brayden 94304
19 Carla 10036 19 Carla 10036
46 Diana null null Alice 94304
null Étienne 94104 null Étienne 94104

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-28
Joining DataFrames (1)

§ A basic inner join when join column is in both DataFrames


Language: Python/Scala

age name pcode peopleDF.join(pcodesDF, "pcode")


null Alice 94304
30 Brayden 94304
19 Carla 10036
pcode age name city state
46 Diana null
94304 null Alice Palo Alto CA
null Étienne 94104
94304 30 Brayden Palo Alto CA
pcode city state 10036 19 Carla New York NY
10036 New York NY 94104 null Étienne San CA
Francisco
87501 Santa Fe NM
94304 Palo Alto CA
94104 San CA
Francisco

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-29
Joining DataFrames (2)

§ Specify type of join as inner (default), outer, left_outer,


right_outer, or leftsemi
Language: Python
age name pcode peopleDF.join(pcodesDF, "pcode",
"left_outer")
null Alice 94304
30 Brayden 94304
Language: Scala
19 Carla 10036 peopleDF.join(pcodesDF,
Array("pcode"), "left_outer")
46 Diana null
null Étienne 94104
pcode age name city state
pcode city state 94304 null Alice Palo Alto CA
10036 New York NY 94304 30 Brayden Palo Alto CA
87501 Santa Fe NM 10036 19 Carla New York NY
94304 Palo Alto CA null 46 Diana null null
94104 San CA 94104 null Étienne San CA
Francisco Francisco

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-30
Joining DataFrames (3)

§ Use a column expression when column names are different


Language: Python
peopleDF.join(zcodesDF,
age name pcode peopleDF.pcode == zcodesDF.zip)

null Alice 94304


Language: Scala
30 Brayden 94304 peopleDF.join( zcodesDF,
19 Carla 10036 $"pcode" === $"zip")
46 Diana null
null Étienne 94104
pcode age name zip city state
zip city state 94304 null Alice 94304 Palo Alto CA

10036 New York NY 94304 30 Brayden 94304 Palo Alto CA

87501 Santa Fe NM 10036 19 Carla 10036 New York NY

94304 Palo Alto CA 94104 null Étienne 94104 San CA


Francisco
94104 San CA
Francisco

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-31
SQL Queries (1)

§ When using HiveContext, you can query Hive/Impala tables using


HiveQL
– Returns a DataFrame
Language: Python/Scala
sqlContext.
sql("""SELECT * FROM customers WHERE name LIKE "A%" """)

Table: customers
cust_id name country

001 Ani us cust_id name country


002 Bob ca 001 Ani us
003 Carlos mx
… … …

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-32
SQL Queries (2)

§ You can also perform some SQL queries with a DataFrame


– First, register the DataFrame as a “table” with the SQL context

Language: Python/Scala
peopleDF.registerTempTable("people")
sqlContext.
sql("""SELECT * FROM people WHERE name LIKE "A%" """)

age name pcode age name pcode


null Alice 94304 null Alice 94304
30 Brayden 94304
19 Carla 10036
46 Diana null Note: This feature does not depend
null Étienne 94104 on Hive or Impala, or on a database

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-33
SQL Queries (3)

§ You can query directly from Parquet or JSON files without needing to
create a DataFrame or register a temporary table
Language: Python/Scala
sqlContext.
sql("""SELECT * FROM json.`/user/training/people.json` WHERE
name LIKE "A%" """)

File: people.json
{"name":"Alice", "pcode":"94304"}
{"name":"Brayden", "age":30, "pcode":"94304"}
{"name":"Carla", "age":19, "pcode":"10036"}
{"name":"Diana", "age":46}
{"name":"Étienne", "pcode":"94104"}

age name pcode


null Alice 94304

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-34
Other Query Functions

§ DataFrames provide many other data manipulation and query functions


such as
– Aggregation such as groupBy, orderBy, and agg
– Multi-dataset operations such as join, unionAll, and intersect
– Statistics such as avg, sampleBy, corr, and cov
– Multi-variable functions rollup and cube
– Window-based analysis functions

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-35
Chapter Topics

DataFrames and Apache Spark SQL

§ Apache Spark SQL and the SQL Context


§ Creating DataFrames
§ Transforming and Querying DataFrames
§ Saving DataFrames
§ DataFrames and RDDs
§ Comparing Spark SQL, Apache Impala, and Apache Hive-on-Spark
§ Apache Spark SQL in Spark 2.x
§ Essential Points
§ Hands-On Exercise: Use Apache Spark SQL for ETL

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-36
Saving DataFrames

§ Data in DataFrames can be saved to a data source


§ Use DataFrame.write to create a DataFrameWriter
§ DataFrameWriter provides convenience functions to externally save
the data represented by a DataFrame
– jdbc inserts into a new or existing table in a database
– json saves as a JSON file
– parquet saves as a Parquet file
– orc saves as an ORC file
– text saves as a text file (string data in a single column only)
– saveAsTable saves as a Hive/Impala table (HiveContext only)

Language: Python/Scala
peopleDF.write.saveAsTable("people")

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-37
Options for Saving DataFrames

§ DataFrameWriter option methods


– format specifies a data source type
– mode determines the behavior if file or table already exists:
overwrite, append, ignore or error (default is error)
– partitionBy stores data in partitioned directories in the form
column=value (as with Hive/Impala partitioning)
– options specifies properties for the target data source
– save is the generic base function to write the data
Language: Python/Scala
peopleDF.write.
format("parquet").
mode("append").
partitionBy("age").
saveAsTable("people")

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-38
Chapter Topics

DataFrames and Apache Spark SQL

§ Apache Spark SQL and the SQL Context


§ Creating DataFrames
§ Transforming and Querying DataFrames
§ Saving DataFrames
§ DataFrames and RDDs
§ Comparing Spark SQL, Apache Impala, and Apache Hive-on-Spark
§ Apache Spark SQL in Spark 2.x
§ Essential Points
§ Hands-On Exercise: Use Apache Spark SQL for ETL

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-39
DataFrames and RDDs (1)

§ DataFrames are built on RDDs


– Base RDDs contain Row objects
– Use rdd to get the underlying RDD

peopleRDD = peopleDF.rdd

peopleDF peopleRDD
age name pcode Row[null,Alice,94304]
null Alice 94304 Row[30,Brayden,94304]
30 Brayden 94304 Row[19,Carla,10036]
19 Carla 10036 Row[46,Diana,null]
46 Diana null Row[null,Étienne,94104]
null Étienne 94104

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-40
DataFrames and RDDs (2)

§ Row RDDs have all the standard Spark actions and transformations
– Actions: collect, take, count, and so on
– Transformations: map, flatMap, filter, and so on
§ Row RDDs can be transformed into pair RDDs to use map-reduce methods
§ DataFrames also provide convenience methods (such as map, flatMap,
and foreach)for converting to RDDs

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-41
Working with Row Objects

§ The syntax for extracting data from Row objects depends on language
§ Python
– Column names are object attributes
– row.age returns age column value from row
§ Scala
– Use Array-like syntax to return values with type Any
– row(n) returns element in the nth column
– row.fieldIndex("age")returns index of the age column
– Use methods to get correctly typed values
– row.getAs[Long]("age")
– Use type-specific get methods to return typed values
– row.getString(n) returns nth column as a string
– row.getInt(n) returns nth column as an integer
– And so on
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-42
Example: Extracting Data from Row Objects
Row[null,Alice,94304]
§ Extract data from Row objects
Row[30,Brayden,94304]
Language: Python Row[19,Carla,10036]
peopleRDD = peopleDF \ Row[46,Diana,null]
.map(lambda row:(row.pcode,row.name))
Row[null,Étienne,94104]
peopleByPCode = peopleRDD \
.groupByKey()
(94304,Alice)
(94304,Brayden)
Language: Scala (10036,Carla)
val peopleRDD = peopleDF. (null,Diana)
map(row => (94104,Étienne)
(row(row.fieldIndex("pcode")),
row(row.fieldIndex("name"))))
val peopleByPCode = peopleRDD. (null,[Diana])
groupByKey() (94304,[Alice,Brayden])
(10036,[Carla])
(94104,[Étienne])

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-43
Converting RDDs to DataFrames

§ You can also create a DF from an RDD using createDataFrame


Language: Python
from pyspark.sql.types import *
schema = StructType([StructField("age",IntegerType(),True),
StructField("name",StringType(),True),
StructField("pcode",StringType(),True)])
myrdd = sc.parallelize([(40,"Abram","01601"),
(16,"Lucia","87501")])
mydf = sqlContext.createDataFrame(myrdd,schema)

import org.apache.spark.sql.types._ Language: Scala


import org.apache.spark.sql.Row
val schema = StructType(Array(
StructField("age", IntegerType, true),
StructField("name", StringType, true),
StructField("pcode", StringType, true)))
val rowrdd = sc.parallelize(Array(Row(40,"Abram","01601"),
Row(16,"Lucia","87501")))
val mydf = sqlContext.createDataFrame(rowrdd,schema)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-44
Chapter Topics

DataFrames and Apache Spark SQL

§ Apache Spark SQL and the SQL Context


§ Creating DataFrames
§ Transforming and Querying DataFrames
§ Saving DataFrames
§ DataFrames and RDDs
§ Comparing Spark SQL, Apache Impala, and Apache Hive-on-Spark
§ Apache Spark SQL in Spark 2.x
§ Essential Points
§ Hands-On Exercise: Use Apache Spark SQL for ETL

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-45
Comparing Impala to Spark SQL

§ Spark SQL is built on Spark, a general purpose processing engine


– Provides convenient SQL-like access to structured data in a Spark
application
§ Impala is a specialized SQL engine
– Much better performance for querying
– Much more mature than Spark SQL
– Robust security using Sentry
§ Impala is better for
– Interactive queries
– Data analysis
§ Use Spark SQL for
– ETL
– Access to structured data required by a Spark application

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-46
Comparing Spark SQL with Hive on Spark

§ Spark SQL
– Provides the DataFrame API to allow structured data
processing in a Spark application
– Programmers can mix SQL with procedural processing
§ Hive on Spark
– Hive provides a SQL abstraction layer over MapReduce or
Spark
– Allows non-programmers to analyze data using familiar
SQL
– Hive on Spark replaces MapReduce as the engine
underlying Hive
– Does not affect the user experience of Hive
– Except queries run many times faster!

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-47
Chapter Topics

DataFrames and Apache Spark SQL

§ Apache Spark SQL and the SQL Context


§ Creating DataFrames
§ Transforming and Querying DataFrames
§ Saving DataFrames
§ DataFrames and RDDs
§ Comparing Spark SQL, Apache Impala, and Apache Hive-on-Spark
§ Apache Spark SQL in Spark 2.x
§ Essential Points
§ Hands-On Exercise: Use Apache Spark SQL for ETL

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-48
What’s Coming in Spark 2.x?

§ Spark 2.0 is the next major release of Spark


§ Several significant changes related to Spark SQL, including
– SparkSession replaces SQLContext and HiveContext
– Support for ANSI-SQL as well as HiveQL
– Support for subqueries
– Support for Datasets

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-49
Spark Datasets

§ Datasets are an alternative to RDDs for structured data


– A strongly-typed collection of objects, mapped to a relational schema
– Unified with the DataFrame API—DFs are Datasets of Row objects
– Use the Spark Catalyst optimizer as DFs do for better performance

Language: Scala
val countsRDD = sc.textFile(filename).
Word count flatMap(line => line.split(" ")).
using RDDs map(word => (word,1)).
reduceByKey((v1,v2) => v1+v2)

Language: Scala
val countsDS =
Word count
sqlContext.read.text(filename).as[String].
using Datasets
flatMap(line => line.split(" ")).
groupBy(word => word).
count()

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-50
Chapter Topics

DataFrames and Apache Spark SQL

§ Apache Spark SQL and the SQL Context


§ Creating DataFrames
§ Transforming and Querying DataFrames
§ Saving DataFrames
§ DataFrames and RDDs
§ Comparing Spark SQL, Apache Impala, and Apache Hive-on-Spark
§ Apache Spark SQL in Spark 2.x
§ Essential Points
§ Hands-On Exercise: Use Apache Spark SQL for ETL

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-51
Essential Points

§ Spark SQL is a Spark API for handling structured and semi-structured data
§ Entry point is a SQL context
§ DataFrames are the key unit of data
– DataFrames are based on an underlying RDD of Row objects
– DataFrames query methods return new DataFrames; similar to RDD
transformations
– The full Spark API can be used with Spark SQL data by accessing the
underlying RDD
§ Spark SQL is not a replacement for a database, or a specialized SQL engine
like Impala
– Spark SQL is most useful for ETL or incorporating structured data into a
Spark application

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-52
Chapter Topics

DataFrames and Apache Spark SQL

§ Apache Spark SQL and the SQL Context


§ Creating DataFrames
§ Transforming and Querying DataFrames
§ Saving DataFrames
§ DataFrames and RDDs
§ Comparing Spark SQL, Apache Impala, and Apache Hive-on-Spark
§ Apache Spark SQL in Spark 2.x
§ Essential Points
§ Hands-On Exercise: Use Apache Spark SQL for ETL

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-53
Hands-On Exercise: Use Apache Spark SQL for ETL

§ In this exercise, you will


– Import data from MySQL using Sqoop
– Use Spark SQL to normalize the data
– Save the data to Parquet format
§ Please refer to the Hands-On Exercise Manual for instructions

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-54
Message Processing with Apache Kafka
Chapter 15
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-2
Message Processing with Apache Kafka

In this chapter you will learn


§ What Apache Kafka is and what advantages it offers
§ About the high-level architecture of Kafka
§ How to create topics, publish messages, and read messages from the
command line

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-3
Chapter Topics

Message Processing with Apache


Kafka

§ What Is Apache Kafka?


§ Apache Kafka Overview
§ Scaling Apache Kafka
§ Apache Kafka Cluster Architecture
§ Apache Kafka Command Line Tools
§ Essential Points
§ Hands-On Exercise: Produce and Consume Apache Kafka Messages

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-4
What Is Apache Kafka?

§ Apache Kafka is a distributed commit log service


– Widely used for data ingest
– Conceptually similar to a publish-subscribe messaging system
– Offers scalability, performance, reliability, and flexibility
§ Originally created at LinkedIn, now an open source Apache project
– Donated to the Apache Software Foundation in 2011
– Graduated from the Apache Incubator in 2012
– Supported by Cloudera for production use with CDH in 2015

Apache Kafka
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-5
Characteristics of Kafka

§ Scalable
– Kafka is a distributed system that supports multiple nodes
§ Fault-tolerant
– Data is persisted to disk and can be replicated throughout the cluster
§ High throughput
– Each broker can process hundreds of thousands of messages per second *
§ Low latency
– Data is delivered in a fraction of a second
§ Flexible
– Decouples the production of data from its consumption

* Using modest hardware, with messages of a typical size


© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-6
Kafka Use Cases

§ Kafka is used for a variety of use cases, such as


– Log aggregation
– Messaging
– Web site activity tracking
– Stream processing
– Event sourcing

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-7
Chapter Topics

Message Processing with Apache


Kafka

§ What Is Apache Kafka?


§ Apache Kafka Overview
§ Scaling Apache Kafka
§ Apache Kafka Cluster Architecture
§ Apache Kafka Command Line Tools
§ Essential Points
§ Hands-On Exercise: Produce and Consume Apache Kafka Messages

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-8
Key Terminology

§ Message
– A single data record passed by Kafka
§ Topic
– A named log or feed of messages within Kafka
§ Producer
– A program that writes messages to Kafka
§ Consumer
– A program that reads messages from Kafka

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-9
Example: High-Level Architecture

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-10
Messages (1)

§ Messages in Kafka are variable-size byte arrays


– Represent arbitrary user-defined content
– Use any format your application requires
– Common formats include free-form text, JSON, and Avro
§ There is no explicit limit on message size
– Optimal performance at a few KB per message
– Practical limit of 1MB per message

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-11
Messages (2)

§ Kafka retains all messages for a defined time period and/or total size
– Administrators can specify retention on global or per-topic basis
– Kafka will retain messages regardless of whether they were read
– Kafka discards messages automatically after the retention period or
total size is exceeded (whichever limit is reached first)
– Default retention is one week
– Retention can reasonably be one year or longer

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-12
Topics

§ There is no explicit limit on the number of topics


– However, Kafka works better with a few large topics than many small
ones
§ A topic can be created explicitly or simply by publishing to the topic
– This behavior is configurable
– Cloudera recommends that administrators disable auto-creation of
topics to avoid accidental creation of large numbers of topics

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-13
Producers

§ Producers publish messages to Kafka topics


– They communicate with Kafka, not a consumer
– Kafka persists messages to disk on receipt

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-14
Consumers

§ A consumer reads messages that were published to Kafka topics


– They communicate with Kafka, not any producer
§ Consumer actions do not affect other consumers
– For example, having one consumer display the messages in a topic as
they are published does not change what is consumed by other
consumers
§ They can come and go without impact on the cluster or other consumers

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-15
Producers and Consumers

§ Tools available as part of Kafka


– Command-line producer and consumer tools
– Client (producer and consumer) Java APIs
§ A growing number of other APIs are available from third parties
– Client libraries in many languages including Python, PHP, C/C++, Go,
.NET, and Ruby
§ Integrations with other tools and projects include
– Apache Flume
– Apache Spark
– Amazon AWS
– syslog
§ Kafka also has a large and growing ecosystem

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-16
Chapter Topics

Message Processing with Apache


Kafka

§ What Is Apache Kafka?


§ Apache Kafka Overview
§ Scaling Apache Kafka
§ Apache Kafka Cluster Architecture
§ Apache Kafka Command Line Tools
§ Essential Points
§ Hands-On Exercise: Produce and Consume Apache Kafka Messages

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-17
Scaling Kafka

§ Scalability is one of the key benefits of Kafka


§ Two features let you scale Kafka for performance
– Topic partitions
– Consumer groups

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-18
Topic Partitioning

§ Kafka divides each topic into some number of partitions *


– Topic partitioning improves scalability and throughput
§ A topic partition is an ordered and immutable sequence of messages
– New messages are appended to the partition as they are received
– Each message is assigned a unique sequential ID known as an offset

* Note that this is unrelated to partitioning in HDFS or Spark


© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-19
Consumer Groups

§ One or more consumers can form their own consumer group that work
together to consume the messages in a topic
§ Each partition is consumed by only one member of a consumer group
§ Message ordering is preserved per partition, but not across the topic

Topic: click-tracking
Kafka Partition Partition Partition Partition
Cluster 0 3 1 2

Consumer Consumer
1 2

Consumer Group: click-processing

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-20
Increasing Consumer Throughput

§ Additional consumers can be added to scale consumer group processing


§ Consumer instances that belong to the same consumer group can be in
separate processes or on separate machines

Topic: click-tracking
Kafka Partition Partition Partition Partition
Cluster 0 3 1 2

C1 C2 C3 C4

Consumer Group: click-processing

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-21
Multiple Consumer Groups

§ Each message published to a topic is delivered to one consumer instance


within each subscribing consumer group
§ Kafka scales to large numbers of consumer groups and consumers

Topic: click-tracking
Kafka Partition Partition Partition Partition
Cluster 0 3 1 2

C1 C2 C3 C4 C5 C6

Consumer Group A Consumer Group B

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-22
Publish and Subscribe to Topic

§ Kafka functions like a traditional queue when all consumer instances


belong to the same consumer group
– In this case, a given message is received by one consumer
§ Kafka functions like traditional publish-subscribe when each consumer
instance belongs to a different consumer group
– In this case, all messages are broadcast to all consumer groups

Topic: click-tracking
Kafka Partition Partition Partition Partition
Cluster 0 3 1 2

C1 C2 C3 C4 C5 C6 C7

Consumer Group A Consumer Group B Consumer Group C

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-23
Chapter Topics

Message Processing with Apache


Kafka

§ What Is Apache Kafka?


§ Apache Kafka Overview
§ Scaling Apache Kafka
§ Apache Kafka Cluster Architecture
§ Apache Kafka Command Line Tools
§ Essential Points
§ Hands-On Exercise: Produce and Consume Apache Kafka Messages

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-24
Kafka Clusters

§ A Kafka cluster consists of one or more brokers—servers running the Kafka


broker daemon
§ Kafka depends on the Apache ZooKeeper service for coordination

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-25
Apache ZooKeeper

§ Apache ZooKeeper is a coordination service for distributed applications


§ Kafka depends on the ZooKeeper service for coordination
– Typically running three or five ZooKeeper instances
§ Kafka uses ZooKeeper to keep track of brokers running in the cluster
§ Kafka uses ZooKeeper to detect the addition or removal of consumers

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-26
Kafka Brokers

§ Brokers are the fundamental daemons that make up a Kafka cluster


§ A broker fully stores a topic partition on disk, with caching in memory
§ A single broker can reasonably host 1000 topic partitions
§ One broker is elected controller of the cluster (for assignment of topic
partitions to brokers, and so on)
§ Each broker daemon runs in its own JVM
– A single machine can run multiple broker daemons

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-27
Topic Replication

§ At topic creation, a topic can be set with a replication count


– Doing so is recommended, as it provides fault tolerance
§ Each broker can act as a leader for some topic partitions and a follower for
others
– Followers passively replicate the leader
– If the leader fails, a follower will automatically become the new leader

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-28
Messages Are Replicated

§ Configure the producer with a list of one or more brokers


– The producer asks the first available broker for the leader of the desired
topic partition
§ The producer then sends the message to the leader
– The leader writes the message to its local log
– Each follower then writes the message to its own log
– After acknowledgements from followers, the message is committed

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-29
Chapter Topics

Message Processing with Apache


Kafka

§ What Is Apache Kafka?


§ Apache Kafka Overview
§ Scaling Apache Kafka
§ Apache Kafka Cluster Architecture
§ Apache Kafka Command Line Tools
§ Essential Points
§ Hands-On Exercise: Produce and Consume Apache Kafka Messages

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-30
Creating Topics from the Command Line

§ Kafka includes a convenient set of command line tools


– These are helpful for exploring and experimentation
§ The kafka-topics command offers a simple way to create Kafka topics
– Provide the topic name of your choice, such as device_status
– You must also specify the ZooKeeper connection string for your cluster

$ kafka-topics --create \
--zookeeper zkhost1:2181,zkhost2:2181,zkhost3:2181 \
--replication-factor 3 \
--partitions 5 \
--topic device_status

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-31
Displaying Topics from the Command Line

§ Use the --list option to list all topics

$ kafka-topics --list \
--zookeeper zkhost1:2181,zkhost2:2181,zkhost3:2181

§ Use the --help option to list all kafka-topics options

$ kafka-topics --help

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-32
Running a Producer from the Command Line (1)

§ You can run a producer using the kafka-console-producer tool


§ Specify one or more brokers in the --broker-list option
– Each broker consists of a hostname, a colon, and a port number
– If specifying multiple brokers, separate them with commas
§ You must also provide the name of the topic

$ kafka-console-producer \
--broker-list brokerhost1:9092,brokerhost2:9092 \
--topic device_status

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-33
Running a Producer from the Command Line (2)

§ You may see a few log messages in the terminal after the producer starts
§ The producer will then accept input in the terminal window
– Each line you type will be a message sent to the topic
§ Until you have configured a consumer for this topic, you will see no other
output from Kafka

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-34
Writing File Contents to Topics Using the Command Line

§ Using UNIX pipes or redirection, you can read input from files
– The data can then be sent to a topic using the command line producer
§ This example shows how to read input from a file named alerts.txt
– Each line in this file becomes a separate message in the topic

$ cat alerts.txt | kafka-console-producer \


--broker-list brokerhost1:9092,brokerhost2:9092 \
--topic device_status

§ This technique can be an easy way to integrate with existing programs

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-35
Running a Consumer from the Command Line

§ You can run a consumer with the kafka-console-consumer tool


§ This requires the ZooKeeper connection string for your cluster
– Unlike starting a producer, which instead requires a list of brokers
§ The command also requires a topic name
§ Use --from-beginning to read all available messages
– Otherwise, it reads only new messages

$ kafka-console-consumer \
--zookeeper zkhost1:2181,zkhost2:2181,zkhost3:2181 \
--topic device_status \
--from-beginning

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-36
Chapter Topics

Message Processing with Apache


Kafka

§ What Is Apache Kafka?


§ Apache Kafka Overview
§ Scaling Apache Kafka
§ Apache Kafka Cluster Architecture
§ Apache Kafka Command Line Tools
§ Essential Points
§ Hands-On Exercise: Produce and Consume Apache Kafka Messages

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-37
Essential Points

§ Producers publish messages to categories called topics


§ Messages in a topic are read by consumers
§ Topics are divided into partitions for performance and scalability
– These partitions are replicated for fault tolerance
§ Consumer groups work together to consume the messages in a topic
§ Nodes running the Kafka service are called brokers
§ Kafka includes command-line tools for managing topics, and for starting
producers and consumers

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-38
Bibliography

The following offer more information on topics discussed in this chapter


§ The Apache Kafka web site
– http://kafka.apache.org/
§ Real-Time Fraud Detection Architecture
– http://tiny.cloudera.com/kmc01a
§ Kafka Reference Architecture
– http://tiny.cloudera.com/kmc01b
§ The Log: What Every Software Engineer Should Know…
– http://tiny.cloudera.com/kmc01c

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-39
Chapter Topics

Message Processing with Apache


Kafka

§ What Is Apache Kafka?


§ Apache Kafka Overview
§ Scaling Apache Kafka
§ Apache Kafka Cluster Architecture
§ Apache Kafka Command Line Tools
§ Essential Points
§ Hands-On Exercise: Produce and Consume Apache Kafka Messages

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-40
Hands-On Exercise: Produce and Consume Apache Kafka
Messages
§ In this exercise, you will
– Use Kafka’s command line utilities to create a new topic, publish
messages to the topic with a producer, and read messages from the
topic with a consumer
§ Please refer to the Hands-On Exercise Manual for instructions

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-41
Capturing Data with Apache Flume
Chapter 16
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-2
Capturing Data with Apache Flume

In this chapter you will learn


§ What are the main architectural components of Apache Flume
§ How these components are configured
§ How to launch a Flume agent

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-3
Chapter Topics

Capturing Data with Apache Flume

§ What Is Apache Flume?


§ Basic Architecture
§ Sources
§ Sinks
§ Channels
§ Configuration
§ Essential Points
§ Hands-On Exercise: Collect Web Server Logs with Apache Flume

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-4
What Is Apache Flume?

§ Apache Flume is a high-performance system for data collection


– Name derives from original use case of near-real time log data ingestion
– Now widely used for collection of any streaming event data
– Supports aggregating data from many sources into HDFS
§ Originally developed by Cloudera
– Donated to Apache Software Foundation in 2011
– Became a top-level Apache project in 2012
– Flume OG (Old Generation) gave way to Flume NG (Next Generation)
§ Benefits of Flume
– Horizontally-scalable
– Extensible
– Reliable

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-5
Flume’s Design Goals: Reliability

§ Channels provide Flume’s reliability


§ Examples
– Memory channel: Fault intolerant, data will be lost if power is lost
– Disk-based channel: Fault tolerant
– Kafka channel: Fault tolerant
§ Data transfer between agents and channels is transactional
– A failed data transfer to a downstream agent rolls back and retries
§ You can configure multiple agents with the same task
– For example, two agents doing the job of one “collector”—if one agent
fails then upstream agents would fail over

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-6
Flume’s Design Goals: Scalability

§ Scalability
– The ability to increase system performance linearly—or better—by
adding more resources to the system
– Flume scales horizontally
– As load increases, add more agents to the machine, and add more
machines to the system

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-7
Flume’s Design Goals: Extensibility

§ Extensibility
– The ability to add new functionality to a system
§ Flume can be extended by adding sources and sinks to existing storage
layers or data platforms
– Flume includes sources that can read data from files, syslog, and
standard output from any Linux process
– Flume includes sinks that can write to files on the local filesystem, HDFS,
Kudu, HBase, and so on
– Developers can write their own sources or sinks

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-8
Common Flume Data Sources

Sensor Data
Log Files Status Updates

UNIX syslog Network Sockets

Hadoop Cluster

Social Media Posts


Program Output

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-9
Large-Scale Deployment Example

§ Flume collects data using configurable agents


– Agents can receive data from many sources, including other agents
– Large-scale deployments use multiple tiers for scalability and reliability
– Flume supports inspection and modification of in-flight data

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-10
Chapter Topics

Capturing Data with Apache Flume

§ What Is Apache Flume?


§ Basic Architecture
§ Sources
§ Sinks
§ Channels
§ Configuration
§ Essential Points
§ Hands-On Exercise: Collect Web Server Logs with Apache Flume

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-11
Flume Events

§ An event is the fundamental unit of data in Flume


– Consists of a body (payload) and a collection of headers (metadata)
§ Headers consist of name-value pairs
– Headers are mainly used for directing output

Anatomy of a Flume Event

timestamp: 1395256884

hostname: webserver05.loudacre.com Headers


datacenter: palo-alto

192.168.5.150 - pablo [11/May/2014:23:56:27


-0800] "GET /KBDOC-0U812.html HTTP/1.1" 200
6747 "http://www.loudacre.com/kb/search? Body
q=batteries" "Mozilla/5.0 (ACME v3.14159)"

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-12
Components in Flume’s Architecture

§ Source
– Receives events from the external actor that generates them
§ Sink
– Sends an event to its destination
§ Channel
– Buffers events from the source until they are drained by the sink
§ Agent
– Configures and hosts the source, channel, and sink
– A Java process that runs in a JVM

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-13
Flume Data Flow

§ This diagram illustrates how syslog data might be captured to HDFS


1. Server running a syslog daemon logs a message
2. Flume agent configured with syslog source retrieves event
3. Source pushes event to the channel, where it is buffered in memory
4. Sink pulls data from the channel and writes it to HDFS

Flume Agent

Sends syslog Data written


message Source Channel Sink to file in HDFS
(syslog) (Memory) (HDFS)

syslog Server
Hadoop Cluster

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-14
Chapter Topics

Capturing Data with Apache Flume

§ What Is Apache Flume?


§ Basic Architecture
§ Sources
§ Sinks
§ Channels
§ Configuration
§ Essential Points
§ Hands-On Exercise: Collect Web Server Logs with Apache Flume

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-15
Notable Built-In Flume Sources

§ Syslog
– Captures messages from UNIX syslog daemon over the network
§ Netcat
– Captures any data written to a socket on an arbitrary TCP port
§ Exec
– Executes a UNIX program and reads events from standard output *
§ Spooldir
– Extracts events from files appearing in a specified (local) directory
§ HTTP Source
– Retrieves events from HTTP requests
§ Kafka
– Retrieves events by consuming messages from a Kafka topic
* Asynchronous sources do not guarantee that events will be delivered
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-16
Chapter Topics

Capturing Data with Apache Flume

§ What Is Apache Flume?


§ Basic Architecture
§ Sources
§ Sinks
§ Channels
§ Configuration
§ Essential Points
§ Hands-On Exercise: Collect Web Server Logs with Apache Flume

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-17
Some Interesting Built-In Flume Sinks

§ Null
– Discards all events (Flume equivalent of /dev/null)
§ Logger
– Logs event to INFO level using SLF4J*
§ IRC
– Sends event to a specified Internet Relay Chat channel
§ HDFS
– Writes event to a file in the specified directory in HDFS
§ Kafka
– Sends event as a message to a Kafka topic
§ HBaseSink
– Stores event in HBase
*SLF4J: Simple Logging Façade for Java
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-18
Chapter Topics

Capturing Data with Apache Flume

§ What Is Apache Flume?


§ Basic Architecture
§ Sources
§ Sinks
§ Channels
§ Configuration
§ Essential Points
§ Hands-On Exercise: Collect Web Server Logs with Apache Flume

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-19
Built-In Flume Channels

§ Memory
– Stores events in the machine’s RAM
– Extremely fast, but not reliable (memory is volatile)
§ File
– Stores events on the machine’s local disk
– Slower than RAM, but more reliable (data is written to disk)
§ Kafka
– Uses Kafka as a scalable, reliable, and highly available channel between
any source and sink type

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-20
Chapter Topics

Capturing Data with Apache Flume

§ What Is Apache Flume?


§ Basic Architecture
§ Sources
§ Sinks
§ Channels
§ Configuration
§ Essential Points
§ Hands-On Exercise: Collect Web Server Logs with Apache Flume

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-21
Flume Agent Configuration File

§ Configure Flume agents through a Java properties file


– You can configure multiple agents in a single file
§ The configuration file uses hierarchical references
– Assign each component a user-defined ID
– Use that ID in the names of additional properties

# Define sources, sinks, and channel for agent named 'agent1'


agent1.sources = mysource
agent1.sinks = mysink
agent1.channels = mychannel

# Sets a property "foo" for the source associated with agent1


agent1.sources.mysource.foo = bar

# Sets a property "baz" for the sink associated with agent1


agent1.sinks.mysink.baz = bat

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-22
Example: Configuring Flume Components (1)

§ Example: Configure a Flume agent to collect data from remote spool


directories and save to HDFS
Flume Agent
HDFS
Sends syslog /loudacre/
Data written
/var/flume/incoming
message Source Memory
Channel Channel Sink to file in HDFS
logdata/
(syslog) (Memory) (HDFS)

syslog Server
Hadoop C
src1 ch1 sink1

agent1

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-23
Example: Configuring Flume Components (2)

agent1.sources = src1
agent1.sinks = sink1
agent1.channels = ch1

agent1.channels.ch1.type = memory

agent1.sources.src1.type = spooldir
agent1.sources.src1.spoolDir = /var/flume/incoming Connects source
agent1.sources.src1.channels = ch1 and channel

agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path = /loudacre/logdata Connects sink
agent1.sinks.sink1.channel = ch1 and channel

§ Properties vary by component type (source, channel, and sink)


– Properties also vary by subtype (such as netcat source, syslog source)
– See the Flume user guide for full details on configuration

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-24
Aside: HDFS Sink Configuration

§ Path may contain patterns based on event headers, such as timestamp


§ The HDFS sink writes uncompressed SequenceFiles by default
– Specifying a codec will enable compression

agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path = /loudacre/logdata/%y-%m-%d
agent1.sinks.sink1.hdfs.codeC = snappy
agent1.sinks.sink1.channel = ch1

§ Setting fileType parameter to DataStream writes raw data


– Can also specify a file extension, if desired
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path = /loudacre/logdata/%y-%m-%d
agent1.sinks.sink1.hdfs.fileType = DataStream
agent1.sinks.sink1.hdfs.fileSuffix = .txt
agent1.sinks.sink1.channel = ch1

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-25
Starting a Flume Agent

§ Typical command line invocation


– The --name argument must match the agent’s name in the
configuration file
– Setting root logger as shown will display log messages in the terminal

$ flume-ng agent \
--conf /etc/flume-ng/conf \
--conf-file /path/to/flume.conf \
--name agent1 \
-Dflume.root.logger=INFO,console

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-26
Chapter Topics

Capturing Data with Apache Flume

§ What Is Apache Flume?


§ Basic Architecture
§ Sources
§ Sinks
§ Channels
§ Configuration
§ Essential Points
§ Hands-On Exercise: Collect Web Server Logs with Apache Flume

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-27
Essential Points

§ Apache Flume is a high-performance system for data collection


– Scalable, extensible, and reliable
§ A Flume agent manages the sources, channels, and sinks
– Sources retrieve event data from its origin
– Channels buffer events between the source and sink
– Sinks send the event to its destination
§ The Flume agent is configured using a properties file
– Give each component a user-defined ID
– Use this ID to define properties of that component

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-28
Bibliography

The following offer more information on topics discussed in this chapter


§ Flume User Guide
– http://tiny.cloudera.com/adcc06a

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-29
Chapter Topics

Capturing Data with Apache Flume

§ What Is Apache Flume?


§ Basic Architecture
§ Sources
§ Sinks
§ Channels
§ Configuration
§ Essential Points
§ Hands-On Exercise: Collect Web Server Logs with Apache Flume

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-30
Hands-On Exercise: Collect Web Server Logs with Apache Flume

§ In this exercise, you will


– Run an Apache Flume agent to ingest web server log data into HDFS
§ Please refer to the Hands-On Exercise Manual for instructions

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-31
Integrating Apache Flume and Apache
Kafka
Chapter 17
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-2
Integrating Apache Flume and Apache Kafka

In this chapter you will learn


§ What to consider when choosing between Apache Flume and Apache
Kafka for a use case
§ How Flume and Kafka can work together
§ How to configure a Kafka channel, sink, or source in Flume

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-3
Chapter Topics

Integrating Apache Flume and


Apache Kafka

§ Overview
§ Use Cases
§ Configuration
§ Essential Points
§ Hands-On Exercise: Send Web Server Log Messages from Apache Flume to
Apache Kafka

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-4
Should I Use Kafka or Flume?

§ Both Flume and Kafka are widely used for data ingest
– Although these tools differ, their functionality has some overlap
– Some use cases could be implemented with either Flume or Kafka
§ How do you determine which is a better choice for your use case?

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-5
Characteristics of Flume

§ Flume is efficient at moving data from a single source into Hadoop


– Offers sinks that write to HDFS, an HBase or Kudu table, or a Solr index
– Easily configured to support common scenarios, without writing code
– Can also process and transform data during the ingest process

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-6
Characteristics of Kafka

§ Kafka is a publish-subscribe messaging system


– Offers more flexibility than Flume for connecting multiple systems
– Provides better durability and fault tolerance than Flume
– Often requires writing code for producers and/or consumers
– Has no direct support for processing messages or loading into Hadoop

Apache Kafka
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-7
Flafka = Flume + Kafka

§ Both systems have strengths and limitations


§ You do not necessarily have to choose between them
– You can use both when implementing your use case
§ Flafka is the informal name for Flume-Kafka integration
– It uses a Flume agent to receive messages from or send messages to
Kafka
§ It is implemented as a Kafka source, channel, and sink for Flume

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-8
Chapter Topics

Integrating Apache Flume and


Apache Kafka

§ Overview
§ Use Cases
§ Configuration
§ Essential Points
§ Hands-On Exercise: Send Web Server Log Messages from Apache Flume to
Apache Kafka

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-9
Using a Flume Kafka Sink as a Producer

§ By using a Kafka sink, Flume can publish messages to a topic


§ In this example, an application uses Flume to publish application events
– The application sends data to the Flume source when events occur
– The event data is buffered in the channel until it is taken by the sink
– The Kafka sink publishes messages to a specified topic
– Any Kafka consumer can then read messages from the topic for
application events

Application Flume Agent Kafka Cluster

Topic Broker Topic


Message Message
Source Channel Sink Broker Consumer
(Netcat) (Memory) (Kafka)
Broker

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-10
Using a Flume Kafka Source as a Consumer

§ By using a Kafka source, Flume can read messages from a topic


– It can then write them to your destination of choice using a Flume sink
§ In this example, the producer sends messages to Kafka
– The Flume agent uses a Kafka source, which acts as a consumer
– The Kafka source reads messages in a specified topic
– The message data is buffered in the channel until it is taken by the sink
– The sink then writes the data into HDFS

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-11
Using a Flume Kafka Channel

§ A Kafka channel can be used with any Flume source or sink


– Provides a scalable, reliable, high-availability channel
§ In this example, a Kafka channel buffers events
– The application sends event data to the Flume source
– The channel publishes event data as messages on a Kafka topic
– The sink receives event data and stores it to HDFS

Application Flume Agent Hadoop Cluster

Source Channel Sink


(Netcat) (Kafka) (HDFS)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-12
Using a Kafka Channel as a Consumer (Sourceless Channel)

§ Kafka channels can also be used without a source


– It can then write events to your destination of choice using a Flume sink
§ In this example, the Producer sends messages to Kafka brokers
– The Flume agent uses a Kafka channel, which acts as a consumer
– The Kafka channel reads messages in a specified topic
– Channel passes messages to the sink, which writes the data into HDFS

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-13
Chapter Topics

Integrating Apache Flume and


Apache Kafka

§ Overview
§ Use Cases
§ Configuration
§ Essential Points
§ Hands-On Exercise: Send Web Server Log Messages from Apache Flume to
Apache Kafka

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-14
Configuring Flume with a Kafka Source

§ The table below describes some key properties of the Kafka source

Name Description
type org.apache.flume.source.kafka.KafkaSource
zookeeperConnect ZooKeeper connection string (example: zkhost:2181)
topic Name of Kafka topic from which messages will be read
groupId Unique ID to use for the consumer group (default: flume)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-15
Example: Configuring Flume with a Kafka Source (1)

§ This is the Flume configuration for the example on the previous slide
– It defines a source for reading messages from a Kafka topic

# Define names for the source, channel, and sink


agent1.sources = source1
agent1.channels = channel1
agent1.sinks = sink1

# Define a Kafka source that reads from the calls_placed topic


# The "type" property line wraps around due to its long value
agent1.sources.source1.type =
org.apache.flume.source.kafka.KafkaSource
agent1.sources.source1.zookeeperConnect = localhost:2181
agent1.sources.source1.topic = calls_placed
agent1.sources.source1.channels = channel1

Note: file continues on next slide

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-16
Example: Configuring Flume with a Kafka Source (2)

§ The remaining portion of the file configures the channel and sink

# Define the properties of our channel


agent1.channels.channel1.type = memory
agent1.channels.channel1.capacity = 10000
agent1.channels.channel1.transactionCapacity = 1000

# Define the sink that writes call data to HDFS


agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path = /user/training/calls_placed
agent1.sinks.sink1.hdfs.fileType = DataStream
agent1.sinks.sink1.hdfs.fileSuffix = .csv
agent1.sinks.sink1.channel = channel1

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-17
Configuring Flume with a Kafka Sink

§ The table below describes some key properties of the Kafka sink

Name Description
type Must be set to org.apache.flume.sink.kafka.KafkaSink
brokerList Comma-separated list of brokers (format host:port) to contact
topic The topic in Kafka to which the messages will be published
batchSize How many messages to process in one batch

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-18
Example: Configuring Flume with a Kafka Sink (1)

§ This is the Flume configuration for the example on the previous slide

# Define names for the source, channel, and sink


agent1.sources = source1
agent1.channels = channel1
agent1.sinks = sink1

# Define the properties of the source, which receives event data


agent1.sources.source1.type = netcat
agent1.sources.source1.bind = localhost
agent1.sources.source1.port = 12345
agent1.sources.source1.channels = channel1

# Define the properties of the channel


agent1.channels.channel1.type = memory
agent1.channels.channel1.capacity = 10000
agent1.channels.channel1.transactionCapacity = 1000

Note: file continues on next slide

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-19
Example: Configuring Flume with a Kafka Sink (2)

§ The remaining portion of the configuration file sets up the Kafka sink

# Define the Kafka sink, which publishes to the app_event topic


agent1.sinks.sink1.type = org.apache.flume.sink.kafka.KafkaSink
agent1.sinks.sink1.topic = app_events
agent1.sinks.sink1.brokerList = localhost:9092
agent1.sinks.sink1.batchSize = 20
agent1.sinks.sink1.channel = channel1

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-20
Configuring Flume with a Kafka Channel

§ The table below describes some key properties of the Kafka channel

Name Description
type org.apache.flume.channel.kafka.KafkaChannel
zookeeperConnect ZooKeeper connection string (example: zkhost:2181)
brokerList Comma-separated list of brokers (format host:port) to
contact
topic Name of Kafka topic from which messages will be read
(optional, default=flume-channel)
parseAsFlumeEvent Set to false for sourceless configuration (optional,
default=true)
readSmallestOffset Set to true to read from the beginning of the Kafka topic
(optional, default=false)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-21
Example: Configuring a Sourceless Kafka Channel (1)

§ This is the Flume configuration for the example shown below

# Define names for the source, channel, and sink


agent1.channels = channel1
agent1.sinks = sink1

Note: file continues on next slide

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-22
Example: Configuring a Sourceless Kafka Channel (2)

§ This is the Flume configuration for the example on the previous slide

# Define the properties of the Kafka channel


# which reads from the calls_placed topic
agent1.channels.channel1.type =
org.apache.flume.channel.kafka.KafkaChannel
agent1.channels.channel1.topic = calls_placed
agent1.channels.channel1.brokerList = localhost:9092
agent1.channels.channel1.zookeeperConnect = localhost:2181
agent1.channels.channel1.parseAsFlumeEvent = false

# Define the sink that writes data to HDFS


agent1.sinks.sink1.type=hdfs
agent1.sinks.sink1.hdfs.path = /user/training/calls_placed
agent1.sinks.sink1.hdfs.fileType = DataStream
agent1.sinks.sink1.hdfs.fileSuffix = .csv
agent1.sinks.sink1.channel = channel1

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-23
Chapter Topics

Integrating Apache Flume and


Apache Kafka

§ Overview
§ Use Cases
§ Configuration
§ Essential Points
§ Hands-On Exercise: Send Web Server Log Messages from Apache Flume to
Apache Kafka

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-24
Essential Points

§ Flume and Kafka are distinct systems with different designs


– You must weigh the advantages and disadvantages of each when
selecting the best tool for your use case
§ Flume and Kafka can be combined
– Flafka is the informal name for Flume components integrated with Kafka
– You can read messages from a topic using a Kafka channel or Kafka
source
– You can publish messages to a topic using a Kafka sink
– A Kafka channel provides a reliable, high-availability alternative to a
memory or file channel

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-25
Bibliography

The following offer more information on topics discussed in this chapter


§ Cloudera documentation on using Flume with Kafka
– http://tiny.cloudera.com/flafkadoc
§ Flafka: Apache Flume Meets Apache Kafka for Event Processing
– http://tiny.cloudera.com/kmc02a
§ Designing Fraud-Detection Architecture That Works Like Your Brain Does
– http://tiny.cloudera.com/kmc02b

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-26
Chapter Topics

Integrating Apache Flume and


Apache Kafka

§ Overview
§ Use Cases
§ Configuration
§ Essential Points
§ Hands-On Exercise: Send Web Server Log Messages from Apache Flume
to Apache Kafka

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-27
Hands-On Exercise: Send Web Server Log Messages from Apache
Flume to Apache Kafka
§ In this exercise, you will
– Configure a Flume agent using a Kafka sink to produce Kafka messages
from data that was received by a Flume source
§ Please refer to the Hands-On Exercise Manual for instructions

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-28
Apache Spark Streaming: Introduction
to DStreams
Chapter 18
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-2
Apache Spark Streaming: Introduction to DStreams

In this chapter you will learn


§ The features and typical use cases for Apache Spark Streaming
§ How to write Spark Streaming applications
§ How to create and operate on file and socket-based DStreams

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-3
Chapter Topics

Apache Spark Streaming:


Introduction to DStreams

§ Apache Spark Streaming Overview


§ Example: Streaming Request Count
§ DStreams
§ Developing Streaming Applications
§ Essential Points
§ Hands-On Exercise: Write an Apache Spark Streaming Application

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-4
What Is Spark Streaming?

§ An extension of core Spark


§ Provides real-time processing of stream data
§ Versions 1.3 and later support Java, Scala, and Python
– Prior versions did not support Python

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-5
Why Spark Streaming?

§ Many big data applications need to process large data streams in real time,
such as
– Continuous ETL
– Website monitoring
– Fraud detection
– Ad monetization
– Social media analysis
– Financial market trends

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-6
Spark Streaming Features

§ Second-scale latencies
§ Scalability and efficient fault tolerance
§ “Once and only once” processing
§ Integrates batch and real-time processing
§ Easy to develop
– Uses Spark’s high-level API

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-7
Spark Streaming Overview

§ Divide up data stream into batches of n seconds


– Called a DStream (Discretized Stream)
§ Process each batch in Spark as an RDD
§ Return results of RDD operations in batches
Live Data Stream

…1001101001000111000011100010…
Spark Streaming

Dstream—RDDs (batches of
n seconds)

Spark

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-8
Chapter Topics

Apache Spark Streaming:


Introduction to DStreams

§ Apache Spark Streaming Overview


§ Example: Streaming Request Count
§ DStreams
§ Developing Streaming Applications
§ Essential Points
§ Hands-On Exercise: Write an Apache Spark Streaming Application

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-9
Example: Streaming Request Count (Scala Overview)
Language: Scala
object StreamingRequestCount {

def main(args: Array[String]) {


val sc = new SparkContext()
val ssc = new StreamingContext(sc,Seconds(2))
val mystream = ssc.socketTextStream(hostname, port)
val userreqs = mystream
.map(line => (line.split(' ')(2),1))
.reduceByKey((x,y) => x+y)

userreqs.print()

ssc.start()
ssc.awaitTermination()
}
}

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-10
Example: Configuring StreamingContext
Language: Scala
object StreamingRequestCount {

def main(args: Array[String]) {


val sc = new SparkContext()
val ssc = new StreamingContext(sc,Seconds(2))
val mystream = ssc.socketTextStream(hostname, port)
val userreqs = mystream
§ A StreamingContext is the main entry point for Spark
.map(line => (line.split(' ')(2),1))
Streaming apps
.reduceByKey((x,y) => x+y)
§ Equivalent to SparkContext in core Spark
userreqs.print()
§ Configured with the same parameters as a SparkContext
plus batch duration—instance of Milliseconds, Seconds, or
ssc.start()
Minutes
ssc.awaitTermination()
} § Named ssc by convention
}

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-11
Streaming Example: Creating a DStream
Language: Scala
object StreamingRequestCount {

def main(args: Array[String]) {


val sc = new SparkContext()
val ssc = new StreamingContext(sc,Seconds(2))
val mystream = ssc.socketTextStream(hostname, port)
val userreqs = mystream
.map(line => (line.split(' ')(2),1))
§ Get a DStream (“Discretized Stream”) from a streaming data
.reduceByKey((x,y) => x+y)
source, for example, text from a socket
userreqs.print()

ssc.start()
ssc.awaitTermination()
}
}

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-12
Streaming Example: DStream Transformations
Language: Scala
object StreamingRequestCount {

def main(args: Array[String]) {


val sc = new SparkContext()
val ssc = new StreamingContext(sc,Seconds(2))
val mystream = ssc.socketTextStream(hostname, port)
val userreqs = mystream
.map(line => (line.split(' ')(2),1))
.reduceByKey((x,y) => x+y)

userreqs.print()
§ DStream operations are applied to each batch RDD in the stream
§ Similar to RDD operations—filter, map, reduce,
ssc.start()
ssc.awaitTermination()
joinByKey, and so on.
}
}

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-13
Streaming Example: DStream Result Output
Language: Scala
object StreamingRequestCount {

def main(args: Array[String]) {


val sc = new SparkContext()
val ssc = new StreamingContext(sc,Seconds(2))
val mystream = ssc.socketTextStream(hostname, port)
val userreqs = mystream
.map(line => (line.split(' ')(2),1))
.reduceByKey((x,y) => x+y)

userreqs.print()

ssc.start()
§ Print out the first 10 elements of each RDD
ssc.awaitTermination()
}
}

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-14
Streaming Example: Starting the Streams
Language: Scala
object StreamingRequestCount {

def main(args: Array[String]) {


val sc = new SparkContext()
val ssc = new StreamingContext(sc,Seconds(2))
val mystream = ssc.socketTextStream(hostname, port)
val userreqs = mystream
.map(line Starts the execution
=> (line.split("
§ start: of all DStreams
")(2),1))
.reduceByKey((x,y) => x+y)
§ awaitTermination: waits for all background threads to
complete before ending the main thread
userreqs.print()

ssc.start()
ssc.awaitTermination()
}
}

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-15
Streaming Example: Python versus Scala
Language: Python
if __name__ == "__main__":

sc = SparkContext()
ssc = StreamingContext(sc,2)

val ssc = new StreamingContext(sc,Seconds(2))


mystream = ssc.socketTextStream(hostname, port)
userreqs = mystream \
.map(lambda line: (line.split(' ')[2],1)) \
.reduceByKey(lambda v1,v2: v1+v2)

userreqs.pprint()

ssc.start() userreqs.print()
ssc.awaitTermination()

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-16
Streaming Example: Streaming Request Count (Recap)
Language: Scala
object StreamingRequestCount {

def main(args: Array[String]) {


val sc = new SparkContext()
val ssc = new StreamingContext(sc,Seconds(2))
val mystream = ssc.socketTextStream(hostname, port)
val userreqs = mystream
.map(line => (line.split(' ')(2),1))
.reduceByKey((x,y) => x+y)

userreqs.print()

ssc.start()
ssc.awaitTermination()
}
}

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-17
DStreams

§ A DStream is a sequence of RDDs representing a data stream

Time

Live Data data…data…data…data…data…data…data…data…

t0 t1 t2 t3

RDD @ t1 RDD @ t2 RDD @ t3


data… data… data…
data… data… data…
DStream
data… data… data…
data… data… data…

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-18
Streaming Example Output (1)

-------------------------------------------
Time: 1401219545000 ms Starts 2 seconds after
------------------------------------------- ssc.start (time
(23713,2)
(53,2) interval t1)
(24433,2)
(127,2)
(93,2)
...

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-19
Streaming Example Output (2)

-------------------------------------------
Time: 1401219545000 ms
-------------------------------------------
(23713,2)
(53,2)
(24433,2)
(127,2)
(93,2)
...
------------------------------------------- t2: 2 seconds later…
Time: 1401219547000 ms
-------------------------------------------
(42400,2)
(24996,2)
(97464,2)
(161,2)
(6011,2)
...

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-20
Streaming Example Output (3)

-------------------------------------------
Time: 1401219545000 ms
-------------------------------------------
(23713,2)
(53,2)
(24433,2)
(127,2)
(93,2)
...
-------------------------------------------
Time: 1401219547000 ms
-------------------------------------------
(42400,2)
(24996,2)
(97464,2)
(161,2)
(6011,2)
...
------------------------------------------- t3: 2 seconds later…
Time: 1401219549000 ms
-------------------------------------------
(44390,2)
(48712,2)
(165,2)
(465,2) Continues until
(120,2)
...
termination…

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-21
Chapter Topics

Apache Spark Streaming:


Introduction to DStreams

§ Apache Spark Streaming Overview


§ Example: Streaming Request Count
§ DStreams
§ Developing Streaming Applications
§ Essential Points
§ Hands-On Exercise: Write an Apache Spark Streaming Application

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-22
DStream Data Sources

§ DStreams are defined for a given input stream (such as a Unix socket)
– Created by the Streaming context
ssc.socketTextStream(hostname, port)
– Similar to how RDDs are created by the Spark context
§ Out-of-the-box data sources
– Network
– Sockets
– Services such as Flume, Akka Actors, Kafka, ZeroMQ, or Twitter
– Files
– Monitors an HDFS directory for new content

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-23
DStream Operations

§ DStream operations are applied to every RDD in the stream


– Executed once per duration
§ Two types of DStream operations
– Transformations
– Create a new DStream from an existing one
– Output operations
– Write data (for example, to a file system, database, or console)
– Similar to RDD actions

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-24
DStream Transformations (1)

§ Many RDD transformations are also available on DStreams


– Regular transformations such as map, flatMap, filter
– Pair transformations such as reduceByKey, groupByKey, join
§ What if you want to do something else?
– transform(function)
– Creates a new DStream by executing function on RDDs in the
current DStream
Language: Scala
val distinctDS =
myDS.transform(rdd => rdd.distinct())

Language: Python
distinctDS =
myDS.transform(lambda rdd: rdd.distinct())

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-25
DStream Transformations (2)

data… data… data…

logs data… data… data…


data… data… data…
… … …

userreqs = logs.map(line =>


(line.split(' ')(2),1))
(user002,1) (user011,1) (user012,1)
(user011,1) (user823,1) (user011,1)
Language: Scala userreqs
(user991,1) (user012,1) (user552,1)
… … …

reqcounts = userreqs.
reduceByKey((x,y) => x+y)
(user002,5) (user710,9) (user002,1)
(user033,1) (user022,4) (user808,8)
reqcounts (user912,2) (user001,4) (user018,2)
… … …

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-26
DStream Output Operations

§ Console output
– print (Scala) / pprint (Python) prints out the first 10 elements of
each RDD
– Optionally pass an integer to print another number of elements
§ File output
– saveAsTextFiles saves data as text
– saveAsObjectFiles saves as serialized object files (SequenceFiles)
§ Executing other functions
– foreachRDD(function)performs a function on each RDD in the
DStream
– Function input parameters
– The RDD on which to perform the function
– The time stamp of the RDD (optional)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-27
Saving DStream Results as Files
Language: Scala
val userreqs = logs
.map(line => (line.split(' ')(2),1))
.reduceByKey((v1,v2) => v1+v2)
userreqs.print()
userreqs.saveAsTextFiles("…/outdir/reqcounts")

(user002,5) (user710,9) (user002,1)


(user033,1) (user022,4) (user808,8)
(user912,2) (user001,4) (user018,2)
… … …

reqcounts-timestamp1/ reqcounts-timestamp2/ reqcounts-timestamp3/


part-00000… part-00000… part-00000…

(user002,1)
(user002,5) (user710,9)
(the,5) (the,9)
(user022,4)
(user808,8)
(word1,n)
(user033,1)
(the,5) (the,9) (word1,n)
(fat,1) (angry,1) (user018,2)
(word2,n)
(user912,2)
(fat,1) (user001,4)
(angry,1) (word2,n)
… (sat,4)
… (word3,n)
… (on,2)
(on,2) (sat,4) (word3,n)
… … …
… … …

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-28
Scala Example: Find Top Users (1)


val userreqs = logs
.map(line => (line.split(' ')(2),1))
.reduceByKey((v1,v2) => v1+v2)
userreqs.saveAsTextFiles(path)

val sortedreqs = userreqs


.map(pair => pair.swap)
.transform(rdd => rdd.sortByKey(false))

sortedreqs.foreachRDD((rdd,time) => {
Transform
println("Top users @ " each RDD: swap userID/count, sort by count
+ time)
rdd.take(5).foreach(
pair => printf("User: %s (%s)\n",pair._2, pair._1))
}
)

ssc.start()
ssc.awaitTermination()

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-29
Scala Example: Find Top Users (2)


val userreqs = logs
.map(line => (line.split(' ')(2),1))
.reduceByKey((v1,v2) => v1+v2)
userreqs.saveAsTextFiles(path)

val sortedreqs = userreqs


.map(pair => pair.swap)
Print out the top 5 users as “User: userID (count)”
.transform(rdd => rdd.sortByKey(false))

sortedreqs.foreachRDD((rdd,time) => {
println("Top users @ " + time)
rdd.take(5).foreach(
pair => printf("User: %s (%s)\n",pair._2, pair._1))
}
)

ssc.start()
ssc.awaitTermination()

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-30
Python Example: Find Top Users (1)

def printTop5(r,t):
print "Top users @",t
for count,user in r.take(5):
print "User:",user,"("+str(count)+")"

userreqs = mystream \
.map(lambda line: (line.split(' ')[2],1)) \
Transform
.reduceByKey(lambda each v1+v2)
v1,v2: RDD: swap userID/count, sort by count
userreqs.saveAsTextFiles("streamreq/reqcounts")

sortedreqs=userreqs \
.map(lambda (k,v): (v,k)) \
.transform(lambda rdd: rdd.sortByKey(False))

sortedreqs.foreachRDD(lambda time,rdd: printTop5(rdd,time))

ssc.start()
ssc.awaitTermination()

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-31
Python Example: Find Top Users (2)

def printTop5(r,t):
print "Top users @",t
for count,user in r.take(5):
print "User:",user,"("+str(count)+")"

userreqs = mystream \
.map(lambda line: (line.split(' ')[2],1)) \
.reduceByKey(lambda v1,v2: v1+v2)
userreqs.saveAsTextFiles("streamreq/reqcounts")

sortedreqs=userreqs \
Print out
.map(lambda (k,v): the top
(v,k)) \ 5 users as “User: userID (count)”
.transform(lambda rdd: rdd.sortByKey(False))

sortedreqs.foreachRDD(lambda time,rdd: printTop5(rdd,time))

ssc.start()
ssc.awaitTermination()

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-32
Example: Find Top Users—Output (1)

Top users @ 1401219545000 ms


User: 16261 (8) t1 (2 seconds after
User: 22232 (7)
ssc.start)
User: 66652 (4)
User: 21205 (2)
User: 24358 (2)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-33
Example: Find Top Users—Output (2)

Top users @ 1401219545000 ms


User: 16261 (8)
User: 22232 (7)
User: 66652 (4)
User: 21205 (2)
User: 24358 (2)
Top users @ 1401219547000 ms
User: 53667 (4)
User: 35600 (4) t2
User: 62 (2)
User: 165 (2) (2 seconds later)
User: 40 (2)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-34
Example: Find Top Users—Output (3)

Top users @ 1401219545000 ms


User: 16261 (8)
User: 22232 (7)
User: 66652 (4)
User: 21205 (2)
User: 24358 (2)
Top users @ 1401219547000 ms
User: 53667 (4)
User: 35600 (4)
User: 62 (2)
User: 165 (2)
User: 40 (2)
Top users @ 1401219549000 ms
User: 31 (12)
User: 6734 (10) t3
User: 14986 (10)
User: 72760 (2) (2 seconds later)
User: 65335 (2)
Top users @ 1401219551000 ms

Continues until
termination…
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-35
Chapter Topics

Apache Spark Streaming:


Introduction to DStreams

§ Apache Spark Streaming Overview


§ Example: Streaming Request Count
§ DStreams
§ Developing Streaming Applications
§ Essential Points
§ Hands-On Exercise: Write an Apache Spark Streaming Application

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-36
Building and Running Spark Streaming Applications

§ Building Spark Streaming applications


– Link with the main Spark Streaming library (included with Spark)
– Link with additional Spark Streaming libraries if necessary, for example,
Kafka, Flume, Twitter
§ Running Spark Streaming applications
– Use at least two threads if running locally
– Adding operations after the Streaming context has been started is
unsupported
– Stopping and restarting the Streaming context is unsupported

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-37
Using Spark Streaming with Spark Shell

§ Spark Streaming is designed for batch applications, not interactive use


§ The Spark shell can be used for limited testing
– Not intended for production use!
– Be sure to run the shell on a cluster with at least 2 cores, or locally with
at least 2 threads

$ spark-shell --master yarn

$ pyspark --master yarn

or

$ spark-shell --master 'local[2]'

$ pyspark --master 'local[2]'

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-38
The Spark Streaming Application UI

The Streaming tab in the Spark


App UI provides basic metrics
about the application

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-39
Chapter Topics

Apache Spark Streaming:


Introduction to DStreams

§ Apache Spark Streaming Overview


§ Example: Streaming Request Count
§ DStreams
§ Developing Streaming Applications
§ Essential Points
§ Hands-On Exercise: Write an Apache Spark Streaming Application

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-40
Essential Points

§ Spark Streaming is an extension of core Spark to process real-time


streaming data
§ DStreams are discretized streams of streaming data, batched into RDDs by
time intervals
– Operations applied to DStreams are applied to each RDD
– Transformations produce new DStreams by applying a function to each
RDD in the base DStream

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-41
Chapter Topics

Apache Spark Streaming:


Introduction to DStreams

§ Apache Spark Streaming Overview


§ Example: Streaming Request Count
§ DStreams
§ Developing Streaming Applications
§ Essential Points
§ Hands-On Exercise: Write an Apache Spark Streaming Application

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-42
Hands-On Exercise: Write an Apache Spark Streaming
Application
§ In this exercise, you will
– Write a Spark Streaming application to process web log data
§ Please refer to the Hands-On Exercise Manual for instructions

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-43
Apache Spark Streaming: Processing
Multiple Batches
Chapter 19
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-2
Apache Spark Streaming: Processing Multiple Batches

In this chapter you will learn


§ How to return data from a specific time period in a DStream
§ How to perform analysis using sliding window operations on a DStream
§ How to maintain state values across all time periods in a DStream

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-3
Chapter Topics

Apache Spark Streaming: Processing


Multiple Batches

§ Multi-Batch Operations
§ Time Slicing
§ State Operations
§ Sliding Window Operations
§ Essential Points
§ Hands-On Exercise: Process Multiple Batches with Apache Spark Streaming

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-4
Multi-Batch DStream Operations

§ DStreams consist of a series of “batches” of data


– Each batch is an RDD
§ Basic DStream operations analyze each batch individually
§ Advanced operations allow you to analyze data collected across batches
– Slice: allows you to operate on a collection of batches
– State: allows you to perform cumulative operations
– Windows: allows you to aggregate data across a sliding time period

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-5
Chapter Topics

Apache Spark Streaming: Processing


Multiple Batches

§ Multi-Batch Operations
§ Time Slicing
§ State Operations
§ Sliding Window Operations
§ Essential Points
§ Hands-On Exercise: Process Multiple Batches with Apache Spark Streaming

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-6
Time Slicing

§ DStream.slice(fromTime, toTime)
– Returns a collection of batch RDDs based on data from the stream
§ StreamingContext.remember(duration)
– By default, input data is automatically cleared when no RDD’s lineage
depends on it
– slice will return no data for time periods for data has already been
cleared
– Use remember to keep data around longer

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-7
Chapter Topics

Apache Spark Streaming: Processing


Multiple Batches

§ Multi-Batch Operations
§ Time Slicing
§ State Operations
§ Sliding Window Operations
§ Essential Points
§ Hands-On Exercise: Process Multiple Batches with Apache Spark Streaming

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-8
State DStreams (1)

§ Use the updateStateByKey function to create a state DStream


§ Example: Total request count by User ID

t1
(user001,5)
Requests (user102,1)
(user009,2)

(user001,5)
(user102,1)
Total
(user009,2)
Requests
(State)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-9
State DStreams (2)

§ Use the updateStateByKey function to create a state DStream


§ Example: Total request count by User ID

t1 t2
(user001,5) (user001,4)
Requests (user102,1) (user012,2)
(user009,2) (user921,5)

(user001,5) (user001,9)
(user102,1) (user102,1)
Total
(user009,2) (user009,2)
Requests
(user012,2)
(State)
(user921,5)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-10
State DStreams (3)

§ Use the updateStateByKey function to create a state DStream


§ Example: Total request count by User ID

t1 t2 t3
(user001,5) (user001,4) (user102,7)
Requests (user102,1) (user012,2) (user012,3)
(user009,2) (user921,5) (user660,4)

(user001,5) (user001,9) (user001,9)


(user102,1) (user102,1) (user102,8)
Total
(user009,2) (user009,2) (user009,2)
Requests
(user012,2) (user012,5)
(State)
(user921,5) (user921,5)
(user660,4)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-11
Python Example: Total User Request Count (1)
Language: Python

userreqs = logs \
.map(lambda line: (line.split(' ')[2],1)) \
.reduceByKey(lambda v1,v2: v1+v2)

ssc.checkpoint("checkpoints")

totalUserreqs = userreqs \
Set checkpoint directory to enable checkpointing.
.updateStateByKey(lambda newCounts, state: \
Required to prevent infinite
updateCount(newCounts, lineages.
state))
totalUserreqs.pprint()

ssc.start()
ssc.awaitTermination()

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-12
Python Example: Total User Request Count (2)
Language: Python

userreqs = logs \
.map(lambda Compute a state DStream
line: (line.split(' based on \the previous states,
')[2],1))
.reduceByKey(lambda v1,v2:
updated with v1+v2)from the current batch of request
the values

counts.
ssc.checkpoint("checkpoints")

totalUserreqs = userreqs \
.updateStateByKey(lambda newCounts, state: \
updateCount(newCounts, state))
totalUserreqs.pprint()
next slide…
ssc.start()
ssc.awaitTermination()

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-13
Python Example: Total User Request Count—Update Function

New Values Current State (or None)


Language: Python
def updateCount(newCounts, state):
if state == None: return sum(newCounts)
else: return state + sum(newCounts)
New State

t1 t2
(user001,5) (user001,4) § Example at t2
Requests (user102,1) (user012,2) user001:
(user009,2) (user921,5) updateCount([4],5) à 9
user012:
Total (user001,5) (user001,9)
Requests (user102,1) updateCount([2],None)) à 2
(user102,1)
(State) user921:
(user009,2) (user009,2)
updateCount([5],None)) à 5
(user012,2)
(user921,5)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-14
Scala Example: Total User Request Count (1)
Language: Scala

val userreqs = logs
.map(line => (line.split(' ')(2),1))
.reduceByKey((x,y) => x+y)

ssc.checkpoint("checkpoints")
val totalUserreqs = userreqs.updateStateByKey(updateCount)
totalUserreqs.print()
Set checkpoint directory to enable checkpointing.
ssc.start()
Required to prevent infinite lineages.
ssc.awaitTermination()

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-15
Scala Example: Total User Request Count (2)
Language: Scala

val userreqs = logs
.map(line => (line.split(' ')(2),1))
.reduceByKey((x,y) => x+y)

next slide…
ssc.checkpoint("checkpoints")
val totalUserreqs = userreqs.updateStateByKey(updateCount)
totalUserreqs.print()

ssc.start()
Compute a state DStream based on the previous states,
ssc.awaitTermination()
… updated with the values from the current batch of request
counts.

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-16
Scala Example: Total User Request Count—Update Function

New Values Current State (or None)


Language: Scala
def updateCount = (newCounts: Seq[Int], state: Option[Int]) => {
val newCount = newCounts.foldLeft(0)(_ + _)
val previousCount = state.getOrElse(0)
Some(newCount + previousCount)
} New State

t1 t2
(user001,5) (user001,4) § Example at t2
Requests (user102,1) (user012,2) user001:
(user009,2) (user921,5) updateCount([4],Some[5]) à 9
user012:
Total (user001,5) (user001,9)
Requests (user102,1) updateCount([2],None)) à 2
(user102,1)
(State) user921:
(user009,2) (user009,2)
updateCount([5],None)) à 5
(user012,2)
(user921,5)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-17
Example: Maintaining State—Output

-------------------------------------------
Time: 1401219545000 ms
------------------------------------------- (user001,5)
(user001,5)
(user102,1) t1 (user102,1)
(user009,2) (user009,2)
-------------------------------------------
Time: 1401219547000 ms
------------------------------------------- (user001,9)
(user001,9)
(user102,1) (user102,1)
(user009,2) t2 (user009,2)
(user012,2)
(user921,5) (user012,2)
------------------------------------------- (user921,5)
Time: 1401219549000 ms
-------------------------------------------
(user001,9) (user001,9)
(user102,8)
(user102,8)
(user009,2)
(user012,5) (user009,2)
(user921,5)
t3
(user012,5)
(user660,4)
------------------------------------------- (user921,5)
Time: 1401219541000 ms
(user660,4)
-------------------------------------------

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-18
Chapter Topics

Apache Spark Streaming: Processing


Multiple Batches

§ Multi-Batch Operations
§ Time Slicing
§ State Operations
§ Sliding Window Operations
§ Essential Points
§ Hands-On Exercise: Process Multiple Batches with Apache Spark Streaming

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-19
Sliding Window Operations (1)

§ Regular DStream operations execute for each RDD based on SSC duration
§ “Window” operations span RDDs over a given duration
– For example reduceByKeyAndWindow, countByWindow

Window Duration

Regular
DStream

Each box represents


reduceByKeyAndWindow( the SSC batch duration
fn,window-duration) (such as 2 seconds)

Window
DStream

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-20
Sliding Window Operations (2)

§ By default, window operations will execute with an “interval” the same as


the SSC duration
– For two-second batch duration, window will “slide” every two seconds

Window Duration
Regular
DStream
(batch size =
Seconds(2))

reduceByKeyAndWindow(fn,
Seconds(12))

Window
DStream

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-21
Sliding Window Operations (3)

§ You can specify a different slide duration (must be a multiple of the SSC
duration)

Window Duration
Regular
DStream
(batch size =
Seconds(2))

reduceByKeyAndWindow(fn,
Seconds(12), Seconds(4))

Window
DStream

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-22
Scala Example: Count and Sort User Requests by Window (1)
Language: Scala

val ssc = new StreamingContext(new SparkConf(), Seconds(2))
val logs = ssc.socketTextStream(hostname, port)

val reqcountsByWindow = logs.
map(line => (line.split(' ')(2),1)).
reduceByKeyAndWindow((v1: Int, v2: Int) => v1+v2,
Minutes(5),Seconds(30))

val topreqsByWindow=reqcountsByWindow.
Every 30 seconds, count requests by user over the last
map(pair => pair.swap).
transform(rdd five minutes.
=> rdd.sortByKey(false))
topreqsByWindow.map(pair => pair.swap).print()

ssc.start()
ssc.awaitTermination()

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-23
Scala Example: Count and Sort User Requests by Window (2)
Language: Scala

val ssc = new StreamingContext(new SparkConf(), Seconds(2))
val logs = ssc.socketTextStream(hostname, port)

val reqcountsByWindow = logs.
map(line => (line.split(' ')(2),1)).
Sort and print the top users for every RDD (every 30
reduceByKeyAndWindow((v1: Int, v2: Int) => v1+v2,
seconds).
Minutes(5),Seconds(30))

val topreqsByWindow=reqcountsByWindow.
map(pair => pair.swap).
transform(rdd => rdd.sortByKey(false))
topreqsByWindow.map(pair => pair.swap).print()

ssc.start()
ssc.awaitTermination()

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-24
Python Example: Count and Sort User Requests by Window (1)
Language: Python

ssc = new StreamingContext(new SparkConf(), 2)
logs = ssc.socketTextStream(hostname, port)

reqcountsByWindow = logs. \
map(lambda line: (line.split(' ')[2],1)).\
reduceByKeyAndWindow(lambda v1,v2: v1+v2,5*60,30)

topreqsByWindow=reqcountsByWindow.
Every 30 seconds, count\requests by user over the last
map(lambda (k,v): (v,k)). \
five minutes.
transform(lambda rdd: rdd.sortByKey(False))
topreqsByWindow.map(lambda (k,v): (v,k)).pprint()

ssc.start()
ssc.awaitTermination()

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-25
Python Example: Count and Sort User Requests by Window (2)
Language: Python

ssc = new StreamingContext(new SparkConf(), 2)
logs = ssc.socketTextStream(hostname, port)

reqcountsByWindow = logs. \
Sort(line.split('
map(lambda line: and print the top users for every RDD (every 30
')[2],1)).\
seconds).
reduceByKeyAndWindow(lambda v1,v2: v1+v2,5*60,30)

topreqsByWindow=reqcountsByWindow. \
map(lambda (k,v):(v,k)). \ #swap
transform(lambda rdd: rdd.sortByKey(False))
topreqsByWindow.map(lambda (k,v):(v,k)).pprint()

ssc.start()
ssc.awaitTermination()

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-26
Chapter Topics

Apache Spark Streaming: Processing


Multiple Batches

§ Multi-Batch Operations
§ Time Slicing
§ State Operations
§ Sliding Window Operations
§ Essential Points
§ Hands-On Exercise: Process Multiple Batches with Apache Spark Streaming

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-27
Essential Points

§ You can get a “slice” of data from a stream based on absolute start and
end times
– For example, all data received between midnight October 1, 2016 and
midnight October 2, 2016
§ You can update state based on prior state
– For example, total requests by user
§ You can perform operations on “windows” of data
– For example, number of logins in the last hour

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-28
Chapter Topics

Apache Spark Streaming: Processing


Multiple Batches

§ Multi-Batch Operations
§ Time Slicing
§ State Operations
§ Sliding Window Operations
§ Essential Points
§ Hands-On Exercise: Process Multiple Batches with Apache Spark
Streaming

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-29
Hands-On Exercise: Process Multiple Batches With Apache Spark
Streaming
§ In this exercise, you will
– Extend an Apache Spark Streaming application to perform multi-batch
analysis on web log data
§ Please refer to the Hands-On Exercise Manual for instructions

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-30
Apache Spark Streaming: Data Sources
Chapter 20
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-2
Apache Spark Streaming: Data Sources

In this chapter you will learn


§ How data sources are integrated with Spark Streaming
§ How receiver-based integration differs from direct integration
§ How Apache Flume and Apache Kafka are integrated with Spark Streaming
§ How to use direct Kafka integration to create a DStream

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-3
Chapter Topics

Apache Spark Streaming: Data


Sources

§ Streaming Data Source Overview


§ Flume and Kafka Data Sources
§ Example: Using a Kafka Direct Data Source
§ Essential Points
§ Hands-On Exercise: Process Apache Kafka Messages with Apache Spark
Streaming

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-4
Spark Streaming Data Sources

§ Basic data sources


– Network socket
– Text file
§ Advanced data sources
– Kafka
– Flume
– Twitter
– ZeroMQ
– Kinesis
– MQTT
– and more coming in the future…
§ To use advanced data sources, download (if necessary) and link to the
required library

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-5
Receiver-Based Data Sources

§ Most data sources are based on receivers


– Network data is received on a worker node
– Receiver distributes data (RDDs) to the cluster as partitions

@ t1 @ t2 @ t3 @ t4
DStream
RDD RDD RDD RDD

RDD
Executor rdd_0_1
Receiver

Network
Executor Data Source
rdd_0_0

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-6
Receiver-Based Replication

§ Spark Streaming RDD replication is enabled by default


– Data is copied to another node as it received

Executor rdd_0_1
Receiver

Network
Executor rdd_0_1
Data Source
rdd_0_0

Executor
rdd_0_0

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-7
Receiver-Based Fault Tolerance

§ If the receiver fails, Spark will restart it on a different executor


– Potential for brief loss of incoming data

Executor
Receiver

Network
Executor rdd_0_1
Data Source
Receiver
rdd_0_0

Executor
rdd_0_0

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-8
Managing Incoming Data

§ Receivers queue jobs to process data as it arrives


§ Data must be processed fast enough that the job queue does not grow
– Manage by setting spark.streaming.backpressure.enabled
= true
§ Monitor scheduling delay and processing time in Spark UI

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-9
Chapter Topics

Apache Spark Streaming: Data


Sources

§ Streaming Data Source Overview


§ Flume and Kafka Data Sources
§ Example: Using a Kafka Direct Data Source
§ Essential Points
§ Hands-On Exercise: Process Apache Kafka Messages with Apache Spark
Streaming

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-10
Overview: Spark Streaming with Flume

§ Two approaches to using Flume


– Push-based
– Pull-based
§ Push-based
– One Spark worker must run a network receiver on a specified node
– Configure Flume with an Avro sink to send to that receiver
§ Pull-based
– Uses a custom Flume sink in spark.streaming.flume package
– Strong reliability and fault tolerance guarantees

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-11
Overview: Spark Streaming with Kafka

§ Apache Kafka is a fast, scalable, distributed publish-subscribe messaging


system that provides
– Durability by persisting data to disk
– Fault tolerance through replication
§ Two approaches to Spark Streaming with Kafka
– Receiver-based
– Direct (receiverless)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-12
Receiver-Based Kafka Integration (1)

§ Receiver-based
– Streams (receivers) are configured with a Kafka topic and a partition in
that topic
– To protect from data loss, enable write ahead logs (introduced in Spark
1.2)
– Scala and Java support added in Spark 1.1
– Python support added in Spark 1.3

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-13
Receiver-Based Kafka Integration (2)

§ Receiver-based
– Kafka supports partitioning of message topics for scalability
– Receiver-based streaming allows multiple receivers, each configured for
individual topic partitions

Executor RDD 0 Receiver


(topic
rdd_0_0 partition 0)

Executor
Kafka Broker
RDD 1 Receiver Kafka Broker
(topic Kafka Broker
rdd_1_0 partition 1)

Executor
rdd_1_1

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-14
Kafka Direct Integration (1)

§ Direct (also called receiverless)


– Support for efficient zero-loss
– Support for exactly-once semantics
– Introduced in Spark 1.3 (Scala and Java only)
– Python support in Spark 1.5

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-15
Kafka Direct Integration (2)

§ Direct (also called receiverless)


– Consumes messages in parallel
– Automatically assigns each topic partition to an RDD partition

Executor RDD
rdd_0_0
topic
partition 0

Executor
Kafka Broker
Kafka Broker
rdd_0_1
topic Kafka Broker
partition 1

Executor

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-16
Chapter Topics

Apache Spark Streaming: Data


Sources

§ Streaming Data Source Overview


§ Flume and Kafka Data Sources
§ Example: Using a Kafka Direct Data Source
§ Essential Points
§ Hands-On Exercise: Process Apache Kafka Messages with Apache Spark
Streaming

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-17
Scala Example: Direct Kafka Integration (1)

import org.apache.spark.SparkContext
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.kafka._
import kafka.serializer.StringDecoder

object StreamingRequestCount {

def main(args: Array[String]) {


val sc = new SparkContext()
val ssc = new StreamingContext(sc,Seconds(2))

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-18
Scala Example: Direct Kafka Integration (2)

val kafkaStream = KafkaUtils.createDirectStream


[String,String,StringDecoder,StringDecoder] (ssc,
Map("metadata.broker.list"->"broker1:port,broker2:port"),
Set("mytopic"))

val logs = kafkaStream.map(pair => pair._2)

val userreqs = logs


.map(line => (line.split(' ')(2),1))
.reduceByKey((x,y) => x+y)

userreqs.print()

ssc.start()
ssc.awaitTermination()
}
}

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-19
Python Example: Direct Kafka Integration (1)

import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

if __name__ == "__main__":

sc = SparkContext()
ssc = StreamingContext(sc,2)

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-20
Python Example: Direct Kafka Integration (2)

kafkaStream = KafkaUtils. \
createDirectStream(ssc, ["mytopic"], \
{"metadata.broker.list": "broker1:port,broker2:port"})

logs = kafkaStream.map(lambda (key,value): value)

userreqs = logs \
.map(lambda line: (line.split(' ')[2],1)) \
.reduceByKey(lambda v1,v2: v1+v2)

userreqs.pprint()

ssc.start()
ssc.awaitTermination()

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-21
Chapter Topics

Apache Spark Streaming: Data


Sources

§ Streaming Data Source Overview


§ Flume and Kafka Data Sources
§ Example: Using a Kafka Direct Data Source
§ Essential Points
§ Hands-On Exercise: Process Apache Kafka Messages with Apache Spark
Streaming

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-22
Essential Points

§ Spark Streaming integrates with a number of data sources


§ Most use a receiver-based integration
– Flume, for example
§ Kafka can be integrated using a receiver-based or a direct (receiverless)
approach
– The direct approach provides efficient strong reliability

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-23
Chapter Topics

Apache Spark Streaming: Data


Sources

§ Streaming Data Source Overview


§ Flume and Kafka Data Sources
§ Example: Using a Kafka Direct Data Source
§ Essential Points
§ Hands-On Exercise: Process Apache Kafka Messages with Apache Spark
Streaming

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-24
Hands-On Exercise: Process Apache Kafka Messages with Apache
Spark Streaming
§ In this exercise, you will
– Write a Spark Streaming application to process web logs using a direct
Kafka data source
§ Please refer to the Hands-On Exercise Manual for instructions

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-25
Conclusion
Chapter 21
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 21-2
Course Objectives

During this course, you have learned


§ How the Apache Hadoop ecosystem fits in with the data processing
lifecycle
§ How data is distributed, stored, and processed in a Hadoop cluster
§ How to write, configure, and deploy Apache Spark applications on a
Hadoop cluster
§ How to use the Spark shell and Spark applications to explore, process, and
analyze distributed data
§ How to process and query structured data using Spark SQL
§ How to use Spark Streaming to process a live data stream
§ How to use Apache Flume and Apache Kafka to ingest data for Spark
Streaming

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 21-3
Which Course to Take Next?

Cloudera offers a range of training courses for you and your team
§ For developers
– Designing and Building Big Data Applications
– Cloudera Training for Apache HBase
§ For system administrators
– Cloudera Administrator Training for Apache Hadoop
§ For data analysts and data scientists
– Cloudera Data Analyst Training: Using Pig, Hive, and Impala with Hadoop
– Data Science at Scale using Spark and Hadoop
§ For architects, managers, CIOs, and CTOs
– Cloudera Essentials for Apache Hadoop

© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 21-4

You might also like