Cloudera Developer Training Slides

Developer Training for Apache
Spark and Hadoop
201611
Introduction
Chapter 1
Course Chapters
§ Introduction
§ Introduction to Apache Hadoop and the Hadoop Ecosystem
§ Apache Hadoop File Storage
§ Data Processing on an Apache Hadoop Cluster
§ Importing Relational Data with Apache Sqoop
§ Apache Spark Basics
§ Working with RDDs
§ Aggregating Data with Pair RDDs
§ Writing and Running Apache Spark Applications
§ Configuring Apache Spark Applications
§ Parallel Processing in Apache Spark
§ RDD Persistence
§ Common Patterns in Apache Spark Data Processing
§ DataFrames and Apache Spark SQL
§ Message Processing with Apache Kafka
§ Capturing Data with Apache Flume
§ Integrating Apache Flume and Apache Kafka
§ Apache Spark Streaming: Introduction to DStreams
§ Apache Spark Streaming: Processing Multiple Batches
§ Apache Spark Streaming: Data Sources
§ Conclusion
© Copyright 2010-2016 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-3
Trademark Information
§ The names and logos of Apache products mentioned in Cloudera training courses,
including those listed below, are trademarks of the Apache Software Foundation
– Apache Accumulo – Apache Lucene
– Apache Avro – Apache Mahout
– Apache Bigtop – Apache Oozie
– Apache Crunch – Apache Parquet
– Apache Flume – Apache Pig
– Apache Hadoop – Apache Sentry
– Apache HBase – Apache Solr
– Apache HCatalog – Apache Spark
– Apache Hive – Apache Sqoop
– Apache Impala (incubating) – Apache Tika
– Apache Kafka – Apache Whirr
– Apache Kudu – Apache ZooKeeper
§ All other product names, logos, and brands cited herein are the property of their
respective owners
Chapter Topics
Introduction
§ About This Course

§ About Cloudera
§ Course Logistics
§ Introductions
Course Objectives
During this course, you will learn

§ How the Apache Hadoop ecosystem fits in with the data processing
lifecycle
§ How data is distributed, stored, and processed in a Hadoop cluster
§ How to write, configure, and deploy Apache Spark applications on a
Hadoop cluster
§ How to use the Spark shell and Spark applications to explore, process, and
analyze distributed data
§ How to process and query structured data using Spark SQL
§ How to use Spark Streaming to process a live data stream
§ How to use Apache Flume and Apache Kafka to ingest data for Spark
Streaming
Chapter Topics
Introduction

§ About Cloudera
§ Course Logistics
§ Introductions
About Cloudera (1)
§ The leader in Apache Hadoop-based software and services

§ Founded by Hadoop experts from Facebook, Yahoo, Google, and Oracle
§ Provides support, consulting, training, and certification for Hadoop users
§ Staff includes committers to virtually all Hadoop projects
§ Many authors of industry standard books on Apache Hadoop projects
About Cloudera (2)
§ Our customers include many key users of Hadoop

§ We offer several public training courses, such as
– Cloudera Administrator Training for Apache Hadoop
– Cloudera Data Analyst Training: Using Pig, Hive, and Impala with Hadoop
– Designing and Building Big Data Applications
– Data Science at Scale using Spark and Hadoop
– Cloudera Search Training
– Cloudera Training for Apache HBase
§ On-site and customized training is also available
CDH
CDH (Cloudera’s Distribution including Apache Hadoop)

§ 100% open source,
enterprise-ready PROCESS, ANALYZE, SERVE
distribution of Hadoop
and related projects BATCH STREAM SQL SEARCH
Spark, Hive, Pig Spark Impala Solr
§ The most complete, MapReduce
tested, and widely

deployed distribution of UNIFIED SERVICES
Hadoop RESOURCE MANAGEMENT SECURITY
YARN Sentry, RecordService
§ Integrates all the key

Hadoop ecosystem FILESYSTEM RELATIONAL NoSQL
projects HDFS Kudu HBase
§ Available as RPMs and STORE

Ubuntu, Debian, or SuSE
packages, or as a tarball BATCH REAL-TIME
Sqoop Kafka, Flume
INTEGRATE
Cloudera Express
§ Cloudera Express
– Completely free to
download and use
§ The best way to get started
with Hadoop
§ Includes CDH
§ Includes Cloudera Manager
– End-to-end
administration for
Hadoop
– Deploy, manage, and
monitor your cluster
Cloudera Enterprise
§ Subscription product including CDH and Cloudera Manager

§ Provides advanced features, such as
– Operational and utilization reporting
– Configuration history and rollbacks
– Rolling updates & service restarts
– External authentication (LDAP/SAML)
– Automated backup and disaster recovery
§ Specific editions offer additional capabilities, such as
– Governance and data management (Cloudera Navigator)
– Active data optimization (Cloudera Navigator Optimizer)
– Comprehensive encryption (Cloudera Navigator Encrypt )
– Key management (Cloudera Navigator Key Trustee)
Chapter Topics
Introduction

§ About Cloudera
§ Course Logistics
§ Introductions
Logistics
§ Class start and finish times

§ Lunch
§ Breaks
§ Restrooms
§ Wi-Fi access
§ Virtual machines
Your instructor will give you details on how to access the course materials
and exercise instructions for the class
Chapter Topics
Introduction

§ About Cloudera
§ Course Logistics
§ Introductions
Introductions
§ About your instructor

§ About you
– Where do you work? What do you do there?
– How much Hadoop and Spark experience do you have?
– What do you expect to gain from this course?
– Which language do you plan to use in the course: Python or Scala?
Introduction to Apache Hadoop and
the Hadoop Ecosystem
Chapter 2
Course Chapters
§ Introduction
§ RDD Persistence
§ Conclusion
Introduction to Apache Hadoop and the Hadoop Ecosystem
In this chapter you will learn

§ What Apache Hadoop is and what kind of use cases it is best suited for
§ How the major components of the Hadoop ecosystem fit together
§ How to get started with the Hands-On Exercises in this course
Chapter Topics

§ Apache Hadoop Overview

§ Data Storage and Ingest
§ Data Processing
§ Data Analysis and Exploration
§ Other Ecosystem Tools
§ Introduction to the Hands-On Exercises
§ Essential Points
§ Hands-On Exercise: Query Hadoop Data with Apache Impala
What Is Apache Hadoop?
§ Scalable and economical PROCESS, ANALYZE, SERVE

data storage, processing,
and analysis BATCH STREAM SQL SEARCH
Spark, Hive, Pig Spark Impala Solr
– Distributed and fault- MapReduce
tolerant UNIFIED SERVICES

– Harnesses the power RESOURCE MANAGEMENT SECURITY
YARN Sentry, RecordService
of industry standard
hardware FILESYSTEM RELATIONAL NoSQL
HDFS Kudu HBase
§ Heavily inspired by
STORE
technical documents
published by Google BATCH REAL-TIME
Sqoop Kafka, Flume
INTEGRATE
Common Hadoop Use Cases
§ Extract, Transform, and Load (ETL) § Data storage

§ Data analysis § Collaborative filtering
§ Text mining § Prediction models
§ Index building § Sentiment analysis
§ Graph creation and analysis § Risk assessment
§ Pattern recognition
§ What do these workloads have in common? Nature of the data…

– Volume
– Velocity
– Variety
Distributed Processing with Hadoop
Processing
A Hadoop Cluster
• Apache Spark
• MapReduce
Resource Management Storage

• YARN • HDFS
• Apache Mesos • Amazon S3
• Spark Standalone • Apache Kudu
Chapter Topics


§ Data Processing
§ Essential Points
Data Ingest and Storage
§ Hadoop typically ingests data from many sources and in many formats
– Traditional data management systems such as databases
– Logs and other machine generated data (event data)
– Imported files
HDFS
Ingest HBase
Kudu
Data Sources
Data Storage
Data Storage: HDFS and Apache HBase
§ Hadoop Distributed File System (HDFS)

– HDFS is the main storage layer for Hadoop
– Provides inexpensive reliable storage for massive
amounts of data on industry-standard hardware
– Data is distributed when stored
§ Apache HBase: The Hadoop Database
HDFS
– A NoSQL distributed database built on HDFS
– Scales to support very large amounts of data
and high throughput
– A table can have thousands of columns
– Covered in Cloudera Training for Apache HBase
Data Storage: Apache Kudu
§ Apache Kudu
– Distributed columnar (key-value) storage for structured data
– Supports random access and updating data (unlike HDFS)
– Faster sequential reads than HBase to support SQL-based analytics
– Works directly on native file system; is not built on HDFS
– Integrates with Spark, MapReduce, and Apache Impala
– Created at Cloudera, donated to Apache Software Foundation
Data Ingest Tools (1)
§ HDFS
– Direct file transfer
§ Apache Sqoop
– High speed import to HDFS from relational
database (and vice versa)
– Supports many data storage systems
– Examples: Netezza, MongoDB, MySQL, HDFS
Teradata, and Oracle
Data Ingest Tools (2)
§ Apache Flume
– Distributed service for ingesting streaming data
– Ideally suited for event data from multiple systems
– For example, log files
§ Apache Kafka
– A high throughput, scalable messaging system
– Distributed, reliable publish-subscribe system
HDFS
– Integrates with Flume and Spark Streaming
Apache
Kafka
Chapter Topics


§ Data Processing
§ Essential Points
Apache Spark: An Engine for Large-Scale Data Processing
§ Spark is a large-scale data processing engine

– General purpose
– Runs on Hadoop clusters and processes data in HDFS
§ Supports a wide range of workloads
– Machine learning
– Business intelligence
– Streaming
– Batch processing
– Querying structured data
§ This course uses Spark for data processing
Hadoop MapReduce: The Original Hadoop Processing Engine
§ Hadoop MapReduce is the original Hadoop

framework for processing big data
– Primarily Java-based
§ Based on the MapReduce programming model
§ The core Hadoop processing engine before Spark was introduced
§ Still the dominant technology
– But losing ground to Spark fast
§ Many existing tools are still built using MapReduce code
§ Has extensive and mature fault tolerance built into the framework
Apache Pig: Scripting for MapReduce
§ Apache Pig builds on Hadoop to offer high-level data processing

– An alternative to writing low-level MapReduce code
– Especially good at joining and transforming data
§ The Pig interpreter runs on the client machine
– Turns Pig Latin scripts into MapReduce or Sparkjobs
– Submits those jobs to a Hadoop cluster
people = LOAD '/user/training/customers' AS (cust_id, name);

orders = LOAD '/user/training/orders' AS (ord_id, cust_id, cost);
groups = GROUP orders BY cust_id;
totals = FOREACH groups GENERATE group, SUM(orders.cost) AS t;
result = JOIN totals BY group, people BY cust_id;
DUMP result;
Chapter Topics


§ Data Processing
§ Essential Points
Apache Impala (Incubating): High-Performance SQL
§ Impala is a high-performance SQL engine

– Runs on Hadoop clusters
– Data stored in HDFS files, or in HBase or Kudu tables
– Inspired by Google’s Dremel project
– Very low latency—measured in milliseconds
– Ideal for interactive analysis
§ Impala supports a dialect of SQL (Impala SQL)
– Data in HDFS modeled as database tables
§ Impala was developed by Cloudera
– Donated to the Apache Software Foundation, where it is incubating
– 100% open source, released under the Apache software license
Apache Hive: SQL on MapReduce or Spark
§ Hive is an abstraction layer on top of Hadoop

– Hive uses a SQL-like language called HiveQL
– Similar to Impala SQL
– Useful for data processing and ETL
– Impala is preferred for interactive analytics
§ Hive executes queries using MapReduce or Spark
SELECT zipcode, SUM(cost) AS total

FROM customers
JOIN orders
ON (customers.cust_id = orders.cust_id)
WHERE zipcode LIKE '63%'
GROUP BY zipcode
ORDER BY total DESC;
Cloudera Search: A Platform for Data Exploration
§ Interactive full-text search for data in a Hadoop cluster

§ Allows non-technical users to access your data
– Nearly everyone can use a search engine
§ Cloudera Search enhances Apache Solr
– Integrates Apache Solr with HDFS, MapReduce,
HBase, and Flume
– Supports file formats widely used with Hadoop
– Dynamic web-based dashboard interface with Hue
§ Cloudera Search is 100% open source
Chapter Topics


§ Data Processing
§ Essential Points
Apache Oozie: Workflow Management
§ Oozie Start
– Workflow engine for Workflow
Hadoop jobs
– Defines Web
dependencies Server
Logs in
between jobs HDFS?
Yes No
§ The Oozie server
Import Sales
submits the jobs to the Data with Sqoop Send e-mail to
server in the correct Administrator
sequence
Is today
Sunday?
Yes No
Generate Weekly
Process Data
Reports with End Workflow
with Spark
Hive
Hue: The UI for Hadoop
§ Hue = Hadoop User Experience

§ Hue provides a web front-end to Hadoop
– Upload and browse data in HDFS
– Query tables in Impala and Hive
– Run Spark jobs, Pig jobs, and Oozie workflows
– Build an interactive Cloudera Search dashboard
– And much more
§ Makes Hadoop easier to use
§ Created by Cloudera
– 100% open source
– Released under Apache license
Apache Sentry: Hadoop Security
§ Sentry provides fine-grained access control

(authorization) to various Hadoop ecosystem
components
– Impala
– Hive
– Cloudera Search
– HDFS
§ In conjunction with Kerberos authentication, Sentry
authorization provides an overall cluster security
solution
§ Created by Cloudera
– Donated to Apache Software Foundation
– Now an open-source Apache project
Chapter Topics


§ Data Processing
§ Essential Points
Introduction to the Hands-On Exercises
§ The best way to learn is to do!

§ Most topics in this course have Hands-On Exercises to practice the skills
you have learned in the course
§ The exercises are based on a hypothetical scenario
– However, the concepts apply to nearly any organization
§ Loudacre Mobile is a (fictional) fast-growing wireless carrier
– Provides mobile service to customers throughout western USA
L udacre
mobile
o
Scenario Explanation
§ Loudacre needs to migrate their existing infrastructure to Hadoop

– The size and velocity of their data has exceeded their ability to process
and analyze their data
§ Loudacre data sources
– MySQL database: customer account data (name, address, phone
numbers, and devices)
– Apache web server logs from Customer Service site
– HTML files: Knowledge Base articles
– XML files: Device activation records
– Real-time device status logs
– Base station files: Cell tower locations
Introduction to Exercises: Classroom Virtual Machine
§ Your virtual machine

– Log in as user training (password training)
– Home directory is /home/training (often referenced as ~)
– Pre-installed and configured with
– Spark and CDH (Cloudera’s Distribution, including Apache Hadoop)
– Various tools including Firefox, gedit, Emacs, Eclipse, and Apache
Maven
§ Training materials: ~/training_materials/devsh folder on the VM
– examples: all the example code in this course
– exercises: starter files, scripts and solutions for the Hands-On
Exercises
– scripts: course setup scripts
§ Course data: ~/training_materials/data
Chapter Topics


§ Data Processing
§ Essential Points
Essential Points
§ Hadoop is a framework for distributed storage and processing

§ Core Hadoop includes HDFS for storage and YARN for cluster resource
management
§ The Hadoop ecosystem includes many components for
– Ingesting data (Flume, Sqoop, Kafka)
– Storing data (HDFS, HBase, Kudu)
– Processing data (Spark, Hadoop MapReduce, Pig)
– Modeling data as tables for SQL access (Impala, Hive)
– Exploring data (Hue, Search)
– Protecting Data (Sentry)
§ Hands-On Exercises let you practice and refine your Hadoop skills
Bibliography
The following offer more information on topics discussed in this chapter

§ Hadoop: The Definitive Guide (published by O’Reilly)
– http://tiny.cloudera.com/hadooptdg
§ Cloudera Essentials for Apache Hadoop—free online training
– http://tiny.cloudera.com/esscourse
Chapter Topics


§ Data Processing
§ Essential Points
Hands-on Exercise: Query Hadoop Data with Apache Impala
§ In this exercise, you will

– Use the Hue Impala Query Editor to explore data in a Hadoop cluster
§ Please refer to the Hands-On Exercise Manual for instructions
Apache Hadoop File Storage
Chapter 3
Course Chapters
§ Introduction
§ RDD Persistence
§ Conclusion

§ How the Apache Hadoop Distributed File System (HDFS) stores data across
a cluster
§ How to use HDFS using the Hue File Browser or the hdfs command
§ What the major supported file storage formats in Hadoop are, and how to
choose which to use
Chapter Topics
§ Apache Hadoop Cluster Components

§ HDFS Architecture
§ Using HDFS
§ Apache Hadoop File Formats
§ Essential Points
§ Hands-On Exercises: Access HDFS with the Command Line and Hue
Hadoop Cluster Terminology
§ A cluster is a group of computers working together

– Provides data storage, data processing, and resource management
§ A node is an individual computer in the cluster
– Master nodes manage distribution of work and data to worker nodes
§ A daemon is a program running on a node
– Each performs different functions in the cluster
Worker Node
Master Node Worker Node Master Node
Worker Node
Worker Node
…
Cluster Components
§ Three main components of a cluster

§ Work together to provide distributed data processing
§ The first topic is the Storage component
– HDFS
Processing
Resource Storage
Management
Chapter Topics

§ Using HDFS
§ Essential Points
HDFS Basic Concepts (1)
§ HDFS is a file system written in Java

– Based on Google File System
§ Sits on top of a native file system
– Such as ext3, ext4, or xfs
§ Provides redundant storage for massive amounts of data
– Using readily-available, industry-standard computers
HDFS
Native OS file system
Disk Storage
HDFS Basic Concepts (2)
§ HDFS performs best with a “modest” number of large files

– Millions, rather than billions, of files
– Each file typically 100MB or more
§ Files in HDFS are “write once”
– No random writes to files are allowed
§ HDFS is optimized for large, streaming reads of files
– Rather than random reads
How Files Are Stored
§ Data files are split into blocks (default 128MB) which are distributed at
load time
§ Each block is replicated on multiple data nodes (default 3x)
§ NameNode stores metadata Block 1
Block 3 Name
Block 1 Node
Block 1
Block 2
Metadata:
Block 2 Block 2
information
Very Block 3
Large Block 4
about files
Data File and blocks
Block 3 Block 2
Block 4
Block 1
Block 3
Block 4
Block 4
Example: Storing and Retrieving Files (1)
Local
Node A Node D
/logs/
031515.log
Node B Node E
/logs/
042316.log Node C
Hadoop
Cluster
Metadata B1: A,B,D NameNode

B2: B,D,E
B3: A,B,C
/logs/031515.log: B1,B2,B3 B4: A,B,E
/logs/042316.log: B4,B5 B5: C,E,D
1 Node A Node D
/logs/
2 1 3 1 5
031515.log
3 4 2
Node B Node E
1 2 2 5
3 4 4
/logs/
4
042316.log 5 Node C
3 5

B2: B,D,E
B3: A,B,C
/logs/031515.log: B1,B2,B3 B4: A,B,E
/logs/042316.log: B4,B5 B5: C,E,D
1 Node A Node D
/logs/
/logs/04231
2 1 3 1 5 6.log?
031515.log
3 4 2
Node B Node E B4,B5

1 2 2 5
3 4 4
/logs/
4
042316.log 5 Node C Client
3 5

B2: B,D,E
B3: A,B,C
/logs/031515.log: B1,B2,B3 B4: A,B,E
/logs/042316.log: B4,B5 B5: C,E,D
1 Node A Node D
/logs/
/logs/04231
2 1 3 1 5 6.log?
031515.log
3 4 2
Node B Node E B4,B5

1 2 2 5
3 4 4
/logs/
4
042316.log 5 Node C Client
3 5
HDFS NameNode Availability
§ The NameNode daemon must be running at all times

– If the NameNode stops, the cluster becomes inaccessible
§ HDFS is typically set up for High

Availability Active Standby
– Two NameNodes: Active and Standby Name Name
– Standby NameNode takes over Node Node
automatically if Active NameNode
fails
§ Small clusters may use classic mode
– One NameNode Secondary
Name Name
– One “helper” node called the Node Node
Secondary NameNode
– Bookkeeping, not backup
Chapter Topics

§ Using HDFS
§ Essential Points
Options for Accessing HDFS
put
§ From the command line HDFS
$ hdfs dfs Client Cluster
get
§ In Spark
– By URI—for example:
hdfs://nnhost:port/file…
§ Other programs
– Java API
– Used by Hadoop tools such as
MapReduce, Impala, Hue,
Sqoop, Flume
– RESTful interface
HDFS Command Line Examples (1)
§ Copy file foo.txt from local disk to the user’s directory in HDFS
$ hdfs dfs -put foo.txt foo.txt
– This will copy the file to /user/username/foo.txt

§ Get a directory listing of the user’s home directory in HDFS
$ hdfs dfs -ls
§ Get a directory listing of the HDFS root directory
$ hdfs dfs –ls /
§ Display the contents of the HDFS file /user/fred/bar.txt
$ hdfs dfs -cat /user/fred/bar.txt
§ Copy that file to the local disk, named as baz.txt
$ hdfs dfs -get /user/fred/bar.txt baz.txt
§ Create a directory called input under the user’s home directory
$ hdfs dfs -mkdir input
Note: copyFromLocal is a synonym for put; copyToLocal is a synonym for get
§ Delete a file
$ hdfs dfs -rm input_old/file1
§ Delete a set of files using a wildcard
$ hdfs dfs -rm input_old/*
§ Delete the directory input_old and all its contents
$ hdfs dfs -rm -r input_old
The Hue HDFS File Browser
§ The File Browser in Hue lets you view and manage your HDFS directories
and files
– Create, move, rename, modify, upload, download, and delete
directories and files
– View file contents
HDFS Recommendations
§ HDFS is a repository for your data

– Structure and organize carefully!
§ Best practices include
– Define a standard directory structure
– Include separate locations for staging data
§ Example organization
– /user/…—data and configuration belonging only to a single user
– /etl—work in progress in Extract/Transform/Load stage
– /tmp—temporary generated data shared between users
– /data—data sets that are processed and available across the
organization for analysis
– /app—non-data files such as configuration, JAR files, SQL files
Chapter Topics

§ Using HDFS
§ Essential Points
Hadoop Data Storage Formats
§ Hadoop and the tools in the Hadoop ecosystem use several different file
formats to store data
§ The most common are
– Text
– SequenceFiles
– Apache Avro data format
– Apache Parquet
§ Which formats to use depend on your use case and which tools you use
§ You can also define custom formats
§ HDFS considers files to be simply a sequence of bytes
Hadoop File Formats: Text Files
§ Text files are the most basic file type in Hadoop

– Can be read or written from virtually any programming language
– Comma- and tab-delimited files are compatible with many applications
§ Text files are human readable since all data is in string format
– Useful when debugging
§ Text files are inefficient at scale
– Representing numeric values as strings wastes storage space
– Difficult to represent binary data such as images
– Often resort to techniques such as Base64 encoding
– Conversion to/from native types adds performance penalty
§ Verdict: Good interoperability, but poor performance
Hadoop File Formats: SequenceFiles
§ SequenceFiles store key-value pairs in a binary container format

– Less verbose and more efficient than text files
– Capable of storing binary data such as images
– Format is Java-specific and tightly coupled to Hadoop
§ Verdict: Better performance, but poor interoperability
Hadoop File Formats: Apache Avro Data Files
§ Efficient storage due to optimized binary encoding

§ Widely supported throughout the Hadoop ecosystem
– Can also be used outside of Hadoop
§ Ideal for long-term storage of important data
– Many languages can read and write Avro files
– Embeds schema in the file, so will always be readable
– In JSON format and not Java-specific
– Schema evolution can accommodate changes
§ Verdict: Excellent interoperability and performance
– Best choice for general-purpose storage in Hadoop
Inspecting Avro Data Files with Avro Tools
§ Avro data files are an efficient way to store data

– However, the binary format makes debugging difficult
§ Use the avro-tools command to work with binary files
– Allows you to read the schema or data for an Avro file
$ avro-tools tojson mydatafile.avro

{"name":"Alice","salary": 56500,"city":"Anaheim"}
{"name":"Bob","salary": 51400,"city":"Bellevue"}
$ avro-tools getschema mydatafile.avro

{
"type" : "record",
"name" : "DeviceData",
"namespace" : "com.loudacre.data", Remainder of schema omitted
Columnar Formats
§ Hadoop also supports a few columnar formats

– These organize data storage by column, rather than by row
– Very efficient when selecting only a subset of a table’s columns
id name city occupation id name city occupation
1 Alice Palo Alto Accountant 1 Alice Palo Alto Accountant
2 Bob Sunnyvale Accountant 2 Bob Sunnyvale Accountant
3 Bob Palo Alto Dentist 3 Bob Palo Alto Dentist
4 Bob Palo Alto Manager 4 Bob Palo Alto Manager
5 Carol Palo Alto Manager 5 Carol Palo Alto Manager
6 David Sunnyvale Mechanic 6 David Sunnyvale Mechanic
Organization of data in Organization of data in

traditional row-based formats columnar formats
Hadoop File Formats: Apache Parquet Files
§ Parquet is a columnar format developed by Cloudera and Twitter

– Supported in Spark, MapReduce, Hive, Pig, Impala, and others
– Schema metadata is embedded in the file (like Avro)
§ Uses advanced optimizations described in Google’s Dremel paper
– Reduces storage space
– Increases performance
§ Most efficient when adding many records at once
– Some optimizations rely on identifying repeated patterns
§ Verdict: Excellent interoperability and performance
– Best choice for column-based access patterns
Inspecting Parquet Files with Parquet Tools
§ Parquet files are binary files

– Binary format makes debugging difficult
§ Use parquet-tools command to work with binary files
– Allows you to read the schema or data for a Parquet file
$ parquet-tools head mydatafile.parquet

…
name = Alice
salary = 56500
…
$ parquet-tools schema mydatafile.parquet

…
optional binary name (UTF8);
optional int32 salary;
…
Data Format Summary
Feature Text Sequence Avro Parquet

Files
Supported by many tools ✓ ✓ ✓
Good performance at scale ✓ ✓ ✓
Binary format ✓ ✓ ✓
Embedded schema ✓ ✓
Columnar organization ✓
Data Compression
§ Each file format may also support compression

– This reduces amount of disk space required to store data
§ Compression is a tradeoff between CPU time and bandwidth/storage
space
– Aggressive algorithms take a long time, but save more space
– Less aggressive algorithms save less space but are much faster
§ Can significantly improve performance Physical Area on Disk Required to
Store a Given Amount of Data
– Many Hadoop jobs are I/O-bound
– Using compression allows you to uncompressed
handle more data per I/O operation compressed
– Compression can also improve the

performance of network transfers
Compression Codecs
§ The implementation of a compression algorithm is known as a codec

– Short for compressor/decompressor
§ Many codecs are commonly used with Hadoop
– Each has different performance characteristics
– Not all Hadoop tools are compatible with all codecs
§ Overall, BZip2 saves the most space
Time Required more >

– But LZ4 and Snappy are much faster BZip2
– Impala supports Snappy but not LZ4
§ For “hot” data, speed matters most GZip
– Better to compress by 40% in one

LZO
second than by 80% in 10 seconds Snappy
< less
LZ4
< less Space Saved more >
Chapter Topics

§ Using HDFS
§ Essential Points
Essential Points
§ The Hadoop Distributed File System (HDFS) is the main storage layer for
Hadoop
§ HDFS chunks data into blocks and distributes them across the cluster when
data is stored
§ HDFS clusters are managed by a single NameNode running on a master
node
§ Access HDFS using Hue, the hdfs command, or the HDFS API
§ The Hadoop ecosystem supports several different file formats
Bibliography

§ HDFS User Guide
– http://tiny.cloudera.com/hdfsuser
§ Hue website
– http://gethue.com/
Chapter Topics

§ Using HDFS
§ Essential Points
Hands-On Exercise: Access HDFS with the Command Line and
Hue
– Create a /loudacre base directory for course exercises
– Practice uploading and viewing data files
Data Processing on an Apache Hadoop
Cluster
Chapter 4
Course Chapters
§ Introduction
§ RDD Persistence
§ Conclusion
Data Processing on an Apache Hadoop Cluster

§ How Hadoop YARN provides cluster resource management for distributed
data processing
§ How to use Hue, the YARN web UI, or the yarn command to monitor your
cluster
Chapter Topics
Data Processing on an Apache

Hadoop Cluster
§ YARN Architecture
§ Working With YARN
§ Essential Points
§ Hands-On Exercises: Run a YARN Job
What Is YARN?
§ YARN = Yet Another Resource Negotiator
§ YARN is the Hadoop processing layer that contains
– A resource manager
– A job scheduler
YARN Daemons
§ ResourceManager (RM)
– Runs on master node
– Global resource scheduler
Resource
– Arbitrates system resources between competing
Manager
applications
– Has a pluggable scheduler to support different
algorithms (such as Capacity or Fair Scheduler)
§ NodeManager (NM)
– Runs on worker nodes
Node
– Communicates with RM Manager
– Manages node resources
– Launches containers
Running Applications on YARN
§ Containers
– Containers allocate a certain amount of resources
(memory, CPU cores) on a worker node
Container
– Applications run in one or more containers
– Clients request containers from RM
§ ApplicationMaster (AM)
– One per application
Application
– Framework/application specific
Master
– Runs in a container
– Requests more containers to run application tasks
Running an Application on YARN (1)
NodeManager DataNode
NodeManager DataNode Name

Node
Resource
Manager
$ my-hadoop-app NodeManager DataNode
Client

Application Node
Resource
Launch Master
Manager
Client
Resource Request:
- 1 x Node1/1GB/1 core
- 1 x Node2/1GB/1 core
Application Node
Resource
Master
Manager
Client

Application Node
Resource
Master
Manager “Here are your
containers”

Task
Client
Task

Application Node
Resource
Master
Manager
Client

Application Node
Resource “I’m done!”
Master
Manager
Chapter Topics

Hadoop Cluster
§ Essential Points
Working with YARN
§ Developers need to be able to

– Submit jobs (applications) to run on the YARN cluster
– Monitor and manage jobs
§ There are three major YARN tools for developers
– The Hue Job Browser
– The YARN web UI
– The YARN command line
§ YARN administrators can use Cloudera Manager
– May also be helpful for developers
– Included in Cloudera Express and Cloudera Enterprise
The Hue Job Browser
§ The Hue Job Browser allows you to

– Monitor the status of a job
– View the logs
– Kill a running job
The YARN Web UI
§ The ResourceManager UI is the main entry point

– Runs on the RM host on port 8088 by default
§ Provides more detailed view than Hue
§ Does not provide any control or configuration
ResourceManager UI: Nodes
Cluster Overview
List of all nodes

Link to Node in the cluster
Manager UI
ResourceManager UI: Applications
Cluster Overview
Link to application List of running and

details… (next slide) recent applications
ResourceManager UI: Application Detail
Link to Application Master

(UI depends on specific
framework) View aggregated log
files (optional)
History Server
§ YARN does not keep track of job history

History
§ Spark and MapReduce each provide a history server Server
– Archives jobs’ metrics and metadata
– Can be accessed through the history server UI or Hue
YARN Command Line (1)
§ Command to configure and view information about the YARN cluster
$ yarn <command>
§ Most YARN commands are for administrators rather than developers

§ Some helpful commands for developers
– List running applications
$ yarn application -list
– Kill a running application

$ yarn application -kill <app-id>
YARN Command Line (2)
– View the logs of the specified application
$ yarn logs -applicationId <app-id>
– View the full list of command options
$ yarn -help
Cloudera Manager
§ Cloudera Manager providers a greater ability to monitor and configure a

cluster from a single location
– Covered in Cloudera Administrator Training for Apache Hadoop
Chapter Topics

Hadoop Cluster
§ Essential Points
Essential Points
§ YARN manages resources in a Hadoop cluster and schedules jobs

§ YARN works with HDFS to run tasks where the data is stored
§ Worker nodes run NodeManager daemons, managed by a
ResourceManager on a master node
§ Monitor jobs using Hue, the YARN web UI, or the yarn command
Bibliography

§ Hadoop Application Architectures: Designing Real-World Big Data
Applications (published by O’Reilly)
– http://tiny.cloudera.com/archbook
§ YARN documentation
– http://tiny.cloudera.com/yarndocs
§ Cloudera Engineering Blog YARN articles
– http://tiny.cloudera.com/yarnblog
Chapter Topics

Hadoop Cluster
§ Essential Points
Hands-On Exercise: Run a YARN Job

– Use the YARN web UI to view your YARN cluster “at rest”
– Submit an application to run on the cluster
– Monitor the job using both the YARN UI and Hue
Importing Relational Data with Apache
Sqoop
Chapter 5
Course Chapters
§ Introduction
§ RDD Persistence
§ Conclusion
Importing Relational Data with Apache Sqoop

§ How to import tables from an RDBMS into your Hadoop cluster
§ How to change the delimiter and file format of imported tables
§ How to control which tables, columns, and rows are imported
Chapter Topics
Importing Relational Data with

Apache Sqoop
§ Apache Sqoop Overview

§ Importing Data
§ Import File Options
§ Exporting Data
§ Essential Points
§ Hands-On Exercise: Import Data from MySQL Using Apache Sqoop
What Is Apache Sqoop?
§ Open source Apache project originally developed by Cloudera

– The name is a contraction of “SQL-to-Hadoop”
§ Sqoop exchanges data between a database and HDFS
– Can import all tables, a single table, or a partial table into HDFS
– Data can be imported in a variety of formats
– Sqoop can also export data from HDFS to a database
Database Hadoop Cluster
How Does Sqoop Work?
§ Sqoop is a client-side application that imports data using Hadoop

MapReduce
§ A basic import involves three steps Database Server
orchestrated by Sqoop
– Examine table details
– Create and submit job to cluster 1
– Fetch records from table and Sqoop 3

write this data to HDFS User
Hadoop Cluster
Basic Syntax
§ Sqoop is a command-line utility with several subcommands, called tools

– There are tools for import, export, listing database contents, and more
– Run sqoop help to see a list of all tools
– Run sqoop help tool-name for help on using a specific tool
§ Basic syntax of a Sqoop invocation
$ sqoop tool-name [tool-options]
Exploring a Database with Sqoop
§ This command will list all tables in the loudacre database in MySQL
$ sqoop list-tables \
--connect jdbc:mysql://dbhost/loudacre \
--username dbuser \
--password pw
§ You can perform database queries using the eval tool
$ sqoop eval \
--query "SELECT * FROM my_table LIMIT 5" \
--username dbuser \
--password pw
Chapter Topics

Apache Sqoop

§ Importing Data
§ Exporting Data
§ Essential Points
Overview of the Import Process
§ Imports are performed using Hadoop MapReduce jobs

§ Sqoop begins by examining the table to be imported
– Determines the primary key, if possible
– Runs a boundary query to see how many records will be imported
– Divides result of boundary query by the number of tasks (mappers)
– Uses this to configure tasks so that they will have equal loads
§ Sqoop also generates a Java source file for each table being imported
– It compiles and uses this during the import process
– The file remains after import, but can be safely deleted
Importing an Entire Database with Sqoop
§ The import-all-tables tool imports an entire database

– Stored as comma-delimited files
– Default base location is your HDFS home directory
– Data will be in subdirectories corresponding to name of each table
$ sqoop import-all-tables \
--username dbuser --password pw
§ Use the --warehouse-dir option to specify a different base directory
$ sqoop import-all-tables \
--username dbuser --password pw \
--warehouse-dir /loudacre
Importing a Single Table with Sqoop
§ The import tool imports a single table

§ This example imports the accounts table
– It stores the data in HDFS as comma-delimited fields
$ sqoop import --table accounts \

--username dbuser --password pw
Importing Partial Tables with Sqoop
§ Import only specified columns from accounts table

--columns "id,first_name,last_name,state"
§ Import only matching rows from accounts table

--where "state='CA'"
Chapter Topics

Apache Sqoop

§ Importing Data
§ Exporting Data
§ Essential Points
Specifying a File Location
§ By default, Sqoop stores the data in the user’s HDFS home directory
– In a subdirectory corresponding to the table name
– For example /user/training/accounts
§ This example specifies an alternate location

--target-dir /loudacre/customer_accounts
Specifying an Alternate Delimiter
§ By default, Sqoop generates text files with comma-delimited fields

§ This example writes tab-delimited fields instead

--fields-terminated-by "\t"
Using Compression with Sqoop
§ Sqoop supports storing data in a compressed file

– Use the --compression-codec flag

--compression-codec \
org.apache.hadoop.io.compress.SnappyCodec
Storing Data in Other Data Formats
§ By default, Sqoop stores data in text format files

§ Sqoop supports importing data as Parquet or Avro files

--as-parquetfile

--as-avrodatafile
Chapter Topics

Apache Sqoop

§ Importing Data
§ Exporting Data
§ Essential Points
Exporting Data from Hadoop to RDBMS with Sqoop
§ Sqoop's import tool pulls records from an RDBMS into HDFS

§ It is sometimes necessary to push data in HDFS back to an RDBMS
– Good solution when you must do batch processing on large data sets
– Export results to a relational database for access by other systems
§ Sqoop supports this via the export tool
– The RDBMS table must already exist prior to export
$ sqoop export \
--export-dir /loudacre/recommender_output \
--update-mode allowinsert \
--table product_recommendations
Chapter Topics

Apache Sqoop

§ Importing Data
§ Exporting Data
§ Essential Points
Essential Points
§ Sqoop exchanges data between a database and a Hadoop cluster

– Provides subcommands (tools) for importing, exporting, and more
§ Tables are imported using MapReduce jobs
– These are written as comma-delimited text by default
– You can specify alternate delimiters or file formats
§ Sqoop provides many options to control imports
– You can select only certain columns or limit rows
Bibliography

§ Sqoop User Guide
– http://tiny.cloudera.com/sqoopuser
§ Apache Sqoop Cookbook (published by O’Reilly)
– http://tiny.cloudera.com/sqoopcookbook
Chapter Topics

Apache Sqoop

§ Importing Data
§ Exporting Data
§ Essential Points
Hands-On Exercise: Import Data from MySQL Using Apache
Sqoop
– Use Sqoop to import customer account data from an RDBMS to HDFS
Apache Spark Basics
Chapter 6
Course Chapters
§ Introduction
§ RDD Persistence
§ Conclusion
Apache Spark Basics

§ How to start the Spark shell
§ How to use the Spark context to access Spark functionality
§ Key concepts of Resilient Distributed Datasets (RDDs)
– What are they?
– How do you create them?
– What operations can you perform with them?
§ How Spark uses the principles of functional programming
Chapter Topics
Apache Spark Basics
§ What is Apache Spark?

§ Using the Spark Shell
§ RDDs (Resilient Distributed Datasets)
§ Functional Programming in Spark
§ Essential Points
§ Hands-On Exercise: Explore RDDs Using the Spark Shell
What Is Apache Spark?
§ Apache Spark is a fast and general engine for large-scale

data processing
§ Written in Scala
– Functional programming language that runs in a JVM
§ Spark shell
– Interactive—for learning or data exploration
– Python or Scala
§ Spark applications
– For large scale data processing
– Python, Scala, or Java
Chapter Topics
Apache Spark Basics

§ Essential Points
Spark Shell
§ The Spark shell provides interactive data exploration (REPL)
Python shell: pyspark Scala shell: spark-shell

$ pyspark $ spark-shell
Welcome to
Welcome to
____ __
/ __/__ ___ _____/ /__ ____ __
_\ \/ _ \/ _ `/ __/ '_/ / __/__ ___ _____/ /__
/__ / .__/\_,_/_/ /_/\_\ version 1.6.0 _\ \/ _ \/ _ `/ __/ '_/
/_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
Using Python version 2.7.8 (default, …)
SparkContext available as sc, HiveContext
available as sqlContext. Using Scala version 2.10.5 (Java HotSpot(TM)
64-Bit Server VM, Java 1.8.0_60)
>>> Spark context available as sc (master = …)
SQL context available as sqlContext.
scala>
REPL: Read/Evaluate/Print Loop
Spark Context
§ Every Spark application requires a Spark context

– The main entry point to the Spark API
§ The Spark shell provides a preconfigured Spark context called sc
Language: Python
Using Python version 2.7.8 (default, Apr 19 2016 07:37:49)
SparkContext available as sc, HiveContext available as sqlContext.
>>> sc.appName
u'PySparkShell'
Language: Scala
…
Spark context available as sc (master = …)
SQL context available as sqlContext.
scala> sc.appName
res0: String = Spark shell
Chapter Topics
Apache Spark Basics

§ Essential Points
RDD (Resilient Distributed Dataset)
§ RDD (Resilient Distributed Dataset)

– Resilient: If data in memory is lost, it can be recreated
– Distributed: Processed across the cluster
– Dataset: Initial data can come from a source such as a file, or it can be
created programmatically
§ RDDs are the fundamental unit of data in Spark
§ Most Spark programming consists of performing operations on RDDs
Creating an RDD
§ Three ways to create an RDD

– From a file or set of files
– From data in memory
– From another RDD
Example: A File-Based RDD
Language: Scala File: purplecow.txt
> val mydata = sc.textFile("purplecow.txt") I've never seen a purple cow.

… I never hope to see one;
15/01/29 06:20:37 INFO storage.MemoryStore: But I can tell you, anyhow,
I'd rather see than be one.
Block broadcast_0 stored as values to
memory (estimated size 151.4 KB, free 296.8
MB)
> mydata.count() RDD: mydata

I've never seen a purple cow.
…
I never hope to see one;
15/01/29 06:27:37 INFO spark.SparkContext: Job
finished: take at <stdin>:1, took But I can tell you, anyhow,
0.160482078 s I'd rather see than be one.
4
RDD Operations
§ Two broad types of RDD operations

RDD
value
– Actions return values
Base RDD New RDD

– Transformations define a new RDD
based on the current one(s)
§ Which type of operation is count()?
RDD Operations: Actions
§ Some common actions

– count() returns the number of elements RDD
– take(n) returns an array of the first n elements value

– collect() returns an array of all elements
– saveAsTextFile(dir) saves to text file(s)
Language: Python Language: Scala

> mydata = > val mydata =
sc.textFile("purplecow.txt") sc.textFile("purplecow.txt")
> mydata.count() > mydata.count()

4 4
> for line in mydata.take(2): > for (line <- mydata.take(2))

print line println(line)
I've never seen a purple cow. I've never seen a purple cow.
I never hope to see one; I never hope to see one;
RDD Operations: Transformations
§ Transformations create a new RDD from

Base RDD New RDD
an existing one
§ RDDs are immutable
– Data in an RDD is never changed
– Transform in sequence to modify the
data as needed
§ Two common transformations
– map(function)creates a new RDD by performing a function on
each record in the base RDD
– filter(function)creates a new RDD by including or excluding
each record in the base RDD according to a Boolean function
Example: map and filter Transformations
Language: Python I've never seen a purple cow. Language: Scala
But I can tell you, anyhow,
map(lambda line: line.upper()) map(line => line.toUpperCase)
I'VE NEVER SEEN A PURPLE COW.

I NEVER HOPE TO SEE ONE;
BUT I CAN TELL YOU, ANYHOW,
I'D RATHER SEE THAN BE ONE.
filter(lambda line: line.startswith('I')) filter(line => line.startsWith('I'))

Lazy Execution (1)
File: purplecow.txt
§ Data in RDDs is not processed until I've never seen a purple cow.
an action is performed I never hope to see one;
Language: Scala
>
Lazy Execution (2)
File: purplecow.txt
Language: Scala RDD: mydata
> val mydata = sc.textFile("purplecow.txt")
Lazy Execution (3)
File: purplecow.txt

> val mydata_uc = mydata.map(line =>
line.toUpperCase())
RDD: mydata_uc
Lazy Execution (4)
File: purplecow.txt

line.toUpperCase())
> val mydata_filt = mydata_uc.filter(line
RDD: mydata_uc
=> line.startsWith("I"))
RDD: mydata_filt
Lazy Execution (5)
File: purplecow.txt

I've never seen a purple cow.
> val mydata_uc = mydata.map(line => But I can tell you, anyhow,
line.toUpperCase()) I'd rather see than be one.
RDD: mydata_uc
=> line.startsWith("I")) I'VE NEVER SEEN A PURPLE COW.
> mydata_filt.count() I NEVER HOPE TO SEE ONE;
3 BUT I CAN TELL YOU, ANYHOW,
RDD: mydata_filt
Chaining Transformations (Scala)
§ Transformations may be chained together

> val mydata_uc = mydata.map(line => line.toUpperCase())
> val mydata_filt = mydata_uc.filter(line => line.startsWith("I"))
> mydata_filt.count()
3
is equivalent to
> sc.textFile("purplecow.txt").map(line => line.toUpperCase()).

filter(line => line.startsWith("I")).count()
3
Chaining Transformations (Python)
§ Same example in Python
> mydata = sc.textFile("purplecow.txt")

> mydata_uc = mydata.map(lambda s: s.upper())
> mydata_filt = mydata_uc.filter(lambda s: s.startswith('I'))
> mydata_filt.count()
3
is exactly equivalent to
> sc.textFile("purplecow.txt").map(lambda line: line.upper()) \

.filter(lambda line: line.startswith('I')).count()
3
RDD Lineage and toDebugString (Scala)
File: purplecow.txt
§ Spark maintains each RDD’s lineage— I've never seen a purple cow.
the previous RDDs on which it I never hope to see one;
depends I'd rather see than be one.
§ Use toDebugString to view the RDD[5]
lineage of an RDD
> val mydata_filt =
sc.textFile("purplecow.txt").
RDD[6]
map(line => line.toUpperCase()).
filter(line => line.startsWith("I"))
> mydata_filt.toDebugString
(2) FilteredRDD[7] at filter … RDD[7]

| MappedRDD[6] at map …
| purplecow.txt MappedRDD[5] …
| purplecow.txt HadoopRDD[4] …
RDD Lineage and toDebugString (Python)
§ toDebugString output is not displayed as nicely in Python
> mydata_filt.toDebugString()
(1) PythonRDD[8] at RDD at …\n | purplecow.txt MappedRDD[7] at textFile
at …[]\n | purplecow.txt HadoopRDD[6] at textFile at …[]
§ Use print for prettier output
> print mydata_filt.toDebugString()

(1) PythonRDD[8] at RDD at …
| purplecow.txt MappedRDD[7] at textFile at …
| purplecow.txt HadoopRDD[6] at textFile at …
Pipelining (1)
File: purplecow.txt
§ When possible, Spark will perform I've never seen a purple cow.
sequences of transformations by I never hope to see one;
element so no data is stored I'd rather see than be one.
Language: Scala
> val mydata = sc.textFile("purplecow.txt") I've never seen a purple cow.
line.toUpperCase())
> mydata_filt.take(2)
Pipelining (2)
File: purplecow.txt
Language: Scala
line.toUpperCase())
Pipelining (3)
File: purplecow.txt
Language: Scala
line.toUpperCase())
Pipelining (4)
File: purplecow.txt
Language: Scala
line.toUpperCase())
Pipelining (5)
File: purplecow.txt
Language: Scala
> val mydata = sc.textFile("purplecow.txt") I never hope to see one;
line.toUpperCase())
Pipelining (6)
File: purplecow.txt
Language: Scala
line.toUpperCase())
Pipelining (7)
File: purplecow.txt
Language: Scala
line.toUpperCase())
Pipelining (8)
File: purplecow.txt
Language: Scala
line.toUpperCase())
Chapter Topics
Apache Spark Basics

§ Essential Points
Functional Programming in Spark
§ Spark depends heavily on the concepts of functional programming

– Functions are the fundamental unit of programming
– Functions have input and output only
– No state or side effects
§ Key concepts
– Passing functions as input to other functions
– Anonymous functions
Passing Functions as Parameters
§ Many RDD operations take functions as parameters

§ Pseudocode for the RDD map operation
– Applies function fn to each record in the RDD
RDD {
map(fn(x)) {
foreach record in rdd
emit fn(record)
}
}
Example: Passing Named Functions
Language: Python
> def toUpper(s):
return s.upper()
> mydata.map(toUpper).take(2)
Language: Scala
> def toUpper(s: String): String =
{ s.toUpperCase }
> mydata.map(toUpper).take(2)
Anonymous Functions
§ Functions defined inline without an identifier

– Best for short, one-off functions
§ Supported in many programming languages
– Python: lambda x: ...
– Scala: x => ...
– Java 8: x -> ...
Example: Passing Anonymous Functions
§ Python:
> mydata.map(lambda line: line.upper()).take(2)
§ Scala:
> mydata.map(line => line.toUpperCase()).take(2)
OR
> mydata.map(_.toUpperCase()).take(2)
Scala allows anonymous

parameters using underscore (_)
Example: Java
Language: Java 7
...
JavaRDD<String> lines = sc.textFile("file");
JavaRDD<String> mydata_uc =
mydata.map(new Function<String,String>() {
@Override
public String call(String s) {
return (s.toUpperCase());
}
});
...
Language: Java 8
...
JavaRDD<String> lines = sc.textFile("file");
JavaRDD<String> lines_uc = lines.map(
line -> line.toUpperCase());
...
Chapter Topics
Apache Spark Basics

§ Functional Programming With Spark
§ Essential Points
Essential Points
§ Spark can be used interactively via the Spark shell

– Python or Scala
§ RDDs (Resilient Distributed Datasets) are a key concept in Spark
§ RDD Operations
– Transformations create a new RDD based on an existing one
– Actions return a value from an RDD
§ Lazy Execution
– Transformations are not executed until required by an action
§ Spark uses functional programming
– Passing functions as parameters
– Anonymous functions in supported languages (Python, Scala, Java 8)
Chapter Topics
Apache Spark Basics

§ Functional Programming With Spark
§ Essential Points
Introduction to Spark Exercises: Choose Your Language
§ Your choice: Python or Scala

– For the Spark-based exercises in this course, you may choose to work
with either Python or Scala
§ Solution and example files
– .pyspark: Python commands that can be copied into the PySpark
shell
– .scalaspark: Scala commands that can be copied into the Scala
Spark shell
– .py: Complete Python Spark applications
– .scala: Complete Scala Spark applications
Hands-On Exercise: Explore RDDs Using the Spark Shell

– View the Spark documentation
– Start the Spark shell
– Follow the instructions for either the Python or Scala shell
– Use RDDs to transform a dataset
– Explore Loudacre Web log files
Working with RDDs
Chapter 7
Course Chapters
§ Introduction
§ RDD Persistence
§ Conclusion
Working with RDDs

§ How RDDs are created from files or data in memory
§ How to handle file formats with multi-line records
§ How to use some additional operations on RDDs
Chapter Topics
Working with RDDs
§ Creating RDDs
§ Other General RDD Operations
§ Essential Points
§ Hands-On Exercise: Process Data Files with Apache Spark
RDDs
§ RDDs can hold any serializable type of element

– Primitive types such as integers, characters, and booleans
– Sequence types such as strings, lists, arrays, tuples, and dicts (including
nested data types)
– Scala/Java Objects (if serializable)
– Mixed types
§ Some RDDs are specialized and have additional functionality
– Pair RDDs
– RDDs consisting of key-value pairs
– Double RDDs
– RDDs consisting of numeric data
Creating RDDs from Collections
§ You can create RDDs from collections instead of files

– sc.parallelize(collection)
Language: Python
> myData = ["Alice","Carlos","Frank","Barbara"]
> myRdd = sc.parallelize(myData)
> myRdd.take(2)
['Alice', 'Carlos']
§ Useful when
– Testing
– Generating data programmatically
– Integrating
– Learning
Creating RDDs from Text Files (1)
§ For file-based RDDs, use SparkContext.textFile

– Accepts a single file, a directory of files, a wildcard list of files, or a
comma-separated list of files
– Examples
– sc.textFile("myfile.txt")
– sc.textFile("mydata/")
– sc.textFile("mydata/*.log")
– sc.textFile("myfile1.txt,myfile2.txt")
– Each line in each file is a separate record in the RDD
§ Files are referenced by absolute or relative URI
– Absolute URI:
– file:/home/training/myfile.txt
– hdfs://nnhost/loudacre/myfile.txt
– Relative URI (uses default file system): myfile.txt
Creating RDDs from Text Files (2)
§ textFile maps each line in a file to a separate RDD element
I've never seen a purple cow.\n I've never seen a purple cow.
I never hope to see one;\n I never hope to see one;
But I can tell you, anyhow,\n
I'd rather see than be one.\n
§ textFile only works with newline-terminated text files

§ What about other formats?
Input and Output Formats (1)
§ Spark uses Hadoop InputFormat and OutputFormat Java classes

– Some examples from core Hadoop
– TextInputFormat / TextOutputFormat (newline-
terminated text files)
– SequenceInputFormat / SequenceOutputFormat
– FixedLengthInputFormat
– Many implementations available in additional libraries
– Such as AvroKeyInputFormat / AvroKeyOutputFormat in
the Avro library
– You may also implement custom InputFormat and OutputFormat
types
Input and Output Formats (2)
§ Specify any input format using sc.hadoopFile

– or newAPIhadoopFile for New API classes
§ Specify any output format using rdd.saveAsHadoopFile
– or saveAsNewAPIhadoopFile for New API classes
§ textFile and saveAsTextFile are convenience functions
– textFile just calls hadoopFile specifying TextInputFormat
– saveAsTextFile calls saveAsHadoopFile specifying
TextOutputFormat
Using wholeTextFiles (1)
§ sc.textFile maps each line in a file to a file1.json

{
separate RDD element "firstName":"Fred",
"lastName":"Flintstone",
– What about files with a multi-line input "userid":"123"
format, such as XML or JSON? }
file2.json
§ sc.wholeTextFiles(directory) {
– Maps entire contents of each file in a directory "firstName":"Barney",
"lastName":"Rubble",
to a single RDD element "userid":"234"
}
– Works only for small files (element must fit in
memory)
(file1.json,{"firstName":"Fred","lastName":"Flintstone","userid":"123"} )
(file2.json,{"firstName":"Barney","lastName":"Rubble","userid":"234"} )
(file3.json,… )
(file4.json,… )
Using wholeTextFiles (2)
Language: Python
> import json
> myrdd1 = sc.wholeTextFiles(mydir)
> myrdd2 = myrdd1 \
.map(lambda (fname,s): json.loads(s))
> for record in myrdd2.take(2): Output:
> print record.get("firstName",None) Fred
Barney
Language: Scala
> import scala.util.parsing.json.JSON
> val myrdd1 = sc.wholeTextFiles(mydir)
> val myrdd2 = myrdd1.
map(pair => JSON.parseFull(pair._2).get.
asInstanceOf[Map[String,String]])
> for (record <- myrdd2.take(2))
println(record.getOrElse("firstName",null))
Chapter Topics
Working with RDDs
§ Creating RDDs
§ Essential Points
Some Other General RDD Operations
§ Single-RDD Transformations
– flatMap maps one element in the base RDD to multiple elements
– distinct filters out duplicates
– sortBy uses the provided function to sort
§ Multi-RDD Transformations
– intersection creates a new RDD with all elements in both original
RDDs
– union adds all elements of two RDDs into a single new RDD
– zip pairs each element of the first RDD with the corresponding
element of the second
– subtract removes the elements in the second RDD from the first RDD
Example: flatMap and distinct
> sc.textFile(file) \
.flatMap(lambda line: line.split(' ')) \ Language: Python
.distinct()
> sc.textFile(file).
flatMap(line => line.split(' ')). Language: Scala
distinct()
the cat sat on the mat the on

the aardvark sat on the sofa cat mat
sat sofa
on aardvark
the cat
mat the
the sat
aardvark
sat
…
Examples: Multi-RDD Transformations (1)
rdd1 rdd2
Chicago San Francisco
Boston Boston
Paris Amsterdam
San Francisco Mumbai
Tokyo McMurdo Station
rdd1.subtract(rdd2) rdd1.zip(rdd2)
Tokyo (Chicago,San Francisco)

Paris (Boston,Boston)
Chicago (Paris,Amsterdam)
(San Francisco,Mumbai)
(Tokyo,McMurdo Station)
Examples: Multi-RDD Transformations (2)
rdd1.union(rdd2)
rdd1 rdd2
Chicago San Francisco
Boston Boston
Chicago
Paris Amsterdam
Boston
Paris
Tokyo McMurdo Station
San Francisco
Tokyo
rdd1.intersection(rdd2) San Francisco
Boston
Boston Amsterdam

McMurdo Station
Some Other General RDD Operations
§ Other RDD operations

– first returns the first element of the RDD
– foreach applies a function to each element in an RDD
– top(n) returns the largest n elements using natural ordering
§ Sampling operations
– sample creates a new RDD with a sampling of elements
– takeSample returns an array of sampled elements
§ Double RDD operations
– Statistical functions, such as mean, sum, variance, and stdev
– Documented in API for
org.apache.spark.rdd.DoubleRDDFunctions
Chapter Topics
Working with RDDs
§ Creating RDDs
§ Essential Points
Essential Points
§ RDDs can be created from files, parallelized data in memory, or other RDDs
§ sc.textFile reads newline-delimited text, one line per RDD element
§ sc.wholeTextFile reads multiple files, one file per RDD element
§ Generic RDDs can consist of any type of data
§ Generic RDDs provide a wide range of transformation operations
Chapter Topics
Working with RDDs
§ Creating RDDs
§ Essential Points
Hands-On Exercise: Process Data Files with Apache Spark

– Process a set of XML files using wholeTextFiles
– Reformat a dataset to standardize format (bonus)
Aggregating Data with Pair RDDs
Chapter 8
Course Chapters
§ Introduction
§ RDD Persistence
§ Conclusion

§ How to create pair RDDs of key-value pairs from generic RDDs
§ Special operations available on pair RDDs
§ How map-reduce algorithms are implemented in Spark
Chapter Topics
§ Key-Value Pair RDDs

§ Map-Reduce
§ Other Pair RDD Operations
§ Essential Points
§ Hands-On Exercise: Use Pair RDDs to Join Two Datasets
Pair RDDs
§ Pair RDDs are a special form of RDD Pair RDD

– Each element must be a key-value pair (a (key1,value1)
two-element tuple) (key2,value2)
– Keys and values can be any type (key3,value3)
§ Why? …
– Use with map-reduce algorithms
– Many additional functions are available for
common data processing needs
– Such as sorting, joining, grouping, and counting
Creating Pair RDDs
§ The first step in most workflows is to get the data into key/value form
– What should the RDD should be keyed on?
– What is the value?
§ Commonly used functions to create pair RDDs
– map
– flatMap / flatMapValues
– keyBy
Example: A Simple Pair RDD
§ Example: Create a pair RDD from a tab-separated file

Language: Python
> users = sc.textFile(file) \
.map(lambda line: line.split('\t')) \
.map(lambda fields: (fields[0],fields[1]))
Language: Scala
> val users = sc.textFile(file).
map(line => line.split('\t').
map(fields => (fields(0),fields(1)))
(user001,Fred Flintstone)
user001\tFred Flintstone
(user090,Bugs Bunny)
user090\tBugs Bunny
user111\tHarry Potter (user111,Harry Potter)
… …
Example: Keying Web Logs by User ID
Language: Python
> sc.textFile(logfile) \
.keyBy(lambda line: line.split(' ')[2])
Language: Scala
> sc.textFile(logfile).
keyBy(line => line.split(' ')(2))
User ID
56.38.234.188 - 99788 "GET /KBDOC-00157.html HTTP/1.0" …
56.38.234.188 - 99788 "GET /theme.css HTTP/1.0" …
…
(99788,56.38.234.188 - 99788 "GET /KBDOC-00157.html…)

(99788,56.38.234.188 - 99788 "GET /theme.css…)
(25254,203.146.17.59 - 25254 "GET /KBDOC-00230.html…)
…
Question 1: Pairs with Complex Values
§ How would you do this?

– Input: a tab-delimited list of postal codes with latitude and longitude
– Output: postal code (key) and lat/long pair (value)
(00210,(43.005895,-71.013202))
00210 43.005895 -71.013202
00211 43.005895 -71.013202 (00211,(43.005895,-71.013202))
00212 43.005895 -71.013202
? (00212,(43.005895,-71.013202))
00213 43.005895 -71.013202 (00213,(43.005895,-71.013202))
00214 43.005895 -71.013202
…
…
Answer 1: Pairs with Complex Values
Language: Python
.map(lambda fields: (fields[0],(fields[1],fields[2])))
Language: Scala
> sc.textFile(file).
map(line => line.split('\t')).
map(fields => (fields(0),(fields(1),fields(2))))
(00210,(43.005895,-71.013202))
00210 43.005895 -71.013202
01014 42.170731 -72.604842 (01014,(42.170731,-72.604842))
01062 42.324232 -72.67915 (01062,(42.324232,-72.67915))
01263 42.3929 -73.228483 (01263,(42.3929,-73.228483))
…
…
Question 2: Mapping Single Rows to Multiple Pairs (1)
§ How would you do this?

– Input: order numbers with a list of SKUs in the order
– Output: order (key) and sku (value)
Input Data Pair RDD

00001 sku010:sku933:sku022 (00001,sku010)
00002 sku912:sku331 (00001,sku933)
00003 sku888:sku022:sku010:sku594
00004 sku411
? (00001,sku022)
(00002,sku912)
(00002,sku331)
(00003,sku888)
…
Question 2: Mapping Single Rows to Multiple Pairs (2)
§ Hint: map alone won’t work
00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
00004 sku411
(00001,(sku010,sku933,sku022))
(00002,(sku912,sku331))
(00003,(sku888,sku022,sku010,sku594))
(00004,(sku411))
Answer 2: Mapping Single Rows to Multiple Pairs (1)
Language: Python
> sc.textFile(file)
00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
00004 sku411
Language: Python
.map(lambda line: line.split('\t'))
00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
[00001,sku010:sku933:sku022]
00004 sku411
[00002,sku912:sku331] Note that split returns
[00003,sku888:sku022:sku010:sku594] 2-element arrays, not
[00004,sku411] pairs/tuples
Language: Python
00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
[00001,sku010:sku933:sku022]
00004 sku411
[00002,sku912:sku331]
[00003,sku888:sku022:sku010:sku594]
(00001,sku010:sku933:sku022)
[00004,sku411]
(00002,sku912:sku331) Map array elements to
(00003,sku888:sku022:sku010:sku594) tuples to produce a
pair RDD
(00004,sku411)
Language: Python
.flatMapValues(lambda skus: skus.split(':'))
00001 sku010:sku933:sku022
00002 sku912:sku331 (00001,sku010)
00003 sku888:sku022:sku010:sku594 (00001,sku933)
[00001,sku010:sku933:sku022]
00004 sku411 (00001,sku022)
[00002,sku912:sku331]
[00003,sku888:sku022:sku010:sku594] (00002,sku912)
(00001,sku010:sku933:sku022)
[00004,sku411] (00002,sku331)
(00002,sku912:sku331)
(00003,sku888)
(00003,sku888:sku022:sku010:sku594)
…
(00004,sku411)
Chapter Topics

§ Map-Reduce
§ Essential Points
Map-Reduce
§ Map-reduce is a common programming model

– Easily applicable to distributed processing of large data sets
§ Hadoop MapReduce is the major implementation
– Somewhat limited
– Each job has one map phase, one reduce phase
– Job output is saved to files
§ Spark implements map-reduce with much greater flexibility
– Map and reduce functions can be interspersed
– Results can be stored in memory
– Operations can easily be chained
Map-Reduce in Spark
§ Map-reduce in Spark works on pair RDDs

§ Map phase
– Operates on one record at a time
– “Maps” each record to zero or more new records
– Examples: map, flatMap, filter, keyBy
§ Reduce phase
– Works on map output
– Consolidates multiple records
– Examples: reduceByKey, sortByKey, mean
Map-Reduce Example: Word Count
Result
on 2
Input Data
sofa 1
the cat sat on the mat mat 1
the aardvark sat on the sofa ? aardvark 1
the 4
cat 1
sat 2
Example: Word Count (1)
Language: Python
> counts = sc.textFile(file)
the cat sat on the

mat
the aardvark sat on
the sofa
Language: Python
> counts = sc.textFile(file) \
.flatMap(lambda line: line.split(' '))
the cat sat on the the

mat cat
the aardvark sat on
sat
the sofa
on
the
mat
the
aardvark
…
Language: Python
.flatMap(lambda line: line.split(' ')) \
.map(lambda word: (word,1)) Key-
Value
Pairs
the cat sat on the the (the, 1)

mat cat (cat, 1)
the aardvark sat on
sat (sat, 1)
the sofa
on (on, 1)
the (the, 1)
mat (mat, 1)
the (the, 1)
aardvark (aardvark, 1)
… …
Language: Python
.map(lambda word: (word,1)) \
.reduceByKey(lambda v1,v2: v1+v2)
the cat sat on the the (the, 1) (on, 2)

mat cat (cat, 1) (sofa, 1)
the aardvark sat on
sat (sat, 1) (mat, 1)
the sofa
on (on, 1) (aardvark, 1)
the (the, 1) (the, 4)
mat (mat, 1) (cat, 1)
the (the, 1) (sat, 2)
aardvark (aardvark, 1)
… …
reduceByKey (1)
Language: Python
§ The function passed to counts = sc.textFile(file) \
reduceByKey combines values .flatMap(lambda line: line.split(' '))\
from two keys .map(lambda word: (word,1)) \
– Function must be binary
(the,1)
(cat,1)
(the,2)
(sat,1) (on,2)
(on,1) (sofa,1)
(the,1) (the,3) (mat,1)
(mat,1) (aardvark,1)
(the,4)
(the,1) (the,4)
(aardvark,1) (cat,1)
(sat,1) (sat,2)
(on,1)
(the,1)
reduceByKey (2)
Language: Python
§ The function might be called in counts = sc.textFile(file) \
any order, therefore must be .flatMap(lambda line: line.split(' '))\
– Commutative: x+y = y+x .reduceByKey(lambda v1,v2: v1+v2)
– Associative: (x+y)+z = x+(y+z)
(the,1)
(cat,1)
(the,2)
(sat,1) (on,2)
(on,1) (sofa,1)
(the,1) (mat,1)
(mat,1) (aardvark,1)
(the,4)
(the,1) (the,4)
(aardvark,1) (cat,1)
(the,2)
(sat,1) (sat,2)
(on,1)
(the,1)
Word Count Recap (The Scala Version)
> val counts = sc.textFile(file).

flatMap(line => line.split(' ')).
map(word => (word,1)).
reduceByKey((v1,v2) => v1+v2)
OR
> val counts = sc.textFile(file).

flatMap(_.split(' ')).
map((_,1)).
reduceByKey(_+_)
Why Do We Care about Counting Words?
§ Word count is challenging over massive amounts of data

– Using a single compute node would be too time-consuming
§ Statistics are often simple aggregate functions
– Distributive in nature
– For example: max, min, sum, and count
§ Map-reduce breaks complex tasks down into smaller elements which can
be executed in parallel
§ Many common tasks are very similar to word count
– Such as log file analysis
Chapter Topics

§ Map-Reduce
§ Essential Points
Pair RDD Operations
§ In addition to map and reduceByKey operations, Spark has several

operations specific to pair RDDs
§ Examples
– countByKey returns a map with the count of occurrences of each key
– groupByKey groups all the values for each key in an RDD
– sortByKey sorts in ascending or descending order
– join returns an RDD containing all pairs with matching keys from two
RDDs
Example: Pair RDD Operations
(00004,sku411)
(00003,sku888)
(00001,sku010)
(00003,sku022)
(00001,sku933)
(00003,sku010)
(00001,sku022)
(00003,sku594)
(00002,sku912)
(00002,sku912)
(00002,sku331)
…
(00003,sku888)
…
(00002,[sku912,sku331])
(00001,[sku022,sku010,sku933])
(00003,[sku888,sku022,sku010,sku594])
(00004,[sku411])
Example: Joining by Key
> movies = moviegross.join(movieyear)
RDD: moviegross RDD: movieyear

(Casablanca,$3.7M) (Casablanca,1942)
(Star Wars,$775M) (Star Wars,1977)
(Annie Hall,$38M) (Annie Hall,1977)
(Argo,$232M) (Argo,2012)
… …
(Casablanca,($3.7M,1942))
(Star Wars,($775M,1977))
(Annie Hall,($38M,1977))
(Argo,($232M,2012))
…
Other Pair Operations
§ Some other pair operations

– keys returns an RDD of just the keys, without the values
– values returns an RDD of just the values, without keys
– lookup(key) returns the value(s) for a key
– leftOuterJoin, rightOuterJoin , fullOuterJoin join two
RDDs, including keys defined in the left, right or either RDD respectively
– mapValues, flatMapValues execute a function on just the values,
keeping the key the same
§ See the PairRDDFunctions class Scaladoc for a full list
A Common Join Pattern
§ A common programming pattern

1. Map separate datasets into key-value pair RDDs
2. Join by key
3. Map joined data into the desired format
4. Save, display, or continue processing…
Example: Join Web Log with Knowledge Base Documents (1)
weblogs
221.78.60.155 - 45402 "GET /titanic_4000_sales.html HTTP/1.0" …
…
User ID Requested File

join
kblist
KBDOC-00157:Ronin Novelty Note 3 - Back up files
KBDOC-00230:Sorrento F33L - Transfer Contacts
KBDOC-00050:Titanic 1000 - Transfer Contacts
KBDOC-00107:MeeToo 5.0 - Transfer Contacts
KBDOC-00300:iFruit 5A - overheats
…
Document ID Document Title
Example: Join Web Log with Knowledge Base Documents (2)
§ Steps
1. Map separate datasets into key-value pair RDDs
a. Map web log requests to (docid,userid)
b. Map KB Doc index to (docid,title)
2. Join by key: docid
3. Map joined data into the desired format: (userid,title)
4. Further processing: group titles by User ID
Step 1a: Map Web Log Requests to (docid,userid)
Language: Python
> import re
> def getRequestDoc(s):
return re.search(r'KBDOC-[0-9]*',s).group()
> kbreqs = sc.textFile(logfile) \

.filter(lambda line: 'KBDOC-' in line) \
.map(lambda line: (getRequestDoc(line),line.split(' ')[2])) \
.distinct()

221.78.60.155 - 45402 "GET kbreqs …
/titanic_4000_sales.html HTTP/1.0"
… (KBDOC-00157,99788)
(KBDOC-00203,25254)
(KBDOC-00107,14242)
…
Step 1b: Map KB Index to (docid,title)
Language: Python
> kblist = sc.textFile(kblistfile) \
.map(lambda line: line.split(':')) \
KBDOC-00157:Ronin Novelty Note 3 - Back up files

KBDOC-00230:Sorrento F33L - Transfer Contacts
KBDOC-00050:Titanic 1000 - Transfer Contacts
KBDOC-00107:MeeToo 5.0 - Transfer Contacts
KBDOC-00206:iFruit 5A - overheats
…
kblist
(KBDOC-00157,Ronin Novelty Note 3 - Back up files)
(KBDOC-00230,Sorrento F33L - Transfer Contacts)
(KBDOC-00050,Titanic 1000 - Transfer Contacts)
(KBDOC-00107,MeeToo 5.0 - Transfer Contacts)
…
Step 2: Join by Key docid
Language: Python
> titlereqs = kbreqs.join(kblist)
kbreqs kblist
(KBDOC-00157,99788) (KBDOC-00157,Ronin Novelty Note 3 - Back up files)
(KBDOC-00230,25254) (KBDOC-00230,Sorrento F33L - Transfer Contacts)
(KBDOC-00107,14242) (KBDOC-00050,Titanic 1000 - Transfer Contacts)
… (KBDOC-00107,MeeToo 5.0 - Transfer Contacts)
…
(KBDOC-00157,(99788,Ronin Novelty Note 3 - Back up files))

(KBDOC-00230,(25254,Sorrento F33L - Transfer Contacts))
(KBDOC-00107,(14242,MeeToo 5.0 - Transfer Contacts))
…
Step 3: Map Result to Desired Format (userid,title)
Language: Python
> titlereqs = kbreqs.join(kblist) \
.map(lambda (docid,(userid,title)): (userid,title))
(KBDOC-00157,(99788,Ronin Novelty Note 3 - Back up

files))
(KBDOC-00230,(25254,Sorrento F33L - Transfer Contacts))
(KBDOC-00107,(14242,MeeToo 5.0 - Transfer Contacts))
…
(99788,Ronin Novelty Note 3 - Back up files)

(25254,Sorrento F33L - Transfer Contacts)
(14242,MeeToo 5.0 - Transfer Contacts)
…
Step 4: Continue Processing—Group Titles by User ID
Language: Python
> titlereqs = kbreqs.join(kblist) \
.map(lambda (docid,(userid,title)): (userid,title)) \
.groupByKey()
(99788,Ronin Novelty Note 3 - Back up files)

(25254,Sorrento F33L - Transfer Contacts)
(14242,MeeToo 5.0 - Transfer Contacts)
…
(99788,[Ronin Novelty Note 3 - Back up files,

Ronin S3 - overheating])
(25254,[Sorrento F33L - Transfer Contacts])
Note: values (14242,[MeeToo 5.0 - Transfer Contacts,

MeeToo 5.1 - Back up files,
are grouped iFruit 1 - Back up files,
into Iterables MeeToo 3.1 - Transfer Contacts])
…
Example Output
Language: Python
> for (userid,titles) in titlereqs.take(10):
print 'user id: ',userid
for title in titles: print '\t',title
user id: 99788

Ronin Novelty Note 3 - Back up files
Ronin S3 - overheating (99788,[Ronin Novelty Note 3 - Back up files,
Ronin S3 - overheating])
user id: 25254
(25254,[Sorrento F33L - Transfer Contacts])
Sorrento F33L - Transfer Contacts
(14242,[MeeToo 5.0 - Transfer Contacts,
user id: 14242 MeeToo 5.1 - Back up files,
MeeToo 5.0 - Transfer Contacts iFruit 1 - Back up files,
MeeToo 3.1 - Transfer Contacts])
MeeToo 5.1 - Back up files
iFruit 1 - Back up files …
MeeToo 3.1 - Transfer Contacts

…
Aside: Anonymous Function Parameters
§ Python and Scala pattern matching can help improve code readability
Language: Python
> map(lambda (docid,(userid,title)): (userid,title))
Language: Scala
> map(pair => (pair._2._1,pair._2._2))
OR
Language: Scala
> map{case (docid,(userid,title)) => (userid,title)}
(KBDOC-00157,(99788,…title…)) (99788,…title…)
… …
Chapter Topics

§ Map-Reduce
§ Essential Points
Essential Points
§ Pair RDDs are a special form of RDD consisting of key-value pairs (tuples)
§ Spark provides several operations for working with pair RDDs
§ Map-reduce is a generic programming model for distributed processing
– Spark implements map-reduce with pair RDDs
– Hadoop MapReduce and other implementations are limited to a single
map and single reduce phase per job
– Spark allows flexible chaining of map and reduce operations
– Spark provides operations to easily perform common map-reduce
algorithms like joining, sorting, and grouping
Chapter Topics

§ Map-Reduce
§ Essential Points
Hands-On Exercise: Use Pair RDDs to Join Two Datasets

– Continue exploring web server log files using key-value pair RDDs
– Join log data with user account data
Writing and Running Apache Spark
Applications
Chapter 9
Course Chapters
§ Introduction
§ RDD Persistence
§ Conclusion
Writing and Running Apache Spark Applications

§ How to write a Spark application
§ How to run a Spark application or the Spark shell on a YARN cluster
§ How to access and use the Spark application web UI
Chapter Topics

Applications
§ Apache Spark Applications vs. Spark Shell

§ Creating the Spark Context
§ Building an Apache Spark Application (Scala and Java)
§ Running an Apache Spark Application
§ The Spark Application Web UI
§ Essential Points
§ Hands-On Exercise: Write and Run an Apache Spark Application
Spark Shell vs. Spark Applications
§ The Spark shell allows interactive exploration and manipulation of data

– REPL using Python or Scala
§ Spark applications run as independent programs
– Python, Scala, or Java
– For jobs such as ETL processing, streaming, and so on
Chapter Topics

Applications

§ Essential Points
The Spark Context
§ Every Spark program needs a SparkContext object

– The interactive shell creates one for you
§ In your own Spark application you create your own SparkContext
object
– Named sc by convention
– Call sc.stop when program terminates
Python Example: Word Count
import sys
from pyspark import SparkContext
if __name__ == "__main__":
if len(sys.argv) < 2:
print >> sys.stderr, "Usage: WordCount.py <file>"
exit(-1)
sc = SparkContext()
counts = sc.textFile(sys.argv[1]) \
.flatMap(lambda line: line.split()) \
for pair in counts.take(5): print pair
sc.stop()
Scala Example: Word Count
import org.apache.spark.SparkContext
object WordCount {
def main(args: Array[String]) {
if (args.length < 1) {
System.err.println("Usage: WordCount <file>")
System.exit(1)
}
val sc = new SparkContext()
val counts = sc.textFile(args(0)).

flatMap(line => line.split("\\W")).
map(word => (word,1)).reduceByKey(_ + _)
counts.take(5).foreach(println)
sc.stop()
}
}
Chapter Topics

Applications

§ Essential Points
Building a Spark Application: Scala or Java
§ Scala or Java Spark applications must be compiled and assembled into JAR
files
– JAR file will be passed to worker nodes
§ Apache Maven is a popular build tool
– For specific setting recommendations, see the Spark Programming
Guide
§ Build details will differ depending on
– Version of Hadoop (HDFS)
– Deployment platform (YARN, Mesos, Spark Standalone)
§ Consider using an Integrated Development Environment (IDE)
– IntelliJ or Eclipse are two popular examples
– Can run Spark locally in a debugger
Chapter Topics

Applications

§ Essential Points
Running a Spark Application
§ The easiest way to run a Spark application is using the spark-submit

script
Python $ spark-submit WordCount.py fileURL
$ spark-submit --class WordCount \

Scala/Java
MyJarFile.jar fileURL
Spark Application Cluster Options
§ Spark can run

– Locally
– No distributed processing
– Locally with multiple worker threads
– On a cluster
§ Local mode is useful for development and testing
§ Production use is almost always on a cluster
Supported Cluster Resource Managers
§ Hadoop YARN
– Included in CDH
– Most common for production sites
– Allows sharing cluster resources with other applications
§ Spark Standalone
– Included with Spark
– Easy to install and run
– Limited configurability and scalability
– No security support
– Useful for learning, testing, development, or small systems
§ Apache Mesos
– First platform supported by Spark
– Not supported by Cloudera
How Spark Runs on YARN: Client Mode (1)
Node A
Driver Program Executor

Spark
Context Node B
Executor Executor
Name
Resource Node C Node
Manager NodeManager DataNode
Application
Master 1
Node D
Node A

Spark
Context Node B
Executor Executor
Name
Application
Master 1
Node D
Node A

Spark
Context Node B
Executor Executor
Name
Application Executor
Master 1
Node D
Driver Program NodeManager DataNode
Spark
Application
Context Executor
Master 2
Node A

Spark
Context Node B
Executor Executor
Name
Master 1
Node D
Driver Program NodeManager DataNode
Spark
Application
Context Executor
Master 2
How Spark Runs on YARN: Cluster Mode (1)
Node A
Executor
Node B
submit
Executor Executor
Name
Manager NodeManager
Application Master DataNode
Driver Program
Spark Context
Node D
How Spark Runs on YARN: Cluster Mode (2)
Node A
Executor
Node B
submit
Executor Executor
Name
Manager NodeManager
Application Master DataNode
Driver Program
Spark Context
Node D
Running a Spark Application Locally
§ Use spark-submit --master to specify cluster option

– Local options
– local[*] runs locally with as many threads as cores (default)
– local[n] runs locally with n threads
– local runs locally with a single thread
Language: Python
$ spark-submit --master 'local[3]' \
WordCount.py fileURL
Language: Scala/Java
$ spark-submit --master 'local[3]' --class \
WordCount MyJarFile.jar fileURL
Running a Spark Application on a Cluster
§ Use spark-submit --master to specify cluster option

– Cluster options
– yarn-client
– yarn-cluster
– spark://masternode:port (Spark Standalone)
– mesos://masternode:port (Mesos)
Language: Python
$ spark-submit --master yarn-cluster \
WordCount.py fileURL
Language: Scala/Java
$ spark-submit --master yarn-cluster --class \
WordCount MyJarFile.jar fileURL
Starting the Spark Shell on a Cluster
§ The Spark shell can also be run on a cluster

§ pyspark and spark-shell both have a --master option
– yarn (client mode only)
– Spark or Mesos cluster manager URL
– local[*] runs with as many threads as cores (default)
– local[n] runs locally with n worker threads
– local runs locally without distributed processing
Language: Python
$ pyspark --master yarn
Language: Scala
$ spark-shell --master yarn
Options when Submitting a Spark Application to a Cluster
§ Some other spark-submit options for clusters

• --jars: Additional JAR files (Scala and Java only)
• --py-files: Additional Python files (Python only)
• --driver-java-options: Parameters to pass to the driver JVM
• --executor-memory: Memory per executor (for example: 1000m,
2g) (Default: 1g)
• --packages: Maven coordinates of an external library to include
§ Plus several YARN-specific options
• --num-executors: Number of executors to start
• --executor-cores: Number cores to allocate for each executor
• --queue: YARN queue to submit the application to
§ Show all available options
• --help
Dynamic Resource Allocation (1)
Node A
NodeManager DataNode Dynamic allocation
Driver Program Executor allows a Spark
Spark
application to add or
Context Node B
release executors as
needed.
Executor Executor
Name
Application
Master 1
Node D
Node A
Spark
Context Node B
needed.
Executor Executor
Name
Master 1
Node D
Executor
Node A
Spark
Context Node B
needed.
Executor Executor
Name
Master 1
Node D
Executor
§ Dynamic allocation in YARN is enabled by default starting in CDH 5.5

– Enabled at a site level in YARN, not application level
– Can be disabled for an individual application
– Specify the --num-executors flag when using spark-
submit
– Or set property spark.dynamicAllocation.enabled to
false
Chapter Topics

Applications

§ Essential Points
The Spark Application Web UI
The Spark UI lets

you monitor
running jobs, and
view statistics and
configuration
Accessing the Spark UI
§ The web UI is run by the Spark driver

– When running locally: http://localhost:4040
– When running on a cluster, access via the YARN UI
Viewing Spark Job History (1)
§ Viewing Spark Job History

– Spark UI is only available while the application is running
– Use Spark History Server to view metrics for a completed application
– Optional Spark component
§ Accessing the History Server
– For local jobs, access by URL
– Example: localhost:18080
– For YARN Jobs, click History link in YARN UI
Viewing Spark Job History (2)
§ Spark History Server
Chapter Topics

Applications

§ Essential Points
Essential Points
§ Use the Spark shell for interactive data exploration

§ Write a Spark application to run independently
§ Spark applications require a SparkContext object
§ Use the spark-submit script to run Spark applications
§ Spark is designed to run on a cluster
– Most large production sites deploy on YARN (included in CDH)
§ The resource manager distributes tasks to individual worker nodes in the
cluster
– Tasks run in executors—JVMs running on worker nodes
Chapter Topics

Applications

§ Essential Points
Building and Running Scala Applications in the
Hands-On Exercises
§ Basic Apache Maven projects are provided in the exercise directory
– stubs: starter Scala file, do exercises here
– solution: final exercise solution
Project Directory Structure

+countjpgs
$ mvn package -pom.xml
+src
+main
$ spark-submit \ +scala
--class stubs.CountJPGs \ +solution
-CountJPGs.scala
target/countjpgs-1.0.jar \ +stubs
weblogs.* -CountJPGs.scala
+target
-countjpgs-1.0.jar
Hands-On Exercise: Write and Run an Apache Spark Application

– Write a Spark application to count JPG requests in a web server log
– If you use Scala, compile and package the application in a JAR file
– Run the application locally to test
– Submit the application to run on the YARN cluster
Configuring Apache Spark Applications
Chapter 10
Course Chapters
§ Introduction
§ RDD Persistence
§ Conclusion
Configuring Apache Spark Applications

§ How to configure Spark application properties, either programmatically or
declaratively
§ How to set logging levels for a Spark application
Chapter Topics
Configuring Apache Spark

Applications
§ Configuring Apache Spark Properties

§ Logging
§ Essential Points
§ Hands-On Exercise: Configure an Apache Spark Application
Spark Application Configuration
§ Spark provides numerous properties for configuring your application

§ Some example properties
– spark.master
– spark.app.name
– spark.local.dir: Where to store local files such as shuffle output
(default /tmp)
– spark.ui.port: Port to run the Spark Application UI (default 4040)
– spark.executor.memory: How much memory to allocate to each
Executor (default 1g)
– spark.driver.memory: How much memory to allocate to the
driver in client mode (default 1g)
– And many more...
– See Spark Configuration page in the Spark documentation for more
details
Spark Application Configuration Options
§ Spark applications can be configured

– Declaratively, using spark-submit options or properties files
– Programmatically, within your application code
Declarative Configuration Options
§ spark-submit script
– Examples:
– spark-submit --driver-memory 500M
– spark-submit --conf spark.executor.cores=4
§ Properties file
– Tab- or space-separated list of properties and values
– Load with spark-submit --properties-file filename
– Example:
spark.master yarn-cluster
spark.local.dir /tmp
spark.ui.port
§ Site defaults properties file
– SPARK_HOME/conf/spark-defaults.conf
– Template file provided
Setting Configuration Properties Programmatically
§ Spark configuration settings are part of the Spark context

§ Configure using a SparkConf object
§ Some example set functions
– setAppName(name)
– setMaster(master)
– set(property-name, value)
§ set functions return a SparkConf object to support chaining
SparkConf Example (Python)
import sys
from pyspark import SparkConf
if __name__ == "__main__":
if len(sys.argv) < 2:
print >> sys.stderr, "Usage: WordCount <file>"
exit(-1)
sconf = SparkConf() \
.setAppName("Word Count") \
.set("spark.ui.port","4141")
sc = SparkContext(conf=sconf)
counts = sc.textFile(sys.argv[1]) \
.flatMap(lambda line: line.split()) \
.map(lambda w: (w,1)) \
for pair in counts.take(5): print pair

sc.stop()
SparkConf Example (Scala)
import org.apache.spark.SparkConf
object WordCount {
if (args.length < 1) {
System.err.println("Usage: WordCount <file>")
System.exit(1)
}
val sconf = new SparkConf()

.setAppName("Word Count")
.set("spark.ui.port","4141")
val sc = new SparkContext(sconf)
val counts = sc.textFile(args(0)).

flatMap(line => line.split("\\W")).
map(word => (word,1)).
reduceByKey(_ + _)
counts.take(5).foreach(println)
sc.stop()
}
}
Viewing Spark Properties
§ You can view the Spark

property settings in the
Spark Application UI
– Environment tab
Chapter Topics

Applications

§ Logging
§ Essential Points
Spark Logging
§ Spark uses Apache Log4j for logging

– Allows for controlling logging at runtime using a properties file
– Enable or disable logging, set logging levels, select output
destination
– For more info see http://logging.apache.org/log4j/1.2/
§ Log4j provides several logging levels
– TRACE
– DEBUG
– INFO
– WARN
– ERROR
– FATAL
– OFF
Spark Log Files (1)
§ Log file locations depend on your cluster management platform

§ YARN
– If log aggregation is off, logs are stored locally on each worker node
– If log aggregation is on, logs are stored in HDFS
– Default /var/log/hadoop-yarn
– Access with yarn logs command or YARN Resource Manager UI
$ yarn application -list
Application-Id Application-Name Application-Type…

application_1441395433148_0003 Spark shell SPARK …
application_1441395433148_0001 myapp.jar MAPREDUCE …
$ yarn logs -applicationId <appid>

…
Spark Log Files (2)
Configuring Spark Logging (1)
§ Logging levels can be set for the cluster, for individual applications, or even
for specific components or subsystems
§ Default for machine: SPARK_HOME/conf/log4j.properties*
– Start by copying log4j.properties.template
Default for all Spark
log4j.properties.template applications
# Set everything to be logged to the console
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
…
log4j.logger.org.apache.spark.repl.Main=WARN
…
Default override for Spark
shell (Scala)
* Located in /usr/lib/spark/conf on course VM
§ Copy log4j.properties.template to log4j.properties

– May require administrator privileges
§ Modify the rootCategory or repl.Main settings
Default for all Spark

log4j.properties applications
# Set everything to be logged to the console
log4j.rootCategory=DEBUG, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
…
log4j.logger.org.apache.spark.repl.Main=DEBUG
…
Default override for Spark
shell (Scala)
§ Logging in the Spark shell can be configured interactively

– The setLogLevel method sets the logging level temporarily
– Added in Spark 1.4
Language: Python/Scala
> sc.setLogLevel("ERROR")
Chapter Topics

Applications

§ Logging
§ Essential Points
Essential Points
§ Spark configuration parameters can be set declaratively using the

spark-submit script or a properties file, or set programmatically using
a SparkConf object
§ Spark uses Log4j for logging
– Configure using a log4j.properties file or sc.setLogLevel
Chapter Topics

Applications

§ Logging
§ Essential Points
Hands-On Exercise: Configure an Apache Spark Application

– Set properties using spark-submit
– Set properties in a properties file
– Change the logging levels in a log4j.properties file
Parallel Processing in Apache Spark
Chapter 11
Course Chapters
§ Introduction
§ RDD Persistence
§ Conclusion

§ How RDDs are distributed across a cluster
§ How Apache Spark partitions file-based RDDs
§ How Spark executes RDD operations in parallel
§ How to control parallelization through partitioning
§ How to view and monitor tasks and stages
Chapter Topics
§ Review: Apache Spark on a Cluster

§ RDD Partitions
§ Partitioning of File-Based RDDs
§ HDFS and Data Locality
§ Executing Parallel Operations
§ Stages and Tasks
§ Essential Points
§ Hands-On Exercise: View Jobs and Stages in the Spark Application UI
Review of Spark on YARN (1)
Worker Nodes
$ spark-submit \
--master yarn-client \
--class MyClass \
MyApp.jar
Resource Name
Manager Node
Worker Nodes
$ spark-submit \ Container
Driveryarn-client
--master Program \
Spark
--class MyClass \
MyApp.jarContext
Container
Resource Name
Manager Node
Container
Worker Nodes
$ spark-submit \ Container
Executor
Driveryarn-client
--master Program \
Spark
--class MyClass \
MyApp.jarContext
Executor
Container
Resource Name
Manager Node
Executor
Container
Chapter Topics

§ RDD Partitions
§ Stages and Tasks
§ Essential Points
RDDs on a Cluster
§ Resilient Distributed Datasets RDD 1

– Data is partitioned across worker nodes
Executor
§ Partitioning is done automatically by Spark rdd_1_0
– Optionally, you can control how many
partitions are created Executor
rdd_1_1
Executor
rdd_1_2
Chapter Topics

§ RDD Partitions
§ Stages and Tasks
§ Essential Points
File Partitioning: Single Files
§ Partitions from single files sc.textFile("myfile",3)
– Partitions based on size
– You can optionally specify a minimum RDD
number of partitions
textFile(file, minPartitions) Executor
– Default is two when running on a cluster

myfile
– Default is one when running locally with
a single thread Executor
– More partitions = more parallelization
Executor
File Partitioning: Multiple Files
RDD
§ sc.textFile("mydir/*")
Executor
– Each file becomes (at least) one
partition file1
– File-based operations can be done Executor

per-partition, for example parsing
XML file2
Executor
§ sc.wholeTextFiles("mydir")
– For many small files RDD
– Creates a key-value PairRDD Executor
– key = file name
– value = file contents
Executor
Operating on Partitions
§ Most RDD operations work on each element of an RDD

§ A few work on each partition
– foreachPartition calls a function for each partition
– mapPartitions creates a new RDD by executing a function on each
partition in the current RDD
– mapPartitionsWithIndex works the same as mapPartitions
but includes index of the partition
§ Functions for partition operations take iterators
Example: foreachPartition
§ Example: Print out the first line of each partition

Language: Python
def printFirstLine(iter):
print iter.next()
myrdd = …
myrdd.foreachPartition(lambda i: printFirstLine(i))
Language: Scala
def printFirstLine(iter: Iterator[Any]) = {
println(iter.next)
}
val myrdd = …
myrdd.foreachPartition(printFirstLine)
Chapter Topics

§ RDD Partitions
§ Stages and Tasks
§ Essential Points
HDFS and Data Locality (1)
Node A
Node B
Node C
Node D
$ hdfs dfs -put mydata
HDFS:
Node A mydata
HDFS
Block 1
Node B
HDFS
Block 2
Node C
HDFS
Block 3
Node D
sc.textFile("hdfs://…mydata").collect()
HDFS:
Driver Program Node A mydata
Executor HDFS
Spark
Block 1
Context
Node B
Executor HDFS
Block 2
Node C
Executor HDFS
Block 3
Node D
By default, Spark partitions

sc.textFile("hdfs://…mydata").collect() file-based RDDs by block.
Each block loads into a
single partition.
HDFS:
RDD
Executor HDFS
Spark
Block 1
Context
Node B
Executor HDFS
Block 2
Node C
Executor HDFS
Block 3
Node D
An action triggers execution:

sc.textFile("hdfs://…mydata").collect() tasks on executors load data
from blocks into partitions
HDFS:
RDD
Executor HDFS
Spark task Block 1
Context
Node B
Executor HDFS
task Block 2
Node C
Executor HDFS
task Block 3
Node D
Data is distributed across

sc.textFile("hdfs://…mydata").collect() executors until an action
returns a value to the driver
HDFS:
RDD
Executor HDFS
Spark
Block 1
Context
Node B
Executor HDFS
Block 2
Node C
Executor HDFS
Block 3
Node D
Chapter Topics

§ RDD Partitions
§ Stages and Tasks
§ Essential Points
Parallel Operations on Partitions
§ RDD operations are executed in parallel on each partition

– When possible, tasks execute on the worker nodes where the data is in
stored
§ Some operations preserve partitioning
– Such as map, flatMap, or filter
§ Some operations repartition
– Such as reduceByKey, sortByKey, join, or groupByKey
Example: Average Word Length by Letter (1)
Language: Python
> avglens = sc.textFile(file)
RDD
HDFS:
mydata
Language: Python
> avglens = sc.textFile(file) \
.flatMap(lambda line: line.split(' '))
RDD RDD
HDFS:
mydata
Language: Python
.map(lambda word: (word[0],len(word)))
RDD RDD RDD
HDFS:
mydata
Language: Python
.map(lambda word: (word[0],len(word))) \
.groupByKey()
RDD RDD RDD

RDD
HDFS:
mydata
Language: Python
.groupByKey() \
.map(lambda (k, values): \
(k, sum(values)/len(values)))
RDD RDD RDD

RDD RDD
HDFS:
mydata
Chapter Topics

§ RDD Partitions
§ Stages and Tasks
§ Essential Points
Stages
§ Operations that can run on the same partition are executed in stages
§ Tasks within a stage are pipelined together
§ Developers should be aware of stages to improve performance
Spark Execution: Stages (1)
Language: Scala
> val avglens = sc.textFile(myfile).
map(word => (word(0),word.length)).
groupByKey().
map(pair => (pair._1, pair._2.sum/pair._2.size.toDouble))
> avglens.saveAsTextFile("avglen-output")
Stage 0 Stage 1
RDD RDD RDD
RDD RDD
Language: Scala
groupByKey().
Stage 0 Stage 1
Task 1
Task 5
Task 2
Task 3 Task 6
Task 4
Language: Scala
groupByKey().
Stage 0 Stage 1
Task 1
Task 5
Task 2
Task 3 Task 6
Task 4
Language: Scala
groupByKey().
Stage 0 Stage 1
Task 1
Task 5
Task 2
Task 3 Task 6
Task 4
Summary of Spark Terminology
§ Job—a set of tasks executed as a result of an action

§ Stage—a set of tasks in a job that can be executed in parallel
§ Task—an individual unit of work sent to one executor
§ Application—the set of jobs managed by a single driver
Job Stage
Task
RDD RDD RDD
RDD RDD
Stage
How Spark Calculates Stages
§ Spark constructs a DAG (Directed Acyclic Graph) of RDD dependencies

§ Narrow dependencies RDD RDD
– Each partition in the child RDD depends on
just one partition of the parent RDD
– No shuffle required between executors
– Can be collapsed into a single stage
– Examples: map, filter, and union
§ Wide (or shuffle) dependencies RDD
– Child partitions depend on multiple RDD
partitions in the parent RDD
– Defines a new stage
– Examples: reduceByKey, join,
and groupByKey
Controlling the Level of Parallelism
§ Wide operations (such as reduceByKey) partition resulting RDDs

– More partitions = more parallel tasks
– Cluster will be under-utilized if there are too few partitions
§ You can control how many partitions
– Optional numPartitions parameter in function call
> words.reduceByKey(lambda v1, v2: v1 + v2, 15)
– Configure the spark.default.parallelism property
spark.default.parallelism 10
– The default is the number of partitions of the parent if you do not

specify either
Viewing Stages in the Spark Application UI (1)
§ You can view jobs and stages in the Spark Application UI
Jobs are identified by the

action that triggered the job
execution
§ Select the job to view execution stages
Stages are Number of tasks =

Data shuffled
identified by the number of
between stages
last operation partitions
§ Click DAG Visualization for an

interactive map of stages
Viewing the Stages Using toDebugString (Scala)
Language: Scala
groupByKey().
> avglens.toDebugString()
(2) MappedRDD[5] at map at …

Stage 1
| ShuffledRDD[4] at groupByKey at …
+-(4) MappedRDD[3] at map at …
| FlatMappedRDD[2] at flatMap at …
Stage 0
| myfile MappedRDD[1] at textFile at …
| myfile HadoopRDD[0] at textFile at …
Indents indicate
stages (shuffle
boundaries)
Viewing the Stages Using toDebugString (Python)
Language: Python
> avglens = sc.textFile(myfile) \
.groupByKey() \
.map(lambda (k, values): \
(k, sum(values)/len(values)))
> print avglens.toDebugString()

(2) PythonRDD[13] at RDD at …
| MappedRDD[12] at values at … Stage 1
| ShuffledRDD[11] at partitionBy at …
+-(4) PairwiseRDD[10] at groupByKey at …
| PythonRDD[9] at groupByKey at …
Stage 0
| myfile MappedRDD[7] at textFile at …
| myfile HadoopRDD[6] at textFile at …
Indents indicate
stages (shuffle
boundaries)
Spark Task Execution (1)
val avglens = sc.textFile(myfile).

Stage 0 Stage 1 groupByKey().
Task 1 Task 5 avglens.saveAsTextFile("avglen-output")
Task 2 Task 6Client

HDFS Node A
Executor
Task 3 Block 1
Task 4
HDFS Node B
Executor
Block 2
Driver Program
Spark
Context HDFS Node C
Executor
Block 3
Executor HDFS Node D

Block 4

Task 5 avglens.saveAsTextFile("avglen-output")
Task 6Client
HDFS Node A
Executor
Block 1
Task 1
HDFS Node B
Executor
Block 2
Driver Program Task 2
Spark
Context HDFS Node C
Executor
Block 3
Task 3

Block 4
Task 4

Task 6Client
HDFS Node A
Executor
Block 1
Shuffle
Data
Task 1
HDFS Node B
Executor
Block 2
Shuffle
Driver Program Data
Task 2
Spark
Context HDFS Node C
Executor
Block 3
Shuffle
Data
Task 3

Block 4
Shuffle
Data
Task 4

Stage 1 groupByKey().
Task 6Client
HDFS Node A
Executor
Block 1
Shuffle
Data
HDFS Node B
Executor
Block 2
Shuffle
Driver Program Data
Spark
Context HDFS Node C
Executor
Block 3
Shuffle
Data

Block 4
Shuffle
Data

Stage 1 groupByKey().
avglens.saveAsTextFile("avglen-output")
Client HDFS Node A

Executor
Block 1
Shuffle
Data
HDFS Node B
Executor
Block 2
Shuffle Task 5
Driver Program Data
Spark
Context HDFS Node C
Executor
Block 3
Shuffle Task 6
Data

Block 4
Shuffle
Data

groupByKey().
avglens.saveAsTextFile("avglen-output")
HDFS Node A
Executor
Block 1
HDFS Node B
Executor
Block 2
Task 5
Driver Program part-00000
Spark
Context HDFS Node C
Executor
Block 3
Task 6
part-00001

Block 4
Spark Task Execution (Alternate Ending)

groupByKey().
avglens.collect()
HDFS Node A
Executor
Block 1
HDFS Node B
Executor
Block 2
Task 5
Driver Program
Spark
Context HDFS Node C
Executor
Block 3
Task 6

Block 4
Chapter Topics

§ RDD Partitions
§ Stages and Tasks
§ Essential Points
Essential Points
§ RDDs are processed in the memory of Spark executor JVMs

§ Data is split into partitions—each partition in a separate executor
§ RDD operations are executed on partitions in parallel
§ Operations that depend on the same partition are pipelined together in
stages
– Examples: map and filter
§ Operations that depend on multiple partitions are executed in separate
stages
– Examples: join and reduceByKey
Chapter Topics

§ RDD Partitions
§ Stages and Tasks
§ Essential Points
Hands-On Exercise: View Jobs and Stages in the Spark
Application UI
– Use the Spark Application UI to view how jobs, stages, and tasks are
executed in a job
RDD Persistence
Chapter 12
Course Chapters
§ Introduction
§ RDD Persistence
§ Conclusion
RDD Persistence

§ How Apache Spark uses an RDD’s lineage in operations
§ How to persist RDDs to improve performance
Chapter Topics
RDD Persistence
§ RDD Lineage
§ RDD Persistence Overview
§ Distributed Persistence
§ Essential Points
§ Hands-On Exercise: Persist an RDD
Lineage Example (1)
File: purplecow.txt
§ Each transformation operation I've never seen a purple cow.
creates a new child RDD I never hope to see one;
Language: Python
Lineage Example (2)
File: purplecow.txt
RDD[1] (mydata)
Language: Python
Lineage Example (3)
File: purplecow.txt
RDD[1] (mydata)
Language: Python
> myrdd = mydata.map(lambda s: s.upper())\
.filter(lambda s:s.startswith('I'))
RDD[2]
RDD[3] (myrdd)
Lineage Example (4)
File: purplecow.txt
§ Spark keeps track of the parent RDD I've never seen a purple cow.
for each new RDD I never hope to see one;
§ Child RDDs depend on their parents
RDD[1] (mydata)
Language: Python
RDD[2]
RDD[3] (myrdd)
Lineage Example (5)
File: purplecow.txt
§ Action operations execute the I've never seen a purple cow.
parent transformations I never hope to see one;
RDD[1] (mydata)
Language: Python I've never seen a purple cow.
> mydata = sc.textFile("purplecow.txt") I never hope to see one;
> myrdd = mydata.map(lambda s: s.upper())\ But I can tell you, anyhow,
.filter(lambda s:s.startswith('I')) I'd rather see than be one.
> myrdd.count()
3 RDD[2]
RDD[3] (myrdd)
Lineage Example (6)
File: purplecow.txt
§ Each action re-executes the lineage I've never seen a purple cow.
transformations starting with the I never hope to see one;
base I'd rather see than be one.
– By default RDD[1] (mydata)

Language: Python
> myrdd.count()
3 RDD[2]
> myrdd.count()
RDD[3] (myrdd)
Lineage Example (7)
File: purplecow.txt
§ Each action re-executes the lineage I've never seen a purple cow.
transformations starting with the I never hope to see one;
base I'd rather see than be one.
– By default RDD[1] (mydata)

Language: Python I've never seen a purple cow.
> mydata = sc.textFile("purplecow.txt") I never hope to see one;
> myrdd = mydata.map(lambda s: s.upper())\ But I can tell you, anyhow,
.filter(lambda s:s.startswith('I')) I'd rather see than be one.
> myrdd.count()
3 RDD[2]
> myrdd.count()
3 I NEVER HOPE TO SEE ONE;
RDD[3] (myrdd)
Chapter Topics
RDD Persistence
§ RDD Lineage
§ Essential Points
RDD Persistence (1)
File: purplecow.txt
§ Persisting an RDD saves the data (in I've never seen a purple cow.
memory, by default) I never hope to see one;
RDD Persistence (2)
File: purplecow.txt
Language: Python RDD[1] (mydata)

> myrdd1 = mydata.map(lambda s:
s.upper())
RDD[2] (myrdd1)
RDD Persistence (3)
File: purplecow.txt

s.upper())
> myrdd1.persist()
RDD[2] (myrdd1)
RDD Persistence (4)
File: purplecow.txt

s.upper())
> myrdd1.persist()
> myrdd2 = myrdd1.filter(lambda \ RDD[2] (myrdd1)
s:s.startswith('I'))
RDD[3] (myrdd2)
RDD Persistence (5)
File: purplecow.txt

> mydata = sc.textFile("purplecow.txt") I've never seen a purple cow.
> myrdd1 = mydata.map(lambda s: I never hope to see one;
s.upper()) But I can tell you, anyhow,

> myrdd1.persist()
> myrdd2.count()
3
RDD[3] (myrdd2)
RDD Persistence (6)
File: purplecow.txt
§ Subsequent operations use saved I've never seen a purple cow.
data I never hope to see one;

s.upper())
> myrdd1.persist()
> myrdd2.count()
3
> myrdd2.count()
RDD[3] (myrdd2)
RDD Persistence (7)
File: purplecow.txt
§ Subsequent operations use saved I've never seen a purple cow.
data I never hope to see one;

s.upper())
> myrdd1.persist()
> myrdd2.count()
3
> myrdd2.count()
3 RDD[3] (myrdd2)
Memory Persistence
§ In-memory persistence is a suggestion to Spark

– If not enough memory is available, persisted partitions will be cleared
from memory
– Least recently used partitions cleared first
– Transformations will be re-executed using the lineage when needed
Chapter Topics
RDD Persistence
§ RDD Lineage
§ Essential Points
Persistence and Fault-Tolerance
§ RDD = Resilient Distributed Dataset

– Resiliency is a product of tracking lineage
– RDDs can always be recomputed from their base if needed
Distributed Persistence
§ RDD partitions are distributed across a cluster

§ By default, partitions are persisted in memory in Executor JVMs
RDD
Node A
Driver Executor
task rdd_1_0
Node B
Executor
task rdd_1_1
Node C
Executor
RDD Fault-Tolerance (1)
§ What happens if a partition persisted in memory becomes unavailable?
RDD
Node A
Driver Executor
task rdd_1_0
Node B
Node C
Executor
RDD Fault-Tolerance (2)
§ The driver starts a new task to recompute the partition on a different node
§ Lineage is preserved, data is never lost
RDD
Node A
Driver Executor
task rdd_1_0
Node B
Node C
Executor
task rdd_1_1
Persistence Levels
§ By default, the persist method stores data in memory only

§ The persist method offers other options called storage levels
§ Storage levels let you control
– Storage location (memory or disk)
– Format in memory
– Partition replication
Persistence Levels: Storage Location
§ Storage location—where is the data stored?

– MEMORY_ONLY: Store data in memory if it fits
– MEMORY_AND_DISK: Store partitions on disk if they do not fit in
memory
– Called spilling
– DISK_ONLY: Store all partitions on disk
Language: Python
> from pyspark import StorageLevel
> myrdd.persist(StorageLevel.DISK_ONLY)
Language: Scala
> import org.apache.spark.storage.StorageLevel
> myrdd.persist(StorageLevel.DISK_ONLY)
Persistence Levels: Memory Format
§ Serialization—you can choose to serialize the data in memory

– MEMORY_ONLY_SER and MEMORY_AND_DISK_SER
– Much more space efficient
– Less time efficient
– If using Java or Scala, choose a fast serialization library such as Kryo
Persistence Levels: Partition Replication
§ Replication—store partitions on two nodes

– DISK_ONLY_2
– MEMORY_AND_DISK_2
– MEMORY_ONLY_2
– MEMORY_AND_DISK_SER_2
– MEMORY_ONLY_SER_2
– You can also define custom storage levels
Default Persistence Levels
§ The storageLevel parameter for the persist() operation is

optional
– If no storage level is specified, the default value depends on the
language
– Scala default: MEMORY_ONLY
– Python default: MEMORY_ONLY_SER
§ cache() is a synonym for persist() with no storage level specified
myrdd.cache()
is equivalent to
myrdd.persist()
Disk Persistence
§ Disk-persisted partitions are stored in local files
RDD
Client Node A
Driver Executor
task rdd_0_0
Node B
Executor
rdd_0_1
task rdd_0_1
Node C
Executor
Disk Persistence with Replication (1)
§ Persistence replication makes recomputation less likely to be necessary
RDD
Client Node A
Driver Executor
task rdd_0_0
Node B
Executor
rdd_0_1
task rdd_0_1
Node C
Executor
rdd_0_1
Disk Persistence with Replication (2)
§ Replicated data on disk will be used to recreate the partition if possible

– Will be recomputed if the data is unavailable
– For example, when the node is down
RDD
Client Node A
Driver Executor
task rdd_0_0
Node B
Node C
Executor
task rdd_0_1 rdd_1
When and Where to Persist
§ When should you persist a dataset?

– When a dataset is likely to be re-used
– Such as in iterative algorithms and machine learning
§ How to choose a persistence level
– Memory only—choose when possible, best performance
– Save space by saving as serialized objects in memory if necessary
– Disk—choose when recomputation is more expensive than disk read
– Such as with expensive functions or filtering large datasets
– Replication—choose when recomputation is more expensive than
memory
Changing Persistence Options
§ To stop persisting and remove from memory and disk

– rdd.unpersist()
§ To change an RDD to a different persistence level
– Unpersist first
Chapter Topics
RDD Persistence
§ RDD Lineage
§ Essential Points
Essential Points
§ Spark keeps track of each RDD’s lineage

– Provides fault tolerance
§ By default, every RDD operation executes the entire lineage
§ If an RDD will be used multiple times, persist it to avoid re-computation
§ Persistence options
– Location—memory only, memory and disk, disk only
– Format—in-memory data can be serialized to save memory (but at the
cost of performance)
– Replication—saves data on multiple locations in case a node goes down,
for job recovery without recomputation
Chapter Topics
RDD Persistence
§ RDD Lineage
§ Essential Points
Hands-On Exercises: Persist an RDD

– Persist an RDD before reusing it
– Use the Spark Application UI to see how an RDD is persisted
Common Patterns in Apache Spark
Data Processing
Chapter 13
Course Chapters
§ Introduction
§ RDD Persistence
§ Conclusion
Common Patterns in Apache Spark Data Processing

§ At what kinds of processing and analysis Apache Spark is best
§ How to implement an iterative algorithm in Spark
§ What major features and benefits are provided by Spark’s machine
learning libraries
Chapter Topics

Data Processing
§ Common Apache Spark Use Cases

§ Iterative Algorithms in Apache Spark
§ Machine Learning
§ Example: k-means
§ Essential Points
§ Hands-On Exercise: Implement an Iterative Algorithm with Apache Spark
Common Spark Use Cases (1)
§ Spark is especially useful when working with any combination of:

– Large amounts of data
– Distributed storage
– Intensive computations
– Distributed computing
– Iterative algorithms
– In-memory processing and pipelining
Common Spark Use Cases (2)
§ Risk analysis
– “How likely is this borrower to pay back a loan?”
§ Recommendations
– “Which products will this customer enjoy?”
§ Predictions
– “How can we prevent service outages instead of simply reacting to
them?”
§ Classification
– “How can we tell which mail is spam and which is legitimate?”
Spark Examples
§ Spark includes many example programs that demonstrate some common

Spark programming patterns and algorithms
– k-means
– Logistic regression
– Calculating pi
– Alternating least squares (ALS)
– Querying Apache web logs
– Processing Twitter feeds
§ Examples
– SPARK_HOME/lib*
– spark-examples.jar: Java and Scala examples
– python.tar.gz: Pyspark examples
*SPARK_HOME is /usr/lib/spark on the course VM

Chapter Topics

Data Processing

§ Machine Learning
§ Example: k-means
§ Essential Points
Example: PageRank
§ PageRank gives web pages a ranking score based on links from other pages
– Higher scores given for more links, and links from other high ranking
pages
§ PageRank is a classic example of big data analysis (like word count)
– Lots of data: Needs an algorithm that is distributable and scalable
– Iterative: The more iterations, the better than answer
PageRank Algorithm (1)
1. Start each page with a rank of 1.0
Page 1
1.0
Page 2 Page 3
1.0 1.0
Page 4
1.0

2. On each iteration:
a. Each page contributes to its neighbors its own rank divided by the
number of its neighbors: contribp = rankp / neighborsp
Page 1
1.0
Page 2 Page 3
1.0 1.0
Page 4
1.0

b. Set each page’s new rank based on the sum of its neighbors
contribution: new_rank = Σcontrib * .85 + .15
Page 1 Iteration 1
1.85
Page 2 Page 3
0.58 1.0
Page 4
0.58

3. Each iteration incrementally improves the page ranking
Page 1 Iteration 2
1.31
Page 2 Page 3
0.39 1.7
Page 4
0.57

3. Each iteration incrementally improves the page ranking
Page 1 Iteration 10
1.43 (Final)
Page 2 Page 3
0.46 1.38
Page 4
0.73
PageRank in Spark: Neighbor Contribution Function
Language: Python
def computeContribs(neighbors, rank):
for neighbor in neighbors: yield(neighbor, rank/len(neighbors))
neighbors: [page1,page2] (page1,.5)

rank: 1.0 (page2,.5)
Page 1
Page 2 Page 3
Page 4
1.0
PageRank in Spark: Example Data
Data Format: page1 page3

source-page destination-page page2 page1
… page4 page1
page3 page1
page4 page2
page3 page4
Page 1
Page 2 Page 3
Page 4
PageRank in Spark: Pairs of Page Links
page1 page3
Language: Python page2 page1
def computeContribs(neighbors, rank):… page4 page1
page3 page1
page4 page2
links = sc.textFile(file)\ page3 page4
.map(lambda line: line.split())\
.map(lambda pages: (pages[0],pages[1]))\
.distinct() (page1,page3)
(page2,page1)
(page4,page1)
(page3,page1)
(page4,page2)
(page3,page4)
PageRank in Spark: Page Links Grouped by Source Page
page1 page3
page3 page1
page4 page2
.distinct()\ (page1,page3)
.groupByKey()
(page2,page1)
(page4,page1)
(page3,page1)
(page4,page2)
(page3,page4)
links
(page4, [page2,page1])
(page2, [page1])
(page1, [page3])
PageRank in Spark: Persisting the Link Pair RDD
page1 page3
page3 page1
page4 page2
.distinct()\ (page1,page3)
.groupByKey()\
(page2,page1)
.persist()
(page4,page1)
(page3,page1)
(page4,page2)
(page3,page4)
links
(page2, [page1])
(page1, [page3])
PageRank in Spark: Set Initial Ranks
Language: Python links

def computeContribs(neighbors, rank):… (page4, [page2,page1])
(page2, [page1])
links = sc.textFile(file)\ (page3, [page1,page4])
(page1, [page3])
.distinct()\
.groupByKey()\
ranks
.persist()
(page4, 1.0)
ranks=links.map(lambda (page,neighbors): (page,1.0)) (page2, 1.0)

(page3, 1.0)
(page1, 1.0)
PageRank in Spark: First Iteration (1)
Language: Python
def computeContribs(neighbors, rank):… links ranks
(page4, [page2,page1]) (page4, 1.0)
links = …
(page2, [page1]) (page2, 1.0)
ranks = … (page3, [page1,page4]) (page3, 1.0)

for x in xrange(10):
contribs=links\
.join(ranks) (page4, ([page2,page1], 1.0))
(page2, ([page1], 1.0))
(page3, ([page1,page4], 1.0))
(page1, ([page3], 1.0))
Language: Python
(page4, [page2,page1]) (page4, 1.0)
links = …
ranks = … (page3, [page1,page4]) (page3, 1.0)

contribs=links\
.join(ranks)\ (page4, ([page2,page1], 1.0))
.flatMap(lambda (page,(neighbors,rank)): \ (page2, ([page1], 1.0))
computeContribs(neighbors,rank))
(page3, ([page1,page4], 1.0))
(page1, ([page3], 1.0))
contribs
(page2,0.5)
(page1,0.5)
(page1,1.0)
(page1,0.5)
(page4,0.5)
(page3,1.0)
contribs
Language: Python (page2,0.5)
def computeContribs(neighbors, rank):… (page1,0.5)
(page1,1.0)
links = …
(page1,0.5)
ranks = … (page4,0.5)
(page3,1.0)
contribs=links\
.join(ranks)\ (page4,0.5)
.flatMap(lambda (page,(neighbors,rank)): \
(page2,0.5)
ranks=contribs\ (page3,1.0)
.reduceByKey(lambda v1,v2: v1+v2) (page1,2.0)
contribs
Language: Python (page2,0.5)
def computeContribs(neighbors, rank):… (page1,0.5)
(page1,1.0)
links = …
(page1,0.5)
ranks = … (page4,0.5)
(page3,1.0)
contribs=links\
.join(ranks)\ (page4,0.5)
(page2,0.5)
ranks=contribs\ (page3,1.0)
.reduceByKey(lambda v1,v2: v1+v2)\ (page1,2.0)
.map(lambda (page,contrib): \
(page,contrib * 0.85 + 0.15)) ranks
(page4,.58)
(page2,.58)
(page3,1.0)
(page1,1.85)
PageRank in Spark: Second Iteration
Language: Python
(page4, [page2,page1]) (page4,0.58)
links = …
(page2, [page1]) (page2,0.58)
ranks = … (page3, [page1,page4]) (page3,1.0)

(page1, [page3]) (page1,1.85)
contribs=links\
.join(ranks)\
…
ranks=contribs\
.reduceByKey(lambda v1,v2: v1+v2)\
.map(lambda (page,contrib): \ ranks
(page,contrib * 0.85 + 0.15)) (page4,0.57)
(page2,0.21)
for rank in ranks.collect(): print rank
(page3,1.0)
(page1,0.77)
Checkpointing (1)
§ Maintaining RDD lineage provides resilience but can also cause problems
when the lineage gets very long
– For example: iterative algorithms, Iter1
streaming data…
Iter2
data…
data…
§ Recovery can be very expensive data…
Iter3
data…
data…
data… Iter4
data…
§ Potential stack overflow data…
data…
data…
data…
data…
Language: Python data…
data…
myrdd = …initial-value… data…
while x in xrange(100):
…
myrdd = myrdd.transform(…) Iter100
myrdd.saveAsTextFile(dir) data…
data…
data…
data…
Checkpointing (2)
§ Checkpointing saves the data to HDFS

– Provides fault-tolerant storage across nodes
§ Lineage is not saved HDFS
data…
§ Must be checkpointed before any checkpoint data…
Iter3
data…
actions on the RDD data…
Iter4
data…
Language: Python data…
data…
data…
sc.setCheckpointDir(directory) data…
data…
myrdd = …initial-value…
data…
while x in xrange(100):
myrdd = myrdd.transform(…)
…
Iter100
if x % 3 == 0: data…
myrdd.checkpoint() data…
myrdd.count() data…
myrdd.saveAsTextFile(dir) data…
Chapter Topics

Data Processing

§ Machine Learning
§ Example: k-means
§ Essential Points
Fundamentals of Computer Programming
§ First consider how a typical program works

– Hardcoded conditional logic
– Predefined reactions when those conditions are met
#!/usr/bin/env python
import sys
for line in sys.stdin:

if "Make MONEY Fa$t At Home!!!" in line:
print "This message is likely spam"
if "Happy Birthday from Aunt Betty" in line:

print "This message is probably OK"
§ The programmer must consider all possibilities at design time

§ An alternative technique is to have computers learn what to do
What is Machine Learning?
§ Machine learning is a field within artificial intelligence (AI)

– AI: “The science and engineering of making intelligent machines”
§ Machine learning focuses on automated knowledge acquisition
– Primarily through the design and implementation of algorithms
– These algorithms require empirical data as input
§ Machine learning algorithms “learn” from data and often produce a
predictive model as their output
– Model can then be used to make predictions as new data arrives
§ For example, consider a predictive model based on credit card customers
– Build model with data about customers who did/did not default on debt
– Model can then be used to predict whether new customers will default
Types of Machine Learning
§ Three established categories of machine learning techniques:

– Collaborative filtering (recommendations)
– Clustering
– Classification
What is Collaborative Filtering?
§ Collaborative filtering is a technique for making recommendations

§ Helps users find items of relevance
– Among a potentially vast number of choices
– Based on comparison of preferences between users
– Preferences can be either explicit (stated) or implicit (observed)
Applications Involving Collaborative Filtering
§ Collaborative filtering is domain agnostic

§ Can use the same algorithm to recommend practically anything
– Movies (Netflix, Amazon Instant Video)
– Television (TiVO Suggestions)
– Music (several popular music download and streaming services)
§ Amazon uses CF to recommend a variety of products
What is Clustering?
§ Clustering algorithms discover structure in collections of data

– Where no formal structure previously existed
§ They discover which clusters (“groupings”) naturally occur in data
– By examining various properties of the input data
§ Clustering is often used for exploratory analysis
– Divide huge amount of data into smaller groups
– Can then tune analysis for each group Price
Store Online
Brand status
Unsupervised Learning (1)
§ Clustering is an example of unsupervised learning

– Begin with a data set that has no apparent label
– Use an algorithm to discover structure in the data
Unsupervised Learning (2)
§ Once the model has been created, you can use it to assign groups
Applications Involving Clustering
§ Market segmentation
– Group similar customers in order to target them effectively
§ Finding related news articles
– Google News
§ Epidemiological studies
– Identifying a “cancer cluster” and finding a root cause
§ Computer vision (groups of pixels that cohere into objects)
– Related pixels clustered to recognize faces or license plates
What is Classification?
§ Classification is a form of supervised learning

– This requires training with data that has known labels
– A classifier can then label new data based on what it learned in training
§ This example depicts how a classifier might identify animals
– In this case, it learned to distinguish
between these two classes of animals
based on height and weight 45
40
Dog
35
30
Weight (lb.)
25
20
15
Cat
10
5
3 6 9 12 15 18 21 24 27
Height (in.)
Supervised Learning (1)
§ Classification is an example of supervised learning

– Begin with a data set that includes the value to be predicted (the label)
– Use an algorithm to train a predictive model using the data-label pairs
Supervised Learning (2)
§ Once the model has been trained, you can make predictions
– This will take new (previously unseen) data as input
– The new data will not have labels
Applications Involving Classification
§ Spam filtering
– Train using a set of spam and non-spam messages
– System will eventually learn to detect unwanted email
§ Oncology
– Train using images of benign and malignant tumors
– System will eventually learn to identify cancer
§ Risk Analysis
– Train using financial records of customers who do/don’t default
– System will eventually learn to identify risk customers
Relationship of Algorithms and Data Volume (1)
§ There are many algorithms for each type of machine learning

– There’s no overall “best” algorithm
– Each algorithm has advantages and limitations
§ Algorithm choice is often related to data volume
– Some scale better than others
§ Most algorithms offer better results as volume increases
– Best approach = simple algorithm + lots of data
§ Spark is an excellent platform for machine learning over large data sets
– Resilient, iterative, parallel computations over distributed data sets
Relationship of Algorithms and Data Volume (2)
“It’s not who has the best algorithms that wins.

It’s who has the most data.” [Banko and Brill, 2001]
1.00
0.95
0.90
Test Accuracy
0.85
0.80
Memory-Based
Winnow
0.75 Perceptron
Naive Bayes
0.70
0.1 1 10 100 1000
Millions of Words
Machine Learning Challenges
§ Highly computation-intensive and iterative

§ Many traditional numerical processing systems do not scale to very large
datasets
– For example, MATLAB
Spark MLlib and Spark ML
§ Spark MLlib is a Spark machine learning library

– Makes practical machine learning scalable and easy
– Includes many common machine learning algorithms
– Includes base data types for efficient calculations at scale
– Supports scalable statistics and data transformations
§ Spark ML is a new higher-level API for machine learning pipelines
– Built on top of Spark’s DataFrames API
– Simple and clean interface for running a series of complex tasks
– Supports most functionality included in Spark MLlib
§ Spark MLlib and ML support a variety of machine learning algorithms
– Such as ALS (alternating least squares), k-means, linear regression,
logistic regression, gradient descent
Chapter Topics

Data Processing

§ Machine Learning
§ Example: k-means
§ Essential Points
k-means Clustering
§ k-means clustering
– A common iterative algorithm used in graph analysis and machine
learning
– You will implement a simplified version in the Hands-On Exercises
Clustering (1)
Clustering (2)
Goal: Find “clusters” of data points
Example: k-means Clustering (1)
1. Choose k random points as

starting centers

starting centers
2. Find all points closest to each
center

starting centers
center
3. Find the center (mean) of each
cluster

starting centers
center
cluster
4. If the centers changed, iterate
again

starting centers
center
cluster
again

starting centers
center
cluster
again

starting centers
center
cluster
again

starting centers
center
cluster
again

starting centers
center
cluster
again
…
5. Done!
Example: Approximate k-means Clustering

starting centers
center
cluster
4. If the centers changed by
more than c, iterate again
…
5. Close enough!
Chapter Topics

Data Processing

§ Machine Learning
§ Example: k-means
§ Essential Points
Essential Points
§ Spark is especially suited to big data problems that require iteration

– In-memory persistence makes this very efficient
§ Common in many types of analysis
– For example, common algorithms such as PageRank and k-means
§ Spark is especially well-suited for implementing machine learning
§ Spark includes MLlib and ML
– Specialized libraries to implement many common machine learning
functions
– Efficient, scalable functions for machine learning (for example: logistic
regression, k-means)
Chapter Topics

Data Processing

§ Machine Learning
§ Example: k-means
§ Essential Points
Hands-On Exercise: Implement an Iterative Algorithm with
Apache Spark
– Implement k-means in Spark in order to identify clustered location data
points from Loudacre device status logs
– Find the geographic centers of device activity
DataFrames and Apache Spark SQL
Chapter 14
Course Chapters
§ Introduction
§ RDD Persistence
§ Conclusion

§ What Spark SQL is
§ What features the DataFrame API provides
§ How to create a SQLContext
§ How to load existing data into a DataFrame
§ How to query data in a DataFrame
§ How to convert from DataFrames to pair RDDs
Chapter Topics
§ Apache Spark SQL and the SQL Context

§ Creating DataFrames
§ Transforming and Querying DataFrames
§ Saving DataFrames
§ DataFrames and RDDs
§ Comparing Spark SQL, Apache Impala, and Apache Hive-on-Spark
§ Apache Spark SQL in Spark 2.x
§ Essential Points
§ Hands-On Exercise: Use Apache Spark SQL for ETL
What is Spark SQL?
§ What is Spark SQL?

– Spark module for structured data processing
– Replaces Shark (a prior Spark module, now deprecated)
– Built on top of core Spark
§ What does Spark SQL provide?
– The DataFrame API—a library for working with data as tables
– Defines DataFrames containing rows and columns
– DataFrames are the focus of this chapter!
– Catalyst Optimizer—an extensible optimization framework
– A SQL engine and command line interface
SQL Context
§ The main Spark SQL entry point is a SQL context object

– Requires a SparkContext object
– The SQL context in Spark SQL is similar to Spark context in core Spark
§ There are two implementations
– SQLContext
– Basic implementation
– HiveContext
– Reads and writes Hive/HCatalog tables directly
– Supports full HiveQL language
– Requires the Spark application be linked with Hive libraries
– Cloudera recommends using HiveContext
Creating a SQL Context
§ The Spark shell creates a HiveContext instance automatically

– Call sqlContext
– You will need to create one when writing a Spark application
– Having multiple SQL context objects is allowed
§ A SQL context object is created based on the Spark context
Language: Python
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
Language: Scala
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
Chapter Topics

§ Essential Points
DataFrames
§ DataFrames are the main abstraction in Spark SQL

– Analogous to RDDs in core Spark
– A distributed collection of structured data organized into named
columns
– Built on a base RDD containing Row objects
Creating DataFrames
§ DataFrames can be created

– From an existing structured data source
– Such as a Hive table, Parquet file, or JSON file
– From an existing RDD
– By performing an operation or query on another DataFrame
– By programmatically defining a schema
Creating a DataFrame from a Data Source
§ sqlContext.read returns a DataFrameReader object

§ DataFrameReader provides the functionality to load data into a
DataFrame
§ Convenience functions
– json(filename)
– parquet(filename)
– orc(filename)
– table(hive-tablename)
– jdbc(url,table,options)
Example: Creating a DataFrame from a JSON File
Language: Python
peopleDF = sqlContext.read.json("people.json")
Language: Scala
val peopleDF = sqlContext.read.json("people.json")
age name pcode

File: people.json
{"name":"Alice", "pcode":"94304"} null Alice 94304
{"name":"Brayden", "age":30, "pcode":"94304"} 30 Brayden 94304
{"name":"Carla", "age":19, "pcode":"10036"}
{"name":"Diana", "age":46} 19 Carla 10036
{"name":"Étienne", "pcode":"94104"} 46 Diana null
null Étienne 94104
Example: Creating a DataFrame from a Hive/Impala Table
Language: Python
customerDF = sqlContext.read.table("customers")
Language: Scala
val customerDF = sqlContext.read.table("customers")
Table: customers
cust_id name country cust_id name country
001 Ani us 001 Ani us

002 Bob ca 002 Bob ca
003 Carlos mx 003 Carlos mx
… … … … … …
Loading from a Data Source Manually
§ You can specify settings for the DataFrameReader

– format: Specify a data source type
– option: A key/value setting for the underlying data source
– schema: Specify a schema instead of inferring from the data source
§ Then call the generic base function load
sqlContext.read.
format("com.databricks.spark.avro").
load("/loudacre/accounts_avro")
sqlContext.read.
format("jdbc").
option("url","jdbc:mysql://localhost/loudacre").
option("dbtable","accounts").
option("user","training").
option("password","training").
load()
Data Sources
§ Spark SQL 1.6 built-in data source types

– table
– json
– parquet
– jdbc
– orc
§ You can also use third party data source libraries, such as
– Avro (included in CDH)
– HBase
– CSV
– MySQL
– and more being added all the time
Chapter Topics

§ Essential Points
DataFrame Basic Operations (1)
§ Basic operations deal with DataFrame metadata (rather than its data)
§ Some examples
– schema returns a schema object describing the data
– printSchema displays the schema as a visual tree
– cache / persist persists the DataFrame to disk or memory
– columns returns an array containing the names of the columns
– dtypes returns an array of (column name,type) pairs
– explain prints debug information about the DataFrame to the
console
§ Most of these examples will be demonstrated in the next several slides
DataFrame Basic Operations (2)
§ Example: Displaying column data types using dtypes

Language: Python
> peopleDF = sqlContext.read.json("people.json")
> for item in peopleDF.dtypes: print item
('age', 'bigint')
('name', 'string')
('pcode', 'string')
Language: Scala
> val peopleDF = sqlContext.read.json("people.json")
> peopleDF.dtypes.foreach(println)
(age,LongType)
(name,StringType)
(pcode,StringType)
Working with Data in a DataFrame
§ Queries—create a new DataFrame

– DataFrames are immutable
– Queries are analogous to RDD transformations
§ Actions—return data to the driver
– Actions trigger “lazy” execution of queries
DataFrame Actions
§ Some DataFrame actions

– collect returns all rows as an array of Row objects
– take(n) returns the first n rows as an array of Row objects
– count returns the number of rows
– show(n)displays the first n rows (default=20)
Language: Python Language: Scala

> peopleDF.count() > peopleDF.count()
5L res7: Long = 5
> peopleDF.show(3) > peopleDF.show(3)

age name pcode age name pcode
null Alice 94304 null Alice 94304
30 Brayden 94304 30 Brayden 94304
19 Carla 10036 19 Carla 10036
DataFrame Queries (1)
§ DataFrame query methods return new DataFrames

– Queries can be chained like transformations
§ Some query methods
– distinct returns a new DataFrame with distinct elements of this DF
– join joins this DataFrame with a second DataFrame
– Variants for inside, outside, left, and right joins
– limit returns a new DataFrame with the first n rows of this DF
– select returns a new DataFrame with data from one or more columns
of the base DataFrame
– where returns a new DataFrame with rows meeting specified query
criteria (alias for filter)
DataFrame Queries (2)
§ Example: A basic query with limit

age name pcode
Language: Scala/Python null Alice 94304
> peopleDF.limit(3).show() 30 Brayden 94304
19 Carla 10036
46 Diana null
null Étienne 94104
age name pcode
Output null Alice 94304
of show 30 Brayden 94304
19 Carla 10036
age name pcode

null Alice 94304
30 Brayden 94304
19 Carla 10036
DataFrame Query Strings (1)
§ Some query operations take strings containing age

simple query expressions
null
– Such as select and where
30
§ Example: select 19
46
age name pcode null
null Alice 94304
30 Brayden 94304 name age
19 Carla 10036 Alice null
46 Diana null Brayden 30
null Étienne 94104 Carla 19
Diana 46
Étienne null
DataFrame Query Strings (2)
§ Example: where
peopleDF.
where("age > 21")

null Alice 94304 30 Brayden 94304
30 Brayden 94304 46 Diana null
19 Carla 10036
46 Diana null
null Étienne 94104
Querying DataFrames using Columns (1)
§ Some DataFrame queries take one or more columns or column expressions

– Required for more sophisticated operations
§ Some examples
– select
– sort
– join
– where
§ Columns can be referenced in multiple ways

age name pcode
§ Python
null Alice 94304
ageDF = peopleDF.select(peopleDF['age']) 30 Brayden 94304
19 Carla 10036
ageDF = peopleDF.select(peopleDF.age) 46 Diana null

null Étienne 94104
§ Scala
val ageDF = peopleDF.select(peopleDF("age")) age

null
30
val ageDF = peopleDF.select($"age")
19
46
null
§ Column references can also be column expressions

Language: Python
peopleDF.select(peopleDF['name'],peopleDF['age']+10)
Language: Scala
peopleDF.select(peopleDF("name"),peopleDF("age")+10)
age name pcode name age+10

null Alice 94304 Alice null
30 Brayden 94304 Brayden 40
19 Carla 10036 Carla 29
46 Diana null Diana 56
null Étienne 94104 Étienne null
§ Example: Sorting by columns (descending)

Language: Python .asc and .desc
peopleDF.sort(peopleDF['age'].desc()) are column expression
methods used with
sort
Language: Scala
peopleDF.sort(peopleDF("age").desc)

null Alice 94304 46 Diana null
30 Brayden 94304 30 Brayden 94304
19 Carla 10036 19 Carla 10036
46 Diana null null Alice 94304
null Étienne 94104 null Étienne 94104
Joining DataFrames (1)
§ A basic inner join when join column is in both DataFrames

age name pcode peopleDF.join(pcodesDF, "pcode")

null Alice 94304
30 Brayden 94304
19 Carla 10036
pcode age name city state
46 Diana null
94304 null Alice Palo Alto CA
null Étienne 94104
94304 30 Brayden Palo Alto CA
pcode city state 10036 19 Carla New York NY
10036 New York NY 94104 null Étienne San CA
Francisco
87501 Santa Fe NM
94304 Palo Alto CA
94104 San CA
Francisco
§ Specify type of join as inner (default), outer, left_outer,

right_outer, or leftsemi
Language: Python
age name pcode peopleDF.join(pcodesDF, "pcode",
"left_outer")
null Alice 94304
30 Brayden 94304
Language: Scala
19 Carla 10036 peopleDF.join(pcodesDF,
Array("pcode"), "left_outer")
46 Diana null
null Étienne 94104
pcode age name city state
pcode city state 94304 null Alice Palo Alto CA
10036 New York NY 94304 30 Brayden Palo Alto CA
87501 Santa Fe NM 10036 19 Carla New York NY
94304 Palo Alto CA null 46 Diana null null
94104 San CA 94104 null Étienne San CA
Francisco Francisco
§ Use a column expression when column names are different

Language: Python
peopleDF.join(zcodesDF,
age name pcode peopleDF.pcode == zcodesDF.zip)
null Alice 94304

Language: Scala
30 Brayden 94304 peopleDF.join( zcodesDF,
19 Carla 10036 $"pcode" === $"zip")
46 Diana null
null Étienne 94104
pcode age name zip city state
zip city state 94304 null Alice 94304 Palo Alto CA
10036 New York NY 94304 30 Brayden 94304 Palo Alto CA
87501 Santa Fe NM 10036 19 Carla 10036 New York NY
94304 Palo Alto CA 94104 null Étienne 94104 San CA

Francisco
94104 San CA
Francisco
SQL Queries (1)
§ When using HiveContext, you can query Hive/Impala tables using

HiveQL
– Returns a DataFrame
sqlContext.
sql("""SELECT * FROM customers WHERE name LIKE "A%" """)
Table: customers
cust_id name country
001 Ani us cust_id name country

002 Bob ca 001 Ani us
003 Carlos mx
… … …
SQL Queries (2)
§ You can also perform some SQL queries with a DataFrame

– First, register the DataFrame as a “table” with the SQL context
peopleDF.registerTempTable("people")
sqlContext.
sql("""SELECT * FROM people WHERE name LIKE "A%" """)

null Alice 94304 null Alice 94304
30 Brayden 94304
19 Carla 10036
46 Diana null Note: This feature does not depend
null Étienne 94104 on Hive or Impala, or on a database
SQL Queries (3)
§ You can query directly from Parquet or JSON files without needing to
create a DataFrame or register a temporary table
sqlContext.
sql("""SELECT * FROM json.`/user/training/people.json` WHERE
name LIKE "A%" """)
File: people.json
{"name":"Alice", "pcode":"94304"}
{"name":"Brayden", "age":30, "pcode":"94304"}
{"name":"Carla", "age":19, "pcode":"10036"}
{"name":"Diana", "age":46}
{"name":"Étienne", "pcode":"94104"}
age name pcode

null Alice 94304
Other Query Functions
§ DataFrames provide many other data manipulation and query functions

such as
– Aggregation such as groupBy, orderBy, and agg
– Multi-dataset operations such as join, unionAll, and intersect
– Statistics such as avg, sampleBy, corr, and cov
– Multi-variable functions rollup and cube
– Window-based analysis functions
Chapter Topics

§ Essential Points
Saving DataFrames
§ Data in DataFrames can be saved to a data source

§ Use DataFrame.write to create a DataFrameWriter
§ DataFrameWriter provides convenience functions to externally save
the data represented by a DataFrame
– jdbc inserts into a new or existing table in a database
– json saves as a JSON file
– parquet saves as a Parquet file
– orc saves as an ORC file
– text saves as a text file (string data in a single column only)
– saveAsTable saves as a Hive/Impala table (HiveContext only)
peopleDF.write.saveAsTable("people")
Options for Saving DataFrames
§ DataFrameWriter option methods

– format specifies a data source type
– mode determines the behavior if file or table already exists:
overwrite, append, ignore or error (default is error)
– partitionBy stores data in partitioned directories in the form
column=value (as with Hive/Impala partitioning)
– options specifies properties for the target data source
– save is the generic base function to write the data
peopleDF.write.
format("parquet").
mode("append").
partitionBy("age").
saveAsTable("people")
Chapter Topics

§ Essential Points
DataFrames and RDDs (1)
§ DataFrames are built on RDDs

– Base RDDs contain Row objects
– Use rdd to get the underlying RDD
peopleRDD = peopleDF.rdd
peopleDF peopleRDD
age name pcode Row[null,Alice,94304]
null Alice 94304 Row[30,Brayden,94304]
30 Brayden 94304 Row[19,Carla,10036]
19 Carla 10036 Row[46,Diana,null]
46 Diana null Row[null,Étienne,94104]
null Étienne 94104
DataFrames and RDDs (2)
§ Row RDDs have all the standard Spark actions and transformations
– Actions: collect, take, count, and so on
– Transformations: map, flatMap, filter, and so on
§ Row RDDs can be transformed into pair RDDs to use map-reduce methods
§ DataFrames also provide convenience methods (such as map, flatMap,
and foreach)for converting to RDDs
Working with Row Objects
§ The syntax for extracting data from Row objects depends on language
§ Python
– Column names are object attributes
– row.age returns age column value from row
§ Scala
– Use Array-like syntax to return values with type Any
– row(n) returns element in the nth column
– row.fieldIndex("age")returns index of the age column
– Use methods to get correctly typed values
– row.getAs[Long]("age")
– Use type-specific get methods to return typed values
– row.getString(n) returns nth column as a string
– row.getInt(n) returns nth column as an integer
– And so on
Example: Extracting Data from Row Objects
Row[null,Alice,94304]
§ Extract data from Row objects
Row[30,Brayden,94304]
Language: Python Row[19,Carla,10036]
peopleRDD = peopleDF \ Row[46,Diana,null]
.map(lambda row:(row.pcode,row.name))
Row[null,Étienne,94104]
peopleByPCode = peopleRDD \
.groupByKey()
(94304,Alice)
(94304,Brayden)
Language: Scala (10036,Carla)
val peopleRDD = peopleDF. (null,Diana)
map(row => (94104,Étienne)
(row(row.fieldIndex("pcode")),
row(row.fieldIndex("name"))))
val peopleByPCode = peopleRDD. (null,[Diana])
groupByKey() (94304,[Alice,Brayden])
(10036,[Carla])
(94104,[Étienne])
Converting RDDs to DataFrames
§ You can also create a DF from an RDD using createDataFrame

Language: Python
from pyspark.sql.types import *
schema = StructType([StructField("age",IntegerType(),True),
StructField("name",StringType(),True),
StructField("pcode",StringType(),True)])
myrdd = sc.parallelize([(40,"Abram","01601"),
(16,"Lucia","87501")])
mydf = sqlContext.createDataFrame(myrdd,schema)
import org.apache.spark.sql.types._ Language: Scala

import org.apache.spark.sql.Row
val schema = StructType(Array(
StructField("age", IntegerType, true),
StructField("name", StringType, true),
StructField("pcode", StringType, true)))
val rowrdd = sc.parallelize(Array(Row(40,"Abram","01601"),
Row(16,"Lucia","87501")))
val mydf = sqlContext.createDataFrame(rowrdd,schema)
Chapter Topics

§ Essential Points
Comparing Impala to Spark SQL
§ Spark SQL is built on Spark, a general purpose processing engine

– Provides convenient SQL-like access to structured data in a Spark
application
§ Impala is a specialized SQL engine
– Much better performance for querying
– Much more mature than Spark SQL
– Robust security using Sentry
§ Impala is better for
– Interactive queries
– Data analysis
§ Use Spark SQL for
– ETL
– Access to structured data required by a Spark application
Comparing Spark SQL with Hive on Spark
§ Spark SQL
– Provides the DataFrame API to allow structured data
processing in a Spark application
– Programmers can mix SQL with procedural processing
§ Hive on Spark
– Hive provides a SQL abstraction layer over MapReduce or
Spark
– Allows non-programmers to analyze data using familiar
SQL
– Hive on Spark replaces MapReduce as the engine
underlying Hive
– Does not affect the user experience of Hive
– Except queries run many times faster!
Chapter Topics

§ Essential Points
What’s Coming in Spark 2.x?
§ Spark 2.0 is the next major release of Spark

§ Several significant changes related to Spark SQL, including
– SparkSession replaces SQLContext and HiveContext
– Support for ANSI-SQL as well as HiveQL
– Support for subqueries
– Support for Datasets
Spark Datasets
§ Datasets are an alternative to RDDs for structured data

– A strongly-typed collection of objects, mapped to a relational schema
– Unified with the DataFrame API—DFs are Datasets of Row objects
– Use the Spark Catalyst optimizer as DFs do for better performance
Language: Scala
val countsRDD = sc.textFile(filename).
Word count flatMap(line => line.split(" ")).
using RDDs map(word => (word,1)).
reduceByKey((v1,v2) => v1+v2)
Language: Scala
val countsDS =
Word count
sqlContext.read.text(filename).as[String].
using Datasets
flatMap(line => line.split(" ")).
groupBy(word => word).
count()
Chapter Topics

§ Essential Points
Essential Points
§ Spark SQL is a Spark API for handling structured and semi-structured data
§ Entry point is a SQL context
§ DataFrames are the key unit of data
– DataFrames are based on an underlying RDD of Row objects
– DataFrames query methods return new DataFrames; similar to RDD
transformations
– The full Spark API can be used with Spark SQL data by accessing the
underlying RDD
§ Spark SQL is not a replacement for a database, or a specialized SQL engine
like Impala
– Spark SQL is most useful for ETL or incorporating structured data into a
Spark application
Chapter Topics

§ Essential Points
Hands-On Exercise: Use Apache Spark SQL for ETL

– Import data from MySQL using Sqoop
– Use Spark SQL to normalize the data
– Save the data to Parquet format
Message Processing with Apache Kafka
Chapter 15
Course Chapters
§ Introduction
§ RDD Persistence
§ Conclusion
Message Processing with Apache Kafka

§ What Apache Kafka is and what advantages it offers
§ About the high-level architecture of Kafka
§ How to create topics, publish messages, and read messages from the
command line
Chapter Topics
Message Processing with Apache

Kafka
§ What Is Apache Kafka?

§ Apache Kafka Overview
§ Scaling Apache Kafka
§ Apache Kafka Cluster Architecture
§ Apache Kafka Command Line Tools
§ Essential Points
§ Hands-On Exercise: Produce and Consume Apache Kafka Messages
What Is Apache Kafka?
§ Apache Kafka is a distributed commit log service

– Widely used for data ingest
– Conceptually similar to a publish-subscribe messaging system
– Offers scalability, performance, reliability, and flexibility
§ Originally created at LinkedIn, now an open source Apache project
– Donated to the Apache Software Foundation in 2011
– Graduated from the Apache Incubator in 2012
– Supported by Cloudera for production use with CDH in 2015
Apache Kafka
Characteristics of Kafka
§ Scalable
– Kafka is a distributed system that supports multiple nodes
§ Fault-tolerant
– Data is persisted to disk and can be replicated throughout the cluster
§ High throughput
– Each broker can process hundreds of thousands of messages per second *
§ Low latency
– Data is delivered in a fraction of a second
§ Flexible
– Decouples the production of data from its consumption
* Using modest hardware, with messages of a typical size

Kafka Use Cases
§ Kafka is used for a variety of use cases, such as

– Log aggregation
– Messaging
– Web site activity tracking
– Stream processing
– Event sourcing
Chapter Topics

Kafka

§ Essential Points
Key Terminology
§ Message
– A single data record passed by Kafka
§ Topic
– A named log or feed of messages within Kafka
§ Producer
– A program that writes messages to Kafka
§ Consumer
– A program that reads messages from Kafka
Example: High-Level Architecture
Messages (1)
§ Messages in Kafka are variable-size byte arrays

– Represent arbitrary user-defined content
– Use any format your application requires
– Common formats include free-form text, JSON, and Avro
§ There is no explicit limit on message size
– Optimal performance at a few KB per message
– Practical limit of 1MB per message
Messages (2)
§ Kafka retains all messages for a defined time period and/or total size
– Administrators can specify retention on global or per-topic basis
– Kafka will retain messages regardless of whether they were read
– Kafka discards messages automatically after the retention period or
total size is exceeded (whichever limit is reached first)
– Default retention is one week
– Retention can reasonably be one year or longer
Topics
§ There is no explicit limit on the number of topics

– However, Kafka works better with a few large topics than many small
ones
§ A topic can be created explicitly or simply by publishing to the topic
– This behavior is configurable
– Cloudera recommends that administrators disable auto-creation of
topics to avoid accidental creation of large numbers of topics
Producers
§ Producers publish messages to Kafka topics

– They communicate with Kafka, not a consumer
– Kafka persists messages to disk on receipt
Consumers
§ A consumer reads messages that were published to Kafka topics

– They communicate with Kafka, not any producer
§ Consumer actions do not affect other consumers
– For example, having one consumer display the messages in a topic as
they are published does not change what is consumed by other
consumers
§ They can come and go without impact on the cluster or other consumers
Producers and Consumers
§ Tools available as part of Kafka

– Command-line producer and consumer tools
– Client (producer and consumer) Java APIs
§ A growing number of other APIs are available from third parties
– Client libraries in many languages including Python, PHP, C/C++, Go,
.NET, and Ruby
§ Integrations with other tools and projects include
– Apache Flume
– Apache Spark
– Amazon AWS
– syslog
§ Kafka also has a large and growing ecosystem
Chapter Topics

Kafka

§ Essential Points
Scaling Kafka
§ Scalability is one of the key benefits of Kafka

§ Two features let you scale Kafka for performance
– Topic partitions
– Consumer groups
Topic Partitioning
§ Kafka divides each topic into some number of partitions *

– Topic partitioning improves scalability and throughput
§ A topic partition is an ordered and immutable sequence of messages
– New messages are appended to the partition as they are received
– Each message is assigned a unique sequential ID known as an offset
* Note that this is unrelated to partitioning in HDFS or Spark

Consumer Groups
§ One or more consumers can form their own consumer group that work
together to consume the messages in a topic
§ Each partition is consumed by only one member of a consumer group
§ Message ordering is preserved per partition, but not across the topic
Topic: click-tracking
Kafka Partition Partition Partition Partition
Cluster 0 3 1 2
Consumer Consumer
1 2
Consumer Group: click-processing
Increasing Consumer Throughput
§ Additional consumers can be added to scale consumer group processing

§ Consumer instances that belong to the same consumer group can be in
separate processes or on separate machines
Cluster 0 3 1 2
C1 C2 C3 C4
Consumer Group: click-processing
Multiple Consumer Groups
§ Each message published to a topic is delivered to one consumer instance

within each subscribing consumer group
§ Kafka scales to large numbers of consumer groups and consumers
Cluster 0 3 1 2
C1 C2 C3 C4 C5 C6
Consumer Group A Consumer Group B
Publish and Subscribe to Topic
§ Kafka functions like a traditional queue when all consumer instances

belong to the same consumer group
– In this case, a given message is received by one consumer
§ Kafka functions like traditional publish-subscribe when each consumer
instance belongs to a different consumer group
– In this case, all messages are broadcast to all consumer groups
Cluster 0 3 1 2
C1 C2 C3 C4 C5 C6 C7
Consumer Group A Consumer Group B Consumer Group C
Chapter Topics

Kafka

§ Essential Points
Kafka Clusters
§ A Kafka cluster consists of one or more brokers—servers running the Kafka

broker daemon
§ Kafka depends on the Apache ZooKeeper service for coordination
Apache ZooKeeper
§ Apache ZooKeeper is a coordination service for distributed applications

§ Kafka depends on the ZooKeeper service for coordination
– Typically running three or five ZooKeeper instances
§ Kafka uses ZooKeeper to keep track of brokers running in the cluster
§ Kafka uses ZooKeeper to detect the addition or removal of consumers
Kafka Brokers
§ Brokers are the fundamental daemons that make up a Kafka cluster

§ A broker fully stores a topic partition on disk, with caching in memory
§ A single broker can reasonably host 1000 topic partitions
§ One broker is elected controller of the cluster (for assignment of topic
partitions to brokers, and so on)
§ Each broker daemon runs in its own JVM
– A single machine can run multiple broker daemons
Topic Replication
§ At topic creation, a topic can be set with a replication count

– Doing so is recommended, as it provides fault tolerance
§ Each broker can act as a leader for some topic partitions and a follower for
others
– Followers passively replicate the leader
– If the leader fails, a follower will automatically become the new leader
Messages Are Replicated
§ Configure the producer with a list of one or more brokers

– The producer asks the first available broker for the leader of the desired
topic partition
§ The producer then sends the message to the leader
– The leader writes the message to its local log
– Each follower then writes the message to its own log
– After acknowledgements from followers, the message is committed
Chapter Topics

Kafka

§ Essential Points
Creating Topics from the Command Line
§ Kafka includes a convenient set of command line tools

– These are helpful for exploring and experimentation
§ The kafka-topics command offers a simple way to create Kafka topics
– Provide the topic name of your choice, such as device_status
– You must also specify the ZooKeeper connection string for your cluster
$ kafka-topics --create \
--zookeeper zkhost1:2181,zkhost2:2181,zkhost3:2181 \
--replication-factor 3 \
--partitions 5 \
--topic device_status
Displaying Topics from the Command Line
§ Use the --list option to list all topics
$ kafka-topics --list \
--zookeeper zkhost1:2181,zkhost2:2181,zkhost3:2181
§ Use the --help option to list all kafka-topics options
$ kafka-topics --help
Running a Producer from the Command Line (1)
§ You can run a producer using the kafka-console-producer tool

§ Specify one or more brokers in the --broker-list option
– Each broker consists of a hostname, a colon, and a port number
– If specifying multiple brokers, separate them with commas
§ You must also provide the name of the topic
$ kafka-console-producer \
--broker-list brokerhost1:9092,brokerhost2:9092 \
Running a Producer from the Command Line (2)
§ You may see a few log messages in the terminal after the producer starts
§ The producer will then accept input in the terminal window
– Each line you type will be a message sent to the topic
§ Until you have configured a consumer for this topic, you will see no other
output from Kafka
Writing File Contents to Topics Using the Command Line
§ Using UNIX pipes or redirection, you can read input from files
– The data can then be sent to a topic using the command line producer
§ This example shows how to read input from a file named alerts.txt
– Each line in this file becomes a separate message in the topic
$ cat alerts.txt | kafka-console-producer \

--broker-list brokerhost1:9092,brokerhost2:9092 \
§ This technique can be an easy way to integrate with existing programs
Running a Consumer from the Command Line
§ You can run a consumer with the kafka-console-consumer tool

§ This requires the ZooKeeper connection string for your cluster
– Unlike starting a producer, which instead requires a list of brokers
§ The command also requires a topic name
§ Use --from-beginning to read all available messages
– Otherwise, it reads only new messages
$ kafka-console-consumer \
--zookeeper zkhost1:2181,zkhost2:2181,zkhost3:2181 \
--topic device_status \
--from-beginning
Chapter Topics

Kafka

§ Essential Points
Essential Points
§ Producers publish messages to categories called topics

§ Messages in a topic are read by consumers
§ Topics are divided into partitions for performance and scalability
– These partitions are replicated for fault tolerance
§ Consumer groups work together to consume the messages in a topic
§ Nodes running the Kafka service are called brokers
§ Kafka includes command-line tools for managing topics, and for starting
producers and consumers
Bibliography

§ The Apache Kafka web site
– http://kafka.apache.org/
§ Real-Time Fraud Detection Architecture
– http://tiny.cloudera.com/kmc01a
§ Kafka Reference Architecture
– http://tiny.cloudera.com/kmc01b
§ The Log: What Every Software Engineer Should Know…
– http://tiny.cloudera.com/kmc01c
Chapter Topics

Kafka

§ Essential Points
Hands-On Exercise: Produce and Consume Apache Kafka
Messages
– Use Kafka’s command line utilities to create a new topic, publish
messages to the topic with a producer, and read messages from the
topic with a consumer
Capturing Data with Apache Flume
Chapter 16
Course Chapters
§ Introduction
§ RDD Persistence
§ Conclusion

§ What are the main architectural components of Apache Flume
§ How these components are configured
§ How to launch a Flume agent
Chapter Topics
§ What Is Apache Flume?

§ Basic Architecture
§ Sources
§ Sinks
§ Channels
§ Configuration
§ Essential Points
§ Hands-On Exercise: Collect Web Server Logs with Apache Flume
What Is Apache Flume?
§ Apache Flume is a high-performance system for data collection

– Name derives from original use case of near-real time log data ingestion
– Now widely used for collection of any streaming event data
– Supports aggregating data from many sources into HDFS
§ Originally developed by Cloudera
– Donated to Apache Software Foundation in 2011
– Became a top-level Apache project in 2012
– Flume OG (Old Generation) gave way to Flume NG (Next Generation)
§ Benefits of Flume
– Horizontally-scalable
– Extensible
– Reliable
Flume’s Design Goals: Reliability
§ Channels provide Flume’s reliability

§ Examples
– Memory channel: Fault intolerant, data will be lost if power is lost
– Disk-based channel: Fault tolerant
– Kafka channel: Fault tolerant
§ Data transfer between agents and channels is transactional
– A failed data transfer to a downstream agent rolls back and retries
§ You can configure multiple agents with the same task
– For example, two agents doing the job of one “collector”—if one agent
fails then upstream agents would fail over
Flume’s Design Goals: Scalability
§ Scalability
– The ability to increase system performance linearly—or better—by
adding more resources to the system
– Flume scales horizontally
– As load increases, add more agents to the machine, and add more
machines to the system
Flume’s Design Goals: Extensibility
§ Extensibility
– The ability to add new functionality to a system
§ Flume can be extended by adding sources and sinks to existing storage
layers or data platforms
– Flume includes sources that can read data from files, syslog, and
standard output from any Linux process
– Flume includes sinks that can write to files on the local filesystem, HDFS,
Kudu, HBase, and so on
– Developers can write their own sources or sinks
Common Flume Data Sources
Sensor Data
Log Files Status Updates
UNIX syslog Network Sockets
Hadoop Cluster
Social Media Posts

Program Output
Large-Scale Deployment Example
§ Flume collects data using configurable agents

– Agents can receive data from many sources, including other agents
– Large-scale deployments use multiple tiers for scalability and reliability
– Flume supports inspection and modification of in-flight data
Chapter Topics

§ Sources
§ Sinks
§ Channels
§ Configuration
§ Essential Points
Flume Events
§ An event is the fundamental unit of data in Flume

– Consists of a body (payload) and a collection of headers (metadata)
§ Headers consist of name-value pairs
– Headers are mainly used for directing output
Anatomy of a Flume Event
timestamp: 1395256884
hostname: webserver05.loudacre.com Headers

datacenter: palo-alto
192.168.5.150 - pablo [11/May/2014:23:56:27

-0800] "GET /KBDOC-0U812.html HTTP/1.1" 200
6747 "http://www.loudacre.com/kb/search? Body
q=batteries" "Mozilla/5.0 (ACME v3.14159)"
Components in Flume’s Architecture
§ Source
– Receives events from the external actor that generates them
§ Sink
– Sends an event to its destination
§ Channel
– Buffers events from the source until they are drained by the sink
§ Agent
– Configures and hosts the source, channel, and sink
– A Java process that runs in a JVM
Flume Data Flow
§ This diagram illustrates how syslog data might be captured to HDFS

1. Server running a syslog daemon logs a message
2. Flume agent configured with syslog source retrieves event
3. Source pushes event to the channel, where it is buffered in memory
4. Sink pulls data from the channel and writes it to HDFS
Flume Agent
Sends syslog Data written

message Source Channel Sink to file in HDFS
(syslog) (Memory) (HDFS)
syslog Server
Hadoop Cluster
Chapter Topics

§ Sources
§ Sinks
§ Channels
§ Configuration
§ Essential Points
Notable Built-In Flume Sources
§ Syslog
– Captures messages from UNIX syslog daemon over the network
§ Netcat
– Captures any data written to a socket on an arbitrary TCP port
§ Exec
– Executes a UNIX program and reads events from standard output *
§ Spooldir
– Extracts events from files appearing in a specified (local) directory
§ HTTP Source
– Retrieves events from HTTP requests
§ Kafka
– Retrieves events by consuming messages from a Kafka topic
* Asynchronous sources do not guarantee that events will be delivered
Chapter Topics

§ Sources
§ Sinks
§ Channels
§ Configuration
§ Essential Points
Some Interesting Built-In Flume Sinks
§ Null
– Discards all events (Flume equivalent of /dev/null)
§ Logger
– Logs event to INFO level using SLF4J*
§ IRC
– Sends event to a specified Internet Relay Chat channel
§ HDFS
– Writes event to a file in the specified directory in HDFS
§ Kafka
– Sends event as a message to a Kafka topic
§ HBaseSink
– Stores event in HBase
*SLF4J: Simple Logging Façade for Java
Chapter Topics

§ Sources
§ Sinks
§ Channels
§ Configuration
§ Essential Points
Built-In Flume Channels
§ Memory
– Stores events in the machine’s RAM
– Extremely fast, but not reliable (memory is volatile)
§ File
– Stores events on the machine’s local disk
– Slower than RAM, but more reliable (data is written to disk)
§ Kafka
– Uses Kafka as a scalable, reliable, and highly available channel between
any source and sink type
Chapter Topics

§ Sources
§ Sinks
§ Channels
§ Configuration
§ Essential Points
Flume Agent Configuration File
§ Configure Flume agents through a Java properties file

– You can configure multiple agents in a single file
§ The configuration file uses hierarchical references
– Assign each component a user-defined ID
– Use that ID in the names of additional properties
# Define sources, sinks, and channel for agent named 'agent1'

agent1.sources = mysource
agent1.sinks = mysink
agent1.channels = mychannel
# Sets a property "foo" for the source associated with agent1

agent1.sources.mysource.foo = bar
# Sets a property "baz" for the sink associated with agent1

agent1.sinks.mysink.baz = bat
Example: Configuring Flume Components (1)
§ Example: Configure a Flume agent to collect data from remote spool

directories and save to HDFS
Flume Agent
HDFS
Sends syslog /loudacre/
Data written
/var/flume/incoming
message Source Memory
Channel Channel Sink to file in HDFS
logdata/
(syslog) (Memory) (HDFS)
syslog Server
Hadoop C
src1 ch1 sink1
agent1
Example: Configuring Flume Components (2)
agent1.sources = src1
agent1.sinks = sink1
agent1.channels = ch1
agent1.channels.ch1.type = memory
agent1.sources.src1.type = spooldir
agent1.sources.src1.spoolDir = /var/flume/incoming Connects source
agent1.sources.src1.channels = ch1 and channel
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path = /loudacre/logdata Connects sink
agent1.sinks.sink1.channel = ch1 and channel
§ Properties vary by component type (source, channel, and sink)

– Properties also vary by subtype (such as netcat source, syslog source)
– See the Flume user guide for full details on configuration
Aside: HDFS Sink Configuration
§ Path may contain patterns based on event headers, such as timestamp

§ The HDFS sink writes uncompressed SequenceFiles by default
– Specifying a codec will enable compression
agent1.sinks.sink1.hdfs.path = /loudacre/logdata/%y-%m-%d
agent1.sinks.sink1.hdfs.codeC = snappy
agent1.sinks.sink1.channel = ch1
§ Setting fileType parameter to DataStream writes raw data

– Can also specify a file extension, if desired
agent1.sinks.sink1.hdfs.path = /loudacre/logdata/%y-%m-%d
agent1.sinks.sink1.hdfs.fileType = DataStream
agent1.sinks.sink1.hdfs.fileSuffix = .txt
agent1.sinks.sink1.channel = ch1
Starting a Flume Agent
§ Typical command line invocation

– The --name argument must match the agent’s name in the
configuration file
– Setting root logger as shown will display log messages in the terminal
$ flume-ng agent \
--conf /etc/flume-ng/conf \
--conf-file /path/to/flume.conf \
--name agent1 \
-Dflume.root.logger=INFO,console
Chapter Topics

§ Sources
§ Sinks
§ Channels
§ Configuration
§ Essential Points
Essential Points
§ Apache Flume is a high-performance system for data collection

– Scalable, extensible, and reliable
§ A Flume agent manages the sources, channels, and sinks
– Sources retrieve event data from its origin
– Channels buffer events between the source and sink
– Sinks send the event to its destination
§ The Flume agent is configured using a properties file
– Give each component a user-defined ID
– Use this ID to define properties of that component
Bibliography

§ Flume User Guide
– http://tiny.cloudera.com/adcc06a
Chapter Topics

§ Sources
§ Sinks
§ Channels
§ Configuration
§ Essential Points
Hands-On Exercise: Collect Web Server Logs with Apache Flume

– Run an Apache Flume agent to ingest web server log data into HDFS
Integrating Apache Flume and Apache
Kafka
Chapter 17
Course Chapters
§ Introduction
§ RDD Persistence
§ Conclusion
Integrating Apache Flume and Apache Kafka

§ What to consider when choosing between Apache Flume and Apache
Kafka for a use case
§ How Flume and Kafka can work together
§ How to configure a Kafka channel, sink, or source in Flume
Chapter Topics
Integrating Apache Flume and

Apache Kafka
§ Overview
§ Use Cases
§ Configuration
§ Essential Points
§ Hands-On Exercise: Send Web Server Log Messages from Apache Flume to
Apache Kafka
Should I Use Kafka or Flume?
§ Both Flume and Kafka are widely used for data ingest
– Although these tools differ, their functionality has some overlap
– Some use cases could be implemented with either Flume or Kafka
§ How do you determine which is a better choice for your use case?
Characteristics of Flume
§ Flume is efficient at moving data from a single source into Hadoop

– Offers sinks that write to HDFS, an HBase or Kudu table, or a Solr index
– Easily configured to support common scenarios, without writing code
– Can also process and transform data during the ingest process
Characteristics of Kafka
§ Kafka is a publish-subscribe messaging system

– Offers more flexibility than Flume for connecting multiple systems
– Provides better durability and fault tolerance than Flume
– Often requires writing code for producers and/or consumers
– Has no direct support for processing messages or loading into Hadoop
Apache Kafka
Flafka = Flume + Kafka
§ Both systems have strengths and limitations

§ You do not necessarily have to choose between them
– You can use both when implementing your use case
§ Flafka is the informal name for Flume-Kafka integration
– It uses a Flume agent to receive messages from or send messages to
Kafka
§ It is implemented as a Kafka source, channel, and sink for Flume
Chapter Topics

Apache Kafka
§ Overview
§ Use Cases
§ Configuration
§ Essential Points
Apache Kafka
Using a Flume Kafka Sink as a Producer
§ By using a Kafka sink, Flume can publish messages to a topic

§ In this example, an application uses Flume to publish application events
– The application sends data to the Flume source when events occur
– The event data is buffered in the channel until it is taken by the sink
– The Kafka sink publishes messages to a specified topic
– Any Kafka consumer can then read messages from the topic for
application events
Application Flume Agent Kafka Cluster
Topic Broker Topic

Message Message
Source Channel Sink Broker Consumer
(Netcat) (Memory) (Kafka)
Broker
Using a Flume Kafka Source as a Consumer
§ By using a Kafka source, Flume can read messages from a topic

– It can then write them to your destination of choice using a Flume sink
§ In this example, the producer sends messages to Kafka
– The Flume agent uses a Kafka source, which acts as a consumer
– The Kafka source reads messages in a specified topic
– The message data is buffered in the channel until it is taken by the sink
– The sink then writes the data into HDFS
Using a Flume Kafka Channel
§ A Kafka channel can be used with any Flume source or sink

– Provides a scalable, reliable, high-availability channel
§ In this example, a Kafka channel buffers events
– The application sends event data to the Flume source
– The channel publishes event data as messages on a Kafka topic
– The sink receives event data and stores it to HDFS
Application Flume Agent Hadoop Cluster
Source Channel Sink

(Netcat) (Kafka) (HDFS)
Using a Kafka Channel as a Consumer (Sourceless Channel)
§ Kafka channels can also be used without a source

– It can then write events to your destination of choice using a Flume sink
§ In this example, the Producer sends messages to Kafka brokers
– The Flume agent uses a Kafka channel, which acts as a consumer
– The Kafka channel reads messages in a specified topic
– Channel passes messages to the sink, which writes the data into HDFS
Chapter Topics

Apache Kafka
§ Overview
§ Use Cases
§ Configuration
§ Essential Points
Apache Kafka
Configuring Flume with a Kafka Source
§ The table below describes some key properties of the Kafka source
Name Description
type org.apache.flume.source.kafka.KafkaSource
zookeeperConnect ZooKeeper connection string (example: zkhost:2181)
topic Name of Kafka topic from which messages will be read
groupId Unique ID to use for the consumer group (default: flume)
Example: Configuring Flume with a Kafka Source (1)
§ This is the Flume configuration for the example on the previous slide
– It defines a source for reading messages from a Kafka topic
# Define names for the source, channel, and sink

agent1.sources = source1
agent1.channels = channel1
# Define a Kafka source that reads from the calls_placed topic

# The "type" property line wraps around due to its long value
agent1.sources.source1.type =
org.apache.flume.source.kafka.KafkaSource
agent1.sources.source1.zookeeperConnect = localhost:2181
agent1.sources.source1.topic = calls_placed
agent1.sources.source1.channels = channel1
Note: file continues on next slide
Example: Configuring Flume with a Kafka Source (2)
§ The remaining portion of the file configures the channel and sink
# Define the properties of our channel

agent1.channels.channel1.type = memory
agent1.channels.channel1.capacity = 10000
agent1.channels.channel1.transactionCapacity = 1000
# Define the sink that writes call data to HDFS

agent1.sinks.sink1.hdfs.path = /user/training/calls_placed
agent1.sinks.sink1.hdfs.fileSuffix = .csv
agent1.sinks.sink1.channel = channel1
Configuring Flume with a Kafka Sink
§ The table below describes some key properties of the Kafka sink
Name Description
type Must be set to org.apache.flume.sink.kafka.KafkaSink
brokerList Comma-separated list of brokers (format host:port) to contact
topic The topic in Kafka to which the messages will be published
batchSize How many messages to process in one batch
Example: Configuring Flume with a Kafka Sink (1)

agent1.sources = source1
# Define the properties of the source, which receives event data

agent1.sources.source1.type = netcat
agent1.sources.source1.bind = localhost
agent1.sources.source1.port = 12345
agent1.sources.source1.channels = channel1
# Define the properties of the channel

agent1.channels.channel1.type = memory
agent1.channels.channel1.capacity = 10000
agent1.channels.channel1.transactionCapacity = 1000
Example: Configuring Flume with a Kafka Sink (2)
§ The remaining portion of the configuration file sets up the Kafka sink
# Define the Kafka sink, which publishes to the app_event topic

agent1.sinks.sink1.type = org.apache.flume.sink.kafka.KafkaSink
agent1.sinks.sink1.topic = app_events
agent1.sinks.sink1.brokerList = localhost:9092
agent1.sinks.sink1.batchSize = 20
Configuring Flume with a Kafka Channel
§ The table below describes some key properties of the Kafka channel
Name Description
type org.apache.flume.channel.kafka.KafkaChannel
zookeeperConnect ZooKeeper connection string (example: zkhost:2181)
brokerList Comma-separated list of brokers (format host:port) to
contact
topic Name of Kafka topic from which messages will be read
(optional, default=flume-channel)
parseAsFlumeEvent Set to false for sourceless configuration (optional,
default=true)
readSmallestOffset Set to true to read from the beginning of the Kafka topic
(optional, default=false)
Example: Configuring a Sourceless Kafka Channel (1)
§ This is the Flume configuration for the example shown below

Example: Configuring a Sourceless Kafka Channel (2)
# Define the properties of the Kafka channel

# which reads from the calls_placed topic
agent1.channels.channel1.type =
org.apache.flume.channel.kafka.KafkaChannel
agent1.channels.channel1.topic = calls_placed
agent1.channels.channel1.brokerList = localhost:9092
agent1.channels.channel1.zookeeperConnect = localhost:2181
agent1.channels.channel1.parseAsFlumeEvent = false
# Define the sink that writes data to HDFS

agent1.sinks.sink1.type=hdfs
agent1.sinks.sink1.hdfs.path = /user/training/calls_placed
agent1.sinks.sink1.hdfs.fileSuffix = .csv
Chapter Topics

Apache Kafka
§ Overview
§ Use Cases
§ Configuration
§ Essential Points
Apache Kafka
Essential Points
§ Flume and Kafka are distinct systems with different designs

– You must weigh the advantages and disadvantages of each when
selecting the best tool for your use case
§ Flume and Kafka can be combined
– Flafka is the informal name for Flume components integrated with Kafka
– You can read messages from a topic using a Kafka channel or Kafka
source
– You can publish messages to a topic using a Kafka sink
– A Kafka channel provides a reliable, high-availability alternative to a
memory or file channel
Bibliography

§ Cloudera documentation on using Flume with Kafka
– http://tiny.cloudera.com/flafkadoc
§ Flafka: Apache Flume Meets Apache Kafka for Event Processing
– http://tiny.cloudera.com/kmc02a
§ Designing Fraud-Detection Architecture That Works Like Your Brain Does
– http://tiny.cloudera.com/kmc02b
Chapter Topics

Apache Kafka
§ Overview
§ Use Cases
§ Configuration
§ Essential Points
§ Hands-On Exercise: Send Web Server Log Messages from Apache Flume
to Apache Kafka
Hands-On Exercise: Send Web Server Log Messages from Apache
Flume to Apache Kafka
– Configure a Flume agent using a Kafka sink to produce Kafka messages
from data that was received by a Flume source
Apache Spark Streaming: Introduction
to DStreams
Chapter 18
Course Chapters
§ Introduction
§ RDD Persistence
§ Conclusion
Apache Spark Streaming: Introduction to DStreams

§ The features and typical use cases for Apache Spark Streaming
§ How to write Spark Streaming applications
§ How to create and operate on file and socket-based DStreams
Chapter Topics
Apache Spark Streaming:

Introduction to DStreams
§ Apache Spark Streaming Overview

§ Example: Streaming Request Count
§ DStreams
§ Developing Streaming Applications
§ Essential Points
§ Hands-On Exercise: Write an Apache Spark Streaming Application
What Is Spark Streaming?
§ An extension of core Spark

§ Provides real-time processing of stream data
§ Versions 1.3 and later support Java, Scala, and Python
– Prior versions did not support Python
Why Spark Streaming?
§ Many big data applications need to process large data streams in real time,
such as
– Continuous ETL
– Website monitoring
– Fraud detection
– Ad monetization
– Social media analysis
– Financial market trends
Spark Streaming Features
§ Second-scale latencies
§ Scalability and efficient fault tolerance
§ “Once and only once” processing
§ Integrates batch and real-time processing
§ Easy to develop
– Uses Spark’s high-level API
Spark Streaming Overview
§ Divide up data stream into batches of n seconds

– Called a DStream (Discretized Stream)
§ Process each batch in Spark as an RDD
§ Return results of RDD operations in batches
Live Data Stream
…1001101001000111000011100010…
Spark Streaming
Dstream—RDDs (batches of
n seconds)
Spark
Chapter Topics


§ DStreams
§ Essential Points
Example: Streaming Request Count (Scala Overview)
Language: Scala
object StreamingRequestCount {

val ssc = new StreamingContext(sc,Seconds(2))
val mystream = ssc.socketTextStream(hostname, port)
val userreqs = mystream
.map(line => (line.split(' ')(2),1))
.reduceByKey((x,y) => x+y)
userreqs.print()
ssc.start()
ssc.awaitTermination()
}
}
Example: Configuring StreamingContext
Language: Scala

§ A StreamingContext is the main entry point for Spark
Streaming apps
§ Equivalent to SparkContext in core Spark
userreqs.print()
§ Configured with the same parameters as a SparkContext
plus batch duration—instance of Milliseconds, Seconds, or
ssc.start()
Minutes
} § Named ssc by convention
}
Streaming Example: Creating a DStream
Language: Scala

§ Get a DStream (“Discretized Stream”) from a streaming data
source, for example, text from a socket
userreqs.print()
ssc.start()
}
}
Streaming Example: DStream Transformations
Language: Scala

userreqs.print()
§ DStream operations are applied to each batch RDD in the stream
§ Similar to RDD operations—filter, map, reduce,
ssc.start()
joinByKey, and so on.
}
}
Streaming Example: DStream Result Output
Language: Scala

userreqs.print()
ssc.start()
§ Print out the first 10 elements of each RDD
}
}
Streaming Example: Starting the Streams
Language: Scala

.map(line Starts the execution
=> (line.split("
§ start: of all DStreams
")(2),1))
§ awaitTermination: waits for all background threads to
complete before ending the main thread
userreqs.print()
ssc.start()
}
}
Streaming Example: Python versus Scala
Language: Python
if __name__ == "__main__":
sc = SparkContext()
ssc = StreamingContext(sc,2)

mystream = ssc.socketTextStream(hostname, port)
userreqs = mystream \
.map(lambda line: (line.split(' ')[2],1)) \
userreqs.pprint()
ssc.start() userreqs.print()
Streaming Example: Streaming Request Count (Recap)
Language: Scala

userreqs.print()
ssc.start()
}
}
DStreams
§ A DStream is a sequence of RDDs representing a data stream
Time
Live Data data…data…data…data…data…data…data…data…
t0 t1 t2 t3
RDD @ t1 RDD @ t2 RDD @ t3

data… data… data…
DStream
Streaming Example Output (1)
-------------------------------------------
Time: 1401219545000 ms Starts 2 seconds after
------------------------------------------- ssc.start (time
(23713,2)
(53,2) interval t1)
(24433,2)
(127,2)
(93,2)
...
-------------------------------------------
Time: 1401219545000 ms
-------------------------------------------
(23713,2)
(53,2)
(24433,2)
(127,2)
(93,2)
...
------------------------------------------- t2: 2 seconds later…
Time: 1401219547000 ms
-------------------------------------------
(42400,2)
(24996,2)
(97464,2)
(161,2)
(6011,2)
...
-------------------------------------------
Time: 1401219545000 ms
-------------------------------------------
(23713,2)
(53,2)
(24433,2)
(127,2)
(93,2)
...
-------------------------------------------
Time: 1401219547000 ms
-------------------------------------------
(42400,2)
(24996,2)
(97464,2)
(161,2)
(6011,2)
...
------------------------------------------- t3: 2 seconds later…
Time: 1401219549000 ms
-------------------------------------------
(44390,2)
(48712,2)
(165,2)
(465,2) Continues until
(120,2)
...
termination…
Chapter Topics


§ DStreams
§ Essential Points
DStream Data Sources
§ DStreams are defined for a given input stream (such as a Unix socket)
– Created by the Streaming context
ssc.socketTextStream(hostname, port)
– Similar to how RDDs are created by the Spark context
§ Out-of-the-box data sources
– Network
– Sockets
– Services such as Flume, Akka Actors, Kafka, ZeroMQ, or Twitter
– Files
– Monitors an HDFS directory for new content
DStream Operations
§ DStream operations are applied to every RDD in the stream

– Executed once per duration
§ Two types of DStream operations
– Transformations
– Create a new DStream from an existing one
– Output operations
– Write data (for example, to a file system, database, or console)
– Similar to RDD actions
DStream Transformations (1)
§ Many RDD transformations are also available on DStreams

– Regular transformations such as map, flatMap, filter
– Pair transformations such as reduceByKey, groupByKey, join
§ What if you want to do something else?
– transform(function)
– Creates a new DStream by executing function on RDDs in the
current DStream
Language: Scala
val distinctDS =
myDS.transform(rdd => rdd.distinct())
Language: Python
distinctDS =
myDS.transform(lambda rdd: rdd.distinct())
DStream Transformations (2)
logs data… data… data…

… … …
userreqs = logs.map(line =>

(line.split(' ')(2),1))
(user002,1) (user011,1) (user012,1)
(user011,1) (user823,1) (user011,1)
Language: Scala userreqs
(user991,1) (user012,1) (user552,1)
… … …
reqcounts = userreqs.
reduceByKey((x,y) => x+y)
(user002,5) (user710,9) (user002,1)
(user033,1) (user022,4) (user808,8)
reqcounts (user912,2) (user001,4) (user018,2)
… … …
DStream Output Operations
§ Console output
– print (Scala) / pprint (Python) prints out the first 10 elements of
each RDD
– Optionally pass an integer to print another number of elements
§ File output
– saveAsTextFiles saves data as text
– saveAsObjectFiles saves as serialized object files (SequenceFiles)
§ Executing other functions
– foreachRDD(function)performs a function on each RDD in the
DStream
– Function input parameters
– The RDD on which to perform the function
– The time stamp of the RDD (optional)
Saving DStream Results as Files
Language: Scala
val userreqs = logs
.reduceByKey((v1,v2) => v1+v2)
userreqs.print()
userreqs.saveAsTextFiles("…/outdir/reqcounts")
(user002,5) (user710,9) (user002,1)

(user033,1) (user022,4) (user808,8)
(user912,2) (user001,4) (user018,2)
… … …
reqcounts-timestamp1/ reqcounts-timestamp2/ reqcounts-timestamp3/

part-00000… part-00000… part-00000…
(user002,1)
(user002,5) (user710,9)
(the,5) (the,9)
(user022,4)
(user808,8)
(word1,n)
(user033,1)
(the,5) (the,9) (word1,n)
(fat,1) (angry,1) (user018,2)
(word2,n)
(user912,2)
(fat,1) (user001,4)
(angry,1) (word2,n)
… (sat,4)
… (word3,n)
… (on,2)
(on,2) (sat,4) (word3,n)
… … …
… … …
Scala Example: Find Top Users (1)
…
val userreqs = logs
userreqs.saveAsTextFiles(path)
val sortedreqs = userreqs

.map(pair => pair.swap)
.transform(rdd => rdd.sortByKey(false))
sortedreqs.foreachRDD((rdd,time) => {
Transform
println("Top users @ " each RDD: swap userID/count, sort by count
+ time)
rdd.take(5).foreach(
pair => printf("User: %s (%s)\n",pair._2, pair._1))
}
)
ssc.start()
…
Scala Example: Find Top Users (2)
…
val userreqs = logs
userreqs.saveAsTextFiles(path)
val sortedreqs = userreqs

.map(pair => pair.swap)
Print out the top 5 users as “User: userID (count)”
.transform(rdd => rdd.sortByKey(false))
sortedreqs.foreachRDD((rdd,time) => {
println("Top users @ " + time)
rdd.take(5).foreach(
pair => printf("User: %s (%s)\n",pair._2, pair._1))
}
)
ssc.start()
…
Python Example: Find Top Users (1)
def printTop5(r,t):
print "Top users @",t
for count,user in r.take(5):
print "User:",user,"("+str(count)+")"
…
Transform
.reduceByKey(lambda each v1+v2)
v1,v2: RDD: swap userID/count, sort by count
userreqs.saveAsTextFiles("streamreq/reqcounts")
sortedreqs=userreqs \
.map(lambda (k,v): (v,k)) \
.transform(lambda rdd: rdd.sortByKey(False))
sortedreqs.foreachRDD(lambda time,rdd: printTop5(rdd,time))
ssc.start()
…
Python Example: Find Top Users (2)
def printTop5(r,t):
print "Top users @",t
for count,user in r.take(5):
print "User:",user,"("+str(count)+")"
…
userreqs.saveAsTextFiles("streamreq/reqcounts")
sortedreqs=userreqs \
Print out
.map(lambda (k,v): the top
(v,k)) \ 5 users as “User: userID (count)”
.transform(lambda rdd: rdd.sortByKey(False))
sortedreqs.foreachRDD(lambda time,rdd: printTop5(rdd,time))
ssc.start()
…
Example: Find Top Users—Output (1)
Top users @ 1401219545000 ms

User: 16261 (8) t1 (2 seconds after
User: 22232 (7)
ssc.start)
User: 66652 (4)
User: 21205 (2)
User: 24358 (2)
Top users @ 1401219545000 ms

User: 16261 (8)
User: 22232 (7)
User: 66652 (4)
User: 21205 (2)
User: 24358 (2)
Top users @ 1401219547000 ms
User: 53667 (4)
User: 35600 (4) t2
User: 62 (2)
User: 165 (2) (2 seconds later)
User: 40 (2)
Top users @ 1401219545000 ms

User: 16261 (8)
User: 22232 (7)
User: 66652 (4)
User: 21205 (2)
User: 24358 (2)
Top users @ 1401219547000 ms
User: 53667 (4)
User: 35600 (4)
User: 62 (2)
User: 165 (2)
User: 40 (2)
Top users @ 1401219549000 ms
User: 31 (12)
User: 6734 (10) t3
User: 14986 (10)
User: 72760 (2) (2 seconds later)
User: 65335 (2)
Top users @ 1401219551000 ms
…
Continues until
termination…
Chapter Topics


§ DStreams
§ Essential Points
Building and Running Spark Streaming Applications
§ Building Spark Streaming applications

– Link with the main Spark Streaming library (included with Spark)
– Link with additional Spark Streaming libraries if necessary, for example,
Kafka, Flume, Twitter
§ Running Spark Streaming applications
– Use at least two threads if running locally
– Adding operations after the Streaming context has been started is
unsupported
– Stopping and restarting the Streaming context is unsupported
Using Spark Streaming with Spark Shell
§ Spark Streaming is designed for batch applications, not interactive use

§ The Spark shell can be used for limited testing
– Not intended for production use!
– Be sure to run the shell on a cluster with at least 2 cores, or locally with
at least 2 threads
$ spark-shell --master yarn
$ pyspark --master yarn
or
$ spark-shell --master 'local[2]'
$ pyspark --master 'local[2]'
The Spark Streaming Application UI
The Streaming tab in the Spark

App UI provides basic metrics
about the application
Chapter Topics


§ DStreams
§ Essential Points
Essential Points
§ Spark Streaming is an extension of core Spark to process real-time

streaming data
§ DStreams are discretized streams of streaming data, batched into RDDs by
time intervals
– Operations applied to DStreams are applied to each RDD
– Transformations produce new DStreams by applying a function to each
RDD in the base DStream
Chapter Topics


§ DStreams
§ Essential Points
Hands-On Exercise: Write an Apache Spark Streaming
Application
– Write a Spark Streaming application to process web log data
Apache Spark Streaming: Processing
Multiple Batches
Chapter 19
Course Chapters
§ Introduction
§ RDD Persistence
§ Conclusion
Apache Spark Streaming: Processing Multiple Batches

§ How to return data from a specific time period in a DStream
§ How to perform analysis using sliding window operations on a DStream
§ How to maintain state values across all time periods in a DStream
Chapter Topics

Multiple Batches
§ Multi-Batch Operations
§ Time Slicing
§ State Operations
§ Sliding Window Operations
§ Essential Points
§ Hands-On Exercise: Process Multiple Batches with Apache Spark Streaming
Multi-Batch DStream Operations
§ DStreams consist of a series of “batches” of data

– Each batch is an RDD
§ Basic DStream operations analyze each batch individually
§ Advanced operations allow you to analyze data collected across batches
– Slice: allows you to operate on a collection of batches
– State: allows you to perform cumulative operations
– Windows: allows you to aggregate data across a sliding time period
Chapter Topics

Multiple Batches
§ Time Slicing
§ State Operations
§ Essential Points
Time Slicing
§ DStream.slice(fromTime, toTime)
– Returns a collection of batch RDDs based on data from the stream
§ StreamingContext.remember(duration)
– By default, input data is automatically cleared when no RDD’s lineage
depends on it
– slice will return no data for time periods for data has already been
cleared
– Use remember to keep data around longer
Chapter Topics

Multiple Batches
§ Time Slicing
§ State Operations
§ Essential Points
State DStreams (1)
§ Use the updateStateByKey function to create a state DStream

§ Example: Total request count by User ID
t1
(user001,5)
Requests (user102,1)
(user009,2)
(user001,5)
(user102,1)
Total
(user009,2)
Requests
(State)
State DStreams (2)

t1 t2
(user001,5) (user001,4)
Requests (user102,1) (user012,2)
(user009,2) (user921,5)
(user001,5) (user001,9)
(user102,1) (user102,1)
Total
(user009,2) (user009,2)
Requests
(user012,2)
(State)
(user921,5)
State DStreams (3)

t1 t2 t3
(user001,5) (user001,4) (user102,7)
Requests (user102,1) (user012,2) (user012,3)
(user009,2) (user921,5) (user660,4)
(user001,5) (user001,9) (user001,9)

(user102,1) (user102,1) (user102,8)
Total
(user009,2) (user009,2) (user009,2)
Requests
(user012,2) (user012,5)
(State)
(user921,5) (user921,5)
(user660,4)
Python Example: Total User Request Count (1)
Language: Python
…
userreqs = logs \
…
ssc.checkpoint("checkpoints")
…
totalUserreqs = userreqs \
Set checkpoint directory to enable checkpointing.
.updateStateByKey(lambda newCounts, state: \
Required to prevent infinite
updateCount(newCounts, lineages.
state))
totalUserreqs.pprint()
ssc.start()
…
Python Example: Total User Request Count (2)
Language: Python
…
userreqs = logs \
.map(lambda Compute a state DStream
line: (line.split(' based on \the previous states,
')[2],1))
.reduceByKey(lambda v1,v2:
updated with v1+v2)from the current batch of request
the values
…
counts.
…
totalUserreqs = userreqs \
.updateStateByKey(lambda newCounts, state: \
updateCount(newCounts, state))
totalUserreqs.pprint()
next slide…
ssc.start()
…
Python Example: Total User Request Count—Update Function
New Values Current State (or None)

Language: Python
def updateCount(newCounts, state):
if state == None: return sum(newCounts)
else: return state + sum(newCounts)
New State
t1 t2
(user001,5) (user001,4) § Example at t2
Requests (user102,1) (user012,2) user001:
(user009,2) (user921,5) updateCount([4],5) à 9
user012:
Total (user001,5) (user001,9)
Requests (user102,1) updateCount([2],None)) à 2
(user102,1)
(State) user921:
(user009,2) (user009,2)
updateCount([5],None)) à 5
(user012,2)
(user921,5)
Scala Example: Total User Request Count (1)
Language: Scala
…
val userreqs = logs
…
val totalUserreqs = userreqs.updateStateByKey(updateCount)
totalUserreqs.print()
Set checkpoint directory to enable checkpointing.
ssc.start()
Required to prevent infinite lineages.
…
Scala Example: Total User Request Count (2)
Language: Scala
…
val userreqs = logs
…
next slide…
val totalUserreqs = userreqs.updateStateByKey(updateCount)
totalUserreqs.print()
ssc.start()
Compute a state DStream based on the previous states,
… updated with the values from the current batch of request
counts.
Scala Example: Total User Request Count—Update Function
New Values Current State (or None)

Language: Scala
def updateCount = (newCounts: Seq[Int], state: Option[Int]) => {
val newCount = newCounts.foldLeft(0)(_ + _)
val previousCount = state.getOrElse(0)
Some(newCount + previousCount)
} New State
t1 t2
(user001,5) (user001,4) § Example at t2
Requests (user102,1) (user012,2) user001:
(user009,2) (user921,5) updateCount([4],Some[5]) à 9
user012:
Total (user001,5) (user001,9)
Requests (user102,1) updateCount([2],None)) à 2
(user102,1)
(State) user921:
(user009,2) (user009,2)
updateCount([5],None)) à 5
(user012,2)
(user921,5)
Example: Maintaining State—Output
-------------------------------------------
Time: 1401219545000 ms
------------------------------------------- (user001,5)
(user001,5)
(user102,1) t1 (user102,1)
(user009,2) (user009,2)
-------------------------------------------
Time: 1401219547000 ms
------------------------------------------- (user001,9)
(user001,9)
(user102,1) (user102,1)
(user009,2) t2 (user009,2)
(user012,2)
(user921,5) (user012,2)
------------------------------------------- (user921,5)
Time: 1401219549000 ms
-------------------------------------------
(user001,9) (user001,9)
(user102,8)
(user102,8)
(user009,2)
(user012,5) (user009,2)
(user921,5)
t3
(user012,5)
(user660,4)
------------------------------------------- (user921,5)
Time: 1401219541000 ms
(user660,4)
-------------------------------------------
…
Chapter Topics

Multiple Batches
§ Time Slicing
§ State Operations
§ Essential Points
Sliding Window Operations (1)
§ Regular DStream operations execute for each RDD based on SSC duration
§ “Window” operations span RDDs over a given duration
– For example reduceByKeyAndWindow, countByWindow
Window Duration
Regular
DStream
Each box represents

reduceByKeyAndWindow( the SSC batch duration
fn,window-duration) (such as 2 seconds)
Window
DStream
§ By default, window operations will execute with an “interval” the same as

the SSC duration
– For two-second batch duration, window will “slide” every two seconds
Window Duration
Regular
DStream
(batch size =
Seconds(2))
reduceByKeyAndWindow(fn,
Seconds(12))
Window
DStream
§ You can specify a different slide duration (must be a multiple of the SSC
duration)
Window Duration
Regular
DStream
(batch size =
Seconds(2))
reduceByKeyAndWindow(fn,
Seconds(12), Seconds(4))
Window
DStream
Scala Example: Count and Sort User Requests by Window (1)
Language: Scala
…
val ssc = new StreamingContext(new SparkConf(), Seconds(2))
val logs = ssc.socketTextStream(hostname, port)
…
val reqcountsByWindow = logs.
map(line => (line.split(' ')(2),1)).
reduceByKeyAndWindow((v1: Int, v2: Int) => v1+v2,
Minutes(5),Seconds(30))
val topreqsByWindow=reqcountsByWindow.
Every 30 seconds, count requests by user over the last
map(pair => pair.swap).
transform(rdd five minutes.
=> rdd.sortByKey(false))
topreqsByWindow.map(pair => pair.swap).print()
ssc.start()
…
Scala Example: Count and Sort User Requests by Window (2)
Language: Scala
…
val ssc = new StreamingContext(new SparkConf(), Seconds(2))
val logs = ssc.socketTextStream(hostname, port)
…
val reqcountsByWindow = logs.
map(line => (line.split(' ')(2),1)).
Sort and print the top users for every RDD (every 30
reduceByKeyAndWindow((v1: Int, v2: Int) => v1+v2,
seconds).
Minutes(5),Seconds(30))
val topreqsByWindow=reqcountsByWindow.
map(pair => pair.swap).
transform(rdd => rdd.sortByKey(false))
topreqsByWindow.map(pair => pair.swap).print()
ssc.start()
…
Python Example: Count and Sort User Requests by Window (1)
Language: Python
…
ssc = new StreamingContext(new SparkConf(), 2)
logs = ssc.socketTextStream(hostname, port)
…
reqcountsByWindow = logs. \
map(lambda line: (line.split(' ')[2],1)).\
reduceByKeyAndWindow(lambda v1,v2: v1+v2,5*60,30)
topreqsByWindow=reqcountsByWindow.
Every 30 seconds, count\requests by user over the last
map(lambda (k,v): (v,k)). \
five minutes.
transform(lambda rdd: rdd.sortByKey(False))
topreqsByWindow.map(lambda (k,v): (v,k)).pprint()
ssc.start()
…
Python Example: Count and Sort User Requests by Window (2)
Language: Python
…
ssc = new StreamingContext(new SparkConf(), 2)
logs = ssc.socketTextStream(hostname, port)
…
reqcountsByWindow = logs. \
Sort(line.split('
map(lambda line: and print the top users for every RDD (every 30
')[2],1)).\
seconds).
reduceByKeyAndWindow(lambda v1,v2: v1+v2,5*60,30)
topreqsByWindow=reqcountsByWindow. \
map(lambda (k,v):(v,k)). \ #swap
transform(lambda rdd: rdd.sortByKey(False))
topreqsByWindow.map(lambda (k,v):(v,k)).pprint()
ssc.start()
…
Chapter Topics

Multiple Batches
§ Time Slicing
§ State Operations
§ Essential Points
Essential Points
§ You can get a “slice” of data from a stream based on absolute start and
end times
– For example, all data received between midnight October 1, 2016 and
midnight October 2, 2016
§ You can update state based on prior state
– For example, total requests by user
§ You can perform operations on “windows” of data
– For example, number of logins in the last hour
Chapter Topics

Multiple Batches
§ Time Slicing
§ State Operations
§ Essential Points
§ Hands-On Exercise: Process Multiple Batches with Apache Spark
Streaming
Hands-On Exercise: Process Multiple Batches With Apache Spark
Streaming
– Extend an Apache Spark Streaming application to perform multi-batch
analysis on web log data
Apache Spark Streaming: Data Sources
Chapter 20
Course Chapters
§ Introduction
§ RDD Persistence
§ Conclusion
Apache Spark Streaming: Data Sources

§ How data sources are integrated with Spark Streaming
§ How receiver-based integration differs from direct integration
§ How Apache Flume and Apache Kafka are integrated with Spark Streaming
§ How to use direct Kafka integration to create a DStream
Chapter Topics
Apache Spark Streaming: Data

Sources
§ Streaming Data Source Overview

§ Flume and Kafka Data Sources
§ Example: Using a Kafka Direct Data Source
§ Essential Points
§ Hands-On Exercise: Process Apache Kafka Messages with Apache Spark
Streaming
Spark Streaming Data Sources
§ Basic data sources

– Network socket
– Text file
§ Advanced data sources
– Kafka
– Flume
– Twitter
– ZeroMQ
– Kinesis
– MQTT
– and more coming in the future…
§ To use advanced data sources, download (if necessary) and link to the
required library
Receiver-Based Data Sources
§ Most data sources are based on receivers

– Network data is received on a worker node
– Receiver distributes data (RDDs) to the cluster as partitions
@ t1 @ t2 @ t3 @ t4
DStream
RDD RDD RDD RDD
RDD
Executor rdd_0_1
Receiver
Network
Executor Data Source
rdd_0_0
Receiver-Based Replication
§ Spark Streaming RDD replication is enabled by default

– Data is copied to another node as it received
Executor rdd_0_1
Receiver
Network
Executor rdd_0_1
Data Source
rdd_0_0
Executor
rdd_0_0
Receiver-Based Fault Tolerance
§ If the receiver fails, Spark will restart it on a different executor

– Potential for brief loss of incoming data
Executor
Receiver
Network
Executor rdd_0_1
Data Source
Receiver
rdd_0_0
Executor
rdd_0_0
Managing Incoming Data
§ Receivers queue jobs to process data as it arrives

§ Data must be processed fast enough that the job queue does not grow
– Manage by setting spark.streaming.backpressure.enabled
= true
§ Monitor scheduling delay and processing time in Spark UI
Chapter Topics

Sources

§ Essential Points
Streaming
Overview: Spark Streaming with Flume
§ Two approaches to using Flume

– Push-based
– Pull-based
§ Push-based
– One Spark worker must run a network receiver on a specified node
– Configure Flume with an Avro sink to send to that receiver
§ Pull-based
– Uses a custom Flume sink in spark.streaming.flume package
– Strong reliability and fault tolerance guarantees
Overview: Spark Streaming with Kafka
§ Apache Kafka is a fast, scalable, distributed publish-subscribe messaging

system that provides
– Durability by persisting data to disk
– Fault tolerance through replication
§ Two approaches to Spark Streaming with Kafka
– Receiver-based
– Direct (receiverless)
Receiver-Based Kafka Integration (1)
§ Receiver-based
– Streams (receivers) are configured with a Kafka topic and a partition in
that topic
– To protect from data loss, enable write ahead logs (introduced in Spark
1.2)
– Scala and Java support added in Spark 1.1
– Python support added in Spark 1.3
Receiver-Based Kafka Integration (2)
§ Receiver-based
– Kafka supports partitioning of message topics for scalability
– Receiver-based streaming allows multiple receivers, each configured for
individual topic partitions
Executor RDD 0 Receiver

(topic
rdd_0_0 partition 0)
Executor
Kafka Broker
RDD 1 Receiver Kafka Broker
(topic Kafka Broker
rdd_1_0 partition 1)
Executor
rdd_1_1
Kafka Direct Integration (1)
§ Direct (also called receiverless)

– Support for efficient zero-loss
– Support for exactly-once semantics
– Introduced in Spark 1.3 (Scala and Java only)
– Python support in Spark 1.5
Kafka Direct Integration (2)
§ Direct (also called receiverless)

– Consumes messages in parallel
– Automatically assigns each topic partition to an RDD partition
Executor RDD
rdd_0_0
topic
partition 0
Executor
Kafka Broker
Kafka Broker
rdd_0_1
topic Kafka Broker
partition 1
Executor
Chapter Topics

Sources

§ Essential Points
Streaming
Scala Example: Direct Kafka Integration (1)
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.kafka._
import kafka.serializer.StringDecoder

Scala Example: Direct Kafka Integration (2)
val kafkaStream = KafkaUtils.createDirectStream

[String,String,StringDecoder,StringDecoder] (ssc,
Map("metadata.broker.list"->"broker1:port,broker2:port"),
Set("mytopic"))
val logs = kafkaStream.map(pair => pair._2)
val userreqs = logs

userreqs.print()
ssc.start()
}
}
Python Example: Direct Kafka Integration (1)
import sys
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
if __name__ == "__main__":
sc = SparkContext()
ssc = StreamingContext(sc,2)
Python Example: Direct Kafka Integration (2)
kafkaStream = KafkaUtils. \
createDirectStream(ssc, ["mytopic"], \
{"metadata.broker.list": "broker1:port,broker2:port"})
logs = kafkaStream.map(lambda (key,value): value)
userreqs = logs \
userreqs.pprint()
ssc.start()
Chapter Topics

Sources

§ Essential Points
Streaming
Essential Points
§ Spark Streaming integrates with a number of data sources

§ Most use a receiver-based integration
– Flume, for example
§ Kafka can be integrated using a receiver-based or a direct (receiverless)
approach
– The direct approach provides efficient strong reliability
Chapter Topics

Sources

§ Essential Points
Streaming
Hands-On Exercise: Process Apache Kafka Messages with Apache
Spark Streaming
– Write a Spark Streaming application to process web logs using a direct
Kafka data source
Conclusion
Chapter 21
Course Chapters
§ Introduction
§ RDD Persistence
§ Conclusion
Course Objectives
During this course, you have learned

§ How the Apache Hadoop ecosystem fits in with the data processing
lifecycle
§ How data is distributed, stored, and processed in a Hadoop cluster
§ How to write, configure, and deploy Apache Spark applications on a
Hadoop cluster
§ How to use the Spark shell and Spark applications to explore, process, and
analyze distributed data
§ How to process and query structured data using Spark SQL
§ How to use Spark Streaming to process a live data stream
§ How to use Apache Flume and Apache Kafka to ingest data for Spark
Streaming
Which Course to Take Next?
Cloudera offers a range of training courses for you and your team
§ For developers
– Designing and Building Big Data Applications
– Cloudera Training for Apache HBase
§ For system administrators
– Cloudera Administrator Training for Apache Hadoop
§ For data analysts and data scientists
– Cloudera Data Analyst Training: Using Pig, Hive, and Impala with Hadoop
– Data Science at Scale using Spark and Hadoop
§ For architects, managers, CIOs, and CTOs
– Cloudera Essentials for Apache Hadoop

Cloudera Developer Training Slides

Uploaded by

Copyright:

Available Formats

Cloudera Developer Training Slides

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cloudera Developer Training Slides

Uploaded by

Copyright:

Available Formats

Developer Training for Apache

Spark and Hadoop

§ About This Course

During this course, you will learn

§ About This Course

§ The leader in Apache Hadoop-based software and services

§ Our customers include many key users of Hadoop

CDH (Cloudera’s Distribution including Apache Hadoop)

tested, and widely

§ Integrates all the key

§ Available as RPMs and STORE

§ Subscription product including CDH and Cloudera Manager

§ About This Course

§ Class start and finish times

§ About This Course

§ About your instructor

In this chapter you will learn

Introduction to Apache Hadoop and

§ Apache Hadoop Overview

§ Scalable and economical PROCESS, ANALYZE, SERVE

tolerant UNIFIED SERVICES

§ Extract, Transform, and Load (ETL) § Data storage

§ What do these workloads have in common? Nature of the data…

Resource Management Storage

Introduction to Apache Hadoop and

§ Apache Hadoop Overview

§ Hadoop Distributed File System (HDFS)

Introduction to Apache Hadoop and

§ Apache Hadoop Overview

§ Spark is a large-scale data processing engine

§ Hadoop MapReduce is the original Hadoop

§ Apache Pig builds on Hadoop to offer high-level data processing

people = LOAD '/user/training/customers' AS (cust_id, name);

Introduction to Apache Hadoop and

§ Apache Hadoop Overview

§ Impala is a high-performance SQL engine

§ Hive is an abstraction layer on top of Hadoop

SELECT zipcode, SUM(cost) AS total

§ Interactive full-text search for data in a Hadoop cluster

Introduction to Apache Hadoop and

§ Apache Hadoop Overview

§ Hue = Hadoop User Experience

§ Sentry provides fine-grained access control

Introduction to Apache Hadoop and

§ Apache Hadoop Overview

§ The best way to learn is to do!

§ Loudacre needs to migrate their existing infrastructure to Hadoop

§ Your virtual machine

Introduction to Apache Hadoop and

§ Apache Hadoop Overview

§ Hadoop is a framework for distributed storage and processing

The following offer more information on topics discussed in this chapter

Introduction to Apache Hadoop and

§ Apache Hadoop Overview

§ In this exercise, you will

In this chapter you will learn

Apache Hadoop File Storage

§ Apache Hadoop Cluster Components

§ A cluster is a group of computers working together

Master Node Worker Node Master Node