spark

What Is Apache Spark?
Apache Spark is a fast and general engine for large-scale

data processing
§ Written in Scala
– Functional programming language that runs in a JVM
§ Spark shell
– Interactive—for learning or data exploration
– Python or Scala
§ Spark applications
– For large scale data process
Spark Shell
§ The Spark shell provides interactive data exploration (REPL)
Spark Context
§ EverySpark application requires a Spark context
– The main entry point to the Spark API
§ The Spark shell provides a preconfigured Spark context

called sc
RDD (Resilient Distributed Dataset)
RDD (Resilient Distributed Dataset)

– Resilient: If data in memory is lost, it can be recreated
– Distributed: Processed across the cluster
–Dataset: Initial data can come from a source such as a file,
or it can be
created programmatically
§ RDDs are the fundamental unit of data in Spark
§ Most Spark programming consists of performing

operations on RDDs
Creating an RDD
§ Three ways to create an RDD

– From a file or set of files
– From data in memory
– From another RDD
Example: A File-Based RDD
RDD Operations
§ Two broad types of RDD operations
– Actions return values
– Transformations define a new RDD

based on the current one(s)
RDD Operations: Actions
§ Some common actions
–count() returns the number of elements
–take(n) returns an array of the first n elements
–collect() returns an array of all elements
–saveAsTextFile(dir) saves to text file(s)
RDD Operations: Transformations
Transformations create a new RDD from an existing one

§ RDDs are immutable
– Data in an RDD is never changed
– Transform in sequence to modify the data as needed
§ Two common transformations
–map(function)creates a new RDD by performing a function

on each record in the base RDD
–filter(function)creates a new RDD by including or excluding
each record in the base RDD according to a Boolean function
Example: map and filter Transformations
Lazy Execution (1)
Lazy Execution (2)
Lazy Execution (3)
Lazy Execution (4)
Lazy Execution (5)
Chaining Transformations (Scala)
§ Transformations may be chained together

Working with RDDs
RDDs
RDDs can hold any serializable type of element
– Primitive types such as integers, characters, and booleans
– Sequence types such as strings, lists, arrays, tuples, and
dicts (including nested data types)
– Scala/Java Objects (if serializable)
– Mixed types
§ Some RDDs are specialized and have additional
functionality
– Pair RDDs
– RDDs consisting of key-value pairs
– Double RDDs
– RDDs consisting of numeric data
Creating RDDs from Collections
You can create RDDs from collections instead of files

–sc.parallelize(collection)
myData = ["Alice","Carlos","Frank","Barbara"]
> myRdd = sc.parallelize(myData)
> myRdd.take(2)
['Alice', 'Carlos']
Creating RDDs from Text Files (1)
For file-based RDDs, use SparkContext.textFile
– Accepts a single file, a directory of files, a wildcard list of
files, or a comma-separated list of filesn Examples
–sc.textFile("myfile.txt")
–sc.textFile("mydata/")
–sc.textFile("mydata/*.log")
–sc.textFile("myfile1.txt,myfile2.txt")
– Each line in each file is a separate record in the RDD
§ Files are referenced by absolute or relative URI
– Absolute URI:
–file:/home/training/myfile.txt
–hdfs://nnhost/loudacre/myfile.txt
Examples: Multi-RDD Transformations (1)
Examples: Multi-RDD Transformations (2)
Some Other General RDD Operations
Other RDD operations

–first returns the first element of the RDD
–foreach applies a function to each element in an RDD
–top(n) returns the largest n elements using natural
ordering
§ Sampling operations
–sample creates a new RDD with a sampling of elements
–takeSample returns an array of sampled elements
Pair RDDs
Pair RDDs
§ Pair RDDs are a special form of RDD

– Each element must be a key-value pair (a
two-element tuple)
– Keys and values can be any type
§ Why?
– Use with map-reduce algorithms
–Many additional functions are available for
common data processing needs
– Such as sorting, joining, grouping, and counting
Creating Pair RDDs
The first step in most workflows is to get the data into
key/value form
–What should the RDD should be keyed on?

–What is the value?
§ Commonly used functions to create pair RDDs

–map
–flatMap / flatMapValues
–keyBy
Example: A Simple Pair RDD
Example: Create a pair RDD from a tab-separated file
Example: Keying Web Logs by User ID
Mapping Single Rows to Multiple Pairs
Answer : Mapping Single Rows to Multiple
Pairs
Map-Reduce
§ Map-reduce is a common programming model
– Easily applicable to distributed processing of large data sets
§ Hadoop MapReduce is the major implementation

– Somewhat limited
– Each job has one map phase, one reduce phase
– Job output is saved to files
§ Spark implements map-reduce with much greater flexibility

–Map and reduce functions can be interspersed
– Results can be stored in memory
– Operations can easily be chained
Map-Reduce in Spark
§ Map-reduce in Spark works on pair RDDs
§ Map phase
–Operates on one record at a time
–“Maps” each record to zero or more new records
– Examples: map, flatMap, filter, keyBy
§ Reduce phase
–Works on map output
– Consolidates multiple records
– Examples: reduceByKey, sortByKey, mean
Example: Word Count
reduceByKey
The function passed to reduceByKey combines values from
two keys
– Function must be binary
Pair RDD Operations
§ Inaddition to map and reduceByKey operations, Spark
has several operations specific to pair RDDs
§ Examples
–countByKey returns a map with the count of occurrences
of each key
–groupByKey groups all the values for each key in an RDD
–sortByKey sorts in ascending or descending order
–join returns an RDD containing all pairs with matching
keys from two RDD
Example: Pair RDD Operations
Example: Joining by Key
Other Pair Operations
§ Some other pair operations
–keys returns an RDD of just the keys, without the values

–values returns an RDD of just the values, without keys
–lookup(key) returns the value(s) for a key
–leftOuterJoin, rightOuterJoin , fullOuterJoin join two
RDDs, including keys defined in the left, right or either RDD
respectively
–mapValues, flatMapValues execute a function on just the
values,
keeping the key the same
Writing and Running Apache Spark
Applications
Spark Shell vs. Spark Applications
§ The Spark shell allows interactive exploration and
manipulation of data
– REPL using Python or Scala
§ Spark applications run as independent programs
– Python, Scala, or Java

– For jobs such as ETL processing, streaming, and so on
The Spark Context
§ Every Spark program needs a SparkContext object
– The interactive shell creates one for you
§ In your own Spark application you create your own

SparkContext object
– Named sc by convention
– Call sc.stop when program terminates
Example: Word Count
Building a Spark Application: Scala
§ Scala or Java Spark applications must be compiled and assembled
into JAR
files
– JAR file will be passed to worker nodes
§ Apache Maven is a popular build tool
– For specific setting recommendations, see the Spark Programming
Guide
§ Build details will differ depending on
– Version of Hadoop (HDFS)
–Deployment platform (YARN, Mesos, Spark Standalone)
§ Consider using an Integrated Development Environment (IDE)
– IntelliJ or Eclipse are two popular examples
– Can run Spark locally in a debugger
Running a Spark Application
The easiest way to run a Spark application is using the spark-
submit script
Spark Application Cluster Options
§ Spark can run
– Locally
– No distributed processing
– Locally with multiple worker threads
– On a cluster
§ Local mode is useful for development and testing
§ Production use is almost always on a cluster

Supported Cluster Resource Managers
§ Hadoop YARN
– Included in CDH
–Most common for production sites
– Allows sharing cluster resources with other applications
§ Spark Standalone
– Included with Spark
– Easy to install and run
– Limited configurability and scalability
– No security support
– Useful for learning, testing, development, or small systems
§ Apache Mesos
– First platform supported by Spark
Spark Runs on YARN: Client Mode
Spark Runs on YARN: Cluster Mode
Running a Spark Application Locally
Use spark-submit --master to specify cluster option
– Local options
–local[*] runs locally with as many threads as cores (default)
–local[n] runs locally with n threads
–local runs locally with a single thread
Running a Spark Application on a Cluster
§ Use spark-submit --master to specify cluster option

– Cluster options
–yarn-client
–yarn-cluster
–spark://masternode:port (Spark Standalone)
–mesos://masternode:port (Mesos)
Starting the Spark Shell on a Cluster
§ TheSpark shell can also be run on a cluster
§ spark-shell has a --master option
–yarn (client mode only)
– Spark or Mesos cluster manager URL
–local[*] runs with as many threads as cores (default)
–local[n] runs locally with n worker threads
–local runs locally without distributed processing
Options when Submitting a Spark
Application to a Cluster
§ Some other spark-submit options for clusters
• --jars: Additional JAR files (Scala and Java only)
• --py-files: Additional Python files (Python only)
• --driver-java-options: Parameters to pass to the driver JVM
• --executor-memory: Memory per executor (for example:
1000m,2g) (Default: 1g)
• --packages: Maven coordinates of an external library to
include
§ Plus several YARN-specific options
• --num-executors: Number of executors to start
• --executor-cores: Number cores to allocate for each executor
• --queue: YARN queue to submit the application to
§ Show all available options
• -help
The Spark Application Web UI
Accessing the Spark UI
§ The web UI is run by the Spark driver
–When running locally: http://localhost:4040
–When running on a cluster, access via the YARN UI
Viewing Spark Job History
§ Viewing Spark Job History
– Spark UI is only available while the application is running
– Use Spark History Server to view metrics for a completed application
– Optional Spark component
§ Accessing the History Server
– For local jobs, access by URL
– Example: localhost:18080
– For YARN Jobs, click History link in YARN UI
Viewing Spark Job History
Configuring Apache Spark Applications
Spark Application Configuration
§ Spark provides numerous properties for configuring your
application
§ Some example properties
–spark.master
–spark.app.name
–spark.local.dir: Where to store local files such as shuffle output
(default /tmp)
–spark.ui.port: Port to run the Spark Application UI (default
4040)
–spark.executor.memory: How much memory to allocate to
each
Executor (default 1g)
–spark.driver.memory: How much memory to allocate to the
driver in client mode (default 1g)
Declarative Configuration Options
§ spark-submit script
– Examples:
–spark-submit --driver-memory 500M
–spark-submit --conf spark.executor.cores=4
§ Properties file
– Tab- or space-separated list of properties and values
– Load with spark-submit --properties-file filename
Setting Configuration Properties
Programmatically
§ Spark configuration settings are part of the Spark context
§ Configure using a SparkConf object
§ Some example set functions

–setAppName(name)
–setMaster(master)
–set(property-name, value)
§ set functions return a SparkConf object to support chaining

SparkConf Example
Viewing Spark Properties
§ You can view the Spark
property settings in the
Spark Application UI
– Environment tab
Spark Logging
§ Spark uses Apache Log4j for logging
– Allows for controlling logging at runtime using a properties
file
– Enable or disable logging, set logging levels, select output
destination
– For more info see http://logging.apache.org/log4j/1.2/
§ Log4j provides several logging levels
–TRACE
–DEBUG
–INFO
–WARN
–ERROR
–FATAL
–OFF
Spark Log Files
§ Log file locations depend on your cluster management
platform
§ YARN
– If log aggregation is off, logs are stored locally on each
worker node
– If log aggregation is on, logs are stored in HDFS
– Default /var/log/hadoop-yarn
– Access with yarn logs command or YARN Resource Manager
UI
Spark Log Files
Configuring Spark Logging
§ Logging levels can be set for the cluster, for individual
applications, or even
for specific components or subsystems
§ Default for machine: SPARK_HOME/conf/log4j.properties*
– Start by copying log4j.properties.template
Configuring Spark Logging
§ Logging in the Spark shell can be configured interactively
– The setLogLevel method sets the logging level temporarily

Parallel Processing in Apache Spark
File Partitioning: Single Files
§ Partitions from single files
– Partitions based on size

– You can optionally specify a
minimum number of partitions
textFile(file, minPartitions)
– Default is two when running on a

cluster
– Default is one when running
locally with a single thread
–More partitions = more
parallelization
HDFS and Data Locality
HDFS and Data Locality
Parallel Operations on Partitions
§ RDD operations are executed in parallel on each partition

–When possible, tasks execute on the worker nodes where
the data is in stored
§ Some operations preserve partitioning

– Such as map, flatMap, or filter
§ Some operations repartition

– Such as reduceByKey, sortByKey, join, or groupByKey
Example: Average Word Length by Letter (1)
Stages
§ Operations that can run on the same partition are
executed in stages
§ Tasks within a stage are pipelined together
§ Developers should be aware of stages to improve

performance
Spark Execution: Stages
Spark Execution: Stages
Summary of Spark Terminology
§ Job—a set of tasks executed as a result of an action
§ Stage—a set of tasks in a job that can be executed in parallel
§ Task—an individual unit of work sent to one executor
§ Application—the set of jobs managed by a single driver
How Spark Calculates Stages
§ Spark constructs a DAG (Directed Acyclic Graph) of RDD dependencies
§ Narrowdependencies
– Each partition in the child RDD depends
on just one partition of the parent RDD
– No shuffle required between executors
– Can be collapsed into a single stage
– Examples: map, filter, and union
§ Wide (or shuffle) dependencies
– Child partitions depend on multiple
partitions in the parent RDD
– Defines a new stage
– Examples: reduceByKey, join
Controlling the Level of Parallelism
§ Wide operations (such as reduceByKey) partition resulting
RDDs
–More partitions = more parallel tasks
– Cluster will be under-utilized if there are too few partitions
§ You can control how many partitions
– Optional numPartitionsparameter in function call
Viewing Stages in the Spark Application UI
Viewing Stages in the Spark Application UI
RDD Persistence
Lineage Example (1)
§ Each transformation operation
creates a new child RDD
Lineage Example (2)
Lineage Example (3)
Lineage Example (4)
§ Spark keeps track of the parent RDD
for each new RDD
§ Child RDDs depend on their parents

Lineage Example (4)
RDD Persistence
§ Persisting an RDD saves the data
(in memory, by default)
RDD Persistence
§ Persisting an RDD saves the data
(in memory, by default)
RDD Persistence
Subsequent operations use saved
data
RDD Persistence
Subsequent operations use saved
data
Memory Persistence
§ In-memory persistence is a suggestion to Spark
– If not enough memory is available, persisted partitions
will be cleared
from memory
– Least recently used partitions cleared first
– Transformations will be re-executed using the lineage
when needed
Persistence Levels
§ By default, the persist method stores data in
memory only
§ The persist method offers other options called

storage levels
§ Storage levels let you control

– Storage location (memory or disk)
– Format in memory
– Partition replication
Persistence Levels: Storage Location
§ Storage location—where is the data stored?
–MEMORY_ONLY: Store data in memory if it fits
–MEMORY_AND_DISK: Store partitions on disk if they do
not fit in memory
– Called spilling
–DISK_ONLY: Store all partitions on disk
Persistence Levels: Partition Replication
§ Replication—store partitions on two nodes

–DISK_ONLY_2
–MEMORY_AND_DISK_2
–MEMORY_ONLY_2
–MEMORY_AND_DISK_SER_2
–MEMORY_ONLY_SER_2
– You can also define custom storage levels
Default Persistence Levels
§ The storageLevel parameter for the persist() operation
is optional
– If no storage level is specified, the default value
depends on the language
– Scala default: MEMORY_ONLY
§ cache() is a synonym for persist() with no storage level

specified
When and Where to Persist
§ When should you persist a dataset?
–When a dataset is likely to be re-used
– Such as in iterative algorithms and machine learning
§ How to choose a persistence level
–Memory only—choose when possible, best performance
– Save space by saving as serialized objects in memory if
necessary
– Disk—choose when recomputation is more expensive than
disk read
– Such as with expensive functions or filtering large datasets
– Replication—choose when recomputation is more
expensive than memory
Changing Persistence Options
§ To stop persisting and remove from memory and disk
–rdd.unpersist()
§ To change an RDD to a different persistence level

– Unpersist first
DataFrames and Apache Spark SQL
What is Spark SQL?
§ What is Spark SQL?
– Spark module for structured data processing
– Replaces Shark (a prior Spark module, now deprecated)
– Built on top of core Spark
§ What does Spark SQL provide?

– The DataFrame API—a library for working with data as
tables
– Defines DataFrames containing rows and columns
– DataFrames are the focus of this chapter!
– Catalyst Optimizer—an extensible optimization framework
– A SQL engine and command line interface
SQL Context
§ The main Spark SQL entry point is a SQL context object
– Requires a SparkContext object
– The SQL context in Spark SQL is similar to Spark context in
core Spark
§ There are two implementations
–SQLContext
– Basic implementation
–HiveContext
– Reads and writes Hive/HCatalog tables directly
– Supports full HiveQL language
– Requires the Spark application be linked with Hive libraries
– Cloudera recommends using HiveContext
Creating a SQL Context
§ The Spark shell creates a HiveContext instance automatically
– Call sqlContext
– You will need to create one when writing a Spark application
– Having multiple SQL context objects is allowed
§ A SQL context object is created based on the Spark context

DataFrames
§ DataFrames are the main abstraction in Spark SQL
– Analogous to RDDs in core Spark

– A distributed collection of structured data organized into
named columns
– Built on a base RDD containing Row objects

Creating a DataFrame from a Data Source
§ sqlContext.read returns a DataFrameReader object
§ DataFrameReader provides the functionality to load

data into a DataFrame
§ Convenience functions
–json(filename)
–parquet(filename)
–orc(filename)
–table(hive-tablename)
–jdbc(url,table,options)
Example: Creating a DataFrame from a
JSON File
Example: Creating a DataFrame from a
Hive/Impala Table
Loading from a Data Source Manually
§ You can specify settings for the DataFrameReader
–format: Specify a data source type
–option: A key/value setting for the underlying data source
–schema: Specify a schema instead of inferring from the data
source
§ Then call the generic base function load
Data Sources
§ Spark SQL 1.6 built-in data source types
–table
–json
–parquet
–jdbc
–orc
§ You can also use third party data source libraries, such as
– Avro (included in CDH)
– HBase
– CSV
–MySQL
– and more being added all the time
DataFrame Basic Operations
§ Basic operations deal with DataFrame metadata (rather than
its data)
§ Some examples
–schema returns a schema object describing the data
–printSchema displays the schema as a visual tree
–cache / persist persists the DataFrame to disk or memory
–columns returns an array containing the names of the
columns
–dtypes returns an array of (column name,type) pairs
–explain prints debug information about the DataFrame to the
console
DataFrame Basic Operations
DataFrame Actions
§ Some DataFrame actions
–collect returns all rows as an array of Row objects
–take(n) returns the first n rows as an array of Row objects
–count returns the number of rows
–show(n)displays the first n rows (default=20)
DataFrame Queries
§ DataFrame query methods return new DataFrames
– Queries can be chained like transformations
§ Some query methods
–distinct returns a new DataFrame with distinct elements of this
DF
–join joins this DataFrame with a second DataFrame
– Variants for inside, outside, left, and right joins
–limit returns a new DataFrame with the first n rows of this DF
–select returns a new DataFrame with data from one or more
columns of the base DataFrame
–where returns a new DataFrame with rows meeting specified
query criteria (alias for filter)
DataFrame Query Strings
Querying DataFrames using Columns
§ Columns can be referenced in multiple ways
Joining DataFrames
§A basic inner join when join column is in both DataFrames
Joining DataFrames
SQL Queries
§ When using HiveContext, you can query Hive/Impala
tables using HiveQL
– Returns a DataFrame
Saving DataFrames
§ Data in DataFrames can be saved to a data source
§ Use DataFrame.write to create a DataFrameWriter
§ DataFrameWriter provides convenience functions to
externally save the data represented by a DataFrame
–jdbc inserts into a new or existing table in a database
–json saves as a JSON file
–parquet saves as a Parquet file
–orc saves as an ORC file
–text saves as a text file (string data in a single column only)
–saveAsTable saves as a Hive/Impala table (HiveContext only)
Options for Saving DataFrames
§ DataFrameWriter option methods
–format specifies a data source type
–mode determines the behavior if file or table already exists:
overwrite, append, ignore or error (default is error)
–partitionBy stores data in partitioned directories in the form
column=value (as with Hive/Impala partitioning)
–options specifies properties for the target data source
–save is the generic base function to write the data
DataFrames and RDDs
§ DataFrames are built on RDDs
– Base RDDs contain Rowobjects
– Use rdd to get the underlying RDD
DataFrames and RDDs
§ Row RDDs have all the standard Spark actions and
transformations
– Actions: collect, take, count, and so on
– Transformations: map, flatMap, filter, and so on
§ Row RDDs can be transformed into pair RDDs to use map-

reduce methods
§ DataFrames also provide convenience methods (such as

map, flatMap, and foreach)for converting to RDDs
Working with Row Objects
– Use Array-like syntax to return values with type Any
–row(n) returns element in the nth column
–row.fieldIndex("age")returns index of the age column
– Use methods to get correctly typed values
–row.getAs[Long]("age")
– Use type-specific get methods to return typed values
–row.getString(n) returns nth column as a string
–row.getInt(n) returns nth column as an integer
– And so on
Example: Extracting Data from Row Objects
Converting RDDs to DataFrames
§ You can also
create a DF from an RDD using
createDataFrame
Apache Spark Streaming
What Is Spark Streaming?
An extension of core Spark
§ Provides real-time processing of stream data
§ Versions 1.3 and later support Java, Scala, and

Python
– Prior versions did not support Python

Spark Streaming Features
§ Second-scale latencies
§ Scalability and efficient fault tolerance
§ “Once and only once” processing
§ Integrates batch and real-time processing
§ Easy to develop
–Uses Spark’s high-level API
Spark Streaming Overview
§ Divide up data stream into batches of n seconds
– Called a DStream (Discretized Stream)
§ Process each batch in Spark as an RDD
§ Return results of RDD operations in batches
Example: Streaming Request Count
DStreams
§A DStream is a sequence of RDDs representing a data
stream
Streaming Example Output (1)
Streaming Example Output (1)
DStream Data Sources
§ DStreams are defined for a given input stream (such as a
Unix socket)
– Created by the Streaming context
ssc.socketTextStream(hostname, port)
– Similar to how RDDs are created by the Spark context
§ Out-of-the-box data sources
– Network
– Sockets
– Services such as Flume, Akka Actors, Kafka, ZeroMQ, or
Twitter
– Files
–Monitors an HDFS directory for new content
DStream Operations
§ DStream operations are applied to every RDD in the stream

– Executed once per duration
§ Two types of DStream operations
– Transformations
– Create a new DStream from an existing one
– Output operations
–Write data (for example, to a file system, database, or
console)
– Similar to RDD actions
DStream Transformations
§ Many RDD transformations are also available on DStreams
– Regular transformations such as map, flatMap, filter
– Pair transformations such as reduceByKey, groupByKey, join
§ What if you want to do something else?

–transform(function)
– Creates a new DStream by executing function on RDDs in the
current DStream
DStream Transformations
DStream Output Operations
§ Console output
–print (Scala) / pprint (Python) prints out the first 10 elements of
each RDD
– Optionally pass an integer to print another number of elements
§ File output
–saveAsTextFiles saves data as text
–saveAsObjectFiles saves as serialized object files (SequenceFiles)
§ Executing other functions
–foreachRDD(function)performs a function on each RDD in the
DStream
– Function input parameters
– The RDD on which to perform the function
– The time stamp of the RDD (optional)
Saving DStream Results as Files
Building and Running Spark Streaming
Applications
§ Building Spark Streaming applications
– Link with the main Spark Streaming library (included with
Spark)
– Link with additional Spark Streaming libraries if necessary, for
example, Kafka, Flume, Twitter
§ Running Spark Streaming applications

– Use at least two threads if running locally
– Adding operations after the Streaming context has been started
is unsupported
– Stopping and restarting the Streaming context is unsupported
Using Spark Streaming with Spark Shell
§ Spark Streaming is designed for batch applications, not
interactive use
§ The Spark shell can be used for limited testing
– Not intended for production use!
– Be sure to run the shell on a cluster with at least 2 cores, or
locally with at least 2 threads
Using Spark Streaming with Spark Shell
The Spark Streaming Application UI
Sliding Window Operations
§ RegularDStream operations execute for each RDD based
on SSC duration
§ “Window” operations span RDDs over a given duration

– For example reduceByKeyAndWindow, countByWindow
§ By default, window operations will execute with an
“interval” the same as
the SSC duration
– For two-second batch duration, window will “slide” every
two seconds
§ You can specifya different slide duration (must be a
multiple of the SSC duration)
Scala Example: Count and Sort User
Requests by Window
Spark Streaming Data Sources
§ Basic data sources
– Network socket
– Text file
§ Advanced data sources
– Kafka
– Flume
– Twitter
– ZeroMQ
– Kinesis
–MQTT
– and more coming in the future…
§ To use advanced data sources, download (if necessary)
and link to the required library
Receiver-Based Replication
§ Spark Streaming RDD replication is enabled by default
– Data is copied to another node as it received
Receiver-Based Fault Tolerance
§ If
the receiver fails, Spark will restart it on a different
executor
– Potential for brief loss of incoming data
Accumulator variable
• Accumulators are variables that are only “added” to
through an associative and commutative operation and can
therefore be efficiently supported in parallel.
• They can be used to implement counters (as in

MapReduce) or sums
val accum = sc.accumulator(0)
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
accum.value
Broadcast Variable
• A broadcast variable is a read-only variable cached once in
each executor that can be shared among tasks.
• It cannot be modified by the executor.
• The goal of broadcast variables is to increase performance
by not copying a local dataset to each task that needs it and
leveraging a broadcast version of it.
Broadcast Variable
val pws = Map("Apache Spark" -> "http://spark.apache.org/",

"Scala" -> "http://www.scala-lang.org/")/)
val websites = sc.parallelize(Seq("Apache Spark",

"Scala")).map(pws).collect
// with Broadcast
val pwsB = sc.broadcast(pws)
val websites = sc.parallelize(Seq("Apache Spark",

"Scala")).map(pwsB.value).collect
partitionBy
• This function operates on RDDs where every element is
of the form list(K, V) or c(K, V).
• For each element of this RDD, the partitioner is used to

compute a hash function and the RDD is partitioned
using this hash value.
• partitionBy(rdd, numPartitions, ...)

Check pointing
• When a worker node dies, any intermediate data
stored on the executor has to be re-computed
• When the lineage gets too long, there is a possibility
of a stack overflow.
• Spark provides a mechanism to mitigate these issues:
checkpointing.
sc.setCheckpointDir("hdfs://somedir/")
rdd = sc.textFile("/path/to/file.txt")
while x in range(<large number>)
rdd.map(…)
if x % 5 == 0
rdd.checkpoint()
rdd.saveAsTextFile("/path/to/output.txt")
Executor Optimization
When submitting an application, we tell the context
--executor-memory
--num-executors
--executor-cores

spark

Uploaded by

Copyright:

Available Formats

spark

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

spark

Uploaded by

Copyright:

Available Formats

What Is Apache Spark?

Apache Spark is a fast and general engine for large-scale

§ The Spark shell provides a preconfigured Spark context

RDD (Resilient Distributed Dataset)

§ RDDs are the fundamental unit of data in Spark

§ Most Spark programming consists of performing

§ Three ways to create an RDD

– Actions return values

– Transformations define a new RDD

Transformations create a new RDD from an existing one

§ Two common transformations

–map(function)creates a new RDD by performing a function

§ Transformations may be chained together

You can create RDDs from collections instead of files

Other RDD operations

§ Pair RDDs are a special form of RDD

–What should the RDD should be keyed on?

§ Commonly used functions to create pair RDDs

§ Hadoop MapReduce is the major implementation

§ Spark implements map-reduce with much greater flexibility

§ Map-reduce in Spark works on pair RDDs

–keys returns an RDD of just the keys, without the values

– REPL using Python or Scala

§ Spark applications run as independent programs

– Python, Scala, or Java

§ In your own Spark application you create your own

§ Spark can run

§ Local mode is useful for development and testing

§ Production use is almost always on a cluster

§ Use spark-submit --master to specify cluster option

§ Spark configuration settings are part of the Spark context

§ Configure using a SparkConf object

§ Some example set functions

§ set functions return a SparkConf object to support chaining

– The setLogLevel method sets the logging level temporarily

– Partitions based on size

– Default is two when running on a

§ RDD operations are executed in parallel on each partition

§ Some operations preserve partitioning

§ Some operations repartition

§ Tasks within a stage are pipelined together

§ Developers should be aware of stages to improve

§ Child RDDs depend on their parents

§ The persist method offers other options called

§ Storage levels let you control

§ Replication—store partitions on two nodes

§ cache() is a synonym for persist() with no storage level

§ To change an RDD to a different persistence level

§ What does Spark SQL provide?

§ A SQL context object is created based on the Spark context

§ DataFrames are the main abstraction in Spark SQL

– Analogous to RDDs in core Spark

– Built on a base RDD containing Row objects

§ sqlContext.read returns a DataFrameReader object

§ DataFrameReader provides the functionality to load

§ Row RDDs can be transformed into pair RDDs to use map-

§ DataFrames also provide convenience methods (such as

§ Provides real-time processing of stream data

§ Versions 1.3 and later support Java, Scala, and

– Prior versions did not support Python