spark
spark
spark
§ Written in Scala
– Functional programming language that runs in a JVM
§ Spark shell
– Interactive—for learning or data exploration
– Python or Scala
§ Spark applications
– For large scale data process
Spark Shell
§ The Spark shell provides interactive data exploration (REPL)
Spark Context
§ EverySpark application requires a Spark context
– The main entry point to the Spark API
myData = ["Alice","Carlos","Frank","Barbara"]
> myRdd = sc.parallelize(myData)
> myRdd.take(2)
['Alice', 'Carlos']
Creating RDDs from Text Files (1)
For file-based RDDs, use SparkContext.textFile
– Accepts a single file, a directory of files, a wildcard list of
files, or a comma-separated list of filesn Examples
–sc.textFile("myfile.txt")
–sc.textFile("mydata/")
–sc.textFile("mydata/*.log")
–sc.textFile("myfile1.txt,myfile2.txt")
– Each line in each file is a separate record in the RDD
§ Files are referenced by absolute or relative URI
– Absolute URI:
–file:/home/training/myfile.txt
–hdfs://nnhost/loudacre/myfile.txt
Examples: Multi-RDD Transformations (1)
Examples: Multi-RDD Transformations (2)
Some Other General RDD Operations
§ Map phase
–Operates on one record at a time
–“Maps” each record to zero or more new records
– Examples: map, flatMap, filter, keyBy
§ Reduce phase
–Works on map output
– Consolidates multiple records
– Examples: reduceByKey, sortByKey, mean
Example: Word Count
reduceByKey
The function passed to reduceByKey combines values from
two keys
– Function must be binary
Pair RDD Operations
§ Inaddition to map and reduceByKey operations, Spark
has several operations specific to pair RDDs
§ Examples
–countByKey returns a map with the count of occurrences
of each key
–groupByKey groups all the values for each key in an RDD
–sortByKey sorts in ascending or descending order
–join returns an RDD containing all pairs with matching
keys from two RDD
Example: Pair RDD Operations
Example: Joining by Key
Other Pair Operations
§ Some other pair operations
– Named sc by convention
– Call sc.stop when program terminates
Example: Word Count
Building a Spark Application: Scala
§ Scala or Java Spark applications must be compiled and assembled
into JAR
files
– JAR file will be passed to worker nodes
§ Apache Maven is a popular build tool
– For specific setting recommendations, see the Spark Programming
Guide
§ Build details will differ depending on
– Version of Hadoop (HDFS)
–Deployment platform (YARN, Mesos, Spark Standalone)
§ Consider using an Integrated Development Environment (IDE)
– IntelliJ or Eclipse are two popular examples
– Can run Spark locally in a debugger
Running a Spark Application
The easiest way to run a Spark application is using the spark-
submit script
Spark Application Cluster Options
– Locally
– No distributed processing
– Locally with multiple worker threads
– On a cluster
§ Narrowdependencies
– Each partition in the child RDD depends
on just one partition of the parent RDD
– No shuffle required between executors
– Can be collapsed into a single stage
– Examples: map, filter, and union
§ Wide (or shuffle) dependencies
– Child partitions depend on multiple
partitions in the parent RDD
– Defines a new stage
– Examples: reduceByKey, join
Controlling the Level of Parallelism
§ Wide operations (such as reduceByKey) partition resulting
RDDs
–More partitions = more parallel tasks
– Cluster will be under-utilized if there are too few partitions
§ You can control how many partitions
– Optional numPartitionsparameter in function call
Viewing Stages in the Spark Application UI
Viewing Stages in the Spark Application UI
RDD Persistence
Lineage Example (1)
§ Each transformation operation
creates a new child RDD
Lineage Example (2)
§ Each transformation operation
creates a new child RDD
Lineage Example (3)
§ Each transformation operation
creates a new child RDD
Lineage Example (4)
§ Spark keeps track of the parent RDD
for each new RDD
–rdd.unpersist()
§ Convenience functions
–json(filename)
–parquet(filename)
–orc(filename)
–table(hive-tablename)
–jdbc(url,table,options)
Example: Creating a DataFrame from a
JSON File
Example: Creating a DataFrame from a
Hive/Impala Table
Loading from a Data Source Manually
§ You can specify settings for the DataFrameReader
–format: Specify a data source type
–option: A key/value setting for the underlying data source
–schema: Specify a schema instead of inferring from the data
source
§ Then call the generic base function load
Data Sources
§ Spark SQL 1.6 built-in data source types
–table
–json
–parquet
–jdbc
–orc
§ You can also use third party data source libraries, such as
– Avro (included in CDH)
– HBase
– CSV
–MySQL
– and more being added all the time
DataFrame Basic Operations
§ Basic operations deal with DataFrame metadata (rather than
its data)
§ Some examples
–schema returns a schema object describing the data
–printSchema displays the schema as a visual tree
–cache / persist persists the DataFrame to disk or memory
–columns returns an array containing the names of the
columns
–dtypes returns an array of (column name,type) pairs
–explain prints debug information about the DataFrame to the
console
DataFrame Basic Operations
DataFrame Actions
§ Some DataFrame actions
–collect returns all rows as an array of Row objects
–take(n) returns the first n rows as an array of Row objects
–count returns the number of rows
–show(n)displays the first n rows (default=20)
DataFrame Queries
§ DataFrame query methods return new DataFrames
– Queries can be chained like transformations
§ Some query methods
–distinct returns a new DataFrame with distinct elements of this
DF
–join joins this DataFrame with a second DataFrame
– Variants for inside, outside, left, and right joins
–limit returns a new DataFrame with the first n rows of this DF
–select returns a new DataFrame with data from one or more
columns of the base DataFrame
–where returns a new DataFrame with rows meeting specified
query criteria (alias for filter)
DataFrame Query Strings
Querying DataFrames using Columns
§ Columns can be referenced in multiple ways
Joining DataFrames
§A basic inner join when join column is in both DataFrames
Joining DataFrames
SQL Queries
§ When using HiveContext, you can query Hive/Impala
tables using HiveQL
– Returns a DataFrame
Saving DataFrames
§ Data in DataFrames can be saved to a data source
§ Use DataFrame.write to create a DataFrameWriter
§ DataFrameWriter provides convenience functions to
externally save the data represented by a DataFrame
–jdbc inserts into a new or existing table in a database
–json saves as a JSON file
–parquet saves as a Parquet file
–orc saves as an ORC file
–text saves as a text file (string data in a single column only)
–saveAsTable saves as a Hive/Impala table (HiveContext only)
Options for Saving DataFrames
§ DataFrameWriter option methods
–format specifies a data source type
–mode determines the behavior if file or table already exists:
overwrite, append, ignore or error (default is error)
–partitionBy stores data in partitioned directories in the form
column=value (as with Hive/Impala partitioning)
–options specifies properties for the target data source
–save is the generic base function to write the data
DataFrames and RDDs
§ DataFrames are built on RDDs
– Base RDDs contain Rowobjects
– Use rdd to get the underlying RDD
DataFrames and RDDs
§ Row RDDs have all the standard Spark actions and
transformations
– Actions: collect, take, count, and so on
– Transformations: map, flatMap, filter, and so on
§ Easy to develop
–Uses Spark’s high-level API
Spark Streaming Overview
§ Divide up data stream into batches of n seconds
– Called a DStream (Discretized Stream)
§ Process each batch in Spark as an RDD
§ Return results of RDD operations in batches
Example: Streaming Request Count
DStreams
§A DStream is a sequence of RDDs representing a data
stream
Streaming Example Output (1)
Streaming Example Output (1)
DStream Data Sources
§ DStreams are defined for a given input stream (such as a
Unix socket)
– Created by the Streaming context
ssc.socketTextStream(hostname, port)
– Similar to how RDDs are created by the Spark context
§ Out-of-the-box data sources
– Network
– Sockets
– Services such as Flume, Akka Actors, Kafka, ZeroMQ, or
Twitter
– Files
–Monitors an HDFS directory for new content
DStream Operations
– Transformations
– Create a new DStream from an existing one
– Output operations
–Write data (for example, to a file system, database, or
console)
– Similar to RDD actions
DStream Transformations
§ Many RDD transformations are also available on DStreams
– Regular transformations such as map, flatMap, filter
– Pair transformations such as reduceByKey, groupByKey, join
accum.value
Broadcast Variable
• A broadcast variable is a read-only variable cached once in
each executor that can be shared among tasks.
• It cannot be modified by the executor.
• The goal of broadcast variables is to increase performance
by not copying a local dataset to each task that needs it and
leveraging a broadcast version of it.
Broadcast Variable
// with Broadcast
val pwsB = sc.broadcast(pws)
--executor-memory
--num-executors
--executor-cores