Introduction To Big Data With Apache Spark: Uc Berkeley
Introduction To Big Data With Apache Spark: Uc Berkeley
Introduction To Big Data With Apache Spark: Uc Berkeley
UC BERKELEY
This Lecture
Programming Spark
Resilient Distributed Datasets (RDDs)
Creating an RDD
Spark Transformations and Actions
Spark Programming Model
Python Spark (pySpark)
• We are using the Python programming interface to
Spark (pySpark)
• pySpark provides an easy-to-use programming
abstraction and parallel runtime:
» “Here’s an operation, run it on all of the data”
• http://spark.apache.org/docs/latest/api/python/index.html
Creating an RDD
• Create RDDs from Python collections (lists)
No computation occurs with sc.parallelize()
>>> data = [1, 2, 3, 4, 5] • Spark only records how to create the RDD with
>>> data four partitions
[1, 2, 3, 4, 5]
>>> rDD = sc.parallelize(data, 4)
>>> rDD
ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:229
Creating RDDs
• From HDFS, text files, Hypertable, Amazon S3, Apache Hbase,
SequenceFiles, any other Hadoop InputFormat, and directory or
glob wildcard: /data/201404*
>>> distFile!
MappedRDD[2] at textFile at
NativeMethodAccessorImpl.java:-2!
Creating an RDD from a File
distFile = sc.textFile("...", 4)
• RDD distributed in 4 partitions
• Elements are lines of input
• Lazy evaluation means
no execution happens now
Spark Transformations
• Create new datasets from an existing one
• Use lazy evaluation: results not computed right away –
instead Spark remembers set of transformations applied
to base dataset
» Spark optimizes the required calculations
» Spark recovers from failures and slow workers
Transformation
Description
map(func) return a new distributed dataset formed by passing
each element of the source through a function func
filter(func) return a new dataset formed by selecting those
elements of the source on which func returns true
distinct([numTasks])) return a new dataset that contains the distinct
elements of the source dataset
flatMap(func) similar to map, but each input item can be mapped
to 0 or more output items (so func should return a
Seq rather than a single item)
Review: Python lambda Functions
• Small anonymous functions (not bound to a name)
lambda a, b: a + b
» returns the sum of its two arguments
>>> rdd = sc.parallelize([5,3,1,2])
>>> rdd.takeOrdered(3, lambda s: ‐1 * s)
Value: [5,3,2] # as list
Spark Programming Model
lines = sc.textFile("...", 4)
print lines.count()
lines
count() causes Spark to:
#
• read data
#
• sum within partitions
• combine sums in driver
#
#
Spark Programming Model
lines = sc.textFile("...", 4)
comments = lines.filter(isComment)
print lines.count(), comments.count()
lines
comments
Spark recomputes lines:
#
#
• read data (again)
#
#
• sum within partitions
• combine sums in
#
#
driver
#
#
Caching RDDs
lines = sc.textFile("...", 4)
lines.cache() # save, don't recompute!
comments = lines.filter(isComment)
print lines.count(),comments.count()
lines
comments
RAM
#
#
RAM
#
#
RAM # #
RAM
#
#
Spark Program Lifecycle
1. Create RDDs from external data or parallelize a
collection in your driver program
2. Lazily transform them into new RDDs
3. cache() some RDDs for reuse
4. Perform actions to execute parallel
computation and produce results
Spark Key-Value RDDs
• Similar to Map Reduce, Spark supports Key-Value pairs
• Each element of a Pair RDD is a pair tuple
>>> rdd = sc.parallelize([(1, 2), (3, 4)])
RDD: [(1, 2), (3, 4)]
Some Key-Value Transformations
Consider These Use Cases
• Iterative or single jobs with large global variables
» Sending large read-only lookup table to workers
» Sending large feature vector in a ML algorithm to workers