SPARK

INF4101-Big Data et NoSQL
5. SPARK
Dr Mouhim Sanaa
SPARK VS HADOOP
• Apache Spark is a computing platform designed to be fast and general-purpose,

and easy to use:
Speed
• In memory computations
• Faster than MapReduce for complex applications on disk
Generality
• Batch applications
• Iterative algorithms
• Interactive queries and streaming
Ease of use
• API for Scala, Python, Java
• Librairies for SQL, machine mearning, streaming, and graph processing
• Runs on Hadoop clusters or as a standalone
SPARK VS HADOOP
• Data is increasing in Volume, velocity, variety

• The need to have faster results from analytics becomes increasingly important
• Big Data Analytics is of two types:

 Batch Analytics
 Real-times Analytics
• Hadoop implements Batch processing onBig Data, it can not deliver to Real time
use case.
SPARK FEATURES
Spark is not intended to replace HADOOP but it can regarded as an axtension to it
Map Reduce and Spark are used together where MapReduce is used for batch processing and Spark
For Real time processing
SPARK FEATURES
Provides powerful
100x faster than
caching and disk
MapReduce for
persistence
large scala data
capabilities
processing
Can be
programmed in Can be deployed
Scala, Java, through Mesos,
Python or R Hadoop via Yarn
or Spark’s own
cluster manager
SPARK ECHOSYSTEM
SPARK ECHOSYSTEM
Spark Core is the base engine for large-scale

parallel and distributed data processing
It is responsible for:
• Memory management and fault recovery

• Scheduling, distributing and monitoring
jobs on a cluster
• Interacting with storage systems
SPARK ARCHITECTURE
RESILIENT DISTRIBUTED DATASET RDD
• An RDD is a distributed collection of elements that is parallelized across the
• cluster
Immutable
• Lazy evaluation
•Resilient: Fault tolerant and is capable of
• On memory computation rebuilding data on failure
• Three methods for creating RDD •Distributed: Distributed data among the multiple
nodes in a cluster
Parallelizing an existing collection
Referencing a dataset •Dataset: Collection of partitioned data with
Transformation from an existing values
RDD
• Two types of RDD operations
Transformations
Actions
• Dataset from any storage supported by Hadoop
HDFS, Cassandra, Hbase….
• Type of file supported:
Text files, Sequence files, Haddop Input Format
RDD OPERATIONS
Two types of RDD

operations
Transformations Actions
• Performs the
• Creates a DAG transformations
• Lazy evaluations and the action
• No return value that follows
• Return a value
RDD OPERATIONS
RDD OPERATIONS
Transformations
Actions
SPARK SHELL
•The Spark shell provides a simple way to learn Spark's API.

•It is also a powerful tool to analyze data interactively.
•The Shell is available in either Scala, which runs on the Java VM, or Python.
Scala:
• To launch the Scala shell : Spark-shell
• To read a text file: Scala> val textfile=sc.textFile(«file.txt»)
Python:
• To launch the Python shell : pyspark
• To read a text file:

>>> textfile=sc.textFile(« file.txt »)
RDD operations: Basics
Loading a file Val lines= sc.textfile(« monText.txt »)
Applying transformationVal lineslenght= lines.map(s =>s.length)
Invoking action Val total= lineslenght.reduce((a,b)=>a+b)

RDD operations: actions
Action Methods Method usage and description
collect():Array[T] Return the complete dataset as an Array.

count():Long Return the count of elements in the dataset.
first():T Return the first element in the dataset.
foreach(f: (T) ⇒ Iterates all elements in the dataset by applying

Unit): Unit function f to all elements.
max()(implicit ord:
Return the maximum value from the dataset.
Ordering[T]): T
reduce(f: (T, T) ⇒ Reduces the elements of the dataset using the

T): T specified binary operator.
RDD operations: Transformations
Transformation
Method usage and description
Methods
cache() Caches the RDD
Returns a new RDD after applying filter function on source
filter()
dataset.
Returns flattern map meaning if you have a dataset with
array, it converts each elements in a array as a row. In other
flatMap()
words it return 0 or more items in output for each element in
dataset.
Applies transformation function on dataset and returns same
map()
number of elements in distributed dataset
reduceByKey() merges the values for each key with the function specified.
sortByKey() used to sort RDD elements on key.

Create a Spark RDD using Parallelize()
scala> val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))
Create a Spark RDD using textFile() or wholeTextFiles()

Spark core provides textFile() & wholeTextFiles() methods in SparkContext class which is used to read single
and multiple text or csv files into a single Spark RDD. Using this method we can also read all files from a
directory and files with a specific pattern.
• TextFile() – Read single or multiple text, csv files and returns a single Spark RDD [String]
• wholeTextFiles() – Reads single or multiple files and returns a single RDD[Tuple2[String,

String]], where first value (_1) in a tuple is a file name and second value (_2) is content of the
file.
val rdd = spark.sparkContext.textFile("C:/tmp/files/*")

rdd.foreach(f=>{ println(f)})
This example reads all files from a directory, creates a single RDD and prints the contents
of the RDD.
val rddWhole = sc.wholeTextFiles("C:/tmp/files/*")

rddWhole.foreach(f=>{ println(f._1+"=>"+f._2) })
where first value (_1) in a tuple is a file name and second value (_2) is content of the file.
map
Returns a new distributed dataset formed by passing each element of the source through a
function.
val pairs = test.map(s => s.lenght)
val pairs = test.map(s => (s,s.length))

flatMap
Spark flatMap() transformation flattens the RDD column after applying the function on every
element and returns a new RDD respectively.
The returned RDD can have the same count or more number of elements. This is one of
the major differences between flatMap() and map(), where map() transformation always
returns the same number of elements as input.
val rdd1 = rdd.flatMap(f=>f.split(" "))

reduce
Spark RDD reduce() aggregate action function is used to calculate min, max, and total of
elements in a dataset
val listRdd = sc.parallelize(List(1,2,3,4,5,3,2))

println("output min using binary : "+listRdd.reduce(_ min _))
println("output max using binary : "+listRdd.reduce(_ max _))
println("output sum using binary : "+listRdd.reduce(_ + _))
reduceByKey
Spark RDD reduceByKey() transformation is used to merge the values of each key using an
associative reduce function.
val rdd2=rdd.reduceByKey(_ + _)
rdd2.foreach(println)
Filter transformation
Spark RDD filter is an operation that creates a new RDD by selecting the elements
from the input RDD that satisfy a given predicate (or condition).
RDD operations: Transformations

RDD ACTIONS
collect
Return the complete dataset as an Array
count
count() – Return the count of elements in the dataset.

SPARK CACH
• Apache Spark provides an important feature to cache intermediate data and provide
significant performance improvement while running multiple queries on the same
data.
When to cache
• The rule of thumb for caching is to identify the Dataframe that you will be reusing in
your Spark Application and cache it.
SPARK CACH
Benefits of caching DataFrame
• Reading data from source(hdfs:// or s3://) is time consuming. So after you read data
from the source and apply all the common operations, cache it if you are going to reuse the
data.
•By caching you create a checkpoint in your spark application and if further down the
execution of application any of the tasks fail your application will be able to recompute the
lost RDD partition from the cache.
•If you don’t have enough memory data will be cached at the local disk of executor
which will also be faster than reading from the source.
•If you can only cache a fraction of data it will also improve the performance, the rest of the
data can be recomputed by spark and that’s what resilient in RDD means.
SPARK CACH
different storage levels for caching the data
We can use different storage levels for caching the data.
• DISK_ONLY: Persist data on disk only in serialized format.

• MEMORY_ONLY: Persist data in memory only in deserialized format.
• MEMORY_ONLY_SER : This is the same as MEMORY_ONLY but the difference being it
stores RDD as serialized objects to JVM memory.
• MEMORY_ONLY_2: Same as MEMORY_ONLY storage level but replicate each partition to
two cluster nodes.
• MEMORY_AND_DISK: Persist data in memory and if enough memory is not available
evicted blocks will be stored on disk.
• MEMORY_AND_DISK_SER – This is the same as MEMORY_AND_DISK storage level
difference being it serializes the DataFrame objects.
SPARK CACH
• cache() method default saves it to memory (MEMORY_ONLY)

• persist() method is used to store it to the user-defined storage level.
val dfCache = df.cache()
val dfPersist =
df.persist(StorageLevel.MEMORY_ONLY)
SPARK MAVEN APPLICATION
<properties>
<maven.compiler.source>1.8</maven.compiler.source 
<maven.compiler.target>1.8</maven.compiler.target <dependency>
> <groupId>org.apache.hadoop</groupId>
</properties> <artifactId>hadoop-client</artifactId>
<version>1.2.1</version>
<dependencies> </dependency>
 </dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.13</artifactId>
<version>3.2.1</version>
</dependency>
En production Créer le fichier wc.jar
Spark-submit\
--class sparkwordcount\
--master local
Wc.jar test.txt resultat

SPARK

Uploaded by

Copyright:

Available Formats

SPARK

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SPARK

Uploaded by

Copyright:

Available Formats

INF4101-Big Data et NoSQL

• Apache Spark is a computing platform designed to be fast and general-purpose,

• Data is increasing in Volume, velocity, variety

• Big Data Analytics is of two types:

Spark is not intended to replace HADOOP but it can regarded as an axtension to it

Spark Core is the base engine for large-scale

• Memory management and fault recovery

Two types of RDD

•The Spark shell provides a simple way to learn Spark's API.

• To read a text file: Scala> val textfile=sc.textFile(«file.txt»)

• To read a text file:

RDD operations: Basics

Loading a file Val lines= sc.textfile(« monText.txt »)

Applying transformationVal lineslenght= lines.map(s =>s.length)

Invoking action Val total= lineslenght.reduce((a,b)=>a+b)

Action Methods Method usage and description

collect():Array[T] Return the complete dataset as an Array.

first():T Return the first element in the dataset.

foreach(f: (T) ⇒ Iterates all elements in the dataset by applying

reduce(f: (T, T) ⇒ Reduces the elements of the dataset using the

sortByKey() used to sort RDD elements on key.

scala> val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))

Create a Spark RDD using textFile() or wholeTextFiles()

• wholeTextFiles() – Reads single or multiple files and returns a single RDD[Tuple2[String,

scala> val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))

Create a Spark RDD using textFile() or wholeTextFiles()

val rdd = spark.sparkContext.textFile("C:/tmp/files/*")

scala> val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))

Create a Spark RDD using textFile() or wholeTextFiles()

val rddWhole = sc.wholeTextFiles("C:/tmp/files/*")

val pairs = test.map(s => s.lenght)

val pairs = test.map(s => (s,s.length))

val rdd1 = rdd.flatMap(f=>f.split(" "))

val listRdd = sc.parallelize(List(1,2,3,4,5,3,2))

RDD operations: Transformations

Return the complete dataset as an Array

count() – Return the count of elements in the dataset.

Benefits of caching DataFrame

different storage levels for caching the data

We can use different storage levels for caching the data.

• DISK_ONLY: Persist data on disk only in serialized format.

• cache() method default saves it to memory (MEMORY_ONLY)

val dfCache = df.cache()

En production Créer le fichier wc.jar

You might also like