SPARK

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 35

INF4101-Big Data et NoSQL

5. SPARK

Dr Mouhim Sanaa
SPARK VS HADOOP

• Apache Spark is a computing platform designed to be fast and general-purpose,


and easy to use:

Speed
• In memory computations
• Faster than MapReduce for complex applications on disk
Generality
• Batch applications
• Iterative algorithms
• Interactive queries and streaming
Ease of use
• API for Scala, Python, Java
• Librairies for SQL, machine mearning, streaming, and graph processing
• Runs on Hadoop clusters or as a standalone
SPARK VS HADOOP

• Data is increasing in Volume, velocity, variety


• The need to have faster results from analytics becomes increasingly important

• Big Data Analytics is of two types:


 Batch Analytics
 Real-times Analytics

• Hadoop implements Batch processing onBig Data, it can not deliver to Real time
use case.
SPARK FEATURES

Spark is not intended to replace HADOOP but it can regarded as an axtension to it

Map Reduce and Spark are used together where MapReduce is used for batch processing and Spark
For Real time processing
SPARK FEATURES

Provides powerful
100x faster than
caching and disk
MapReduce for
persistence
large scala data
capabilities
processing

Can be
programmed in Can be deployed
Scala, Java, through Mesos,
Python or R Hadoop via Yarn
or Spark’s own
cluster manager
SPARK ECHOSYSTEM
SPARK ECHOSYSTEM

Spark Core is the base engine for large-scale


parallel and distributed data processing

It is responsible for:

• Memory management and fault recovery


• Scheduling, distributing and monitoring
jobs on a cluster
• Interacting with storage systems
SPARK ARCHITECTURE
RESILIENT DISTRIBUTED DATASET RDD
• An RDD is a distributed collection of elements that is parallelized across the
• cluster
Immutable
• Lazy evaluation
•Resilient: Fault tolerant and is capable of
• On memory computation rebuilding data on failure
• Three methods for creating RDD •Distributed: Distributed data among the multiple
nodes in a cluster
Parallelizing an existing collection
Referencing a dataset •Dataset: Collection of partitioned data with
Transformation from an existing values
RDD
• Two types of RDD operations
Transformations
Actions
• Dataset from any storage supported by Hadoop
HDFS, Cassandra, Hbase….
• Type of file supported:
Text files, Sequence files, Haddop Input Format
RESILIENT DISTRIBUTED DATASET RDD
RDD OPERATIONS

Two types of RDD


operations

Transformations Actions

• Performs the
• Creates a DAG transformations
• Lazy evaluations and the action
• No return value that follows
• Return a value
RDD OPERATIONS
RDD OPERATIONS
Transformations

Actions
SPARK SHELL

•The Spark shell provides a simple way to learn Spark's API.


•It is also a powerful tool to analyze data interactively.
•The Shell is available in either Scala, which runs on the Java VM, or Python.

Scala:
• To launch the Scala shell : Spark-shell

• To read a text file: Scala> val textfile=sc.textFile(«file.txt»)

Python:
• To launch the Python shell : pyspark

• To read a text file:


>>> textfile=sc.textFile(« file.txt »)
RESILIENT DISTRIBUTED DATASET RDD

RDD operations: Basics

Loading a file Val lines= sc.textfile(« monText.txt »)

Applying transformationVal lineslenght= lines.map(s =>s.length)

Invoking action Val total= lineslenght.reduce((a,b)=>a+b)


RESILIENT DISTRIBUTED DATASET RDD
RDD operations: actions

Action Methods Method usage and description

collect():Array[T] Return the complete dataset as an Array.


count():Long Return the count of elements in the dataset.

first():T Return the first element in the dataset.

foreach(f: (T) ⇒ Iterates all elements in the dataset by applying


Unit): Unit function f to all elements.
max()(implicit ord:
Return the maximum value from the dataset.
Ordering[T]): T

reduce(f: (T, T) ⇒ Reduces the elements of the dataset using the


T): T specified binary operator.
RESILIENT DISTRIBUTED DATASET RDD
RDD operations: Transformations
Transformation
Method usage and description
Methods
cache() Caches the RDD
Returns a new RDD after applying filter function on source
filter()
dataset.
Returns flattern map meaning if you have a dataset with
array, it converts each elements in a array as a row. In other
flatMap()
words it return 0 or more items in output for each element in
dataset.
Applies transformation function on dataset and returns same
map()
number of elements in distributed dataset

reduceByKey() merges the values for each key with the function specified.

sortByKey() used to sort RDD elements on key.


RESILIENT DISTRIBUTED DATASET RDD
Create a Spark RDD using Parallelize()

scala> val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))

Create a Spark RDD using textFile() or wholeTextFiles()


Spark core provides textFile() & wholeTextFiles() methods in SparkContext class which is used to read single
and multiple text or csv files into a single Spark RDD. Using this method we can also read all files from a
directory and files with a specific pattern.

• TextFile() – Read single or multiple text, csv files and returns a single Spark RDD [String]

• wholeTextFiles() – Reads single or multiple files and returns a single RDD[Tuple2[String,


String]], where first value (_1) in a tuple is a file name and second value (_2) is content of the
file.
RESILIENT DISTRIBUTED DATASET RDD
Create a Spark RDD using Parallelize()

scala> val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))

Create a Spark RDD using textFile() or wholeTextFiles()

val rdd = spark.sparkContext.textFile("C:/tmp/files/*")


rdd.foreach(f=>{ println(f)})

This example reads all files from a directory, creates a single RDD and prints the contents
of the RDD.
RESILIENT DISTRIBUTED DATASET RDD
Create a Spark RDD using Parallelize()

scala> val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))

Create a Spark RDD using textFile() or wholeTextFiles()

val rddWhole = sc.wholeTextFiles("C:/tmp/files/*")


rddWhole.foreach(f=>{ println(f._1+"=>"+f._2) })

where first value (_1) in a tuple is a file name and second value (_2) is content of the file.
RESILIENT DISTRIBUTED DATASET RDD
map

Returns a new distributed dataset formed by passing each element of the source through a
function.

val pairs = test.map(s => s.lenght)

val pairs = test.map(s => (s,s.length))


RESILIENT DISTRIBUTED DATASET RDD
flatMap

Spark flatMap() transformation flattens the RDD column after applying the function on every
element and returns a new RDD respectively.

The returned RDD can have the same count or more number of elements. This is one of
the major differences between flatMap() and map(), where map() transformation always
returns the same number of elements as input.

val rdd1 = rdd.flatMap(f=>f.split(" "))


RESILIENT DISTRIBUTED DATASET RDD
reduce

Spark RDD reduce() aggregate action function is used to calculate min, max, and total of
elements in a dataset

val listRdd = sc.parallelize(List(1,2,3,4,5,3,2))


println("output min using binary : "+listRdd.reduce(_ min _))
println("output max using binary : "+listRdd.reduce(_ max _))
println("output sum using binary : "+listRdd.reduce(_ + _))
RESILIENT DISTRIBUTED DATASET RDD
reduceByKey

Spark RDD reduceByKey() transformation is used to merge the values of each key using an
associative reduce function.

val rdd2=rdd.reduceByKey(_ + _)
rdd2.foreach(println)
RESILIENT DISTRIBUTED DATASET RDD
Filter transformation

Spark RDD filter is an operation that creates a new RDD by selecting the elements
from the input RDD that satisfy a given predicate (or condition).
RESILIENT DISTRIBUTED DATASET RDD

RDD operations: Transformations


RDD ACTIONS
collect

Return the complete dataset as an Array

count

count() – Return the count of elements in the dataset.


SPARK CACH

• Apache Spark provides an important feature to cache intermediate data and provide
significant performance improvement while running multiple queries on the same
data.
When to cache

• The rule of thumb for caching is to identify the Dataframe that you will be reusing in
your Spark Application and cache it.
SPARK CACH

Benefits of caching DataFrame

• Reading data from source(hdfs:// or s3://) is time consuming. So after you read data
from the source and apply all the common operations, cache it if you are going to reuse the
data.

•By caching you create a checkpoint in your spark application and if further down the
execution of application any of the tasks fail your application will be able to recompute the
lost RDD partition from the cache.

•If you don’t have enough memory data will be cached at the local disk of executor
which will also be faster than reading from the source.

•If you can only cache a fraction of data it will also improve the performance, the rest of the
data can be recomputed by spark and that’s what resilient in RDD means.
SPARK CACH

different storage levels for caching the data

We can use different storage levels for caching the data.

• DISK_ONLY: Persist data on disk only in serialized format.


• MEMORY_ONLY: Persist data in memory only in deserialized format.
• MEMORY_ONLY_SER : This is the same as MEMORY_ONLY but the difference being it
stores RDD as serialized objects to JVM memory.
• MEMORY_ONLY_2: Same as MEMORY_ONLY storage level but replicate each partition to
two cluster nodes.
• MEMORY_AND_DISK: Persist data in memory and if enough memory is not available
evicted blocks will be stored on disk.
• MEMORY_AND_DISK_SER – This is the same as MEMORY_AND_DISK storage level
difference being it serializes the DataFrame objects.
SPARK CACH

• cache() method default saves it to memory (MEMORY_ONLY)


• persist() method is used to store it to the user-defined storage level.

val dfCache = df.cache()

val dfPersist =
df.persist(StorageLevel.MEMORY_ONLY)
SPARK MAVEN APPLICATION

<properties>

<maven.compiler.source>1.8</maven.compiler.source <!--
> https://mvnrepository.com/artifact/org.apache.hadoop
/hadoop-client -->
<maven.compiler.target>1.8</maven.compiler.target <dependency>
> <groupId>org.apache.hadoop</groupId>
</properties> <artifactId>hadoop-client</artifactId>
<version>1.2.1</version>
<dependencies> </dependency>
<!--
https://mvnrepository.com/artifact/org.apache.spa
rk/spark-core --> </dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.13</artifactId>
<version>3.2.1</version>
</dependency>
SPARK MAVEN APPLICATION
SPARK MAVEN APPLICATION
SPARK MAVEN APPLICATION

En production Créer le fichier wc.jar

Spark-submit\
--class sparkwordcount\
--master local
Wc.jar test.txt resultat

You might also like