SPARK
SPARK
SPARK
5. SPARK
Dr Mouhim Sanaa
SPARK VS HADOOP
Speed
• In memory computations
• Faster than MapReduce for complex applications on disk
Generality
• Batch applications
• Iterative algorithms
• Interactive queries and streaming
Ease of use
• API for Scala, Python, Java
• Librairies for SQL, machine mearning, streaming, and graph processing
• Runs on Hadoop clusters or as a standalone
SPARK VS HADOOP
• Hadoop implements Batch processing onBig Data, it can not deliver to Real time
use case.
SPARK FEATURES
Map Reduce and Spark are used together where MapReduce is used for batch processing and Spark
For Real time processing
SPARK FEATURES
Provides powerful
100x faster than
caching and disk
MapReduce for
persistence
large scala data
capabilities
processing
Can be
programmed in Can be deployed
Scala, Java, through Mesos,
Python or R Hadoop via Yarn
or Spark’s own
cluster manager
SPARK ECHOSYSTEM
SPARK ECHOSYSTEM
It is responsible for:
Transformations Actions
• Performs the
• Creates a DAG transformations
• Lazy evaluations and the action
• No return value that follows
• Return a value
RDD OPERATIONS
RDD OPERATIONS
Transformations
Actions
SPARK SHELL
Scala:
• To launch the Scala shell : Spark-shell
Python:
• To launch the Python shell : pyspark
reduceByKey() merges the values for each key with the function specified.
• TextFile() – Read single or multiple text, csv files and returns a single Spark RDD [String]
This example reads all files from a directory, creates a single RDD and prints the contents
of the RDD.
RESILIENT DISTRIBUTED DATASET RDD
Create a Spark RDD using Parallelize()
where first value (_1) in a tuple is a file name and second value (_2) is content of the file.
RESILIENT DISTRIBUTED DATASET RDD
map
Returns a new distributed dataset formed by passing each element of the source through a
function.
Spark flatMap() transformation flattens the RDD column after applying the function on every
element and returns a new RDD respectively.
The returned RDD can have the same count or more number of elements. This is one of
the major differences between flatMap() and map(), where map() transformation always
returns the same number of elements as input.
Spark RDD reduce() aggregate action function is used to calculate min, max, and total of
elements in a dataset
Spark RDD reduceByKey() transformation is used to merge the values of each key using an
associative reduce function.
val rdd2=rdd.reduceByKey(_ + _)
rdd2.foreach(println)
RESILIENT DISTRIBUTED DATASET RDD
Filter transformation
Spark RDD filter is an operation that creates a new RDD by selecting the elements
from the input RDD that satisfy a given predicate (or condition).
RESILIENT DISTRIBUTED DATASET RDD
count
• Apache Spark provides an important feature to cache intermediate data and provide
significant performance improvement while running multiple queries on the same
data.
When to cache
• The rule of thumb for caching is to identify the Dataframe that you will be reusing in
your Spark Application and cache it.
SPARK CACH
• Reading data from source(hdfs:// or s3://) is time consuming. So after you read data
from the source and apply all the common operations, cache it if you are going to reuse the
data.
•By caching you create a checkpoint in your spark application and if further down the
execution of application any of the tasks fail your application will be able to recompute the
lost RDD partition from the cache.
•If you don’t have enough memory data will be cached at the local disk of executor
which will also be faster than reading from the source.
•If you can only cache a fraction of data it will also improve the performance, the rest of the
data can be recomputed by spark and that’s what resilient in RDD means.
SPARK CACH
val dfPersist =
df.persist(StorageLevel.MEMORY_ONLY)
SPARK MAVEN APPLICATION
<properties>
<maven.compiler.source>1.8</maven.compiler.source <!--
> https://mvnrepository.com/artifact/org.apache.hadoop
/hadoop-client -->
<maven.compiler.target>1.8</maven.compiler.target <dependency>
> <groupId>org.apache.hadoop</groupId>
</properties> <artifactId>hadoop-client</artifactId>
<version>1.2.1</version>
<dependencies> </dependency>
<!--
https://mvnrepository.com/artifact/org.apache.spa
rk/spark-core --> </dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.13</artifactId>
<version>3.2.1</version>
</dependency>
SPARK MAVEN APPLICATION
SPARK MAVEN APPLICATION
SPARK MAVEN APPLICATION
Spark-submit\
--class sparkwordcount\
--master local
Wc.jar test.txt resultat