PySpark+Slides v1
PySpark+Slides v1
PySpark+Slides v1
Spark
Apache Spark
Prior to 2.0, to create SparkContext we first need to build a SparkConf object that contains information about
the application. (This is old way of doing.)
conf = SparkConf().setAppName(appName).setMaster(master)
sc = SparkContext(conf=conf)
Prior to 2.0
2.0 and Onwards
SparkContext SparkSession
SQLContext (Also has these old classes but not
recommended to use)
HiveContext
Instance
Instance Instance
Object Object
spark = SparkSession \
.builder \
.master('yarn') \
.appName("Python Spark SQL basic example") \
.getOrCreate()
How to Run :
1. Organize the folders and create a python file under bin folder.
2. Write above codes in the .py file.
3. Execute the file using spark-submit command.
spark2-submit \
/devl/example1/src/main/python/bin/basic.py
spark-submit \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key<=<value> \
--driver-memory <value>g \
--executor-memory <value>g \
--executor-cores <number of cores> \
--jars <comma separated dependencies> \
--packages <package name> \
--py-files \
<application> <application args>
Executor JVM
JVM Heap
Client JVM
JVM Heap T T T
T
Driver
Spark
Context Executor JVM
Spark
Application JVM Heap
JVM Heap
T T T
Cluster
Executor JVM
Task Slots
T T T
JVM Heap
Unoccupied Task Slots
T
Client JVM
Spark Context
JVM Heap
T T T
--conf spark.yarn.appMasterEnv.HDFS_PATH=“practice/retail_db/orders”
We can set environment variables like this when spark is running on yarn.
https://spark.apache.org/docs/latest/running-on-yarn.html#configuration
--executor cores : Number of CPU cores to use for the executor process.
--py-files: Use –py-files to add .py and .zip files. File specified with –py-files are uploaded to the cluster before it
run the application.
Ex - --py-files file1.py, file2.py,file3.zip
***Correction - I said 100MB. It is 200MB File would be divided into 2 partitions – 128MB and 72MB.
Transformations
Create RDD
sc.parallelize
RDD
Actions
Result
Worker Node 1
Executor
Partition 1 task1
Task 1 e Task 2
Task 3
T T
RDD RDD RDD
filter Group by 3
1 2
A Count
4
Only at this point execution
starts.
Single-Node
Transfor
Data in Transfor Transfor Data in
mation
Stage 1 mation 1 mation 2 Stage 2
…
RAM RAM
Opaque Computation
&
Opaque data
Compute function is used to do computations on partitions.
Map from each partition to an Iterator of data that is stored within the RDD.
Given a particular partition, it is going to create an iterator and it will execute that particular code on that partition, distribute
across the cluster and merge the results back and that’s the fundamental programming model of an RDD.
RDD does not know what is the function doing or anything about the data. It just serialize this part of code, send it over to executor
and let it execute.
For example if you are trying to join or filter or project, spark can not automatically optimize it.
Also the data that is stored in the RDD is also opaque to spark and so spark can not do any pruning of data if the query does not
need it.
Other RDDs
rdd1 = rdd
Rdd1.rdd.map(lambda x: x(1))
Existing DataFrame
df=spark.createDataFrame(data=(('robert',35),('Mike',45)),schema=('name','age'))
new_rdd= df.rdd
RDD operating in
Parallel
Transformations
Actions
Results
Transformations
Row Level Joining Key Agg Sorting Set Sampling Pipe Partitions
map join reduceByKey sortByKey union sample pipe Coalesce
Actions
Display Total Agg File Extraction foreach
take reduce saveAsTextFile foreach
takeSample count saveAsSequenceFile
takeOrdered saveAsObjectFile
first
collect
ord = sc.textFile('practice/retail_db/orders')
ordItems = sc.textFile('practice/retail_db/order_items')
### PS: Create key-value pairs with key as Order id and values as whole records.
ordMap = ord.map(lambda x : (x.split(',')(0),x))
### PS: Applied user defined function to convert status into lowercase.
def lowerCase(str):
return str.lower() Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
ord.map(lambda x : lowerCase(x.split(',')(3))).first()
flatMap : flatMap(f, preservesPartitioning=False)
Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.
Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a
single item). Number of records in input is less than or equal to output.
PS: Print all the orders which are closed or Complete and ordered in the year 2013.
ord = sc.textFile('practice/retail_db/orders')
filteredOrd = ord.filter(lambda x : (x.split(',')(3) in ("CLOSED","COMPLETE")) and (x.split(',')(1).split('-')(0) == '2014'))
RDD2
ordMap=ord.map(lambda x : (x.split(',')(0),x.split(',')(2)))
ordItemsMap=ordItems.map(lambda x : (x.split(',')(1),x.split(',')(4)))
findSubtotalForCust = ordMap.join(ordItemsMap)
findSubtotalForCust.map(lambda x : x(1)(0)+','+x(1)(1)).first()
findSubtotalForCust.map(lambda x : str(x(1)(0))+','+str(x(1)(1))).first()
Ex –
x = sc.parallelize((("a", 1), ("b", 4)))
y = sc.parallelize((("a", 2)))
xy = x.cogroup(y)
for i,j in list(xy.take(5)) : print(i + ' ' + str(map(list,j)))
### For a given order 10 find the maximum subtotal out of all orders.
ordItems.filter(lambda x : int(x.split(',')(1)) ==10).map(lambda x : x.split(',')(4)).reduce(lambda a,b : a if
(float(a.split(',')(0)) > float(b.split(',')(0))) else b)
ordItems.filter(lambda x : int(x.split(',')(1)) ==10).map(lambda x : x.split(',')(4)).reduce(max)
Combiner:
• It computes the intermediate values for each partition to avoid partial shuffling.
toDebugString returns “A description of this RDD and its recursive dependencies for debugging.” It
includes possible shuffles.
Ex-
from operator import add
rdd = sc.parallelize((("a", 1), ("b", 1), ("a", 1)))
sorted(rdd.reduceByKey(add).collect())
2, 199.99
2, 250.00
Partition 1 4, 49.98 2, 449.99
4, 299.95 4, 349.93
2, 579.98
4, 699.85
2, 129.99 2, 129.99
Partition 2 4, 150.0 4, 349.92
4, 199.92
ordItems=sc.parallelize([
(2,"Joseph",200), (2,"Jimmy",250), (2,"Tina",130), (4,"Jimmy",50), (4,"Tina",300),
(4,"Joseph",150), (4,"Ram",200), (7,"Tina",200), (7,"Joseph",300), (7,"Jimmy",80)],2)
#Initialize Accumulator
# Zero Value: Zero value in our case will be 0 as we are finding Maximum Marks
zero_val=0
ordItems=sc.parallelize([
(2,"Joseph",200), (2,"Jimmy",250), (2,"Tina",130), (4,"Jimmy",50), (4,"Tina",300),
(4,"Joseph",150), (4,"Ram",200), (7,"Tina",200), (7,"Joseph",300), (7,"Jimmy",80)],2)
#Initialize Accumulator
# Zero Value: Zero value in our case will be 0 as we are finding Maximum Marks
zero_val=('',0)
ordItems=sc.parallelize([
(2,"Joseph",200), (2,"Jimmy",250), (2,"Tina",130), (4,"Jimmy",50), (4,"Tina",300),
(4,"Joseph",150), (4,"Ram",200), (7,"Tina",200), (7,"Joseph",300), (7,"Jimmy",80)],2)
#Initialize Accumulator
# Zero Value: Zero value in our case will be 0 as we are finding Maximum Marks
zero_val=(0,0)
Partition 1 Partition 2
Max Revenue Count
ord = sc.textFile('practice/retail_db/orders')
Global Ranking
• takeOrdered or top
Ranking Per Group:
• Getting ranking per group is a bit complex but important to know.
• Per-key or Per group ranking can be achieved using
• groupByKey with flatMap
• Python Knowledge like sorted function, list etc
Global Ranking
Using sortByKey and take:
Ex – Top five products with highest prices.
prod = sc.textFile('practice/retail_db/products')
prodPair = prod.map(lambda x : (float(x.split(',')[4]),x))
prod = prod.filter(lambda x : x.split(',')[4] = '').count()
prodPair = prod.map(lambda x : (float(x.split(',')[4]),x))
top5Products = prodPair.sortByKey(False).take(5)
Ex –
Top 2 Products with highest Prices per Category.
prod = sc.textFile('practice/retail_db/products')
prodF = prod.filter(lambda x : (int(x.split(',')[1]) in [2,3,4]) and (int(x.split(',')[0]) in [1,2,3,4,5,25,26,27,28,29,49,50,51,52,53]))
prodGroupBy = prodF.map(lambda line : ( int(line.split(',')[1]), line)).groupByKey()
first = prodGroupBy.first()
sorted(first[1],key = lambda x : float(x.split(',')[4]),reverse=True)
top2ProductsByPrice= productsGroupBy.flatMap(lambda x: sorted(x[1], key=lambda k:float(k.split(",")[4]), reverse=True)[:2])
rdd = sc.parallelize(range(100), 4)
rdd.sample(seed=10,fraction=0.1,withReplacement=False).collect()
rdd.takeSample(seed=10,num=10,withReplacement=True)
Apply a Filter T
File: test_data
Size: 192B
Size:670 MB Record Count :32
Record Count :8m
No of Blocks:6
Partitions: 6
Partitions: 6
### Testing
rdd = sc.textFile('/user/test/test_data')
rdd.getNumPartitions()
#Apply a filter.
rdd1 = rdd.filter(lambda x : int(x.split(',')[0]) == 1)
rdd1.getNumPartitions() Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Repartition
Repartition(numPartitions):
• Return a new RDD that has exactly numPartitions partitions.
• Create almost equal sized partitions.
• Can increase or decrease the level of parallelism.
• Spark performs better with equal sized partitions. If you need further processing of huge data, it is preferred to have a equal sized partitions
and so we should consider using repartition.
• Internally, this uses a shuffle to redistribute data from all partitions leading to very expensive operation. So avoid if not required.
• If you are decreasing the number of partitions, consider using coalesce, where the movement of data shuffling across the partitions is lower.
Ex –
ord = sc.textFile('practice/retail_db/orders')
ord.glom().map(len).collect()
ord = ord.repartition(5)
ord.glom().map(len).collect()
repartition(2)
Lots of Shuffling in
Repartition.
P 1 (100)
P 2(50) P 6(500)
P 3(350)
P 4(400) P 7(500)
P 5(100)
Ex -
rdd = sc.parallelize(((9, ('a','z')), (3, ('x','f')), (6, ('j','b')), (4, ('a','b')), (8, ('s','b')), (1, ('a','b'))),2)
rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x : x % 2, True)
rdd2.glom().collect()
coalesce(2)
No Shuffle in coalesce.
P 1 (100) P 1 (400)
P 2(300)
P 3(5) P 3 (8)
P 4(3)
Ex – Find the number of customers placed order in July or Aug Month. Store the output at HDFS as a text format.
Create 5 files. Compression format should be bzip2.
ord = sc.textFile('practice/retail_db/orders')
julyOrd = ord.filter(lambda x : str(x.split(',')[1].split('-')[1]) == '07‘)
augOrd = ord.filter(lambda x : str(x.split(',')[1].split('-')[1]) == '08')
julyAugOrders = julyOrd.union(augOrd).distinct()
julyAugOrders.coalesce(5).saveAsTextFile('practice/retail_db/dump/julyAugOrders',
compressionCodecClass='org.apache.hadoop.io.compress.BZip2Codec')
Executor Cache
8 9
1 Task Task
JVM 2 7
3 Spark Context
6 Cluster Manager
7 Worker Node
Action Cache
Executor
Operation 8 9
Task Task
task2
RDD1 RDD2
task3 Tasks
Operator
DAG RDD
task4 Launch task via Task
Scheduler
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Follows master-slave architecture.
Client submits user application code to Driver. (1)
JVM Is created on the driver. (2)
Spark Context is created in the JVM of driver program. Only one active sc per JVM. (3)
Driver implicitly converts the user code into logically DAG(Directed Acyclic Graph) using DAG scheduler. (4)
• DAG Scheduler performs optimizations such as pipelining transformations and then it converts the logical graph DAG
into Physical executing plan with many stages.
• After creating physical executing plans, it creates Physical executing units called tasks under each stage.
The stages pass on to Task Scheduler. It launces task through cluster manager. (5)
Now driver via Spark Context talks to the cluster manager and negotiates resources. It request for worker nodes and
executors in the cluster. (6) (7)
• Spark is agnostic to the underlying cluster manager. As long as it can acquire executor processes, and these
communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other
applications (e.g. Mesos/YARN).
• Cluster Manager allocate resources and instruct executors to execute the job.
• Also track the submitted jobs and report back the status of the job to the driver.
Driver sends application code and dependencies(defined by jar or Python files passed to the SparkContext) to executors. (8)
Finally driver also send the tasks to the executors to run.
Job resources which we are trying to pass as part of execution can be cached at Worker Nodes. One of the job
resource can be our code itself.
All the executors start to register themselves with the drivers so that driver will have a complete view of the executors.
Executors now start executing the tasks that are assigned by the driver program.
When application is running, the driver program will monitor the set of executors that runs. Driver also schedules the future
tasks based on data placement. Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
After execution, the result returns back to the Spark Context. (9)
Cluster Manager Types:
The system currently supports several cluster managers:
• Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster.
• Apache Mesos – a general cluster manager that can also run Hadoop MapReduce and service applications.
• Hadoop YARN – the resource manager in Hadoop 2.
• Kubernetes – an open-source system for automating deployment, scaling, and management of containerized
applications.
Flow:
1. Client submit the Spark application. Driver instantiates SparkContext.
2. Driver talks to the cluster manager(YARN) and negotiates resources.
3. The YARN resource manager search for a Node Manger which will, in turn, launch an ApplicationMaster for the
specific job in a container.
4. The ApplicationMaster registers itself with the resource Manager.
5. The ApplicationMaster negotiates containers for executors from the ResourceManager. Can request for more
resources from RM.
6. The ApplicationMaster notifies the Node Managers to launch the containers and executors. Executor then executes
the tasks.
7. Driver communicates with executors to coordinate the processing of tasks of an application.
8. Once the tasks are complete, ApplicationMaster un-registers with the Resource Manager.
Node Manager
3 3
Container
4 Application
1 Resource 5
Master
Driver 2
Manager 8
Node Manager 6 6
Container
Executor
7
Task Task Node Manager
Container
Executor
7
Task Task
Node Manager:
• Runs on all Worker Nodes.
• Launches and monitor containers which are assigned by RM.
• Responsible for the execution of the task in each data node.
Containers:
• Are set of resources like RAM, CPU, memory etc on a single node and they are scheduled by RM and monitored
by NM.
Application Master:
• An individual ApplicationMaster is assigned for each job by RM.
• It’s chief responsibility is to negotiate the resources from the RM. It works with the Node Manager to monitor
and execute the tasks.
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
JVM Processes
Cluster
Executor JVM
Task Slots
T T T
JVM Heap Unoccupied Task Slots
T
Client JVM
Spark Context
JVM Heap
T T T
Partition 2 T[ Partition 2
Partition 3 Partition 3
RDD1 RDD2
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Wide Transformations:
• This type of transformation will have input partitions contributing to many output partitions.
• Each Wide Transformations creates a new stage.
• Slow compared to Narrow.
• Data shuffle.
• Ex- groupByKey(), aggregateByKey(), join, distinct(), repartition() etc.
Partition 1 Partition 1
T
Partition 2 Partition 2
Partition 3 Partition 3
aggregate
parallelize ByKey parallelize join
Spark Context
Stage 1 Stage 2
rdd1
rdd2 rdd3
rdd4
• Displays the description of this RDD and its recursive dependencies textFile
for debugging.
HadoopRDD
Word Count Program:
NT Stage 1
text_file = sc.textFile('practice/retail_db/word')
wordCounts = text_file.flatMap(lambda line: line.split(",")) \ mapPartitionsRDD
.filter(lambda x : x.isdigit() == False) \ NT
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b) mapPartitionsRDD
top3Words = wordCounts.takeOrdered(3,lambda k: -float(k[1]))
NT
shuffleRDD Stage 2
A
DAG Scheduler
Physical
Execution Plan
DAG Scheduler
mapPartitionsRDD p1 p2
Stage 1
mapPartitionsRDD p1 p1
Task Scheduler
mapPartitionsRDD p1 p2
shuffleRDD p1 p2 Stage 2
Default
Storage
unpersist():
• Mark the RDD as non-persistent, and remove all blocks for it from memory and disk.
• Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used
(LRU) fashion.
• If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist()
method.
rdd.persist()
rdd.is_cached
rdd.unpersist()
rdd.is_cached
--Broadcast a Dictionary
days={"sun": "Sunday", "mon" : "Monday", "tue":"Tuesday"}
bcDays = spark.sparkContext.broadcast(days)
bcDays.value
bcDays.value['sun']
--Broadcast a list
numbers = (1,2,3)
broadcastNumbers=spark.sparkContext.broadcast(numbers)
broadcastNumbers.value
broadcastNumbers.value[0]
Solution:
days={"sun": "Sunday", "mon" : "Monday", "tue":"Tuesday"}
bcDays = spark.sparkContext.broadcast(days)
data = (("James","Smith","USA","mon"),
("Michael","Rose","USA","tue"),
("Robert","Williams","USA","sun"),
("Maria","Jones","USA","tue")
)
rdd = spark.sparkContext.parallelize(data)
def days_convert(dict):
return bcDays.value(dict)
counter = 0
def f1(x):
global counter
counter += 1
rdd = spark.sparkContext.parallelize((1,2,3))
rdd.foreach(f1)
counter.value
The counter Variable will not added or changed, because when spark ships this code to every executor the variables become local to
that executor. So the variable is updated for that executor but do not send it back to the driver. To avoid this problems, we need ac
accumulator. All the updates to accumulator variable in every executor is send it back to the driver.
rdd.foreach(f1)
counter.value ### Only accessed by Driver
accum=sc.accumulator(0)
rdd=spark.sparkContext.parallelize((1,2,3,4,5))
rdd.foreach(lambda x:accum.add(x))
print(accum.value) #Accessed by driver
foreach() is an action which is applied to each element of the rdd and then adding each element to accum variable.
rdd.foreach() is executed on workers and accum.value is called from driver.
Tungsten Execution
Catalyst Optimizer
Engine
Catalyst
User Programs
Tungsten
Series of Transformations
SQL Query
Parsed
Optimized
Query Plan RDDs
Query Plan
DataFrame
Abstractions of User
Programs (Trees)
Optimizer: An optimizer can automatically find out the most efficient plan to execute a query.
Catalyst Optimizer:
• Spark SQL is designed with Catalyst Optimizer which is based on functional programming of Scala.
• Responsible to improve the performance of user programs (SQL Query/DataFrame APIs).
• It converts a query plan into optimized query plan.
• Two main Purpose:
Add new optimization techniques to solve big data problems.
Allows developers/spark community to implement and extend the optimizer with new features.
• Offers both rule-based and cost-based optimization (Spark 2.0).
Rule Based : How to execute the query from a set of defined rules.
Cost-Based: Generates multiple execution plans and chose the lowest cost plan.
Identify
Expressions
Project
Filter
Scan
Reference: databricks.com summit
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Volcano Iterator Model:
• Spark 1.6
• Each operator in the query plan would implement an iterator interface that takes in the records from
the operator below it, tries to do some processing and output that record optionally to the operator
above it.
• Advantages:
Each operator is independent from each other. So it is easy to write when we introduce a new
operator and we no need to worry about how it interacts with all the other operators.
No need to worry about the operators before it or after it.
• Disadvantages:
Too many virtual function calls. Since we are agnostic to the operator that is below, we have no
idea where the input data is coming from.
Extensive memory access: Each operator has to write the intermediate row that its trying to send
to the upstream operator into memory. So there is memory read or writes bottleneck.
Can’t take advantage of modern CPU features like SIMD, pipelining, prefetching etc.
Aggregate ( ~ count)
var count = 0
Project (No Column to project) for (order_cust_id in orders) {
if (order_cust_id == 1000) {
count += 1
Filter (~ Filter Condition)
}
}
Scan (~For Loop)
def benchmark(version):
start = time()
spark.range(1000 * 1000 * 1000).select(sum("id")).show()
end = time()
elapsed = end-start
print(elapsed)
Spark 1.6:
spark.conf.set("spark.sql.codegen.wholeStage",False)
benchmark('1.6')
Total Time: 10.4 secs
Spark 2.0:
spark.conf.set("spark.sql.codegen.wholeStage",True)
benchmark('2.0')
Total Time: 0.4 secs
Ex –
spark.conf.set("spark.sql.codegen.wholeStage",True)
spark.range(1000).filter("id > 100").selectExpr("sum(id)").explain()
Physical
Analysis Logical Planning Code
Optimization Generation
SQL Query
Cost Model
Parsed Optimized Selected
Unresolved Logical Physical Physical
Logical RDDs
Logical Plan Plan Plans Plan
Plan
DataFrame
Catalog
DataSource API:
Used to read and store structured and semi-structured data into Spark SQL.
DataSource API then fetches the data which is then converted into a DataFrame API.
DataFrame API:
Equivalent to a relational table in SQL to perform SQL Operations.
Distributed collection of data organized into named Columns.
Data is stored in partitions.
Catalyst Optimizer:
It converts a query plan into optimized query plan.
Tungsten:
Takes the optimized query plan from Catalyst and generates code and execute in the cluster in a
distributed fashion. Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame
Fundamentals
Name Age
“Robert”,31
Robert 31
“Alicia”,25
(‘name’,’age’) Alicia 25
“Deja”,19
“Manoj”, 31 Deja 19
Dataset Schema Majoj 31
DataFrame
Worker Node 2
Partition 3 Task3 Executor
Task 3
Hands-on 1
Worker Node 1
Partition 1 Task1
Executor
DataFrame 128MB col%2 == 0
300MB Task 1 Task 2
Partition 2 Task2
128MB col%2 == 0
Task 3
T T
DataFrame DataFrame DataFrame
filter Group by 3
1 2
A Count
4
Only at this point execution
starts.
Example – Find out 10 sample records having a string “Robert” from a file(1TB).
Single-Node
Transfor
Data in Transfor Transfor Data in
mation
Stage 1 mation 1 mation 2 Stage 2
…
Data in Data in
Transform Transform Transform
Stage 1 on Stage 2 on
ations 1 ations 2 ations …
Node 1 Node 1
Data in Data in
Stage 1 Stage 2 on
on Node 2 Node 1
Transformation Recipe is stored
Data in Data in
Stage 1 on Stage 2 on
Node 3 Node 1
Data in Data in
Stage 1 on Stage 2 on
Node … Node …
Use DF consistently
across all Spark Spark SQL
Libraries
Spark Streaming
DataFrame
Mlib(Machine
Learning)
DataFrame APIs
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame Organization of Data
StorageLevel(useDisk, useMemory, useOffHeap, deserialized, replication=1)
• Data can be stored either in Disk or Memory or off-heap memory or any of these combinations.
• Off-Heap Memory is a segment of memory lies outside the JVM,but is used by JVM for certain use-cases. Off-Heap memory
can be used by Spark explicitly as well to store serialized data-frames and RDDs.
• Data can be stored in Serialized or deserialized. Serilization is a way to convert a java object in memory to series of bits. The
deserialization is the process of bringing those bits into memory as an object. Whenever we are talking about 'deserialized'
RDD/DF we are always referring to RDD/DFs in memory.
• Use the replicated storage levels if you want fast fault recovery.
Default
Storage
Prior to 2.0
2.0 and Onwards
SparkContext SparkSession
SQLContext (Also has these old classes but not
recommended to use)
HiveContext
Instance
Instance Instance
Object Object
spark = SparkSession \
.builder \
.master('yarn') \
.appName("Python Spark SQL basic example") \
.getOrCreate()
How to Run :
1. Organize the folders and create a python file under bin folder.
2. Write above codes in the .py file.
3. Execute the file using spark-submit command.
spark2-submit \
/devl/example1/src/main/python/bin/basic.py
spark-submit \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key<=<value> \
--driver-memory <value>g \
--executor-memory <value>g \
--executor-cores <number of cores> \
--jars <comma separated dependencies> \
--packages <package name> \
--py-files \
<application> <application args>
--conf spark.yarn.appMasterEnv.HDFS_PATH=“practice/retail_db/orders”
We can set environment variables like this when spark is running on yarn.
https://spark.apache.org/docs/latest/running-on-yarn.html#configuration
--executor cores : Number of CPU cores to use for the executor process.
--py-files: Use –py-files to add .py and .zip files. File specified with –py-files are uploaded to the cluster before it
run the application.
Ex - --py-files file1.py, file2.py,file3.zip
import pandas as pd
data = (('tom', 10), ('nick', 15), ('juli', 14))
df_pandas = pd.DataFrame(data,columns=('Name','Age'))
df = spark.createDataFrame(data=df_pandas)
Source: www.databrics.com
Ex-1:
lst1 = (('Robert',35),('James',25))
lst2= (('Robert',101),('James',102))
df_emp = spark.createDataFrame(data=lst1,schema=(‘EmpName','Age'))
df_emp.createOrReplaceTempView("dept")
df_dept = spark.createDataFrame(data=lst2,schema=(‘EmpName',’DeptNo'))
df_dept.createOrReplaceTempView("dept")
df_joined = spark.sql (""" select e.name,e.age,d.dept from emp e join dept d where e.name = d.name """)
createOrReplaceTempView("table1") //Creates the view in the current database and valid for only one session.
Ex-1:
lst1 = (('Robert',35),('James',25))
df_emp = spark.createDataFrame(data=lst1,schema=('EmpName','Age'))
df_emp.createOrReplaceTempView("emp")
df_op = spark.table("emp")
sorted(df_op.collect()) == sorted(df_emp.collect())
Read a JDBC:
Ex-4:
df=spark.read.load('practice/retail_db/testSpace.txt',format='csv',sep=',',ignoreLeadingWhiteSpace=True,ignoreTr
ailingWhiteSpace=True)
df = spark.read.load('practice/retail_db/orders',format='text')
CSV, JSON and AVRO are Row-based File formats. Sample data in CSV:
(Sequence file is also row based)
ID,FIRST_NAME,AGE
1, Matthew, 19
2, Joe,25
In general, column-oriented formats work well when queries access only a small number of columns in the
table. Conversely, row oriented formats are appropriate when a large number of columns of a single row are
needed for processing at the same time.
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : read – orc/parquet
In general, column-oriented formats work well when queries access only a small number of columns in the
table. Conversely, row oriented formats are appropriate when a large number of columns of a single row are
needed for processing at the same time.
Column-oriented formats need more memory for reading and writing, since they have to buffer a row split
in memory, rather than just a single row. Also, it’s not usually possible to control when writes occur (via
flush or sync operations), so column-oriented formats are not suited to streaming writes, as the current file
cannot be recovered if the writer process fails. On the other hand, row-oriented formats like sequence files
and Avro datafiles can be read up to the last sync point after a writer failure. It is for this reason that Flume
uses row-oriented formats.
df = spark.read.load(‘practice/retail_db/orders',format='avro')
Ex-1: Table
df= spark.read.format("jdbc") \
.option("url", "jdbc:oracle:thin:@xxxx-xxx-xxxx:1521/xxx") \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.option("dbtable",“ORDERS” ) \
.option("user", “someUser") \
.option("password", “somePsw") \
.load()
df= spark.read.format("jdbc") \
.option("url", "jdbc:oracle:thin:@xxxx-xxx-xxxx:1521/xxx") \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.option("dbtable",“(SELECT * FROM T_EMP WHERE ID=1) query” ) \
.option("user", “someUser") \
.option("password", “xxx") \
.load()
df= spark.read.format("jdbc") \
.option("url", "jdbc:oracle:thin:@xxxx-xxx-xxxx:1521/xxx") \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.option("dbtable",“ORDERS” ) \
.option(“partitionColumn”,”ORDER_ID”) \
.option(“lowerBound”, “500”) \
.option(“upperBound”, “1000”) \
.option(“numPartitions”,”5”) \
.option("user", “someUser") \
.option("password", “somePassword") \
.load()
df= spark.read.format("jdbc") \
.option("url", "jdbc:oracle:thin:@xxxx-xxx-xxxx:1521/xxx") \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.option("dbtable", "(select t1.*, cast(ROWNUM as number(5)) as num_rows from (select * from orders) t1) oracle_table1") \
.option(“partitionColumn”, ”num_rows ”) \
.option(“lowerBound”, “500”) \
.option(“upperBound”, “1000”) \
.option(“numPartitions”,”10”) \
.option("user", “someUser") \
.option("password", “somePassword") \
.load()
import string
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType,IntegerType
@udf(returnType=StringType())
def initCap(str):\
finalStr=""\
ar = str.split(" ")\
for word in ar:\
finalStr= finalStr + word[0:1].upper() + word[1:len(word)] + " " \
return string.strip(finalStr)
DataFrame:
df.select(df.emp_name, initCap(df.emp_name)).show()
Spark Sql:
spark.udf.register("initcap1", initCap)
spark.sql(""" select emp_name, initcap1(emp_name) from default.emp """).show()
def convertCap(str):
finalStr=""
ar = str.split(" ")
for word in ar:
finalStr= finalStr + word(0:1).upper() + word(1:len(word)) + " "
return string.strip(finalStr)
Spark Sql:
spark.udf.register(" initcap ", convertCap)
Spark.sql(" " " select emp_name,initcap(emp_name) from default.emp " " ")
import string
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType,IntegerType
@udf(returnType=StringType())
def initCap(str):
finalStr=""
ar = str.split(" ")
for word in ar:
finalStr= finalStr + word(0:1).upper() + word(1:len(word)) + " "
return string.strip(finalStr)
spark.udf.register("initcap1", initCap)
spark.sql(""" select emp_name, initcap1(emp_name) from default.emp """).show()
String Types
StringType() Character String Values
VarcharType(length) Variant of StringType with Length limitation
CharType(length) Variant of VarcharType with Fixed Length
Boolean Types
BooleanType () Boolean Values (True or False. Also can have Null Values)
Binary Type
BinaryType () Byte Aurora
Sequence
AcademyValues
of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Supported Data Types:
Date Types
TimestampType() year, month, day, hour, minute, second, time zone
DateType () year, month, day
Complex Type
ArrayType (elementType,containsNull)
MapType (keyType, valueType,valueContainsNull)
StructType (fields)
Ex-1
schema = StructType((
StructField("name",StringType(),True),
StructField("id", IntegerType(),True),
))
data=(("James",1),
("Robert",2),
("Maria",3)
)
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
df.show(truncate=False)
schema = StructType((
StructField('name', StringType(), True),
StructField('properties', MapType(StringType(),StringType()),True)
))
d = ( ('James',{'hair':'black','eye':'brown'}),
('Michael',{'hair':'brown','eye':None}),
('Robert',{'hair':'red','eye':'black'})
)
df_map= spark.createDataFrame(data=d, schema = schema)
df_map.printSchema()
df_map.show(truncate=False)
df_map.select(df_map.properties).show(truncate=False)
df_map.select(df_map.properties[‘eye’]).show(truncate=False)
schema = StructType((
StructField('name', StringType(), True),
StructField('mobileNumbers', ArrayType(IntegerType()),True)
))
d = ( ('James',(123,456,789)),
('Michael',(234,456,678)),
('Robert',(168,89,190))
)
df_arr = spark.createDataFrame(data=d, schema = schema)
df_arr.printSchema()
df_arr.show(truncate=False)
df_arr.select(df_arr.mobileNumbers[1]).show()
For Ex-
IntegerType int, integer
StringType string
BooleanType boolean
Name Age
Robert 31
Alicia 25
Deja 19
Majoj 31
1. Row Object:
from pyspark.sql import Row
lst=(Row(name="Alice",age=11), Row(name="Robert",age=35),Row(name="James",age=33))
rdd = sc.parallelize(lst)
for i in rdd.collect(): print (str(i.age) + ' ' + i.name)
df = spark.createDataFrame(lst)
lst=(Person("Alice",11), Person("Robert",35),Person("James",33))
rdd=sc.parallelize(lst)
for i in rdd.collect() : print i.name
df = spark.createDataFrame(lst)
1. Select a Column
df.order_id or df(“order_id")
ord.select(col("*")).show() #from pyspark.sql.functions import col
3. Order a Column
asc()
asc_nulls_first()
asc_nulls_last()
desc()
desc_nulls_first()
desc_nulls_last)
Ex - ord.orderBy(ord.order_status.asc()).select(ord.order_status).distinct().show()
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Column
4.cast() : Convert type of a column. asType() is alias of cast().
PS: Convert order_id column from Integer Type to String Type.
ord.select(ord.order_id.cast("string"))
5. between():
PS: Print all the orders between 10 and 20.
ord(ord.order_id.between(10,20)).show()
ord.where(ord.order_id.between(10,20)).show()
6. contains(), startswith,endswith(),like(),rlike()
PS: Print all the orders with Status CLOSED.
ord.where(ord.order_status.contains('CLOSED')).show()
isNull(), isNotNull()
9. substr
PS: Find Number of completed orders in the year 2013.
ord.where((ord.order_date.substr(1,4).contains('2013')) & (ord.order_status.contains('CLOSED') )).count()
ord.where((ord.order_date.substr(1,4) == '2013') & (ord.order_status == 'CLOSED' )).count()
11. getItem(): An expression that gets an item at position ordinal out of a list, or gets an item by key out of a dict.
Ex- df = spark.createDataFrame((((1, 2),Aurora
{"key": "value"})),
Academy of Training.("lst",
Copyright"dict"))
[email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
df.select(df.lst.getItem(0), df.dist.getItem("key")).show()
Column
12. when(), otherwise() :
Kind of if-else statements in SQL. Using this, we can check multiple conditions in sequence and returns a value when
the first condition is met.
Ex –
from pyspark.sql.functions import when
ord.select(ord.order_status,
when(ord.order_status == 'PENDING_PAYMENT', 'PP')
.when(ord.order_status == 'CLOSED', 'CL')
.when(ord.order_status == 'COMPLETE', 'CO')
.when(ord.order_status == 'PROCESSING', 'PR')
.otherwise(ord.order_status).alias("order_status2")).show(10)
data=(('Robert',35,40,40),('Robert',35,40,40),('Ram',31,33,29),('Ram',31,33,91))
emp = spark.createDataFrame(data=data,schema=('name','score1','score2','score3'))
• selectExpr(*expr)
This is a variant of select that accepts SQL expressions.
ord.selectExpr('substring(order_date,1,10) as order_month').show()
ord.select(substring(ord.order_date,1,4).alias('order_year')).show()
If we want to use any functions available in SQL but not in Spark Built-in functions, then we can use selectExpr.
stack(n, expr1, ..., exprk) - Separates expr1, ..., exprk into n rows.
df.selectExpr("stack(3,1,2,3,4,5,6)").show()
• withColumnRenamed(existingCol, newCol)
Rename Existing Column.
ord.withColumnRenamed('order_id','order_id1').show()
• dropDuplicates(subset=None)
Drop duplicate rows.
Optionally can consider only subset of columns.
emp.dropDuplicates().show()
emp.dropDuplicates(("name","score1","score2")).show()
data=(('a',1),('d',4),('c',3),('b',2),('e',5))
df = spark.createDataFrame(data=data,schema='col1 string,col2 int')
• sortWithinPartitions:
At time, we may not want sort globally, but with in a group. In that case we can use sortWithinPartitions.
df.sortWithinPartitions(df.col1.asc(),df.col2.asc()).show()
• unionByName():
The difference between this function and :func:`union` is that this function
resolves columns by name (not by position)
df1 = spark.createDataFrame(data=(('a',1),('b',2)),schema=('col1 string,col2 int'))
df2 = spark.createDataFrame(data=((2,'b'),(3,'c')),schema=('col2 int,col1 string'))
df1.union(df2).show()
df1.unionByName(df2).show()
• crossJoin(self, other)
• self Join
df1 = spark.createDataFrame(data=((1,'Robert',2),(2,'Ria',3),(3,'James',5)),schema='empid int,empname
string,managerid int')
df1.alias("emp1").join(df1.alias("emp2"),col("emp1.managerid") ==
col("emp2.empid"),'inner').select(col("emp1.empid"),col("emp1.empname"),col("emp2.empid").alias("managerid"),c
ol("emp2.empname").alias("managaer_name")).show()
Use of col(): Sometimes we need to use the column name which is the alias of a withColumn. In that case we need to
refer the column name as col(column _name).
pyspark.sql.functions import col
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Join APIs
• Multi Column Join
df1 = spark.createDataFrame(data=((1,101,'Robert'),(2,102,'Ria'),(3,103,'James')),schema='empid int,deptid
int,empname string')
df2 = spark.createDataFrame(data=((2,102,'USA'),(4,104,'India')),schema='empid int,deptid int,country string')
df1.join(df2,(df1.empid == df2.empid) & (df1.deptid == df2.deptid)).show()
avg(),mean()
count()
min()
max()
sum()
agg() For multiple aggregations at once
pivot()
apply()
data = (("James","Sales","NY",9000,34),
("Alicia","Sales","NY",8600,56),
("Robert","Sales","CA",8100,30),
("Lisa","Finance","CA",9000,24),
("Deja","Finance","CA",9900,40),
("Sugie","Finance","NY",8300,36),
("Ram","Finance","NY",7900,53),
("Kyle","Marketing","CA",8000,25),
("Reid","Marketing","NY",9100,50)
)
schema=("empname","dept","state","salary","age")
df = spark.createDataFrame(data=data,schema=schema)
df.groupBy(df.dept)
<pyspark.sql.group.GroupedData object at 0x7f68eaead690>
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
GroupBy API
Using avg(),sum(),min(),max(),count(),agg()
Ex-
from pyspark.sql.functions import pandas_udf, PandasUDFType
df = spark.createDataFrame(((1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)),("id", "v"))
@pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP) # doctest: +SKIP
def normalize(pdf):
v = pdf.v
return pdf.assign(v=(v - v.mean()) / v.std())
df.groupby("id").apply(normalize).show()
• To perform a window function, we will have to partition the data using Window.partitionBy.
• Lets see 3 types of Window Functions:
Ranking
Analytical
Aggregate
Aggregate Function
Analytical Function
Ex -
df.select(df.dept,df.salary) \
.withColumn("lag_prev_sal",lag("salary",1,0).over(spec)) \
.withColumn("lead_next_sal",lead("salary",1,0).over(spec)) \
.show()
Ex – Ex –
spec = Window.partitionBy("dept") spec = Window.partitionBy("dept").orderBy("salary")
df.select(df.dept,df.salary) \ df.select(df.dept,df.salary) \
.withColumn("sum_sal",sum("salary").over(spec)) \ .withColumn("first_sal",first("salary").over(spec)) \
.withColumn("max_sal",max("salary").over(spec)) \ .withColumn("last_sal",last("salary").over(spec)) \
.withColumn("min_sal",min("salary").over(spec)) \ .show()
.withColumn("avg_sal",avg("salary").over(spec)) \
.withColumn("count_sal",count("salary").over(spec)) \
.show()
spec=Window.partitionBy(df.dept).\
orderBy(df.salary). \
rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
spec=Window.partitionBy(df.dept).\
orderBy(df.salary). \
rangeBetween(Window.currentRow, Window.unboundedFollowing)
spec=Window.partitionBy(df.dept).\
orderBy(df.salary). \
rangeBetween(Window.currentRow,500)
df.select(df.dept,df.salary).withColumn("sum_sal",sum("salary").over(spec1)).show()
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Window Functions
• rowsBetween:
Takes two argument (start,end) to define frame boundaries.
Deafult : unboundedPreceding and unboundedFollowing.
Both `start` and `end` are relative from the current row. For example, "0" means "current row", while "-1" means one
off before the current row, and "5" means the five off after the current row.
Recommend to use ``Window.unboundedPreceding``, ``Window.unboundedFollowing``, and ``Window.currentRow``
to specify special boundary values, rather than using integral values directly.
spec1=Window.partitionBy(df.dept).\
orderBy(df.salary). \
rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
spec1=Window.partitionBy(df.dept).\
orderBy(df.salary). \
rowsBetween(Window.currentRow, Window.unboundedFollowing)
spec1=Window.partitionBy(df.dept).\
orderBy(df.salary). \
rowsBetween(Window.currentRow, 2)
df.select(df.dept,df.salary).withColumn("sum_sal",sum("salary").over(spec1)).show()
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Window Functions
rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
dept salary sum_sal
Finance 7900 43300
Finance 8200 43300
Finance 8300 43300
Finance 9000 43300
Finance 9900 43300
rangeBetween(Window.currentRow, Window.unboundedFollowing)
rowsBetween(Window.currentRow, Window.unboundedFollowing)
rowsBetween(Window.currentRow, 2)
--df1 DataFrame
df1 = spark.range(10)
monotonically_increasing_id():
• A column that generates monotonically increasing 64-bit integers.
• The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
• Use Case: Create a Primary Key/Unique column.
df.withColumn('id',monotonically_increasing_id()).show()
lit():
• It creates a static column with value is provided.
df.withColumn('col',f.lit(10)).show()
• Can also be used to concat columns.
df.select(concat('salary',lit('|'),'age').alias('value')).show()
spark_partition_id():
• Generates a column with partitions ids.
Ex-
df1 = spark.range(10)
df1.repartition(5)
df1=df1.repartition(5)
df1.select("id",spark_partition_id()).show()
randn(seed):
• Generates a column with independent and identically distributed (i.i.d.) samples from the standard
normal distribution.
sha2(col,numBits):
• Used for Encryption.
• Returns the hex string result of SHA-2 family.
• numBits : 0, 224, 256, 384, 512
df.select(df.age,sha2(df.age.cast('string'),224)).show(truncate=False)
hash(*cols):
• Any type of column or combination of columns.
• Calculates the hash code of given columns and return result as int column.
• May be used for Encryption.
df.select(df.age,hash(df.age)).show(truncate=False)
md5(cols):
• Calculates the MD5 digest and returns the value as a 32 character hex string.
df.select(df.age,md5(df.age)).show(truncate=False)
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame APIs : String
Functions
ltrim(col),rtrim(col),trim(col):
lpad(col,len,pad), rpad(col,len,pad): Pad the string column to width `len` with `pad`.
concat_ws(sep, *cols): Concatenates multiple input string columns together into a single string column, using the
given separator.
ord.withColumn('IDStatus',concat_ws('-',ord.order_id,ord.order_status)).show()
substring(str, pos, len): Substring starts at `pos` and is of length `len` when str is String type.
ord.withColumn('orderYear',substring(ord.order_date,1,4)).show()
substring_index(str, delim, count):Returns the substring from string str before count occurrences of the delimiter
delim. If count is positive, everything the left of the final delimiter (counting from left) is returned. If count is
negative, every to the right of the final delimiter (counting from the right) is returned. substring_index performs a
case-sensitive match when searching for delim.
ord.withColumn('sub',substring_index(ord.order_date,'-',1)).show()
instr(str, substr): Locate the position of the first occurrence of substr column in the given string.
Returns null if either of the arguments are null.
ord.withColumn('instr',instr(ord.order_status,'LO')).show()
translate(srcCol, matching, replace): Translate any character in the `srcCol` by a character in `matching`.
df = spark.createDataFrame((('translate',)), ('col'))
df.select(translate(df.col, "rnlt", "123")).show()
ord_new = ord.withColumn('new_order_date',date_add(ord.order_date,50))
current_timestamp():
next_day(date, dayOfWeek):
Returns the first date which is later than the value of the date column.
dayOfWeek - "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun“.
ord.select(ord.order_date,next_day(ord.order_date,'Fri')).distinct().show()
last_day(date):
Returns the last day of the month which the given date belongs to.
ord.select(ord.order_date,last_day(ord.order_date)).distinct().show()
dayofweek(col):
dayofmonth(col):
dayofyear(col):
weekofyear(col):
date_format(date, format):
Converts a date/timestamp/string to a value of string in the format specified by the date.
ord.withColumn('new_order_date',date_format(ord.order_date,'yyyy/MM/dd')).show(5)
from_utc_timestamp(timestamp, tz):
This is a common function for databases supporting TIMESTAMP WITHOUT TIMEZONE. This function takes a timestamp
which is timezone-agnostic, and interprets it as a timestamp in UTC, and renders that timestamp as a timestamp in the
given time zone.
to_timestamp(col, format=None):
df.select(df.t,to_timestamp(df.t).alias('dt')).show()
### df DataFrame
df = spark.sql("SELECT array(struct(1, 'a'), struct(2, 'b')) as data")
element_at(col,extraction):
• Returns element of array at given index in extraction if col is array.
• Returns value for the given key in extraction if col is map.
emp1.select(emp1.FirstName,element_at(emp1.Languages,2),element_at(emp1.properties,'eye')).show()
struct(*cols) :
• Create a new struct column.
emp_new = emp1.select(struct((emp1.FirstName,emp1.LastName)))
array_max(col), array_min(col):
• Returns maximum or minimum values of an array column.
emp3.select(array_max("score_arr1")).show()
array_distinct(col):
• Returns distinct values of an array column.
emp3.select(array_distinct("score_arr1")).show()
array_repeat(col,count):
• Repeated count times.
emp3.select(array_repeat("score_arr1",3)).show(truncate=False)
slice(col,start,length)
• Returns an array containing all the elements in `col` from index `start` for length `length`.
• col is Array Type.
emp1.select(emp1.Languages,slice(emp1.Languages,3,1)).show()
array_remove(col,element):
• Remove all elements that equal to element from the given array.
emp3.select(array_remove("score_arr1",20)).show()
array_sort(col):
• Sorts the input array in ascending order.
• The elements of the input array must be orderable.
• Null elements will be placed at the end of the returned array.
emp3.select("score_arr1",array_sort("score_arr1")).show()
sort_array(col,asc=True):
• Sorts the input array in ascending or descending order according to the natural ordering of the array elements.
• Null elements will be placed at the beginning of the returned array in ascending order or at the end of the returned array in
descending order.
emp3.select(emp3.score_arr1,sort_array(emp3.score_arr1,asc=False)).show()
array_contains(col, value):
• Returns null if the array is null, true if the array contains the given value, and false otherwise.
emp3.select(array_contains(emp3.score_arr1,55)).show()
array_except(col1, col2):
• Returns an array of the elements in col1 but not in col2 without duplicates.
emp3.select(array_except(emp3.score_arr1,emp3.score_arr2)).show(truncate=False)
array_intersect(col1, col2):
• Returns an array of the elements in the intersection of col1 and col2 without duplicates.
emp3.select(array_intersect(emp3.score_arr1,emp3.score_arr2)).show(truncate=False)
arrays_zip(*cols):
• Merge arrays. First element of array1 will be merged with first element of array 2 and so on.
emp3.select(arrays_zip(emp3.score_arr1,emp3.score_arr2)).show(truncate=False)
shuffle(col):
• Random shuffle an array.
emp3.select(emp3.score_arr1,shuffle(emp3.score_arr1)).show()
map_from_entries(col):
• col is array of paired structs.
• Function returns a map created from the given array of entries.
df.select(map_from_entries("data").alias("map")).show()
map_from_arrays(col1,col2):
• Creates a new map from two arrays.
emp3.select(map_from_arrays(emp3.score_arr1,emp3.score_arr2)).printSchema()
map_keys():
• Returns an unordered array containing the keys of the map.
emp1.select(map_keys(emp1.properties)).show()
map_values():
• Returns an unordered array containing the values of the map.
emp1.select(map_values(emp1.properties)).show()
map_concat(*cols):
• Returns the union of all the given maps.
emp1.select(map_concat(emp1.properties,emp1.properties)).show(truncate=False)
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
sequence(start, stop, step =1):
• Generate a sequence of integers from `start` to `stop`, incrementing by `step`.
• If `step` is not set, incrementing by 1 if `start` is less than or equal to `stop`, otherwise -1.
emp2.select(emp2.score1, emp2.score2,sequence(emp2.score1,emp2.score2).alias('new_col')).show(truncate=False)
Ex- data=(('Alice',80,10),('Bob',None,5),('Tom',50,50),(None,None,None),('Robert',30,35))
schema='name string, age int, height int'
df = spark.createDataFrame(data,schema)
df.na.drop().show()
covar_pop(col1, col2):
Return population covariance.
df.select(covar_pop(df.salary,df.age)).show()
covar_samp(col1, col2):
Return sample covariance.
data =(('Alicia',('Java','Scala'),{'hair':'black','eye':'brown'}),
('Robert',('Spark','Java',None),{'hair':'brown','eye':None}),
('Mike',('CSharp',''),{'hair':'red','eye':''}),
('John',None,None),
('Jeff',('1','2'),{}))
schema = ('empName','Languages','properties')
emp = spark.createDataFrame(data=data,schema=schema)
emp.select(emp.empName,explode(emp.Languages)).show()
emp.select(emp.empName,explode(emp.properties)).show()
explode_outer(col): In explode nulls are ignored, but in explode_outer nulls are reported.
emp.select(emp.empName,explode_outer(emp.Languages)).show()
Ex-
data =(('Alicia',(('Java'),('Scala'),('Python'))),\
('Robert',((None),('Java'),('Hadoop')))
)
schema = ('empName','ArrayofArray')
emp = spark.createDataFrame(data=data,schema=schema)
emp.select(emp.empName,flatten(emp.ArrayofArray)).show()
Ex-
data =(('Alicia',((1),(2))),\
('Robert',(None,(1)))
)
schema = ('empName','ArrayofArray')
emp = spark.createDataFrame(data=data,schema=schema)
emp.select(emp.empName,flatten(emp.ArrayofArray)).show()
format_number(col,d): Formats the number X to a format to d decimal places with HALF_EVEN round mode, and returns the
result as a string.
ordItems.select(ordItems.subtotal,format_number(ordItems.subtotal,4)).show()
format_string(format, *cols):
Formats the arguments in printf-style and returns the result as a string column.
df.select(format_string('%d %s', df.a, df.b).alias('v')).show()
data=((1, """{"Zipcode":85016,"ZipCodeType":"STANDARD","City":"Phoenix","State":"AZ"}"""))
df_struct=spark.createDataFrame(data,("id","value"))
--Array Type Ex
schema=ArrayType(IntegerType())
df_arr_new = df_arr.withColumn('arr_column',from_json(df_arr.value,schema))
df_arr_new.printSchema()
--Struct Type Ex
schema=StructType((StructField("Zipcode",IntegerType()),StructField("ZipCodeType",StringType()),StructField("city",StringType()),StructField("s
tate",StringType())))
df_struct_new = df_struct.withColumn('struct_column',from_json(df_struct.value,schema))
df_struct_new.printSchema()
schema_of_json(json_string):
Use schema_of_json() to create schema string from JSON string column. Json String Column Schema
schemaOfStr=spark.range(1) \
.select(schema_of_json(lit("""{"id":101, "name":"Robert","City":"Phoenix","State":"AZ"}"""))) \
.collect()(0)(0)
get_json_object(col,path):
Used to extract the JSON string based on path from the JSON column.
df_map.select(col("id"),get_json_object(col("value"),"$.ZipCodeType").alias("ZipCodeType")) \
.show(truncate=False)
last(col, ignorenulls=False):
greatest(*cols):
Returns the greatest value of the list of column names, skipping null values.
df.select(greatest(df.salary,df.age)).show()
least(*cols):
Returns the leastvalue of the list of column names, skipping null values.
skewness(col):
Returns the skewness of the values in a group.
df.select(skewness(df.salary)).show()
ascii(col):
• Computes the numeric value of the first character of the string column.
df.select(ascii(lit('a'))).show()
bin(col):
• Returns the string representation of the binary value of the given column.
df.select(df.phone,bin(df.phone)).show()
expr()
Apply a Filter T
Size ~1MB
File: test_data.csv
Record Count
Size:2.6GB
:64k
Record Count :128m
Partitions: 21
Partitions: 21
### Testing
df_new = spark.read.load('/user/test/test_data.csv',format='csv',schema=('col1 int, col2 int, col3 int'))
df_new.rdd.getNumPartitions()
--- Apply Fillter
df_filter = df_new.where(df_new.col1 < 501)
df_filter.rdd.getNumPartitions()
Ex-df_new.rdd.glom().map(len).collect()
df_filter.rdd.glom().map(len).collect()
df_filter = df_filter.repartition(5)
df_filter.rdd.glom().map(len).collect()
Ex-
data=(('Ram',30),('Raj',25),('James',30),('Joann',25),('Kyle',25),('Robert',30),('Reid',35),('Sam',35))
df = spark.createDataFrame(data=data,schema=('name','age'))
df = df.repartition('age')
df.rdd.getNumPartitions()
spark.conf.get("spark.sql.shuffle.partitions")
df = df.repartition(4,'age')
df.rdd.getNumPartitions()
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Repartition
repartition(2)
Shuffle in Repartition
P1
P2 P6
P3
P4 P7
P5
df_filter.rdd.glom().map(len).collect()
df_filter = df.coalesce(5)
df_filter.rdd.glom().map(len).collect()
coalesce(2)
No Shuffle in coalesce
P1
P 1 (= P1 + P2)
P2
P3
P 3 (= P3 + P4)
P4
csv Params:
• path
• mode : default ‘error’ or ‘errorIfExists’ : Throw exception.
‘append’ :Append contents to existing data.
‘overwrite’ : Overwrite existing data.
‘ignore’ : Ignore the operation if data exists.
• compression (none, bzip2, gzip, lz4,snappy and deflate)
• sep (default ‘ , ’)
• header (True or False. default – False)
• dateFormat (default – ‘yyyy-MM-dd’)
• timestampFormat (Default – ‘yyyy-MM-dd'T'HH:mm:ss.SSSXXX’)
• ignoreLeadingWhiteSpace (Dafault – True)
• ignoreTrailingWhiteSpace (Default – True )
• And more …
Text Params:
• path
• compression
• linesep (default – ‘\n’)
PS: Convert the Order csv file to text file with fixed length of tab and create 10 output files.
from pyspark.sql.functions import concat_ws
ordText = ord.select(concat_ws('\t',ord.order_id,ord.order_date,ord.order_customer_id,ord.order_status).alias('col1'))
ordText.repartition(10).write.save('practice/dump/retail_db/orderText',format='text')
PS: Convert the Order csv file to parquet file. Create one file for each order status category.
ord.write.save('practice/dump/retail_db/orderParquet',format='parquet',mode='overwrite',partitionBy="order_status")
PS: Convert the Order csv file to orc file. Create one file for each order status category.
ord.write.save('practice/dump/retail_db/orderOrc',format='orc', partitionBy="order_status")
PS: Convert the Order csv file to json file. File should be bzip2 compressed and total 1 output file.
ord.coalesce(1).write.save('practice/dump/retail_db/orderJson',format='json', compression='bzip2')
Ex-
ord.write.insertInto('db2.orders')
Ex-
ord.write.saveAsTable(name='db2_orders.order_test', format='orc')
Ex-
ord.write.saveAsTable(name='db2_orders.order_test', format='orc', mode='overwrite')
Ex-
ord.write.saveAsTable('db2_orders.orders1',partitionBy='order_status',format='orc',mode='overwrite',compression='none')
Ex -2 (Using createTableColumnTypes )
ord.select('order_id').write.format("jdbc") \
.option("url", url) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.option("createTableColumnTypes","order_id char(10)" ) \
.option("dbtable","new_orders" ) \
.option("user", someUser) \
.option("password", somePassword) \
.mode("overwrite") \
.save()
ord.select('order_id').write.format("jdbc") \
.option("url", url) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.option("createTableColumnTypes","order_id char(10)" ) \
.option("dbtable","new_orders" ) \
.option("user", someUser) \
.option("password", somePassword) \
.option("batchsize",5000) \
.mode("overwrite") \
.save()
ord.select('order_id').write.format("jdbc") \
.option("url", url) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.option("createTableColumnTypes","order_id char(10)" ) \
.option("dbtable","new_orders" ) \
.option("user", someUser) \
.option("password", somePassword) \
.option("queryTimeout",1) \
.mode("overwrite") \
.save()
Hash
join
1
Partition 1 Table B BitTorrent Protocol/
peer to peer protocol.
Hash
Table A join
1
Partition 2 Table B
New Table
Hash
join
1
Partition
… 2 Table B
Table B
Hash
join
Partition n 1
Table B
2 New Table
Hash
join
1
Partition
… 2 Table B
Table B
Hash
2
join
Partition n 1
Table B
ordItemsDF=spark.read.load('practice/retail_db/order_items',sep=',',format='csv',schema=('order_item_id int,\
order_item_order_id int,order_item_product_id int,quantity tinyint,subtotal float,price float'))
Joined.explain()
Ex –
largeDF = spark.range(1,1000000000)
joined.explain()
P1 P1
10
10 Shuffle 10 P1 P1
20 30 Shuffle 10
30 30 10
15
40
DataFrame
DataFrame P2 P2
;00000
B
P2 P2
A 4
30 15 40
20 Shuffle
40 Shuffle
35 35
20
10 40
40
20 40
Hash Join
P1 P1
P1
10 10
20 Shuffle 10 10 Bucket 1 P1 P1
30 30 10 10 Shuffle 10
30 30 10
30 15
40
DataFrame
DataFrame P2 P2
;00000
B
P2 P2
P2
A 4
30 Bucket 1 15 40
20 Shuffle
40 Shuffle 20 15 35 35
20
10 20 35 40
40
20 40 40 Bucket 2
40 40
Ex –
df1 = spark.range(1,10000000000)
df2 = spark.range(1,10000000)
spark.conf.set("spark.sql.join.preferSortMergeJoin", False)
joined = df1.join(df2,"id")
joined.explain()
joined = df1.hint('shuffle_hash').join(df2,”id”)
Merge
Partition 2 Shuffle
P2 Sort
P2 P2 P2 Shuffle Partition 2
Sort
Table A Merge Table B
Partition 3 P3 P3 P3 P3 Partition 3
Merge
Partition 4 P4 P4 P4 P4 Partition 4
New Table
Merge
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Notes:
Only supported for ‘=‘ join.
The join keys need to be sortable.
Supported for all join types (inner, left, right etc).
joined = df1.hint(‘merge').join(df2,”id”)
P2 P2
When More than one hint is specified and hints are applicable:
• If its an ‘=‘ join:
Broadcast hint : Pick broadcast hash join
Merge hint : Pick sort-merge join
Shuffle_Hash hint : Pick shuffle hash join
Shuffle_replicate_nl hint : Pick Cartesian Product join.
• If its not ‘=‘ join:
Broadcast hint: Pick broadcast nested loop join.
Shuffle_replicate_nl : Pick Cartesian product if join type is inner like.
Spark-submit Options:
--driver-memory : Memory for driver (e.g. 1000M, 2G) (Default: 1024M)
• Driver Memory is the amount of memory to use for driver process, i.e. the process running the main() function of the application
and where SparkContext is instantiated.
--driver-cores :
• Number of cores used by the driver, only in cluster mode (Default: 1).
• Generally, not required unless you want to perform some local computations in parallel.
All Properties:
https://spark.apache.org/docs/latest/configuration.html
Spark Context
Driver
Spark Spark
Application Context
Spark
Application
Cluster
Executor JVM
Task Slots
T T T
JVM Heap Unoccupied Task Slots
T
Client JVM
Spark Context
JVM Heap
T T T
Executor JVM
JVM Heap
Client JVM
JVM Heap T T T
T
Driver
Spark
Context Executor JVM
Spark
Application JVM Heap
JVM Heap
T T T
YARN Container
Executor JVM
T T T
Off-Heap (Disabled by
Default)
Problems:
• With only one executor per core, we will not be able to take ……………………..
advantage of running multiple tasks in the same JVM. ……………………..
• There is ~10% overhead for each JVM. Due to 160 executors ……………………...
160 JVM processes would be created and so lot of
unnecessary overheads.
Executor 159 Cache Executor 160 Cache
• Shared variables (Broadcast, Accumulator) will be copied 160 Memory: 4GB Memory: 4GB
times.
• Not leaving enough memory for YARN Daemon processes and Core 1 Core 1
Application Manager.
• Not enough memory for executors. Task Task
NOT GOOD
spark2-submit \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key<=<value> \
--driver-memory 2G\
--executor-memory 64G \
--executor-cores 5 \
--num-executors 10\
--jars <comma separated dependencies> \
--packages <package name> \
--py-files \
<application> <application args>
spark.default.parallelism :
• Only applicable to RDD.
• Default value set to the number of all cores on all nodes in a cluster.
• RDD wider transformations like reduceByKey(), groupByKey(), join() triggers the data shuffling. Prior to using these
operations, use the below code to set the desired partitions for shuffle operations. Change the value accordingly.
spark.conf.set("spark.default.parallelism",150)
spark.conf.get("spark.default.parallelism")
Ex –
pyspark2 --master yarn --conf spark.default.parallelism=150
rdd = sc.parallelize(range(1000))
rdd.getNumPartitions()
150
Node Manager
Container
Application
Resource Master
Driver
Manager
Container Container
Executor Executor
Task Task Task Task
Flow:
1. Client submit the Spark application. Driver instantiates SparkContext.
2. Driver talks to the cluster manager(YARN) and negotiates resources.
3. The YARN resource manager search for a Node Manger which will, in turn, launch an ApplicationMaster for the
specific job in a container.
4. The ApplicationMaster registers itself with the resource Manager.
5. The ApplicationMaster negotiates containers for executors from the ResourceManager. Can request for more
resources from RM.
6. The ApplicationMaster notifies the Node Managers to launch the containers and executors. Executor then executes
the tasks.
7. Driver communicates with executors to coordinate the processing of tasks of an application.
8. Once the tasks are complete, ApplicationMaster un-registers with the Resource Manager.
Overhead
User Memory = 23 %
Others (Non-spark)
Off-heap Reserved Memory = 7 %
Heap
Spark.executor.memory
No Hard Boundary
Usable Memory
Executor Memory
Execution Memory Storage Memory
Usable Memory *
Usable Memory *
spark.memory.fraction *
spark.memory.fraction *
(1 – spark.memory.storageFraction) spark.memory.storageFraction
Daemon Processes
Others (Non-spark)
Reserved Memory (reserved_system_memory_bytes = 300MB)
Overhead
spark.executor.memoryOverhead
10% or a minimum 384MB
Off-Heap
spark.memory.offHeap.size
It is disabled by default.
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
• Off-heap Memory:
We had covered this in the RDD Persistence.
Off-Heap Memory is a segment of memory lies outside the JVM, but is used by JVM for certain use-cases. Off-
Heap memory can be used by Spark explicitly as well to store serialized data-frames and RDDs.
Spark may use off-heap memory for data-intensive applications.
User can also persist data at Off-heap memory using persist method.
Off-heap storage is not managed by JVM’s GC mechanism and so must be explicitly handled by the
application.
It is disabled by default.
spark.memory.offHeap.enabled (Default False)
spark.memory.offHeap.size Can be set after enabling it.
• Overhead Memory:
Default - 10% of executor memory with a minimum of 384MB.
Can be set by property:
spark.executor.memoryOverhead
It basically covers expenses like VM overheads, interned strings, other native overheads, etc.
X X X X
X X X X
X X X X
Usable Memory
Executor Memory
Execution Memory Storage Memory
Usable Memory *
Usable Memory *
spark.memory.fraction *
spark.memory.fraction *
(1 – spark.memory.storageFraction) spark.memory.storageFraction
=3796 * 0.75 * 0.5 =3796 * 0.75 * 0.5
= 1423.5 MB = 1423.5 MB
Overhead
spark.executor.memoryOverhead
10% or a minimum 384MB = 409.6 MB
Off-Heap
spark.memory.offHeap.size
It is disabled by default.