My Pyspark Practice Notes
My Pyspark Practice Notes
content:-
Spark Session -
1
SparkSession is designed to be a singleton, which means that only one instance should be active
in the application at any given time.
spark2 = SparkSession.newSession
print(spark2)
spark = SparkSession.builder \
.master("local[*]") \
.config("spark.executor.memory", "4g") \
.config("spark.executor.cores", "4") \
.config("spark.driver.memory", "2g") \
.getOrCreate()
Jagdeesh
Introduction
PySpark, the Python library for Apache Spark, has gained immense popularity among data
engineers and data scientists due to its simplicity and power in handling big data tasks.
This blog post will provide a comprehensive understanding of the PySpark entry point, the
SparkSession. We’ll explore the concepts, features, and the use of SparkSession to set up a
PySpark application effectively.
What is SparkSession?
SparkSession is the entry point for any PySpark application, introduced in Spark 2.0 as a unified
API to replace the need for separate SparkContext, SQLContext, and HiveContext.
2
The SparkSession is responsible for coordinating various Spark functionalities and provides a
simple way to interact with structured and semi-structured data, such as reading and writing
data from various formats, executing SQL queries, and utilizing built-in functions for data
manipulation.
Simplified API: SparkSession unifies the APIs of SparkContext, SQLContext, and HiveContext,
making it easier for developers to interact with Spark’s core features without switching
between multiple contexts.
Configuration management: You can easily configure a SparkSession by setting various options,
such as the application name, the master URL, and other configurations.
Access to Spark ecosystem: SparkSession allows you to interact with the broader Spark
ecosystem, such as DataFrames, Datasets, and MLlib, enabling you to build powerful data
processing pipelines.
Improved code readability: By encapsulating multiple Spark contexts, SparkSession helps you
write cleaner and more maintainable code.
Creating a SparkSession
To create a SparkSession, we first need to import the necessary PySpark modules and classes.
Here’s a simple example:
3
spark = SparkSession.builder \
.master("local[*]") \
.getOrCreate()
In this example, we import the SparkSession class from the pyspark.sql module and use the
builder method to configure the application name and master URL. The getOrCreate() method
is then used to either get the existing SparkSession or create a new one if none exists.
The SparkSession.builder object provides various functions to configure the SparkSession before
creating it. Some of the important functions are:
appName(name): Sets the application name, which will be displayed in the Spark web user
interface.
master(masterURL): Sets the URL of the cluster manager (like YARN, Mesos, or standalone) that
Spark will connect to. You can also set it to “local” or “local[N]” (where N is the number of
cores) for running Spark locally.
config(key, value): Sets a configuration property with the specified key and value. You can use
this method multiple times to set multiple configuration properties.
config(conf): Sets the Spark configuration object (SparkConf) to be used for building the
SparkSession.
4
getOrCreate(): Retrieves an existing SparkSession or, if there is none, creates a new one based
on the options set via the builder.
In PySpark, you can technically create multiple SparkSession instances, but it is not
recommended. The standard practice is to use a single SparkSession per application.
SparkSession is designed to be a singleton, which means that only one instance should be active
in the application at any given time
spark2 = SparkSession.newSession
print(spark2)
Configuring a SparkSession
You can configure a SparkSession with various settings, such as the number of executor cores,
executor memory, driver memory, and more. Here’s an example
spark = SparkSession.builder \
.master("local[*]") \
.config("spark.executor.memory", "4g") \
.config("spark.executor.cores", "4") \
.config("spark.driver.memory", "2g") \
.getOrCreate()
In this example, we’ve added three additional configurations for executor memory, executor
cores, and driver memory using the config() method.
5
Accessing SparkSession Components
# Access SparkContext
spark_context = spark.sparkContext
# Access SQLContext
sql_context = spark._wrapped
data_frame.write.parquet("path/to/output/parquet-file")
With SparkSession, you can also execute SQL queries directly on your data. Here’s an example:
6
data_frame.createOrReplaceTempView("my_table")
result.show()
spark = SparkSession.builder \
.appName("Counting ")
.master("local[*]") \
.getOrCreate()
7
# perform the word count
word_count = words.groupBy("Word").count()
word_count.show()
spark.stop()
Note -
EXPLODE returns type is generally a new row for each element given.
explode() will return each and every individual value from an array. If the array is empty or null,
it will ignore and go to the next array in an array type column in PySpark DataFrame. This is
possible using the select() method. Inside this method, we can use the array_min() function and
return the result.
Read and Write files using PySpark – Multiple ways to Read and Write data using PySpark
import findspark
findspark.init()
8
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("Local[*]") \
.getOrCreate()
df = spark.createDataFrame(data, columns)
df.show()
+-----+---+
| Name|Age|
+-----+---+
|Alice| 34|
| Bob| 45|
|Cathy| 29|
csv_file = "path/to/your/csv/file.csv"
9
df_csv = spark.read.csv(csv_file, header=True, inferSchema=True)
output_path = "path/to/output/csv/file.csv"
json_file = "path/to/your/json/file.json"
df_json = spark.read.json(json_file)
output_path = "path/to/output/json/file.json"
df_json.write.json(output_path, mode="overwrite")
To read a Parquet file using PySpark, you can use the read.parquet() method:
parquet_file = "path/to/your/parquet/file.parquet"
df_parquet = spark.read.parquet(parquet_file)
To write the data back to a Parquet file, use the write.parquet() method:
output_path = "path/to/output/parquet/file.parquet"
df_parquet.write.parquet(output_path, mode="overwrite")
10
Creating a SQL Table in PySpark
data [
df = spark.createDataFrame(data)
df.createOrReplaceTempView("People")
result_df = spark.sql(query)
result_df.show()
import pandas as pd
data [
11
# creating the pyspark dataframe from pandas
sparkdf = spark.createDataFrame(pandasData)
sparkdf.printSchema()
sparkdf.show()
root
+-------+---+-------------+
| name|age| city|
+-------+---+-------------+
12
The show() function is a method available for DataFrames in PySpark. It is used to display
the contents of a DataFrame in a tabular format, making it easier to visualize and understand
the data.
Syntax
Parameters:
truncate: If set to True, the column content will be truncated if it is too long. The default value
is True.
vertical: If set to True, the output will be displayed vertically. The default value is False.
Example 1-
import findaspark
findspark.init()
spark = SparkSession.builder \
.appName \
.getOrCreate()
df = spark.createDataFrame(data, columns)
df.show()
13
+-------+---+
| Name|Age|
+-------+---+
| Alice| 34|
| Bob| 45|
|Charlie| 29|
| David| 31|
df.show(2)
+-----+---+
| Name|Age|
+-----+---+
|Alice| 34|
| Bob| 45|
+-----+---+
df.show(truncate = False)
Name |Age|
+-------+---+
|Alice |34 |
14
|Bob |45 |
|Charlie|29 |
|David |31 |
df.show(vertical=True)
-RECORD 0-------
Name | Alice
Age | 34
-RECORD 1-------
Name | Bob
Age | 45
-RECORD 2-------
Name | Charlie
Age | 29
-RECORD 3-------
Name | David
Age | 31
15
1 101 3 100 2023-01-01
4 103 5 50 2023-01-04
csv_file = sales_data.csv
df = spark.read.option("header":"true").option("inferSchema":"true").csv(csv_files)
df.createOrReplaceTempView("Saless_data)
query = """
select ProductID,
from saless_data
group by ProductID
"""
result = spark.sql(query)
result.show()
16
+---------+------------+
|ProductID|TotalRevenue|
+---------+------------+
| 101| 500|
| 102| 200|
| 103| 250|
+---------+--
query = """
select
ProductID,
from saless_data
group by ProductID
order by totalRevenue
limit 2; """
resultDF = spark.sql(query)
resultDF.show()
+---------+------------+
17
|ProductID|TotalRevenue|
+---------+------------+
| 101| 500|
| 102| 200|
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
import databricks.koalas as ks
spark = SparkSession.builder \
.appName \
.getOrCreate()
sales_data = ks.read_csv("sales_data.csv")
calculate the average revenue per unit sold and add it as a new column
18
sales_data['Avg_Revenue_per_Unit] =
sales_data['Revenue'] / sales_data['Unit_Sold']
sorted_summary_stats = summary_stats.sort_values(
sorted_summary_stats.to_csv("summary_stats.csv", index=False)
import findspark
findspark.init()
spark = SparkSession.builder \
.appName \
.master \
.getOrCreate()
19
data = [("Alice", 34, "Female"), ("Bob":45, "Male"), ("Charlie":28, "Male"),
df = spark.createDateFrame(data, columns)
df.show()
Name|Age|Gender|
+-------+---+------+
| Alice| 34|Female|
| Diana| 39|Female|
+-------+---+------+
selectdf1.show()
| Name|Age|
+-------+---+
| Alice| 34|
| Bob| 45|
|Charlie| 28|
| Diana| 39|
20
selectdf2.show()
Name|Age|
+-------+---+
| Alice| 34|
| Bob| 45|
|Charlie| 28|
| Diana| 39
name_df = df["Name"]
selected_df3.show()
Name|Age|
+-------+---+
| Alice| 34|
| Bob| 45|
|Charlie| 28|
| Diana| 39|
21
Select Columns using index
column_indices = [0, 2]
selected_df4 = df.select(selected_columns)
selected_df4.show()
Name|Gender|
+-------+------+
| Alice|Female|
| Bob| Male|
|Charlie| Male|
| Diana|Female|
22
select specific columns while adding or removing columns, you can use the
‘ withColumn’ function to add a new column and the ‘drop’ function to remove a
column.
selectDF.show()
Name|Age|IsAdult|
+-------+---+-------+
select columns using the ‘ selectExpr’ function. This is useful when you want to perform
operations on columns while selecting them.
selectDF.show()
Name|Age|IsAdult|
23
+-------+---+-------+
PySpark withColumn
Syntax
DataFrame.withColumn(colName, col)
where:
colName: The name of the new or existing column you want to add, replace, or update.
24
import findspark
findspark.init()
spark = SparkSession.builder \
.appName \
.getOrCreate()
# create a Dataframe
df = spark.createDataFrame(data, columns)
df.show()
id| name|age|
+---+-------+---+
| 1| Alice| 25|
| 2| Bob| 30|
| 3|Charlie| 35|
25
1. Remaining the column.
df = df.withColumn("Years", col("age"))
df = df.drop("Age")
df.show()
id| name|years|
+---+-------+-----+
| 1| Alice| 25|
| 2| Bob| 30|
| 3|Charlie| 35|
+---+-------+-----
26
from pyspark.sql.functions import expr
df.show()
id| name|years|months|
+---+-------+-----+------+
df = df.withColumn("id", col("id").cast(StringType()))
df.show()
id| name|years|months|
+---+-------+-----+------+
27
from pyspark.sql.functions import when
data = [(1, "Alice", 25, 45000), (2, "Bob", 30, 55000), (3, "Charlie", 35, 60000)
df = spark.createDataFrame(data, columns)
df.show()
df.show()
id| name|age|salary|
+---+-------+---+------+
+---+-------+---+------+
+---+-------+---+------+------+
+---+-------+---+------+------+
28
5. Using the USer Defined Function with withColumn.
def age_group(age):
return "Young"
return "MiddleAged"
else:
return "Old"
df = df.withColumn("age_group", age_group_udf(col("age")))
df.show()
+---+-------+---+------+------+-----------+
29
# creating the new column net_sales column based on the subtracting the tax column from
salary
df.show()
+---+-------+---+------+------+-----------+----------+
We will use the “concat_ws” function, which allows us to concatenate multiple columns
with a specified delimiter.
df.show()
30
id| name|age|salary| tax| age_group|net_salary| name_age_group|
+---+-------+---+------+------+-----------+----------+--------------------+
import findspark
findspark.init()
spark = SparkSession.builder \
.appName \
.getOrCreate()
31
columns= ["name", "age", "city", "gender"]
df = df.createDataFrame(data, columns)
df.show()
name|age| city|gender|
+-----+---+-------------+------+
The Drop() function can be used to remove a single column from a DataFrame.
Syntax:
df = df.drop("gender")
df.show()
name| city|
+-----+-------------+
| Bob|San Francisco|
|David| Chicago|
32
Drop() function to remove multiple columns from a DataFrame. Simply pass a list of column
names to the function.
df = df.drop("age", "gender")
df.show()
df = df.drop(* dropping_column_name)
df.show()
name| city|
+-----+-------------+
| Bob|San Francisco|
|David| Chicago
if "gender" in df.columns:
df = df.drop("gender")
df.show()
name|age| city|
+-----+---+-------------+
33
|Alice| 30| New York|
“drop()” function in combination with a regular expression (regex) pattern to drop multiple
columns matching the pattern.
import re
regex_pattern = "gender|age"
df.show()
+-----+-------------+
| name| city|
+-----+-------------+
| Bob|San Francisco|
|David| Chicago|
34
PySpark Rename Columns
import findspark
findspark.init()
sample_df = spark.createDataFrame(data)
sample_df.show()
name|age| city|
+-----+---+-------------+
+-----+---+-------------+
35
renamed_df.show()
name|user_age| city|
+-----+--------+-------------+
renamed_df.show()
name|user_age| city|
+-----+--------+-------------+
renamed_df.show()
36
user_name|user_age| user_city|
+---------+--------+-------------+
.withColumnRenamed("age", "user_age") \
.withColumnRenamed("city", "user_city")
renamed_df.show()
user_name|user_age| user_city|
+---------+--------+-------------+
import findspark
findspark.init()
37
spark = SparkSession.builder \
.getOrCreate()
data = [
df = spark.createDataFrame(data, columns)
df.show()
id| name|age|
+---+-------+---+
| 1| Alice| 30|
| 2| Bob| 25|
| 3|Charlie| 35|
| 4| David| 28
It takes a boolean expression as an argument and returns a new DataFrame containing only the
38
rows that satisfy the condition.
filtered_df.show()
id| name|age|
+---+-------+---+
| 1| Alice| 30|
| 3|Charlie| 35|
The where function is an alias for the ‘filter’ function and can be used interchangeably. It also
takes a boolean expression as an argument and returns a new DataFrame containing only the
rows that satisfy the condition.
PySpark Filter vs Where – Comprehensive Guide Filter Rows from PySpark DataFrame
Jagdeesh
Apache PySpark is a popular open-source distributed data processing engine built on top of the
Apache Spark framework. It provides a high-level API for handling large-scale data processing
tasks in Python, Scala, and Java.
One of the most common tasks when working with PySpark DataFrames is filtering rows based
on certain conditions. In this blog post, we’ll discuss different ways to filter rows in PySpark
DataFrames, along with code examples for each method.
39
Different ways to filter rows in PySpark DataFrames
Before we dive into filtering rows, let’s quickly review some basics of PySpark DataFrames. To
work with PySpark DataFrames, we first need to import the necessary modules and create a
SparkSession
import findspark
findspark.init()
spark = SparkSession.builder \
.getOrCreate()
data = [
40
]
df = spark.createDataFrame(data, columns)
df.show()
+---+-------+---+
| id| name|age|
+---+-------+---+
| 1| Alice| 30|
| 2| Bob| 25|
| 3|Charlie| 35|
| 4| David| 28|
+---+-------+---+
The filter function is one of the most straightforward ways to filter rows in a PySpark
DataFrame. It takes a boolean expression as an argument and returns a new DataFrame
containing only the rows that satisfy the condition.
filtered_df.show()
+---+-------+---+
41
| id| name|age|
+---+-------+---+
| 1| Alice| 30|
| 3|Charlie| 35|
+---+-------+---+
The where function is an alias for the ‘filter’ function and can be used interchangeably. It also
takes a boolean expression as an argument and returns a new DataFrame containing only the
rows that satisfy the condition.
filtered_df.show()
id| name|age|
+---+-------+---+
| 1| Alice| 30|
| 3|Charlie| 35|
df.createOrReplaceTempView("people")
filtered_df.show()
42
id|name|age|
+---+----+---+
| 2| Bob| 25|
combine multiple filter conditions using the ‘&’ (and), ‘|’ (or), and ‘~’ (not) operators. Make
sure to use parentheses to separate different conditions, as it helps maintain the correct order
of operations.
filtered_df.show()
id| name|age|
+---+-------+---+
| 1| Alice| 30|
| 3|Charlie| 35
import findspark
findspark.init()
# Create a SparkSession
spark = SparkSession.builder \
43
.appName("PySpark orderBy() and sort() Example") \
.getOrCreate()
# Sample data
data = [
# Create a DataFrame
df = spark.createDataFrame(data, columns)
df.show()
Name|Age| City|
+-------+---+-------------+
orderBy() function
The orderBy() function in PySpark is used to sort a DataFrame based on one or more
columns. It takes one or more columns as arguments and returns a new DataFrame sorted by
44
the specified columns.
Syntax:
DataFrame.orderBy(*cols, ascending=True)
The sort() function is an alias of orderBy() and has the same functionality. The syntax and
parameters are identical to orderBy().
Syntax:
DataFrame.sort(*cols, ascending=True)
There is no functional difference between orderBy() and sort() in PySpark. The sort() function is
simply an alias for orderBy().'
sorted_by_age = df.orderBy("Age")
sorted_by_age.show()
Name|Age| City|
+-------+---+-------------+
45
# Sort by multiple columns using orderBy()
sorted_by_age_and_city.show()
Name|Age| City|
+-------+---+-------------+
sorted_by_age = df.sort("Age")
sorted_by_age.show()
Name|Age| City|
+-------+---+-------------+
46
# Sort by multiple columns using sort()
sorted_by_age_and_city.show()
Name|Age| City|
+-------+---+-------------+
sorted_by_age_desc_expr = df.orderBy(desc("Age"))
sorted_by_age_desc_expr.show()
Name|Age| City|
+-------+---+-------------+
47
| Bob| 28|San Francisco
sorted_by_city_asc_expr = df.sort(asc("City"))
sorted_by_city_asc_expr.show()
Name|Age| City|
+-------+---+-------------+
data_with_nulls = [
# Sort the DataFrame with NULL values in Age column (NULLs appear last)
48
sorted_with_nulls.show()
+-------+----+-----------+
# Sort the DataFrame with NULL values in City column (NULLs appear first)
sorted_with_nulls_alt.show()
+-------+----+-----------+
49
# Create a custom sorting column
.when(col("City") == city_order[1], 1) \
.when(col("City") == city_order[2], 2) \
.when(col("City") == city_order[3], 3) \
.otherwise(4)
sorted_by_custom_order = df.orderBy(custom_sort_col)
sorted_by_custom_order.show()
Name|Age| City|
+-------+---+-------------+
PySpark GroupBy()
PySpark GroupBy is a powerful operation that allows you to perform aggregations on your data.
It groups the rows of a DataFrame based on one or more columns and then applies an
aggregation function to each group. Common aggregation functions include sum, count, mean,
min, and max.
Syntax :
dataFrame.groupBy(“column_name”).agg(aggregation_function)
50
aggregation functions
import findspark
findspark.init()
spark = SparkSession.builder \
.getOrCreate()
# Sample data
# Create DataFrame
df = spark.createDataFrame(data, columns)
51
df.show()
+-------+----------+-----------+--------+-----+----------+
| 1005|Smartphone|Electronics| 1| 700|2023-01-05
result = df.groupBy("Product").agg(sum("Price").alias("Total_Sales"))
# Show results
result.show()
Product|Total_Sales|
+----------+-----------+
| Laptop| 2200|
| Mouse| 80|
|Smartphone| 700
52
# GroupBy and aggregate
.agg(sum("Price").alias("Total_Sales"))
# Show results
result.show()
Product| Category|Total_Sales|
+----------+-----------+-----------+
| Laptop|Electronics| 2200|
| Mouse|Electronics| 80|
|Smartphone|Electronics| 700
result = df.groupBy("Product") \
.agg(sum("Price").alias("Total_Sales"),
sum("Quantity").alias("Total_Quantity"))
# Show results
result.show()
Product|Total_Sales|Total_Quantity|
+----------+-----------+--------------+
| Laptop| 2200| 2|
53
| Mouse| 80| 5|
|Smartphone| 700| 1|
can use a combination of where() (which is equivalent to the SQL WHERE clause) and groupBy()
to perform a groupBy operation with a specific condition.
result = df.groupBy("Product") \
.agg(avg("Price").alias("Total_Sales"),
sum("Quantity").alias("Total_Quantity")) \
.where(col("Total_Quantity") >= 2)
# Show results
result.show()
Product|Total_Sales|Total_Quantity|
+-------+-----------+--------------+
| Laptop| 1100.0| 2|
| Mouse| 40.0| 5
import pandas as pd
54
from pyspark.sql.functions import pandas_udf
@pandas_udf(FloatType())
return float(column.median())
Category|Median_Price|
+-----------+------------+
|Electronics| 500.0
PySpark Joins
Type of Joins
Inner Join
Left Join
Right Join
Cross Join
import findspark
findspark.init()
55
# Initialize Spark Session
# Show result
result.show()
id|value1|value2|
+---+------+------+
| 1| A| X|
| 2| B| Y|
# Show result
result.show()
id|value1|value2|
+---+------+------+
56
| 1| A| X|
| 2| B| Y|
| 3| C| null|
| 4| null| Z
# Show result
result.show()
id|value1|value2|
+---+------+------+
| 1| A| X|
| 3| C| null|
| 2| B| Y|
# Show result
result.show()
+---+------+------+
| id|value1|value2|
57
+---+------+------+
| 1| A| X|
| 2| B| Y|
| 4| null| Z
A left semi join returns only the columns from the left dataframe for the rows with matching
keys in both dataframes. It is similar to an inner join but only returns the columns from the left
dataframe.
# Show result
result.show()
+---+------+
| id|value1|
+---+------+
| 1| A|
| 2| B|
A left anti join returns the rows from the left dataframe that do not have matching keys in the
right dataframe. It is the opposite of a left semi join.
58
# Perform left anti join
# Show result
result.show()
+---+------+
| id|value1|
+---+------+
| 3| C|
Cross Join
A cross join, also known as a cartesian join, returns the cartesian product of both dataframes. It
combines each row from the left dataframe with each row from the right dataframe.
result = df1.crossJoin(df2)
# Show result
result.show()
+---+------+---+------+
| id|value1| id|value2|
+---+------+---+------+
| 1| A| 1| X|
| 1| A| 2| Y|
| 1| A| 4| Z|
59
| 2| B| 1| X|
| 2| B| 2| Y|
| 2| B| 4| Z|
| 3| C| 1| X|
| 3| C| 2| Y|
| 3| C| 4| Z|
PySpark Union?
PySpark Union is an operation that allows you to combine two or more DataFrames
with the same schema, creating a single DataFrame containing all rows from the input
DataFrames.
It’s important to note that the Union operation doesn’t eliminate duplicate rows, so you may
need to use the distinct() function afterward if you want to remove duplicates.
import findspark
findspark.init()
schema = StructType([
60
StructField("product", StringType(), True),
])
df_union = df_A.union(df_B)
df_union.show()
+-------+-----+--------+
|product|price|quantity|
+-------+-----+--------+
| apple| 3| 5|
| banana| 1| 10|
61
| orange| 2| 8|
| apple| 3| 5|
| banana| 1| 15|
| grape| 4| 6|
It’s important to note that the Union operation doesn’t eliminate duplicate rows, so you may
need to use the distinct() function afterward if you want to remove duplicates.
df_union_dist = df_A.union(df_B).distinct()
df_union_dist.show()
product|price|quantity|
+-------+-----+--------+
| apple| 3| 5|
| banana| 1| 10|
| orange| 2| 8|
| banana| 1| 15|
| grape| 4| 6|
62
df_union_all = df_A.union(df_B).union(df_C)
df_union_all.show()
product|price|quantity|
+-------+-----+--------+
| apple| 3| 5|
| banana| 1| 10|
| orange| 2| 8|
| apple| 3| 5|
| banana| 1| 15|
| grape| 4| 6|
| apple| 3| 10|
| banana| 1| 20|
| grape| 4| 10|
| orange| 2| 7
63