My Pyspark Practice Notes

Nityanand Singh pyspark Practice Notes
content:-
Spark Session -
1
SparkSession is designed to be a singleton, which means that only one instance should be active
in the application at any given time.
spark2 = SparkSession.newSession
print(spark2)
spark = SparkSession.builder \
.appName("My PySpark Application") \
.master("local[*]") \
.config("spark.executor.memory", "4g") \
.config("spark.executor.cores", "4") \
.config("spark.driver.memory", "2g") \
.getOrCreate()
What is SparkSession – PySpark Entry Point, Dive into SparkSession
Jagdeesh
Introduction
PySpark, the Python library for Apache Spark, has gained immense popularity among data
engineers and data scientists due to its simplicity and power in handling big data tasks.
This blog post will provide a comprehensive understanding of the PySpark entry point, the
SparkSession. We’ll explore the concepts, features, and the use of SparkSession to set up a
PySpark application effectively.
What is SparkSession?
SparkSession is the entry point for any PySpark application, introduced in Spark 2.0 as a unified
API to replace the need for separate SparkContext, SQLContext, and HiveContext.
2
The SparkSession is responsible for coordinating various Spark functionalities and provides a
simple way to interact with structured and semi-structured data, such as reading and writing
data from various formats, executing SQL queries, and utilizing built-in functions for data
manipulation.
SparkSession offers several benefits that make it an essential component of PySpark

applications:
Simplified API: SparkSession unifies the APIs of SparkContext, SQLContext, and HiveContext,
making it easier for developers to interact with Spark’s core features without switching
between multiple contexts.
Configuration management: You can easily configure a SparkSession by setting various options,
such as the application name, the master URL, and other configurations.
Access to Spark ecosystem: SparkSession allows you to interact with the broader Spark
ecosystem, such as DataFrames, Datasets, and MLlib, enabling you to build powerful data
processing pipelines.
Improved code readability: By encapsulating multiple Spark contexts, SparkSession helps you
write cleaner and more maintainable code.
Creating a SparkSession
To create a SparkSession, we first need to import the necessary PySpark modules and classes.
Here’s a simple example:
from pyspark.sql import SparkSession
3
.getOrCreate()
In this example, we import the SparkSession class from the pyspark.sql module and use the
builder method to configure the application name and master URL. The getOrCreate() method
is then used to either get the existing SparkSession or create a new one if none exists.
The SparkSession.builder object provides various functions to configure the SparkSession before
creating it. Some of the important functions are:
appName(name): Sets the application name, which will be displayed in the Spark web user
interface.
master(masterURL): Sets the URL of the cluster manager (like YARN, Mesos, or standalone) that
Spark will connect to. You can also set it to “local” or “local[N]” (where N is the number of
cores) for running Spark locally.
config(key, value): Sets a configuration property with the specified key and value. You can use
this method multiple times to set multiple configuration properties.
config(conf): Sets the Spark configuration object (SparkConf) to be used for building the
SparkSession.
enableHiveSupport(): Enables Hive support, including connectivity to a persistent Hive

metastore, support for Hive SerDes, and Hive user-defined functions (UDFs).
4
getOrCreate(): Retrieves an existing SparkSession or, if there is none, creates a new one based
on the options set via the builder.
How many pyspark sessions can be created?
In PySpark, you can technically create multiple SparkSession instances, but it is not
recommended. The standard practice is to use a single SparkSession per application.
SparkSession is designed to be a singleton, which means that only one instance should be active
in the application at any given time
# Create new SparkSession
spark2 = SparkSession.newSession
print(spark2)
Configuring a SparkSession
You can configure a SparkSession with various settings, such as the number of executor cores,
executor memory, driver memory, and more. Here’s an example
.config("spark.executor.memory", "4g") \
.config("spark.executor.cores", "4") \
.config("spark.driver.memory", "2g") \
.getOrCreate()
In this example, we’ve added three additional configurations for executor memory, executor
cores, and driver memory using the config() method.
5
Accessing SparkSession Components
# Access SparkContext
spark_context = spark.sparkContext
# Access SQLContext
sql_context = spark._wrapped
# Read a CSV file
data_frame = spark.read.csv("path/to/your/csv-file", header=True, inferSchema=True)
# Write the data to a Parquet file
data_frame.write.parquet("path/to/output/parquet-file")
Executing SQL Queries with SparkSession
With SparkSession, you can also execute SQL queries directly on your data. Here’s an example:
# Register a DataFrame as a temporary table
6
data_frame.createOrReplaceTempView("my_table")
# Execute an SQL query on the temporary table
result = spark.sql("SELECT * FROM my_table WHERE age > 30")
# Display the result
result.show()
Example: Word Count with SparkSession
from pyspark.sql.functions import split, explode, col
# creating a spark session
.appName("Counting ")
.getOrCreate()
# read the input file
data = spark.read.txt("txt file")
# Split the line into words
words = data.select(explode(split(col("Value", " ")).alias("Word")
7
# perform the word count
word_count = words.groupBy("Word").count()
# Display the results
word_count.show()
spark.stop()
Note -
explode - EXPLODE is a PySpark function used to works over columns in PySpark.
EXPLODE is used for the analysis of nested column data.
PySpark EXPLODE converts the Array of Array Columns to row.
EXPLODE can be flattened up post analysis using the flatten method.
EXPLODE returns type is generally a new row for each element given.
explode() will return each and every individual value from an array. If the array is empty or null,
it will ignore and go to the next array in an array type column in PySpark DataFrame. This is
possible using the select() method. Inside this method, we can use the array_min() function and
return the result.
Read and Write files using PySpark – Multiple ways to Read and Write data using PySpark
pip install findspark
pip install pyspark
import findspark
findspark.init()
8
.appName("My name is ") \
.master("Local[*]") \
.getOrCreate()
Creating a Pyspark dataFrame
data = [("Alice", 34), ("Bob" , 45), ("Cathy", 29)]
columns = ["Name" "Roll_Number"]
df = spark.createDataFrame(data, columns)
df.show()
+-----+---+
| Name|Age|
+-----+---+
|Alice| 34|
| Bob| 45|
|Cathy| 29|
csv_file = "path/to/your/csv/file.csv"
9
df_csv = spark.read.csv(csv_file, header=True, inferSchema=True)
output_path = "path/to/output/csv/file.csv"
df_csv.write.csv(output_path, header=True, mode="overwrite")
json_file = "path/to/your/json/file.json"
df_json = spark.read.json(json_file)
output_path = "path/to/output/json/file.json"
df_json.write.json(output_path, mode="overwrite")
Reading and Writing Parquet Files
To read a Parquet file using PySpark, you can use the read.parquet() method:
parquet_file = "path/to/your/parquet/file.parquet"
df_parquet = spark.read.parquet(parquet_file)
To write the data back to a Parquet file, use the write.parquet() method:
output_path = "path/to/output/parquet/file.parquet"
df_parquet.write.parquet(output_path, mode="overwrite")
10
Creating a SQL Table in PySpark
data [
{"Name":"Alice", "Age":30, "City":"New York"},
{"Name":"Bob", "Age":25, "City":"San Francisco"},
{"Name":"Charlie", "Age":35, "City":"Los Angeles"}]
df = spark.createDataFrame(data)
df.createOrReplaceTempView("People")
query = "select * from People where Age > 30"
result_df = spark.sql(query)
result_df.show()
Convert Pandas Dataframe to PySpark Dataframe
import pandas as pd
data [
{"Name":"Alice", "Age":30, "City":"New York"},
{"Name":"Bob", "Age":25, "City":"San Francisco"},
{"Name":"Charlie", "Age":35, "City":"Los Angeles"}]
pandasData = pd.dataFrame(data, columns = ["Name", "Age", "City"]
11
# creating the pyspark dataframe from pandas
sparkdf = spark.createDataFrame(pandasData)
sparkdf.printSchema()
sparkdf.show()
root
|-- name: string (nullable = true)
|-- age: long (nullable = true)
|-- city: string (nullable = true)
+-------+---+-------------+
| name|age| city|
+-------+---+-------------+
| Alice| 30| New York|
| Bob| 25|San Francisco|
|Charlie| 35| Los Angeles|
A DataFrame is a distributed collection of data organized into named columns.

It is conceptually equivalent to a table in a relational database or a data frame in Python, but
optimized for large-scale processing. DataFrame as a spreadsheet with rows and columns.
12
The show() function is a method available for DataFrames in PySpark. It is used to display
the contents of a DataFrame in a tabular format, making it easier to visualize and understand
the data.
Syntax
DataFrame.show(n=20, truncate=True, vertical=False)
Parameters:
n: The number of rows to display. The default value is 20.
truncate: If set to True, the column content will be truncated if it is too long. The default value
is True.
vertical: If set to True, the output will be displayed vertically. The default value is False.
Example 1-
import findaspark
findspark.init()
.appName \
.getOrCreate()
data = [("Alice":34 ), ("Bob":45), ("Charlie":29), ("David": 31)]
columns = ["Name", "Age"]
df.show()
13
+-------+---+
| Name|Age|
+-------+---+
| Alice| 34|
| Bob| 45|
|Charlie| 29|
| David| 31|
Display the Specific number of row. like 2
df.show(2)
+-----+---+
| Name|Age|
+-----+---+
|Alice| 34|
| Bob| 45|
+-----+---+
only showing top 2 rows
Display Contents Without Truncation
df.show(truncate = False)
Name |Age|
+-------+---+
|Alice |34 |
14
|Bob |45 |
|Charlie|29 |
|David |31 |
Display Contents Vertically
df.show(vertical=True)
-RECORD 0-------
Name | Alice
Age | 34
-RECORD 1-------
Name | Bob
Age | 45
-RECORD 2-------
Name | Charlie
Age | 29
-RECORD 3-------
Name | David
Age | 31
Run SQL Queries with PySpark
OrderIDProductID Quantity Price OrderDate
15
1 101 3 100 2023-01-01
2 102 1 200 2023-01-02
3 101 2 100 2023-01-03
4 103 5 50 2023-01-04
name of file = sales data
csv_file = sales_data.csv
df = spark.read.option("header":"true").option("inferSchema":"true").csv(csv_files)
# creatting a temporary View
df.createOrReplaceTempView("Saless_data)
# Calculating the total Revenue of each product
query = """
select ProductID,
SUM (Quantity * Price) AS totalRevenue
from saless_data
group by ProductID
"""
result = spark.sql(query)
result.show()
16
+---------+------------+
|ProductID|TotalRevenue|
+---------+------------+
| 101| 500|
| 102| 200|
| 103| 250|
+---------+--
to find the top 2 products with the highest revenue
query = """
select
ProductID,
SUM (Quantity * Price) AS totalRevenue
from saless_data
group by ProductID
order by totalRevenue
limit 2; """
resultDF = spark.sql(query)
resultDF.show()
+---------+------------+
17
|ProductID|TotalRevenue|
+---------+------------+
| 101| 500|
| 102| 200|
PySpark Pandas API
import pandas as pd
import numpy as np
import databricks.koalas as ks
# creatting the sparkSession
.appName \
.getOrCreate()
reading the file
sales_data = ks.read_csv("sales_data.csv")
calculate the average revenue per unit sold and add it as a new column
18
sales_data['Avg_Revenue_per_Unit] =
sales_data['Revenue'] / sales_data['Unit_Sold']
Computing summary statistics
summary_stats = sales_data.groupby(['Store_ID', 'Product_ID']).agg(
{'Revenue': 'sum', 'Units_Sold': 'sum'}).reset_index()
Sorting the results
sorted_summary_stats = summary_stats.sort_values(
by=['Store_ID', 'Revenue'], ascending=[True, False])
Exporting the results
sorted_summary_stats.to_csv("summary_stats.csv", index=False)
spark.stop() # for stoping that current session.
Select columns in PySpark dataframe
import findspark
findspark.init()
.appName \
.master \
.getOrCreate()
19
data = [("Alice", 34, "Female"), ("Bob":45, "Male"), ("Charlie":28, "Male"),
("Diana", 39, "Female")]
columns = ["Name", "Age", "Gender"]
df = spark.createDateFrame(data, columns)
df.show()
Name|Age|Gender|
+-------+---+------+
| Alice| 34|Female|
| Bob| 45| Male|
|Charlie| 28| Male|
| Diana| 39|Female|
+-------+---+------+
# Select columns using column names
selectdf1 = df.select("Name", "Age")
selectdf1.show()
| Name|Age|
+-------+---+
| Alice| 34|
| Bob| 45|
|Charlie| 28|
| Diana| 39|
# select column using the col functions
selectdf2 = df.select (col("name"), col("Age"))
20
selectdf2.show()
Name|Age|
+-------+---+
| Alice| 34|
| Bob| 45|
|Charlie| 28|
| Diana| 39
Selecting Columns using the ‘[ ]’ Operator

# Select a single column using the '[]' operator
name_df = df["Name"]
# Select multiple columns using the '[]' operator
selected_df3 = df.select(df["Name"], df["Age"])
selected_df3.show()
Name|Age|
+-------+---+
| Alice| 34|
| Bob| 45|
|Charlie| 28|
| Diana| 39|
21
Select Columns using index
# Define the column indices you want to select
column_indices = [0, 2]
# Extract column names based on indices
selected_columns = [df.columns[i] for i in column_indices]
# Select columns using extracted column names
selected_df4 = df.select(selected_columns)
# Show the result DataFrame
selected_df4.show()
Name|Gender|
+-------+------+
| Alice|Female|
| Bob| Male|
|Charlie| Male|
| Diana|Female|
Selecting Columns using the ‘withColumn’ and ‘drop’ Functions.
22
select specific columns while adding or removing columns, you can use the
‘ withColumn’ function to add a new column and the ‘drop’ function to remove a
column.
# Add a new column 'IsAdult' and remove the 'Gender' column
selectDF = df.withColumn("IsAdult", col("Age") > 18).drop("Gender")
selectDF.show()
Name|Age|IsAdult|
+-------+---+-------+
| Alice| 34| true|
| Bob| 45| true|
|Charlie| 28| true|
| Diana| 39| true|
Selecting Columns using SQL Expressions
select columns using the ‘ selectExpr’ function. This is useful when you want to perform
operations on columns while selecting them.
selectDF = df.selectExpr("Name", "Age", "Age >= 18 AS IsAdult")
selectDF.show()
Name|Age|IsAdult|
23
+-------+---+-------+
| Alice| 34| true|
| Bob| 45| true|
|Charlie| 28| true|
| Diana| 39| true|
PySpark withColumn
The “withColumn” function in PySpark allows to add, replace, or update columns in a

DataFrame. It is a DataFrame transformation operation, meaning it returns a new DataFrame
with the specified changes, without altering the original DataFrame.
The “withColumn” function is particularly useful when we eed to perform column-based

operations like renaming, changing the data type, or applying a function to the values in a
column.
Syntax
DataFrame.withColumn(colName, col)
where:
DataFrame: The original PySpark DataFrame you want to manipulate.
colName: The name of the new or existing column you want to add, replace, or update.
col: The new expression or value for the specified column.
24
import findspark
findspark.init()
from pyspark.sql.functions import col
# initialization the sparksession
.appName \
.getOrCreate()
data = [(1, "Alice", 25),
(2, "Bob", 30),
(3, "Charlie", 35)
columns = ["id", "name", "age"]
# create a Dataframe
df.show()
id| name|age|
+---+-------+---+
| 1| Alice| 25|
| 2| Bob| 30|
| 3|Charlie| 35|
25
1. Remaining the column.
"withColumns" to rename the "age" column to "years"
# rename the age column to years
df = df.withColumn("Years", col("age"))
# drop the column age
df = df.drop("Age")
df.show()
id| name|years|
+---+-------+-----+
| 1| Alice| 25|
| 2| Bob| 30|
| 3|Charlie| 35|
+---+-------+-----
2. Applying a function to a column
Converting the age from years to months
26
from pyspark.sql.functions import expr
df = df.withClumn("months", expr("years * 12")
df.show()
id| name|years|months|
+---+-------+-----+------+
| 1| Alice| 25| 300|
| 2| Bob| 30| 360|
| 3|Charlie| 35| 420|
3. Change the column data type
from pyspark.sql.types import StringType
# change the id to string
df = df.withColumn("id", col("id").cast(StringType()))
df.show()
id| name|years|months|
+---+-------+-----+------+
| 1| Alice| 25| 300|
| 2| Bob| 30| 360|
| 3|Charlie| 35| 420
4. Conditional column update with "withColumn".
27
from pyspark.sql.functions import when
data = [(1, "Alice", 25, 45000), (2, "Bob", 30, 55000), (3, "Charlie", 35, 60000)
columns = ["id", "name", "age", "salary"]
df.show()
# add the "tax" column based on the "salary".
df = df.withColumn("tax", when(col("salary")>=50000, col("salary")*0.1).otherwise(col("salary")

* 0.05
df.show()
id| name|age|salary|
+---+-------+---+------+
| 1| Alice| 25| 45000|
| 2| Bob| 30| 55000|
| 3|Charlie| 35| 60000|
+---+-------+---+------+
+---+-------+---+------+------+
| id| name|age|salary| tax|
+---+-------+---+------+------+
| 1| Alice| 25| 45000|2250.0|
| 2| Bob| 30| 55000|5500.0|
| 3|Charlie| 35| 60000|6000.0|
28
5. Using the USer Defined Function with withColumn.
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def age_group(age):
if age < 30:
return "Young"
elif age < 45:
return "MiddleAged"
else:
return "Old"
# Register the UDF
age_group_udf = udf(age_group, Stringtype())
# add the new columnn age_group based on the age
df = df.withColumn("age_group", age_group_udf(col("age")))
df.show()
id| name|age|salary| tax| age_group|
+---+-------+---+------+------+-----------+
| 1| Alice| 25| 45000|2250.0| Young|
| 2| Bob| 30| 55000|5500.0|Middle-aged|
| 3|Charlie| 35| 60000|6000.0|Middle-aged|
29
# creating the new column net_sales column based on the subtracting the tax column from
salary
df = df.withColumn("net_sales", round(col("salary")- col("tax"), 2))
df.show()
id| name|age|salary| tax| age_group|net_salary|
+---+-------+---+------+------+-----------+----------+
| 1| Alice| 25| 45000|2250.0| Young| 42750.0|
| 2| Bob| 30| 55000|5500.0|Middle-aged| 49500.0|
| 3|Charlie| 35| 60000|6000.0|Middle-aged| 54000.0|
# combining the multiple column into one column.
name and age_group column = name_age_group
We will use the “concat_ws” function, which allows us to concatenate multiple columns
with a specified delimiter.
from pyspark.sql.functions import concat_ws
# combining the column
df = df.withColumn("name_age_group", concat_ws("_", col("name"), col("age_group")))
df.show()
30
id| name|age|salary| tax| age_group|net_salary| name_age_group|
+---+-------+---+------+------+-----------+----------+--------------------+
| 1| Alice| 25| 45000|2250.0| Young| 42750.0| Alice - Young|
| 2| Bob| 30| 55000|5500.0|Middle-aged| 49500.0| Bob - Middle-aged|
| 3|Charlie| 35| 60000|6000.0|Middle-aged| 54000.0|Charlie - Middle-...|
PySpark Drop Columns
Drop() function to remove columns from a DataFrame
Creating the dataframe
import findspark
findspark.init()
.appName \
.getOrCreate()
data = [("Alice", 30, "New York", "F"),
("Bob", 28, "San Francisco", "M"),
("Cathy", 29, "Los Angeles", "F"),
("David", 32 "Chicago", "M")
31
columns= ["name", "age", "city", "gender"]
df = df.createDataFrame(data, columns)
df.show()
name|age| city|gender|
+-----+---+-------------+------+
|Alice| 30| New York| F|
| Bob| 28|San Francisco| M|
|Cathy| 29| Los Angeles| F|
|David| 32| Chicago| M|
Different ways to drop columns in PySpark DataFrame
Dropping a Single Column
The Drop() function can be used to remove a single column from a DataFrame.
Syntax:
df = df.drop("gender")
df.show()
name| city|
+-----+-------------+
|Alice| New York|
| Bob|San Francisco|
|Cathy| Los Angeles|
|David| Chicago|
Dropping Multiple Columns
32
Drop() function to remove multiple columns from a DataFrame. Simply pass a list of column
names to the function.
df = df.drop("age", "gender")
df.show()
or list of column name
dropping_column_name = ["age" , "gender"]
df = df.drop(* dropping_column_name)
df.show()
name| city|
+-----+-------------+
|Alice| New York|
|David| Chicago
Dropping Columns Conditionally
if "gender" in df.columns:
df = df.drop("gender")
df.show()
name|age| city|
+-----+---+-------------+
33
|Alice| 30| New York|
|Cathy| 29| Los Angeles|
|David| 32| Chicago|
Dropping Columns Using Regex Pattern
“drop()” function in combination with a regular expression (regex) pattern to drop multiple
columns matching the pattern.
import re
regex_pattern = "gender|age"
df = df.select([col(c) for c in df.columns if not re.match(regex_pattern, c)])
df.show()
+-----+-------------+
| name| city|
+-----+-------------+
|Alice| New York|
|David| Chicago|
34
PySpark Rename Columns
import findspark
findspark.init()
spark = SparkSession.builder.appName("PySpark Rename Columns").getOrCreate()
from pyspark.sql import Row
data = [Row(name="Alice", age=25, city="New York"),
Row(name="Bob", age=30, city="San Francisco"),
Row(name="Cathy", age=35, city="Los Angeles")]
sample_df = spark.createDataFrame(data)
sample_df.show()
name|age| city|
+-----+---+-------------+
|Cathy| 35| Los Angeles|
+-----+---+-------------+
Different ways to rename columns in a PySpark DataFrame
Renaming Columns Using ‘withColumnRenamed’
The ‘withColumnRenamed’ method is a simple way to rename a single column in a DataFrame
renamed_df = sample_df.withColumnRenamed("age", "user_age")
35
renamed_df.show()
name|user_age| city|
+-----+--------+-------------+
|Cathy| 35| Los Angeles
Renaming Columns Using ‘select’ and ‘alias’
renamed_df = sample_df.select(col("name"), col("age").alias("user_age"), col("city"))
renamed_df.show()
name|user_age| city|
+-----+--------+-------------+
|Cathy| 35| Los Angeles
Renaming Columns Using ‘toDF’
renamed_df = sample_df.toDF("user_name", "user_age", "user_city")
renamed_df.show()
36
user_name|user_age| user_city|
+---------+--------+-------------+
| Cathy| 35| Los Angeles
Renaming Multiple Columns
renamed_df = sample_df.withColumnRenamed("name", "user_name") \
.withColumnRenamed("age", "user_age") \
.withColumnRenamed("city", "user_city")
renamed_df.show()
user_name|user_age| user_city|
+---------+--------+-------------+
| Cathy| 35| Los Angeles
PySpark Filter vs Where
import findspark
findspark.init()
37
.appName("Filtering Rows in PySpark DataFrames") \
.getOrCreate()
data = [
Row(id=1, name="Alice", age=30),
Row(id=2, name="Bob", age=25),
Row(id=3, name="Charlie", age=35),
Row(id=4, name="David", age=28)]
df.show()
id| name|age|
+---+-------+---+
| 1| Alice| 30|
| 2| Bob| 25|
| 3|Charlie| 35|
| 4| David| 28
Different ways to filter rows in PySpark DataFrames
1. Filtering Rows Using ‘filter’ Function
It takes a boolean expression as an argument and returns a new DataFrame containing only the
38
rows that satisfy the condition.
Example: Filter rows with age greater than 30
filtered_df = df.filter(df.age > 29)
filtered_df.show()
id| name|age|
+---+-------+---+
| 1| Alice| 30|
| 3|Charlie| 35|
2. Filtering Rows Using ‘where’ Function
The where function is an alias for the ‘filter’ function and can be used interchangeably. It also
takes a boolean expression as an argument and returns a new DataFrame containing only the
PySpark Filter vs Where – Comprehensive Guide Filter Rows from PySpark DataFrame
Jagdeesh
Apache PySpark is a popular open-source distributed data processing engine built on top of the
Apache Spark framework. It provides a high-level API for handling large-scale data processing
tasks in Python, Scala, and Java.
One of the most common tasks when working with PySpark DataFrames is filtering rows based
on certain conditions. In this blog post, we’ll discuss different ways to filter rows in PySpark
DataFrames, along with code examples for each method.
39
Different ways to filter rows in PySpark DataFrames
3. Filtering Rows Using SQL Queries
4. Combining Multiple Filter Conditions
Before we dive into filtering rows, let’s quickly review some basics of PySpark DataFrames. To
work with PySpark DataFrames, we first need to import the necessary modules and create a
SparkSession
import findspark
findspark.init()
.appName("Filtering Rows in PySpark DataFrames") \
.getOrCreate()
Next, let’s create a simple DataFrame to use in our examples
data = [
Row(id=1, name="Alice", age=30),
Row(id=2, name="Bob", age=25),
Row(id=3, name="Charlie", age=35),
Row(id=4, name="David", age=28)
40
]
df.show()
+---+-------+---+
| id| name|age|
+---+-------+---+
| 1| Alice| 30|
| 2| Bob| 25|
| 3|Charlie| 35|
| 4| David| 28|
+---+-------+---+
The filter function is one of the most straightforward ways to filter rows in a PySpark
DataFrame. It takes a boolean expression as an argument and returns a new DataFrame
containing only the rows that satisfy the condition.
Example: Filter rows with age greater than 30
filtered_df = df.filter(df.age > 29)
filtered_df.show()
+---+-------+---+
41
| id| name|age|
+---+-------+---+
| 1| Alice| 30|
| 3|Charlie| 35|
+---+-------+---+
The where function is an alias for the ‘filter’ function and can be used interchangeably. It also
takes a boolean expression as an argument and returns a new DataFrame containing only the
Example: Filter rows with name equal to “Alice”:
filtered_df = df.where(df.name.isin(["Alice", "Charlie"]))
filtered_df.show()
id| name|age|
+---+-------+---+
| 1| Alice| 30|
| 3|Charlie| 35|
3. Filtering Rows Using SQL Queries
Example: Filter rows with age less than or equal to 25
df.createOrReplaceTempView("people")
filtered_df = spark.sql("SELECT * FROM people WHERE age <= 25")
filtered_df.show()
42
id|name|age|
+---+----+---+
| 2| Bob| 25|
4. Combining Multiple Filter Conditions
combine multiple filter conditions using the ‘&’ (and), ‘|’ (or), and ‘~’ (not) operators. Make
sure to use parentheses to separate different conditions, as it helps maintain the correct order
of operations.
filtered_df = df.filter((df.age > 25) & (df.name != "David"))
filtered_df.show()
id| name|age|
+---+-------+---+
| 1| Alice| 30|
| 3|Charlie| 35
PySpark orderBy() and sort()
import findspark
findspark.init()
# Create a SparkSession
43
.appName("PySpark orderBy() and sort() Example") \
.getOrCreate()
# Sample data
data = [
("Alice", 30, "New York"),
("Bob", 28, "San Francisco"),
("Charlie", 34, "Los Angeles"),
("Diana", 29, "Chicago")
# Create a DataFrame
columns = ["Name", "Age", "City"]
df.show()
Name|Age| City|
+-------+---+-------------+
| Diana| 29| Chicago
orderBy() function
The orderBy() function in PySpark is used to sort a DataFrame based on one or more
columns. It takes one or more columns as arguments and returns a new DataFrame sorted by
44
the specified columns.
Syntax:
DataFrame.orderBy(*cols, ascending=True)
The sort() Function
The sort() function is an alias of orderBy() and has the same functionality. The syntax and
parameters are identical to orderBy().
Syntax:
DataFrame.sort(*cols, ascending=True)
Difference between orderBy() and sort()
There is no functional difference between orderBy() and sort() in PySpark. The sort() function is
simply an alias for orderBy().'
sorted_by_age = df.orderBy("Age")
sorted_by_age.show()
Name|Age| City|
+-------+---+-------------+
| Diana| 29| Chicago|
45
# Sort by multiple columns using orderBy()
sorted_by_age_and_city = df.orderBy(["Age", "City"], ascending=[True, False])
sorted_by_age_and_city.show()
Name|Age| City|
+-------+---+-------------+
# Sort the DataFrame using sort()
sorted_by_age = df.sort("Age")
sorted_by_age.show()
Name|Age| City|
+-------+---+-------------+
|Charlie| 34| Los Angeles
46
# Sort by multiple columns using sort()
sorted_by_age_and_city = df.sort(["Age", "City"], ascending=[True, False])
sorted_by_age_and_city.show()
Name|Age| City|
+-------+---+-------------+
Sorting a DataFrame using column expressions
from pyspark.sql.functions import desc, asc
# Sort the DataFrame by age in descending order using column expressions
sorted_by_age_desc_expr = df.orderBy(desc("Age"))
sorted_by_age_desc_expr.show()
Name|Age| City|
+-------+---+-------------+
47
| Bob| 28|San Francisco
# Sort the DataFrame by city in ascending order using column expressions
sorted_by_city_asc_expr = df.sort(asc("City"))
sorted_by_city_asc_expr.show()
Name|Age| City|
+-------+---+-------------+
Sorting a DataFrame with NULL values
data_with_nulls = [
("Alice", None, "New York"),
("Bob", 28, None),
("Charlie", 34, "Los Angeles"),
("Diana", 29, "Chicago")
# Create a DataFrame with NULL values
df_with_nulls = spark.createDataFrame(data_with_nulls, columns)
# Sort the DataFrame with NULL values in Age column (NULLs appear last)
sorted_with_nulls = df_with_nulls.orderBy("Age", ascending=True, nulls_last=True)
48
sorted_with_nulls.show()
Name| Age| City|
+-------+----+-----------+
| Alice|null| New York|
| Bob| 28| null|
|Charlie| 34|Los Angeles
# Sort the DataFrame with NULL values in City column (NULLs appear first)
sorted_with_nulls_alt = df_with_nulls.sort("City", ascending=True, nulls_first=True)
sorted_with_nulls_alt.show()
Name| Age| City|
+-------+----+-----------+
| Bob| 28| null|
|Charlie| 34|Los Angeles|
| Alice|null| New York
Sorting a DataFrame using a custom sorting order
from pyspark.sql.functions import col, when
# Define a custom sorting order for cities
city_order = ["New York", "Los Angeles", "Chicago", "San Francisco"]
49
# Create a custom sorting column
custom_sort_col = when(col("City") == city_order[0], 0) \
.when(col("City") == city_order[1], 1) \
.otherwise(4)
# Sort the DataFrame using the custom sorting order
sorted_by_custom_order = df.orderBy(custom_sort_col)
sorted_by_custom_order.show()
Name|Age| City|
+-------+---+-------------+
| Bob| 28|San Francisco
PySpark GroupBy()
PySpark GroupBy is a powerful operation that allows you to perform aggregations on your data.
It groups the rows of a DataFrame based on one or more columns and then applies an
aggregation function to each group. Common aggregation functions include sum, count, mean,
min, and max.
Here’s a general structure of a GroupBy operation:
Syntax :
dataFrame.groupBy(“column_name”).agg(aggregation_function)
50
aggregation functions
count() – return the number of rows for each group
max() – returns the maximum of values for each group
min() – returns the minimum of values for each group
sum() – returns the total for values for each group
avg() – returns the average for values for each group
import findspark
findspark.init()
from pyspark.sql.functions import *
# Create a Spark session
.appName("PySpark GroupBy Example") \
.getOrCreate()
# Sample data
data = [("1001", "Laptop", "Electronics", 1, 1000, "2023-01-01"),
("1002", "Mouse", "Electronics", 2, 50, "2023-01-02"),
("1003", "Laptop", "Electronics", 1, 1200, "2023-01-03"),
("1004", "Mouse", "Electronics", 3, 30, "2023-01-04"),
("1005", "Smartphone", "Electronics", 1, 700, "2023-01-05")]
# Create DataFrame
columns = ["OrderID", "Product", "Category", "Quantity", "Price", "Date"]
51
df.show()
OrderID| Product| Category|Quantity|Price| Date|
+-------+----------+-----------+--------+-----+----------+
| 1001| Laptop|Electronics| 1| 1000|2023-01-01|
| 1002| Mouse|Electronics| 2| 50|2023-01-02|
| 1003| Laptop|Electronics| 1| 1200|2023-01-03|
| 1004| Mouse|Electronics| 3| 30|2023-01-04|
| 1005|Smartphone|Electronics| 1| 700|2023-01-05
GroupBy operation on single column
# GroupBy and aggregate
result = df.groupBy("Product").agg(sum("Price").alias("Total_Sales"))
# Show results
result.show()
Product|Total_Sales|
+----------+-----------+
| Laptop| 2200|
| Mouse| 80|
|Smartphone| 700
GroupBy operation on Multiple Columns
52
result = df.groupBy(["Product", "Category"]) \
.agg(sum("Price").alias("Total_Sales"))
# Show results
result.show()
Product| Category|Total_Sales|
+----------+-----------+-----------+
| Laptop|Electronics| 2200|
| Mouse|Electronics| 80|
|Smartphone|Electronics| 700
GroupBy operation on Multiple Aggregations
result = df.groupBy("Product") \
.agg(sum("Price").alias("Total_Sales"),
sum("Quantity").alias("Total_Quantity"))
# Show results
result.show()
Product|Total_Sales|Total_Quantity|
+----------+-----------+--------------+
| Laptop| 2200| 2|
53
| Mouse| 80| 5|
|Smartphone| 700| 1|
er Aggregated data using where condition
can use a combination of where() (which is equivalent to the SQL WHERE clause) and groupBy()
to perform a groupBy operation with a specific condition.
# GroupBy and aggregate using where condition
result = df.groupBy("Product") \
.agg(avg("Price").alias("Total_Sales"),
sum("Quantity").alias("Total_Quantity")) \
.where(col("Total_Quantity") >= 2)
# Show results
result.show()
Product|Total_Sales|Total_Quantity|
+-------+-----------+--------------+
| Laptop| 1100.0| 2|
| Mouse| 40.0| 5
Custom Aggregation Functions
import pandas as pd
from pyspark.sql.types import FloatType
54
from pyspark.sql.functions import pandas_udf
@pandas_udf(FloatType())
def median(column: pd.Series) -> float:
return float(column.median())
Category|Median_Price|
+-----------+------------+
|Electronics| 500.0
PySpark Joins
Type of Joins
Inner Join
Outer (Full) Join
Left Join
Right Join
Left Semi Join
Left Anti Join
Cross Join
import findspark
findspark.init()
55
# Initialize Spark Session
spark = SparkSession.builder.master("local").appName("PySpark Join Types").getOrCreate()
# Create sample dataframes
df1 = spark.createDataFrame([(1, "A"), (2, "B"), (3, "C")], ["id", "value1"])
df2 = spark.createDataFrame([(1, "X"), (2, "Y"), (4, "Z")], ["id", "value2"])
# Perform inner join
result = df1.join(df2, on="id", how="inner")
# Show result
result.show()
id|value1|value2|
+---+------+------+
| 1| A| X|
| 2| B| Y|
# Perform outer join
result = df1.join(df2, on="id", how="outer")
# Show result
result.show()
id|value1|value2|
+---+------+------+
56
| 1| A| X|
| 2| B| Y|
| 3| C| null|
| 4| null| Z
# Perform left join
result = df1.join(df2, on="id", how="left")
# Show result
result.show()
id|value1|value2|
+---+------+------+
| 1| A| X|
| 3| C| null|
| 2| B| Y|
# Perform right join
result = df1.join(df2, on="id", how="right")
# Show result
result.show()
+---+------+------+
| id|value1|value2|
57
+---+------+------+
| 1| A| X|
| 2| B| Y|
| 4| null| Z
Left Semi Join
A left semi join returns only the columns from the left dataframe for the rows with matching
keys in both dataframes. It is similar to an inner join but only returns the columns from the left
dataframe.
# Perform left semi join
result = df1.join(df2, on="id", how="left_semi")
# Show result
result.show()
+---+------+
| id|value1|
+---+------+
| 1| A|
| 2| B|
Left Anti Join
A left anti join returns the rows from the left dataframe that do not have matching keys in the
right dataframe. It is the opposite of a left semi join.
58
# Perform left anti join
result = df1.join(df2, on="id", how="left_anti")
# Show result
result.show()
+---+------+
| id|value1|
+---+------+
| 3| C|
Cross Join
A cross join, also known as a cartesian join, returns the cartesian product of both dataframes. It
combines each row from the left dataframe with each row from the right dataframe.
# Perform cross join
result = df1.crossJoin(df2)
# Show result
result.show()
+---+------+---+------+
| id|value1| id|value2|
+---+------+---+------+
| 1| A| 1| X|
| 1| A| 2| Y|
| 1| A| 4| Z|
59
| 2| B| 1| X|
| 2| B| 2| Y|
| 2| B| 4| Z|
| 3| C| 1| X|
| 3| C| 2| Y|
| 3| C| 4| Z|
PySpark Union?
PySpark Union is an operation that allows you to combine two or more DataFrames
with the same schema, creating a single DataFrame containing all rows from the input
DataFrames.
It’s important to note that the Union operation doesn’t eliminate duplicate rows, so you may
need to use the distinct() function afterward if you want to remove duplicates.
import findspark
findspark.init()
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
# Create a Spark session
spark = SparkSession.builder.appName("PySpark Union Example").getOrCreate()
# Define the schema
schema = StructType([
60
StructField("product", StringType(), True),
StructField("price", IntegerType(), True),
StructField("quantity", IntegerType(), True)
])
# Create DataFrame for region A
data_A = [("apple", 3, 5), ("banana", 1, 10), ("orange", 2, 8)]
df_A = spark.createDataFrame(data_A, schema=schema)
# Create DataFrame for region B
data_B = [("apple", 3, 5), ("banana", 1, 15), ("grape", 4, 6)]
df_B = spark.createDataFrame(data_B, schema=schema)
# Create DataFrame for region C
data_C = [("apple", 3, 10), ("banana", 1, 20), ("grape", 4, 10), ("orange", 2, 7)]
df_C = spark.createDataFrame(data_C, schema=schema)
# Perform the Union operation on two DataFrames
df_union = df_A.union(df_B)
# Show the results
df_union.show()
+-------+-----+--------+
|product|price|quantity|
+-------+-----+--------+
| apple| 3| 5|
| banana| 1| 10|
61
| orange| 2| 8|
| apple| 3| 5|
| banana| 1| 15|
| grape| 4| 6|
Union without Duplicates
It’s important to note that the Union operation doesn’t eliminate duplicate rows, so you may
need to use the distinct() function afterward if you want to remove duplicates.
# Perform the Union operation on two DataFrames
df_union_dist = df_A.union(df_B).distinct()
# Show the results
df_union_dist.show()
product|price|quantity|
+-------+-----+--------+
| apple| 3| 5|
| banana| 1| 10|
| orange| 2| 8|
| banana| 1| 15|
| grape| 4| 6|
Union Multiple DataFrames
# Perform the Union operation on multiple DataFrames
62
df_union_all = df_A.union(df_B).union(df_C)
# Show the results
df_union_all.show()
product|price|quantity|
+-------+-----+--------+
| apple| 3| 5|
| banana| 1| 10|
| orange| 2| 8|
| apple| 3| 5|
| banana| 1| 15|
| grape| 4| 6|
| apple| 3| 10|
| banana| 1| 20|
| grape| 4| 10|
| orange| 2| 7
63

My Pyspark Practice Notes

Uploaded by

My Pyspark Practice Notes

Uploaded by

Nityanand Singh pyspark Practice Notes

.appName("My PySpark Application") \

What is SparkSession – PySpark Entry Point, Dive into SparkSession

SparkSession offers several benefits that make it an essential component of PySpark

from pyspark.sql import SparkSession

.appName("My PySpark Application") \

enableHiveSupport(): Enables Hive support, including connectivity to a persistent Hive

How many pyspark sessions can be created?

# Create new SparkSession

.appName("My PySpark Application") \

# Read a CSV file

data_frame = spark.read.csv("path/to/your/csv-file", header=True, inferSchema=True)

# Write the data to a Parquet file

Executing SQL Queries with SparkSession

# Register a DataFrame as a temporary table

# Execute an SQL query on the temporary table

result = spark.sql("SELECT * FROM my_table WHERE age > 30")

# Display the result

Example: Word Count with SparkSession

from pyspark.sql import SparkSession

from pyspark.sql.functions import split, explode, col

# creating a spark session

# read the input file

data = spark.read.txt("txt file")

# Split the line into words

words = data.select(explode(split(col("Value", " ")).alias("Word")

# Display the results

explode - EXPLODE is a PySpark function used to works over columns in PySpark.

EXPLODE is used for the analysis of nested column data.

PySpark EXPLODE converts the Array of Array Columns to row.

EXPLODE can be flattened up post analysis using the flatten method.

pip install findspark

pip install pyspark

.appName("My name is ") \

Creating a Pyspark dataFrame

data = [("Alice", 34), ("Bob" , 45), ("Cathy", 29)]

columns = ["Name" "Roll_Number"]

df_csv.write.csv(output_path, header=True, mode="overwrite")

Reading and Writing Parquet Files

{"Name":"Alice", "Age":30, "City":"New York"},

{"Name":"Bob", "Age":25, "City":"San Francisco"},

{"Name":"Charlie", "Age":35, "City":"Los Angeles"}]

query = "select * from People where Age > 30"

Convert Pandas Dataframe to PySpark Dataframe

{"Name":"Alice", "Age":30, "City":"New York"},

{"Name":"Bob", "Age":25, "City":"San Francisco"},

{"Name":"Charlie", "Age":35, "City":"Los Angeles"}]

pandasData = pd.dataFrame(data, columns = ["Name", "Age", "City"]

|-- name: string (nullable = true)

|-- age: long (nullable = true)

|-- city: string (nullable = true)

| Alice| 30| New York|

| Bob| 25|San Francisco|

|Charlie| 35| Los Angeles|

A DataFrame is a distributed collection of data organized into named columns.

DataFrame.show(n=20, truncate=True, vertical=False)

n: The number of rows to display. The default value is 20.

from pyspark.sql import SparkSession

data = [("Alice":34 ), ("Bob":45), ("Charlie":29), ("David": 31)]

columns = ["Name", "Age"]

Display the Specific number of row. like 2

only showing top 2 rows

Display Contents Without Truncation

Display Contents Vertically

Run SQL Queries with PySpark

OrderIDProductID Quantity Price OrderDate

2 102 1 200 2023-01-02

3 101 2 100 2023-01-03

name of file = sales data

# creatting a temporary View

# Calculating the total Revenue of each product

SUM (Quantity * Price) AS totalRevenue

to find the top 2 products with the highest revenue

SUM (Quantity * Price) AS totalRevenue

PySpark Pandas API

# creatting the sparkSession