0% found this document useful (0 votes)
6 views

m etl spark

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

m etl spark

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 34

21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

Table of Contents

Usage of google colab 2

Install Apache Spark on Windows 3 - 11

Apache Spark Architecture 12 - 14

Power Bi 15

1. Write a Program to create a SparkSession and Read Data from a CSV File 16 - 17

2. Write a program to group record of Supermarket’s sales data of Kaggle Dataset by Gender 18 - 19

3. Write a program to create a Spark Session and display DataFrame of employee.json 20

4. Perform Various Operations with SparkSQL 21 - 22

5. 5.Create a New Data Pipeline with Apache Spark 23 - 24

6. 6.Run SQL Queries on the Data in a Parquet Table 25

7. 7. Develop Parquet Table to a Platform Data Container 26

8. 8. Change Data in an Existing Delta Lake Table 27 - 29

9. 9.Create a New Ingestion Pipeline with Apache Spark 30 - 31

10. 10. Run SQL Queries on the Data in NoSQL Table 32 - 33

ETL SPARK 1
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

USAGE OF GOOGLE COLAB

Google Colab is like a fancy notebook for coding that you can use right in your web browser. It helps people
write and run code, especially for data science and machine learning, without needing to set up anything on
their own computer. You can share your work with others, and it’s free to use. Plus, it gives you access to
powerful computers and GPUs (graphics processing units) that make running complex calculations faster.

Google Colab, or "Colaboratory", is a cloud-based machine learning platform that you can use to write and
execute Python code, process data, and create visualizations. It's free for most tasks, and you can also pay for
more demanding needs. Here are some reasons why you might want to use Google Colab.

To install PySpark in Google Colab :

1. Open Google Colab

• Go to Google Colab.

• Create a new notebook by clicking on "File" > "New notebook".

2. Install PySpark Using Pip

• In a new code cell, you will use the pip command to install PySpark.

The ! at the beginning of the command tells Colab to execute this as a shell command, not Python code.

pip install pyspark

pip is a package manager for Python that installs and manages Python packages. install is the
command to install a new package. pyspark is the name of the package you're installing,
which is the Python API for Apache Spark. .

ETL SPARK 2
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

Install Apache Spark on Windows

Installing Apache Spark on Windows 10 may seem complicated to novice users, but this simple tutorial will
have you up and running. If you already have Java 8 and Python 3 installed, you can skip the first two steps.

Step 1: Install Java 8

Apache Spark requires Java 8. You can check to see if Java is installed using the command prompt.

Open the command line by clicking Start > type cmd > click Command Prompt.

Type the following command in the command prompt:


java-version
If Java is installed, it will respond with the following output:

Your version may be different. The second digit is the Java version – in this case, Java 8.

If you don’t have Java installed:

1. Open a browser window, and navigate to https://java.com/en/download/.

2. Click the Java Download button and save the file to a location of your choice.

3. Once the download finishes double-click the file to install Java.

ETL SPARK 3
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

Step 2: Install Python

1. To install the Python package manager, navigate to https://www.python.org/ in your web browser.

2. Mouse over the Download menu option and click Python 3.8.3. 3.8.3 is the latest version at the time of
writing the article.

3. Once the download finishes, run the file.

4. Near the bottom of the first setup dialog box, check off Add Python 3.8 to PATH. Leave the other
box checked.

5. Next, click Customize installation.

6. You can leave all boxes checked at this step, or you can uncheck the options you do not want.

ETL SPARK 4
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

7. Click Next.

8. Select the box Install for all users and leave other boxes as they are.

9.Under Customize install location, click Browse and navigate to the C drive. Add a new folder and name it
Python.

10.Select that folder and click OK.

11.Click Install, and let the installation complete.

12.When the installation completes, click the Disable path length limit option at the bottom and then click
Close.

13.If you have a command prompt open, restart it. Verify the installation by checking the version of Python:
python --version

The output should print Python

3.8.3. Step 3: Download Apache

Spark

1. Open a browser and navigate to https://spark.apache.org/downloads.html.

2. Under the Download Apache Spark heading, there are two drop-down menus. Use the current non-preview
version.

ETL SPARK 5
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

• In our case, in Choose a Spark release drop-down menu select 2.4.5 (Feb 05 2020).

ETL SPARK 6
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

• In the second drop-down Choose a package type, leave the selection Pre-built for Apache Hadoop
2.7.

3. Click the spark-2.4.5-bin-hadoop2.7.tgz link.

4. A page with a list of mirrors loads where you can see different servers to download from. Pick any from the
list and save the file to your Downloads folder.

Step 4: Verify Spark Software File

1. Verify the integrity of your download by checking the checksum of the file. This ensures you are working
with unaltered, uncorrupted software.

2. Navigate back to the Spark Download page and open the Checksum link, preferably in a new tab.

3. Next, open a command line and enter the following command:

certutil -hashfile c:\users\username\Downloads\spark-3.5.3-bin-hadoop3.tgz SHA512

4. Change the username to your username. The system displays a long alphanumeric code,
along with the message Certutil: -hashfile completed successfully.

ETL SPARK 7
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

5. Compare the code to the one you opened in a new browser tab. If they match, your
download file is uncorrupted.

Step 5: Install Apache Spark

Installing Apache Spark involves extracting the downloaded file to the desired location.

1. Create a new folder namesd Spark in the root of your C: drive. From a command line, enter the following:
cd \ mkdir Spark

2. In Explorer, locate the Spark file you downloaded.

3. Right-click the file and extract it to C:\Spark using the tool you have on your system (e.g., 7-Zip).

4. Now, your C:\Spark folder has a new folder spark-2.4.5-bin-hadoop2.7 with the necessary files inside.

Step 6: Add winutils.exe File

Download the winutils.exe file for the underlying Hadoop version for the Spark installation you downloaded.

1. Navigate to this URL https://github.com/cdarlint/winutils and inside the bin folder, locate winutils.exe, and
click it.

2. Find the Download button on the right side to download the file.

3. Now, create new folders Hadoop and bin on C: using Windows Explorer or the Command Prompt.

4. Copy the winutils.exe file from the Downloads folder to C:\hadoop\bin.

Step 7: Configure Environment Variables

Configuring environment variables in Windows adds the Spark and Hadoop locations to your system PATH.
It allows you to run the Spark shell directly from a command prompt window.

1. Click Start and type environment.

ETL SPARK 8
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

2. Select the result labeled Edit the system environment variables.

3. A System Properties dialog box appears. In the lower-right corner, click Environment Variables and then
click New in the next window.

4. For Variable Name type SPARK_HOME.

5. For Variable Value type C:\Spark\spark-2.4.5-bin-hadoop2.7 and click OK. If you changed the folder
path, use that one instead.

6. In the top box, click the Path entry, then click Edit. Be careful with editing the system path. Avoid
deleting any entries already on the list.

7. You should see a box with entries on the left. On the right, click New.

ETL SPARK 9
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

8. The system highlights a new line. Enter the path to the Spark folder C:\Spark\spark-2.4.5-bin-
hadoop2.7\bin.
We recommend using %SPARK_HOME%\bin to avoid possible issues with the path.

Spark.

The system should display several lines indicating the status of the applicati

9. Repeat this process for Hadoop and Java.

ETL SPARK 10
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

10. For Hadoop, the variable name is HADOOP_HOME and for the value use the path of the folder you
created earlier: C:\hadoop. Add C:\hadoop\bin to the Path variable field, but we recommend
using
%HADOOP_HOME%\bin.
11. For Java, the variable name is JAVA_HOME and for the value use the path to your Java JDK directory (in
our case it’s C:\Program Files\Java\jdk1.8.0_251).

12. Click OK to close all open windows.

Step 8: Launch Spark

1. Open a new command-prompt window using the right-click and Run as administrator:

2. To start Spark, enter

C:\Spark\spark-2.4.5-bin-hadoop2.7\bin\spark-shell

3. If you set the environment path correctly, you can type spark-shell to launch on. You may get a Java pop-
up. Select Allow access to continue.

Finally, the Spark logo appears, and the prompt displays the Scala shell.

ETL SPARK 11
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

4., Open a web browser and navigate to http://localhost:4040/.

5. You can replace localhost with the name of your system.

6. You should see an Apache Spark shell Web UI. The example below shows the Executors page.

7. To exit Spark and close the Scala shell, press ctrl-d in the command-prompt window.

ETL SPARK 12
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

Apache Spark Architecture

Apache Spark is known for its fast processing capabilities compared to MapReduce. It is fast because it runs
on Memory(RAM) which helps it to process data more quickly than on Disk. Apache Spark offers a wide
range of capabilities, allowing users to perform multiple operations such as creating data pipelines, integrating
data from various sources, running machine learning models, working with graphs, executing SQL queries,
and more.

Spark Architecture, an open-source, framework-based component that processes a large amount of


unstructured, semi-structured, and structured data for analytics, is utilised in Apache Spark. Apart from
Hadoop and map-reduce architectures for big data processing, Apache Spark’s architecture is regarded as an
alternative. The RDD and DAG, Spark’s data storage and processing framework, are utilised to store and
process data, respectively. Spark architecture consists of four components, including the spark driver,
executors, cluster administrators, and worker nodes. It uses the Dataset and data frames as the fundamental
data storage mechanism to optimise the Spark process and big data computation.

Architecture of Apache Spark:


Apache Spark has a master-slave architecture consisting of a driver program and cluster of worker nodes.

ETL SPARK 13
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

Spark applications run in clusters, where each worker node contains executors responsible for executing tasks
in parallel and storing data in memory across multiple machines, enhancing speed and fault tolerance. Spark’s
cluster manager (e.g., YARN, Mesos, or Kubernetes) allocates resources and coordinates between driver and
workers. Additionally, Spark uses the Resilient Distributed Dataset (RDD) abstraction, which ensures fault
tolerance by tracking data transformations, enabling the system to recompute lost data without full reruns.

When the Driver Program in the Apache Spark architecture executes, it calls the real program of an application
and creates a SparkContext. SparkContext contains all of the basic functions. The Spark Driver includes
several other components, including a DAG Scheduler, Task Scheduler, Backend Scheduler, and Block
Manager, all of which are responsible for translating user-written code into jobs that are actually executed on
the cluster.The Cluster Manager manages the execution of various jobs in the cluster. Spark Driver works in
conjunction with the Cluster Manager to control the execution of various other jobs. The cluster Manager does
the task of allocating resources for the job. Once the job has been broken down into smaller jobs, which are
then distributed to worker nodes, Spark driver will control the execution. Many worker nodes can be used to
process an RDD created in the SparkContext, and the results can also be cached.The Spark Context receives
task information from the Cluster Manager and enqueues it on worker nodes.The executor is in charge of
carrying out these duties. The lifespan of executors is the same as that of the Spark Application. We can
increase the number of workers if we want to improve the performance of the system. In this way, we can
divide jobs into more coherent parts.

ETL SPARK 14
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

Apache Spark Features

Apache Spark, a popular cluster computing framework, was created in order to accelerate data processing
applications. Spark, which enables applications to run faster by utilising in-memory cluster computing, is a
popular open source framework.

A cluster is a collection of nodes that communicate with each other and share data. Because of implicit data
parallelism and fault tolerance, Spark may be applied to a wide range of sequential and interactive
processing demands.
1. Speed: Spark performs up to 100 times faster than MapReduce for processing large amounts of data. It is
also able to divide the data into chunks in a controlled way.
2. Powerful Caching: Powerful caching and disk persistence capabilities are offered by a simple
programming layer.
3. Deployment: Mesos, Hadoop via YARN, or Spark’s own cluster manager can all be used to deploy it.
4. Real-Time: Because of its in-memory processing, it offers real-time computation and low latency.
5. Polyglot: In addition to Java, Scala, Python, and R, Spark also supports all four of these languages. You
can write Spark code in any one of these languages. Spark also provides a command-line interface in Scala
and Python.

ETL SPARK 15
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

POWER BI

Power BI is a Microsoft business analytics tool that enables users to visualize data and
share insights through interactive reports and dashboards. It connects to various data sources,
allowing for data transformation and modeling using Power Query and DAX. With a wide
range of visualization options, Power BI facilitates data-driven decision-making and
collaboration, makingit easy to share reports within organizations and access them via mobile
devices.

Power BI can integrate with Apache Spark in an ETL (Extract, Transform, Load) pipeline
to leverage Spark's distributed data processing power and Power BI’s data visualization
capabilities. Here's how it works:

1. Extract: Data is collected from various sources (e.g., databases, cloud storage, or
APIs) into Apache Spark. Spark can handle large-scale, diverse datasets efficiently through its
parallel processing framework.

2. Transform: In Spark, data is cleaned, transformed, and processed using operations like
filtering, aggregation, and joining across distributed datasets. This is where Spark shines,
handling both batch and real-time streaming data.

3. Load into Power BI: After transformation, the processed data can be loaded into Power
BI for visualization. Power BI connects to Spark via:
 Spark ODBC/JDBC: Power BI can connect to Spark using ODBC/JDBC drivers to
query and fetch data directly from the Spark cluster.

 Azure Synapse or Databricks: If Spark is running on Azure Synapse or Azure


Databricks, Power BI can natively integrate and query the data from these services.

4. Visualization and Analysis: Power BI pulls the processed data from Spark, allowing users to
build interactive dashboards, perform real-time analytics, and visualize trends.

ETL SPARK 16
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

1. Write a Program to create a SparkSession and Read Data from a CSV File

File Prerequisites:

OUTPUT:

Explanation:

from pyspark.sql import SparkSession


spark = SparkSession.builder.appName("Read CSV").getOrCreate()
df.show()

from pyspark.sql import SparkSession


This line imports the SparkSession class from the pyspark.sql module. SparkSession is the entry point to
programming with Spark SQL. It allows you to create DataFrame and execute SQL queries.

spark = SparkSession.builder.appName("Read CSV").getOrCreate()


SparkSession.builder starts the process of creating a new Spark session.
.appName("Read CSV") sets the name of the Spark application to "Read CSV". This name will be displayed in
the Spark web UI, which can be helpful for debugging and monitoring.
.getOrCreate() method either retrieves an existing Spark session or creates a new one if none exists. This is
useful to ensure that you don't accidentally create multiple Spark sessions in the same application.

ETL SPARK 17
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

df = spark.read.csv("/content/supermarket_sales.csv",header=True, inferSchema=True)

spark.read.csv is a method of the SparkSession object to read a CSV file.


"/content/supermarket_sales.csv" is the path to the CSV file you want to read. You need to replace this with the
actual path to your CSV file.

inferSchema=True parameter tells Spark to automatically infer the data types of the columns based on the
values in the CSV file.

df.show()
It will display the content of the DataFrame. This is a method of the DataFrame class that prints the first 20
rows of the DataFrame to the console.

OUTPUT:

ETL SPARK 18
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

2. Write a program to group record of Supermarket’s sales data of Kaggle Dataset by


Gender

Prerequisites:
Before running the code, ensure you have PySpark installed. You can install it using:
pip install pyspark

PySpark Program to Group Data by Gender :

from pyspark.sql import SparkSession


from pyspark.sql.functions import col, sum
spark = SparkSession.builder.appName("Supermarket Sales Analysis").getOrCreate()
df.printSchema()
df.show()
df_with_total_sales = df.withColumn("TotalSales", col("Quantity") * col("Unit price"))
result = df_with_total_sales.groupBy("Gender").agg(sum("TotalSales").alias("TotalSales"))
result.show()
spark.stop()

OUTPUT:

ETL SPARK 19
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

ETL SPARK 20
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

3. Write a program to create a Spark Session and display DataFrame of employee.json

from pyspark.sql import SparkSession


spark = SparkSession.builder.appName("Employee Data Analysis").getOrCreate()
df = spark.read.json("/content/employee.json")
df.printSchema()
df.show()
spark.stop() #To stop the Spark session and release any resources used. It’s a good practice to stop the session
once all operations are complete.

OUTPUT:

ETL SPARK 21
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

4. Perform Various Operations with SparkSQL

from pyspark.sql import SparkSession


spark = SparkSession.builder.appName(“SparkSQL Operations”).getOrCreate()
df = spark.read.csv(“ /content/supermarket_sales.csv”, header=True, inferSchema=True)
df.createOrReplaceTempView(“data_view”)
result1 = spark.sql(“SELECT * FROM data_view LIMIT 10”)
result2 = spark.sql(“SELECT `Invoice ID`, COUNT(*) as count FROM data_view GROUP BY
`Invoice ID`”)
result1.show()
result2.show()
spark.stop()

output:

ETL SPARK 22
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

ETL SPARK 23
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

5.Create a New Data Pipeline with Apache Spark

from google.colab import files


files.download('/content/output.parquet')

OUTPUT:

from pyspark.sql import SparkSession


spark = SparkSession.builder.appName(“Data Pipeline”).getOrCreate()
df = spark.read.csv(“/content/supermarket_sales .csv", header=True, inferSchema=True)
transformed_df= df.withColumn(“NewColumn”, df[“Unit price”] * 2)
transformed_ df.write.mode (“overwrite”). Parquet(“/content/output.parquet”)
spark.stop()

ETL SPARK 24
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

OUTPUT:

ETL SPARK 25
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

6.Run SQL Queries on the Data in a Parquet Table

from pyspark.sql import SparkSession


spark = SparkSession.builder.appName(“Parquet SQL Queries”).getOrCreate()
df = spark.read.parquet(“/content/output.parquet”)
df.createOrReplaceTempView(“parquet_view”)
result = spark.sql("SELECT `Invoice ID`, SUM(`Rating`) as total FROM parquet_view
GROUP BY `Invoice ID`")
result.show()
spark.stop()

OUTPUT:

ETL SPARK 26
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

7. Develop Parquet Table to a Platform Data Container


from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Parquet to Platform").getOrCreate()
df = spark.read.parquet("/content/output.parquet")
df.write.mode("overwrite").parquet("path/to/your/platform/container")
df.show()
spark.stop()

OUTPUT :

ETL SPARK 27
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

8. Change Data in an Existing Delta Lake Table

# Install Java
!apt-get install openjdk-11-jdk-headless -qq > /dev/null
# Install compatible versions of PySpark and Delta Lake
!pip install pyspark==3.1.3 delta-spark==1.0.0

Output :

# Set up environment variables


import os
os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-11-openjdk-amd64'
os.environ['SPARK_HOME'] =
'/usr/local/lib/python3.10/dist-packages/pyspark'

# Initialize Spark session with Delta support


from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("DeltaLakeExample") \
.config("spark.jars.packages", "io.delta:delta-core_2.12:1.0.0") \
.config("spark.sql.extensions",
"delta.sql.DeltaSparkSessionExtensions") \
.config("spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()

ETL SPARK 28
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

# Create sample data


data = [("Alice", 34), ("Bob", 45), ("Cathy", 29)]
columns = ["Name", "Age"]

# Create a DataFrame
df = spark.createDataFrame(data, columns)

# Define the path for the Delta table


delta_table_path = "/content/delta_table"

# Write the DataFrame to a Delta table


df.write.format("delta").mode("overwrite").save(delta_table_path)

#read contents of data table


df = spark.read.format("delta").load(delta_table_path)
df.show()

output :

# Create new records to add


new_data = [("David", 30), ("Eva", 22)]
new_df = spark.createDataFrame(new_data, columns)
# Append the new DataFrame to the existing Delta table
new_df.write.format("delta").mode("append").save(delta_table_path)
# Show updated data
updated_df = spark.read.format("delta").load(delta_table_path)
print("Updated Data with New Records:")
updated_df.show()

ETL SPARK 29
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

OUTPUT :

ETL SPARK 30
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

9.Create a New Ingestion Pipeline with Apache Spark

# Install PySpark
!pip install pyspark

OUTPUT :

# Import necessary libraries


from pyspark.sql import SparkSession

# Create a Spark session


spark = SparkSession.builder \
.appName("Ingestion Pipeline") \
.getOrCreate()

# Sample data
data = [
(1, "Alice", 34),
(2, "Bob", 45),
(3, "Cathy", 29),
]

# Define schema
columns = ["id", "name", "age"]

# Create DataFrame
df = spark.createDataFrame(data, schema=columns)
df.show()

# Filter data where age is greater than 30


filtered_df = df.filter(df.age > 30)

ETL SPARK 31
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

# Select specific columns


result_df = filtered_df.select("id", "name")
result_df.show()
result_df.write.mode("overwrite").csv("output.csv",
header=True) spark.stop()

OUTPUT :

ETL SPARK 32
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

10. Run SQL Queries on the Data in NoSQL Table

from pyspark.sql import SparkSession


spark=SparkSession.builder.appName(“NoSQL SQL
Queries”).config(“spark.cassandra.connection.host”,”127.0.0.1”).getOrCreate()
df = spark.read.format(“org.apache.spark.sql.cassandra” \
.options(table=”your_table”, keyspace= “your_keyspace”) \
.load()
df.createOrReplaceTempView(“nosql_view”)
result = spark.sql(“SELECT column1, COUNT(*) as count FROM nosql_view GROUP BY column1”)
result.show()
spark.stop()

Explanation:
Create a Spark Session:
spark = SparkSession.builder \
.appName(“NoSQL SQL Queries”) \
. config(“spark.cassandra.connection.host”,”127.0.0.1”).getOrCreate().
Initializes a Spark session with Cassandra configuration.

Load Data from Cassandra:


df = spark.read.format(“org.apache.spark.sql.cassandra” \
.options(table=”your_table”, keyspace= “your_keyspace”) \
.load()
Reads data from a Cassandra table into a DataFrame.

Register as SQL View:


df.createOrReplaceTempView(“nosql_view”)
Registers the DataFrame as a temporary SQL view

Perform SQL Queries:


result = spark.sql(“SELECT column1, COUNT(*) as count FROM nosql_view GROUP BY column1”)
Executes an SQL query on the NoSQL view.

ETL SPARK 33
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

Show Results:
result.show()
Displays the results of the SQL query.

Stop the Spark Session:


spark.stop() will stops the Spark session.

ETL SPARK 34

You might also like