0% found this document useful (0 votes)

6 views

m etl spark

Uploaded by

bjxvenrypfgkqdmndz

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

m etl spark

Uploaded by

bjxvenrypfgkqdmndz

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 34

21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

Table of Contents

Usage of google colab 2

Install Apache Spark on Windows 3 - 11

Apache Spark Architecture 12 - 14

Power Bi 15

1. Write a Program to create a SparkSession and Read Data from a CSV File 16 - 17

2. Write a program to group record of Supermarket’s sales data of Kaggle Dataset by Gender 18 - 19

3. Write a program to create a Spark Session and display DataFrame of employee.json 20

4. Perform Various Operations with SparkSQL 21 - 22

5. 5.Create a New Data Pipeline with Apache Spark 23 - 24

6. 6.Run SQL Queries on the Data in a Parquet Table 25

7. 7. Develop Parquet Table to a Platform Data Container 26

8. 8. Change Data in an Existing Delta Lake Table 27 - 29

9. 9.Create a New Ingestion Pipeline with Apache Spark 30 - 31

10. 10. Run SQL Queries on the Data in NoSQL Table 32 - 33

ETL SPARK 1
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

USAGE OF GOOGLE COLAB

Google Colab is like a fancy notebook for coding that you can use right in your web browser. It helps people
write and run code, especially for data science and machine learning, without needing to set up anything on
their own computer. You can share your work with others, and it’s free to use. Plus, it gives you access to
powerful computers and GPUs (graphics processing units) that make running complex calculations faster.

Google Colab, or "Colaboratory", is a cloud-based machine learning platform that you can use to write and
execute Python code, process data, and create visualizations. It's free for most tasks, and you can also pay for
more demanding needs. Here are some reasons why you might want to use Google Colab.

To install PySpark in Google Colab :

1. Open Google Colab

• Go to Google Colab.

• Create a new notebook by clicking on "File" > "New notebook".

2. Install PySpark Using Pip

• In a new code cell, you will use the pip command to install PySpark.

The ! at the beginning of the command tells Colab to execute this as a shell command, not Python code.

pip install pyspark

pip is a package manager for Python that installs and manages Python packages. install is the
command to install a new package. pyspark is the name of the package you're installing,
which is the Python API for Apache Spark. .

ETL SPARK 2
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

Install Apache Spark on Windows

Installing Apache Spark on Windows 10 may seem complicated to novice users, but this simple tutorial will
have you up and running. If you already have Java 8 and Python 3 installed, you can skip the first two steps.

Step 1: Install Java 8

Apache Spark requires Java 8. You can check to see if Java is installed using the command prompt.

Open the command line by clicking Start > type cmd > click Command Prompt.

Type the following command in the command prompt:

java-version
If Java is installed, it will respond with the following output:

Your version may be different. The second digit is the Java version – in this case, Java 8.

If you don’t have Java installed:

1. Open a browser window, and navigate to https://java.com/en/download/.

2. Click the Java Download button and save the file to a location of your choice.

3. Once the download finishes double-click the file to install Java.

ETL SPARK 3
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

Step 2: Install Python

1. To install the Python package manager, navigate to https://www.python.org/ in your web browser.

2. Mouse over the Download menu option and click Python 3.8.3. 3.8.3 is the latest version at the time of
writing the article.

3. Once the download finishes, run the file.

4. Near the bottom of the first setup dialog box, check off Add Python 3.8 to PATH. Leave the other
box checked.

5. Next, click Customize installation.

6. You can leave all boxes checked at this step, or you can uncheck the options you do not want.

ETL SPARK 4
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

7. Click Next.

8. Select the box Install for all users and leave other boxes as they are.

9.Under Customize install location, click Browse and navigate to the C drive. Add a new folder and name it
Python.

10.Select that folder and click OK.

11.Click Install, and let the installation complete.

12.When the installation completes, click the Disable path length limit option at the bottom and then click
Close.

13.If you have a command prompt open, restart it. Verify the installation by checking the version of Python:
python --version

The output should print Python

3.8.3. Step 3: Download Apache

Spark

1. Open a browser and navigate to https://spark.apache.org/downloads.html.

2. Under the Download Apache Spark heading, there are two drop-down menus. Use the current non-preview
version.

ETL SPARK 5
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

• In our case, in Choose a Spark release drop-down menu select 2.4.5 (Feb 05 2020).

ETL SPARK 6
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

• In the second drop-down Choose a package type, leave the selection Pre-built for Apache Hadoop
2.7.

3. Click the spark-2.4.5-bin-hadoop2.7.tgz link.

4. A page with a list of mirrors loads where you can see different servers to download from. Pick any from the
list and save the file to your Downloads folder.

Step 4: Verify Spark Software File

1. Verify the integrity of your download by checking the checksum of the file. This ensures you are working
with unaltered, uncorrupted software.

2. Navigate back to the Spark Download page and open the Checksum link, preferably in a new tab.

3. Next, open a command line and enter the following command:

certutil -hashfile c:\users\username\Downloads\spark-3.5.3-bin-hadoop3.tgz SHA512

4. Change the username to your username. The system displays a long alphanumeric code,
along with the message Certutil: -hashfile completed successfully.

ETL SPARK 7
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

5. Compare the code to the one you opened in a new browser tab. If they match, your
download file is uncorrupted.

Step 5: Install Apache Spark

Installing Apache Spark involves extracting the downloaded file to the desired location.

1. Create a new folder namesd Spark in the root of your C: drive. From a command line, enter the following:
cd \ mkdir Spark

2. In Explorer, locate the Spark file you downloaded.

3. Right-click the file and extract it to C:\Spark using the tool you have on your system (e.g., 7-Zip).

4. Now, your C:\Spark folder has a new folder spark-2.4.5-bin-hadoop2.7 with the necessary files inside.

Step 6: Add winutils.exe File

Download the winutils.exe file for the underlying Hadoop version for the Spark installation you downloaded.

1. Navigate to this URL https://github.com/cdarlint/winutils and inside the bin folder, locate winutils.exe, and
click it.

2. Find the Download button on the right side to download the file.

3. Now, create new folders Hadoop and bin on C: using Windows Explorer or the Command Prompt.

4. Copy the winutils.exe file from the Downloads folder to C:\hadoop\bin.

Step 7: Configure Environment Variables

Configuring environment variables in Windows adds the Spark and Hadoop locations to your system PATH.
It allows you to run the Spark shell directly from a command prompt window.

1. Click Start and type environment.

ETL SPARK 8
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

2. Select the result labeled Edit the system environment variables.

3. A System Properties dialog box appears. In the lower-right corner, click Environment Variables and then
click New in the next window.

4. For Variable Name type SPARK_HOME.

5. For Variable Value type C:\Spark\spark-2.4.5-bin-hadoop2.7 and click OK. If you changed the folder
path, use that one instead.

6. In the top box, click the Path entry, then click Edit. Be careful with editing the system path. Avoid
deleting any entries already on the list.

7. You should see a box with entries on the left. On the right, click New.

ETL SPARK 9
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

8. The system highlights a new line. Enter the path to the Spark folder C:\Spark\spark-2.4.5-bin-
hadoop2.7\bin.
We recommend using %SPARK_HOME%\bin to avoid possible issues with the path.

Spark.

The system should display several lines indicating the status of the applicati

9. Repeat this process for Hadoop and Java.

ETL SPARK 10
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

10. For Hadoop, the variable name is HADOOP_HOME and for the value use the path of the folder you
created earlier: C:\hadoop. Add C:\hadoop\bin to the Path variable field, but we recommend
using
%HADOOP_HOME%\bin.
11. For Java, the variable name is JAVA_HOME and for the value use the path to your Java JDK directory (in
our case it’s C:\Program Files\Java\jdk1.8.0_251).

12. Click OK to close all open windows.

Step 8: Launch Spark

1. Open a new command-prompt window using the right-click and Run as administrator:

2. To start Spark, enter

C:\Spark\spark-2.4.5-bin-hadoop2.7\bin\spark-shell

3. If you set the environment path correctly, you can type spark-shell to launch on. You may get a Java pop-
up. Select Allow access to continue.

Finally, the Spark logo appears, and the prompt displays the Scala shell.

ETL SPARK 11
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

4., Open a web browser and navigate to http://localhost:4040/.

5. You can replace localhost with the name of your system.

6. You should see an Apache Spark shell Web UI. The example below shows the Executors page.

7. To exit Spark and close the Scala shell, press ctrl-d in the command-prompt window.

ETL SPARK 12
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

Apache Spark Architecture

Apache Spark is known for its fast processing capabilities compared to MapReduce. It is fast because it runs
on Memory(RAM) which helps it to process data more quickly than on Disk. Apache Spark offers a wide
range of capabilities, allowing users to perform multiple operations such as creating data pipelines, integrating
data from various sources, running machine learning models, working with graphs, executing SQL queries,
and more.

Spark Architecture, an open-source, framework-based component that processes a large amount of

unstructured, semi-structured, and structured data for analytics, is utilised in Apache Spark. Apart from
Hadoop and map-reduce architectures for big data processing, Apache Spark’s architecture is regarded as an
alternative. The RDD and DAG, Spark’s data storage and processing framework, are utilised to store and
process data, respectively. Spark architecture consists of four components, including the spark driver,
executors, cluster administrators, and worker nodes. It uses the Dataset and data frames as the fundamental
data storage mechanism to optimise the Spark process and big data computation.

Architecture of Apache Spark:

Apache Spark has a master-slave architecture consisting of a driver program and cluster of worker nodes.

ETL SPARK 13
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

Spark applications run in clusters, where each worker node contains executors responsible for executing tasks
in parallel and storing data in memory across multiple machines, enhancing speed and fault tolerance. Spark’s
cluster manager (e.g., YARN, Mesos, or Kubernetes) allocates resources and coordinates between driver and
workers. Additionally, Spark uses the Resilient Distributed Dataset (RDD) abstraction, which ensures fault
tolerance by tracking data transformations, enabling the system to recompute lost data without full reruns.

When the Driver Program in the Apache Spark architecture executes, it calls the real program of an application
and creates a SparkContext. SparkContext contains all of the basic functions. The Spark Driver includes
several other components, including a DAG Scheduler, Task Scheduler, Backend Scheduler, and Block
Manager, all of which are responsible for translating user-written code into jobs that are actually executed on
the cluster.The Cluster Manager manages the execution of various jobs in the cluster. Spark Driver works in
conjunction with the Cluster Manager to control the execution of various other jobs. The cluster Manager does
the task of allocating resources for the job. Once the job has been broken down into smaller jobs, which are
then distributed to worker nodes, Spark driver will control the execution. Many worker nodes can be used to
process an RDD created in the SparkContext, and the results can also be cached.The Spark Context receives
task information from the Cluster Manager and enqueues it on worker nodes.The executor is in charge of
carrying out these duties. The lifespan of executors is the same as that of the Spark Application. We can
increase the number of workers if we want to improve the performance of the system. In this way, we can
divide jobs into more coherent parts.

ETL SPARK 14
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

Apache Spark Features

Apache Spark, a popular cluster computing framework, was created in order to accelerate data processing
applications. Spark, which enables applications to run faster by utilising in-memory cluster computing, is a
popular open source framework.

A cluster is a collection of nodes that communicate with each other and share data. Because of implicit data
parallelism and fault tolerance, Spark may be applied to a wide range of sequential and interactive
processing demands.
1. Speed: Spark performs up to 100 times faster than MapReduce for processing large amounts of data. It is
also able to divide the data into chunks in a controlled way.
2. Powerful Caching: Powerful caching and disk persistence capabilities are offered by a simple
programming layer.
3. Deployment: Mesos, Hadoop via YARN, or Spark’s own cluster manager can all be used to deploy it.
4. Real-Time: Because of its in-memory processing, it offers real-time computation and low latency.
5. Polyglot: In addition to Java, Scala, Python, and R, Spark also supports all four of these languages. You
can write Spark code in any one of these languages. Spark also provides a command-line interface in Scala
and Python.

ETL SPARK 15
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

POWER BI

Power BI is a Microsoft business analytics tool that enables users to visualize data and
share insights through interactive reports and dashboards. It connects to various data sources,
allowing for data transformation and modeling using Power Query and DAX. With a wide
range of visualization options, Power BI facilitates data-driven decision-making and
collaboration, makingit easy to share reports within organizations and access them via mobile
devices.

Power BI can integrate with Apache Spark in an ETL (Extract, Transform, Load) pipeline
to leverage Spark's distributed data processing power and Power BI’s data visualization
capabilities. Here's how it works:

1. Extract: Data is collected from various sources (e.g., databases, cloud storage, or
APIs) into Apache Spark. Spark can handle large-scale, diverse datasets efficiently through its
parallel processing framework.

2. Transform: In Spark, data is cleaned, transformed, and processed using operations like
filtering, aggregation, and joining across distributed datasets. This is where Spark shines,
handling both batch and real-time streaming data.

3. Load into Power BI: After transformation, the processed data can be loaded into Power
BI for visualization. Power BI connects to Spark via:
 Spark ODBC/JDBC: Power BI can connect to Spark using ODBC/JDBC drivers to
query and fetch data directly from the Spark cluster.

 Azure Synapse or Databricks: If Spark is running on Azure Synapse or Azure

Databricks, Power BI can natively integrate and query the data from these services.

4. Visualization and Analysis: Power BI pulls the processed data from Spark, allowing users to
build interactive dashboards, perform real-time analytics, and visualize trends.

ETL SPARK 16
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

1. Write a Program to create a SparkSession and Read Data from a CSV File

File Prerequisites:

OUTPUT:

Explanation:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Read CSV").getOrCreate()
df.show()

from pyspark.sql import SparkSession

This line imports the SparkSession class from the pyspark.sql module. SparkSession is the entry point to
programming with Spark SQL. It allows you to create DataFrame and execute SQL queries.

spark = SparkSession.builder.appName("Read CSV").getOrCreate()

SparkSession.builder starts the process of creating a new Spark session.
.appName("Read CSV") sets the name of the Spark application to "Read CSV". This name will be displayed in
the Spark web UI, which can be helpful for debugging and monitoring.
.getOrCreate() method either retrieves an existing Spark session or creates a new one if none exists. This is
useful to ensure that you don't accidentally create multiple Spark sessions in the same application.

ETL SPARK 17
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

df = spark.read.csv("/content/supermarket_sales.csv",header=True, inferSchema=True)

spark.read.csv is a method of the SparkSession object to read a CSV file.

"/content/supermarket_sales.csv" is the path to the CSV file you want to read. You need to replace this with the
actual path to your CSV file.

inferSchema=True parameter tells Spark to automatically infer the data types of the columns based on the
values in the CSV file.

df.show()
It will display the content of the DataFrame. This is a method of the DataFrame class that prints the first 20
rows of the DataFrame to the console.

OUTPUT:

ETL SPARK 18
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

2. Write a program to group record of Supermarket’s sales data of Kaggle Dataset by

Gender

Prerequisites:
Before running the code, ensure you have PySpark installed. You can install it using:
pip install pyspark

PySpark Program to Group Data by Gender :

from pyspark.sql import SparkSession

from pyspark.sql.functions import col, sum
spark = SparkSession.builder.appName("Supermarket Sales Analysis").getOrCreate()
df.printSchema()
df.show()
df_with_total_sales = df.withColumn("TotalSales", col("Quantity") * col("Unit price"))
result = df_with_total_sales.groupBy("Gender").agg(sum("TotalSales").alias("TotalSales"))
result.show()
spark.stop()

OUTPUT:

ETL SPARK 19
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

ETL SPARK 20
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

3. Write a program to create a Spark Session and display DataFrame of employee.json

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Employee Data Analysis").getOrCreate()
df = spark.read.json("/content/employee.json")
df.printSchema()
df.show()
spark.stop() #To stop the Spark session and release any resources used. It’s a good practice to stop the session
once all operations are complete.

OUTPUT:

ETL SPARK 21
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

4. Perform Various Operations with SparkSQL

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName(“SparkSQL Operations”).getOrCreate()
df = spark.read.csv(“ /content/supermarket_sales.csv”, header=True, inferSchema=True)
df.createOrReplaceTempView(“data_view”)
result1 = spark.sql(“SELECT * FROM data_view LIMIT 10”)
result2 = spark.sql(“SELECT `Invoice ID`, COUNT(*) as count FROM data_view GROUP BY
`Invoice ID`”)
result1.show()
result2.show()
spark.stop()

output:

ETL SPARK 22
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

ETL SPARK 23
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

5.Create a New Data Pipeline with Apache Spark

from google.colab import files

files.download('/content/output.parquet')

OUTPUT:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName(“Data Pipeline”).getOrCreate()
df = spark.read.csv(“/content/supermarket_sales .csv", header=True, inferSchema=True)
transformed_df= df.withColumn(“NewColumn”, df[“Unit price”] * 2)
transformed_ df.write.mode (“overwrite”). Parquet(“/content/output.parquet”)
spark.stop()

ETL SPARK 24
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

OUTPUT:

ETL SPARK 25
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

6.Run SQL Queries on the Data in a Parquet Table

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName(“Parquet SQL Queries”).getOrCreate()
df = spark.read.parquet(“/content/output.parquet”)
df.createOrReplaceTempView(“parquet_view”)
result = spark.sql("SELECT `Invoice ID`, SUM(`Rating`) as total FROM parquet_view
GROUP BY `Invoice ID`")
result.show()
spark.stop()

OUTPUT:

ETL SPARK 26
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

7. Develop Parquet Table to a Platform Data Container

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Parquet to Platform").getOrCreate()
df = spark.read.parquet("/content/output.parquet")
df.write.mode("overwrite").parquet("path/to/your/platform/container")
df.show()
spark.stop()

OUTPUT :

ETL SPARK 27
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

8. Change Data in an Existing Delta Lake Table

# Install Java
!apt-get install openjdk-11-jdk-headless -qq > /dev/null
# Install compatible versions of PySpark and Delta Lake
!pip install pyspark==3.1.3 delta-spark==1.0.0

Output :

# Set up environment variables

import os
os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-11-openjdk-amd64'
os.environ['SPARK_HOME'] =
'/usr/local/lib/python3.10/dist-packages/pyspark'

# Initialize Spark session with Delta support

from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("DeltaLakeExample") \
.config("spark.jars.packages", "io.delta:delta-core_2.12:1.0.0") \
.config("spark.sql.extensions",
"delta.sql.DeltaSparkSessionExtensions") \
.config("spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()

ETL SPARK 28
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

# Create sample data

data = [("Alice", 34), ("Bob", 45), ("Cathy", 29)]
columns = ["Name", "Age"]

# Create a DataFrame
df = spark.createDataFrame(data, columns)

# Define the path for the Delta table

delta_table_path = "/content/delta_table"

# Write the DataFrame to a Delta table

df.write.format("delta").mode("overwrite").save(delta_table_path)

#read contents of data table

df = spark.read.format("delta").load(delta_table_path)
df.show()

output :

# Create new records to add

new_data = [("David", 30), ("Eva", 22)]
new_df = spark.createDataFrame(new_data, columns)
# Append the new DataFrame to the existing Delta table
new_df.write.format("delta").mode("append").save(delta_table_path)
# Show updated data
updated_df = spark.read.format("delta").load(delta_table_path)
print("Updated Data with New Records:")
updated_df.show()

ETL SPARK 29
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

OUTPUT :

ETL SPARK 30
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

9.Create a New Ingestion Pipeline with Apache Spark

# Install PySpark
!pip install pyspark

OUTPUT :

# Import necessary libraries

from pyspark.sql import SparkSession

# Create a Spark session

spark = SparkSession.builder \
.appName("Ingestion Pipeline") \
.getOrCreate()

# Sample data
data = [
(1, "Alice", 34),
(2, "Bob", 45),
(3, "Cathy", 29),
]

# Define schema
columns = ["id", "name", "age"]

# Create DataFrame
df = spark.createDataFrame(data, schema=columns)
df.show()

# Filter data where age is greater than 30

filtered_df = df.filter(df.age > 30)

ETL SPARK 31
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

# Select specific columns

result_df = filtered_df.select("id", "name")
result_df.show()
result_df.write.mode("overwrite").csv("output.csv",
header=True) spark.stop()

OUTPUT :

ETL SPARK 32
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

10. Run SQL Queries on the Data in NoSQL Table

from pyspark.sql import SparkSession

spark=SparkSession.builder.appName(“NoSQL SQL
Queries”).config(“spark.cassandra.connection.host”,”127.0.0.1”).getOrCreate()
df = spark.read.format(“org.apache.spark.sql.cassandra” \
.options(table=”your_table”, keyspace= “your_keyspace”) \
.load()
df.createOrReplaceTempView(“nosql_view”)
result = spark.sql(“SELECT column1, COUNT(*) as count FROM nosql_view GROUP BY column1”)
result.show()
spark.stop()

Explanation:
Create a Spark Session:
spark = SparkSession.builder \
.appName(“NoSQL SQL Queries”) \
. config(“spark.cassandra.connection.host”,”127.0.0.1”).getOrCreate().
Initializes a Spark session with Cassandra configuration.

Load Data from Cassandra:

df = spark.read.format(“org.apache.spark.sql.cassandra” \
.options(table=”your_table”, keyspace= “your_keyspace”) \
.load()
Reads data from a Cassandra table into a DataFrame.

Register as SQL View:

df.createOrReplaceTempView(“nosql_view”)
Registers the DataFrame as a temporary SQL view

Perform SQL Queries:

result = spark.sql(“SELECT column1, COUNT(*) as count FROM nosql_view GROUP BY column1”)
Executes an SQL query on the NoSQL view.

ETL SPARK 33
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING

Show Results:
result.show()
Displays the results of the SQL query.

Stop the Spark Session:

spark.stop() will stops the Spark session.

ETL SPARK 34

Install Spark On Windows 10-MacOS
No ratings yet
Install Spark On Windows 10-MacOS
23 pages
Machine Learning: Hands-On for Developers and Technical Professionals
From Everand
Machine Learning: Hands-On for Developers and Technical Professionals
Jason Bell
No ratings yet
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
The Little Book of Sitecore® Tips: Volume 1
From Everand
The Little Book of Sitecore® Tips: Volume 1
Neil P Shack
No ratings yet
(DUMP) HCIA Cloud Computing Latest
No ratings yet
(DUMP) HCIA Cloud Computing Latest
198 pages
Oracle RAC Administration
100% (2)
Oracle RAC Administration
26 pages
Installation Et Configuration de Spark
No ratings yet
Installation Et Configuration de Spark
14 pages
Introduction to Spark
No ratings yet
Introduction to Spark
7 pages
Spark Interview Questions PDF 2
No ratings yet
Spark Interview Questions PDF 2
19 pages
Spark Installation Guide
No ratings yet
Spark Installation Guide
6 pages
Setting Up SPARK On Local Environment
No ratings yet
Setting Up SPARK On Local Environment
19 pages
Install Pyspark On Windows, Mac & Linux - DataCamp - 1
No ratings yet
Install Pyspark On Windows, Mac & Linux - DataCamp - 1
18 pages
Spark Interview Questions PDF 2
No ratings yet
Spark Interview Questions PDF 2
19 pages
How To Install Apache Spark On Windows 10
No ratings yet
How To Install Apache Spark On Windows 10
19 pages
steps
No ratings yet
steps
5 pages
Installing Spark
No ratings yet
Installing Spark
20 pages
Apache Spark
No ratings yet
Apache Spark
8 pages
Install+Apache+Spark+in+a+Standalone+Mode+on+Windows
No ratings yet
Install+Apache+Spark+in+a+Standalone+Mode+on+Windows
11 pages
Installing Spark On A Windows PC: Ukdataservice - Ac.uk
No ratings yet
Installing Spark On A Windows PC: Ukdataservice - Ac.uk
15 pages
Spark Installation On Windows Machine
No ratings yet
Spark Installation On Windows Machine
2 pages
Spark: Big Data Cluster Computing in Production
From Everand
Spark: Big Data Cluster Computing in Production
Ilya Ganelin
No ratings yet
DevOps. How to build pipelines with Jenkins, Docker container, AWS ECS, JDK 11, git and maven 3?
From Everand
DevOps. How to build pipelines with Jenkins, Docker container, AWS ECS, JDK 11, git and maven 3?
John Edward Cooper Berg
No ratings yet
Apache Spark Installation
No ratings yet
Apache Spark Installation
4 pages
Practical Play Framework: Focus on what is really important
From Everand
Practical Play Framework: Focus on what is really important
Alberto Souza
No ratings yet
Unit 4 Spark Updated
No ratings yet
Unit 4 Spark Updated
86 pages
Softwares
No ratings yet
Softwares
3 pages
Spark Overview: Security
No ratings yet
Spark Overview: Security
4 pages
Installing Apache Spark and Scala: Windows
No ratings yet
Installing Apache Spark and Scala: Windows
3 pages
Spark Python Install
No ratings yet
Spark Python Install
3 pages
A concise guide to PHP MySQL and Apache
From Everand
A concise guide to PHP MySQL and Apache
alasdair gilchrist
4/5 (2)
Big+Data+with+Apache+Spark+3+and+Python+From+Zero+to+Expert
No ratings yet
Big+Data+with+Apache+Spark+3+and+Python+From+Zero+to+Expert
28 pages
Integration of Python With Hadoop and Spark
No ratings yet
Integration of Python With Hadoop and Spark
10 pages
How to Deploy Any Web Application to the Apple App Store
From Everand
How to Deploy Any Web Application to the Apple App Store
Michael D Callaghan
No ratings yet
Step 1: Verifying Java Installation: Download Scala
No ratings yet
Step 1: Verifying Java Installation: Download Scala
3 pages
TensorFlow Developer Certificate Exam Practice Tests 2024 Made Easy
From Everand
TensorFlow Developer Certificate Exam Practice Tests 2024 Made Easy
Mr Troy
No ratings yet
Installation Guide For Spark
No ratings yet
Installation Guide For Spark
4 pages
Installation Guide For Spark
No ratings yet
Installation Guide For Spark
4 pages
PowerShell Pro: Advanced Strategies and Best Practices for Harnessing the Power of PowerShell in Enterprise Environments
From Everand
PowerShell Pro: Advanced Strategies and Best Practices for Harnessing the Power of PowerShell in Enterprise Environments
Ryan Campbell
No ratings yet
Spark With Python Notes
No ratings yet
Spark With Python Notes
206 pages
Part-B Assignment No. 3
No ratings yet
Part-B Assignment No. 3
5 pages
7 Steps For A Developer To Learn Apache Spark
No ratings yet
7 Steps For A Developer To Learn Apache Spark
30 pages
06-Apache Spark
No ratings yet
06-Apache Spark
75 pages
JAVASCRIPT FRONT END PROGRAMMING: Crafting Dynamic and Interactive User Interfaces with JavaScript (2024 Guide for Beginners)
From Everand
JAVASCRIPT FRONT END PROGRAMMING: Crafting Dynamic and Interactive User Interfaces with JavaScript (2024 Guide for Beginners)
DAISY JOHNSTON
No ratings yet
7 Steps For A Developer To Learn Apache Spark
No ratings yet
7 Steps For A Developer To Learn Apache Spark
30 pages
DEV3600SlideGuide PDF
No ratings yet
DEV3600SlideGuide PDF
555 pages
Rewwww
No ratings yet
Rewwww
12 pages
reStructuredText for Sphinx
From Everand
reStructuredText for Sphinx
Vimalkumar Velayudhan
No ratings yet
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
Object-Oriented Programming Made Simple: A Practical Guide with Java Examples
From Everand
Object-Oriented Programming Made Simple: A Practical Guide with Java Examples
William E. Clark
No ratings yet
The Mac Terminal Reference and Scripting Primer
From Everand
The Mac Terminal Reference and Scripting Primer
Jay Docherty
4.5/5 (3)
The Definitive Guide to Getting Started with OpenCart 2.x
From Everand
The Definitive Guide to Getting Started with OpenCart 2.x
iSenseLabs
No ratings yet
Learn SQL using MySQL in One Day and Learn It Well: SQL for beginners with Hands-on Project
From Everand
Learn SQL using MySQL in One Day and Learn It Well: SQL for beginners with Hands-on Project
Jamie Chan
No ratings yet
Docker Tutorial for Beginners: Learn Programming, Containers, Data Structures, Software Engineering, and Coding
From Everand
Docker Tutorial for Beginners: Learn Programming, Containers, Data Structures, Software Engineering, and Coding
Andrew Lee
3/5 (2)
BIG DATA ANALYTICS Lab Manual
No ratings yet
BIG DATA ANALYTICS Lab Manual
51 pages
Ewwww
No ratings yet
Ewwww
12 pages
Kubernetes: Build and Deploy Modern Applications in a Scalable Infrastructure. The Complete Guide to the Most Modern Scalable Software Infrastructure.: Docker & Kubernetes, #2
From Everand
Kubernetes: Build and Deploy Modern Applications in a Scalable Infrastructure. The Complete Guide to the Most Modern Scalable Software Infrastructure.: Docker & Kubernetes, #2
Jordan Lioy
No ratings yet
Oracle APEX Reporting Tips & Tricks
From Everand
Oracle APEX Reporting Tips & Tricks
George Bara
2/5 (1)
Extending Docker
From Everand
Extending Docker
Russ McKendrick
5/5 (1)
Mastering Nikto: A Comprehensive Guide to Web Vulnerability Scanning: Security Books
From Everand
Mastering Nikto: A Comprehensive Guide to Web Vulnerability Scanning: Security Books
Erwin Dirks
No ratings yet
DT-1. Familiarization With AIML Platforms
No ratings yet
DT-1. Familiarization With AIML Platforms
25 pages
645a183e12b85
No ratings yet
645a183e12b85
5 pages
Iouu
No ratings yet
Iouu
12 pages
Sakhr - Chaib - Paper On Data Mining
No ratings yet
Sakhr - Chaib - Paper On Data Mining
3 pages
Sap Suse Linux
No ratings yet
Sap Suse Linux
136 pages
Openstack Reference Architecture PDF
No ratings yet
Openstack Reference Architecture PDF
59 pages
Review Paper On Big Data Management and Cloud Computing.
No ratings yet
Review Paper On Big Data Management and Cloud Computing.
10 pages
Docu58234 PDF
No ratings yet
Docu58234 PDF
14 pages
Kubernaties
No ratings yet
Kubernaties
36 pages
SAP HANA Workload on Azure
No ratings yet
SAP HANA Workload on Azure
1,330 pages
IIHT Oracle DBA
No ratings yet
IIHT Oracle DBA
14 pages
Cluster Computing by Pritam Bhansali
100% (1)
Cluster Computing by Pritam Bhansali
17 pages
IBM TS7700 Virtual Tape Library Education For Technical Sales Level 3 Quiz
No ratings yet
IBM TS7700 Virtual Tape Library Education For Technical Sales Level 3 Quiz
8 pages
Hinted Handoff - System Design
No ratings yet
Hinted Handoff - System Design
8 pages
Managing Raw Disks in AIX To Use With Oracle ASM (Lkdev, Rendev) (Doc ID 1445870.1)
No ratings yet
Managing Raw Disks in AIX To Use With Oracle ASM (Lkdev, Rendev) (Doc ID 1445870.1)
11 pages
Docu53954 - Avamar 7.1 For Oracle User Guide
No ratings yet
Docu53954 - Avamar 7.1 For Oracle User Guide
152 pages
INFOCOM14
No ratings yet
INFOCOM14
6 pages
Akkaatoz: An Architect'S Guide To Designing, Building, and Running Reactive Systems
No ratings yet
Akkaatoz: An Architect'S Guide To Designing, Building, and Running Reactive Systems
35 pages
HPE StoreVirtual VSA Software-C04111621
No ratings yet
HPE StoreVirtual VSA Software-C04111621
13 pages
Practice Questions Vlore
No ratings yet
Practice Questions Vlore
23 pages
Red Hat Openshift Container Storage 4.6
No ratings yet
Red Hat Openshift Container Storage 4.6
39 pages
RHGS QCT Config Size Guide Technology Detail
No ratings yet
RHGS QCT Config Size Guide Technology Detail
34 pages
AKS cluster
No ratings yet
AKS cluster
7 pages
File Share Witness Configuration
No ratings yet
File Share Witness Configuration
20 pages
ANSYS Remote Solve Manager Tutorials 1234 R150
No ratings yet
ANSYS Remote Solve Manager Tutorials 1234 R150
40 pages
Virtualization Structures:Tools and Mechanisms
No ratings yet
Virtualization Structures:Tools and Mechanisms
8 pages
Clustered Data ONTAP® 8.2
No ratings yet
Clustered Data ONTAP® 8.2
323 pages
Icluster Userguide v81 PDF
No ratings yet
Icluster Userguide v81 PDF
629 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
36 pages
PerconaXtraDBCluster-8 0 20-11 3
No ratings yet
PerconaXtraDBCluster-8 0 20-11 3
245 pages
Illumio Core For Kubernetes and OpenShift 21.2.1
No ratings yet
Illumio Core For Kubernetes and OpenShift 21.2.1
103 pages