m etl spark
m etl spark
Table of Contents
Power Bi 15
1. Write a Program to create a SparkSession and Read Data from a CSV File 16 - 17
2. Write a program to group record of Supermarket’s sales data of Kaggle Dataset by Gender 18 - 19
ETL SPARK 1
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING
Google Colab is like a fancy notebook for coding that you can use right in your web browser. It helps people
write and run code, especially for data science and machine learning, without needing to set up anything on
their own computer. You can share your work with others, and it’s free to use. Plus, it gives you access to
powerful computers and GPUs (graphics processing units) that make running complex calculations faster.
Google Colab, or "Colaboratory", is a cloud-based machine learning platform that you can use to write and
execute Python code, process data, and create visualizations. It's free for most tasks, and you can also pay for
more demanding needs. Here are some reasons why you might want to use Google Colab.
• Go to Google Colab.
• In a new code cell, you will use the pip command to install PySpark.
The ! at the beginning of the command tells Colab to execute this as a shell command, not Python code.
pip is a package manager for Python that installs and manages Python packages. install is the
command to install a new package. pyspark is the name of the package you're installing,
which is the Python API for Apache Spark. .
ETL SPARK 2
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING
Installing Apache Spark on Windows 10 may seem complicated to novice users, but this simple tutorial will
have you up and running. If you already have Java 8 and Python 3 installed, you can skip the first two steps.
Apache Spark requires Java 8. You can check to see if Java is installed using the command prompt.
Open the command line by clicking Start > type cmd > click Command Prompt.
Your version may be different. The second digit is the Java version – in this case, Java 8.
2. Click the Java Download button and save the file to a location of your choice.
ETL SPARK 3
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING
1. To install the Python package manager, navigate to https://www.python.org/ in your web browser.
2. Mouse over the Download menu option and click Python 3.8.3. 3.8.3 is the latest version at the time of
writing the article.
4. Near the bottom of the first setup dialog box, check off Add Python 3.8 to PATH. Leave the other
box checked.
6. You can leave all boxes checked at this step, or you can uncheck the options you do not want.
ETL SPARK 4
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING
7. Click Next.
8. Select the box Install for all users and leave other boxes as they are.
9.Under Customize install location, click Browse and navigate to the C drive. Add a new folder and name it
Python.
12.When the installation completes, click the Disable path length limit option at the bottom and then click
Close.
13.If you have a command prompt open, restart it. Verify the installation by checking the version of Python:
python --version
Spark
2. Under the Download Apache Spark heading, there are two drop-down menus. Use the current non-preview
version.
ETL SPARK 5
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING
• In our case, in Choose a Spark release drop-down menu select 2.4.5 (Feb 05 2020).
ETL SPARK 6
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING
• In the second drop-down Choose a package type, leave the selection Pre-built for Apache Hadoop
2.7.
4. A page with a list of mirrors loads where you can see different servers to download from. Pick any from the
list and save the file to your Downloads folder.
1. Verify the integrity of your download by checking the checksum of the file. This ensures you are working
with unaltered, uncorrupted software.
2. Navigate back to the Spark Download page and open the Checksum link, preferably in a new tab.
4. Change the username to your username. The system displays a long alphanumeric code,
along with the message Certutil: -hashfile completed successfully.
ETL SPARK 7
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING
5. Compare the code to the one you opened in a new browser tab. If they match, your
download file is uncorrupted.
Installing Apache Spark involves extracting the downloaded file to the desired location.
1. Create a new folder namesd Spark in the root of your C: drive. From a command line, enter the following:
cd \ mkdir Spark
3. Right-click the file and extract it to C:\Spark using the tool you have on your system (e.g., 7-Zip).
4. Now, your C:\Spark folder has a new folder spark-2.4.5-bin-hadoop2.7 with the necessary files inside.
Download the winutils.exe file for the underlying Hadoop version for the Spark installation you downloaded.
1. Navigate to this URL https://github.com/cdarlint/winutils and inside the bin folder, locate winutils.exe, and
click it.
2. Find the Download button on the right side to download the file.
3. Now, create new folders Hadoop and bin on C: using Windows Explorer or the Command Prompt.
Configuring environment variables in Windows adds the Spark and Hadoop locations to your system PATH.
It allows you to run the Spark shell directly from a command prompt window.
ETL SPARK 8
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING
3. A System Properties dialog box appears. In the lower-right corner, click Environment Variables and then
click New in the next window.
5. For Variable Value type C:\Spark\spark-2.4.5-bin-hadoop2.7 and click OK. If you changed the folder
path, use that one instead.
6. In the top box, click the Path entry, then click Edit. Be careful with editing the system path. Avoid
deleting any entries already on the list.
7. You should see a box with entries on the left. On the right, click New.
ETL SPARK 9
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING
8. The system highlights a new line. Enter the path to the Spark folder C:\Spark\spark-2.4.5-bin-
hadoop2.7\bin.
We recommend using %SPARK_HOME%\bin to avoid possible issues with the path.
Spark.
The system should display several lines indicating the status of the applicati
ETL SPARK 10
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING
10. For Hadoop, the variable name is HADOOP_HOME and for the value use the path of the folder you
created earlier: C:\hadoop. Add C:\hadoop\bin to the Path variable field, but we recommend
using
%HADOOP_HOME%\bin.
11. For Java, the variable name is JAVA_HOME and for the value use the path to your Java JDK directory (in
our case it’s C:\Program Files\Java\jdk1.8.0_251).
1. Open a new command-prompt window using the right-click and Run as administrator:
C:\Spark\spark-2.4.5-bin-hadoop2.7\bin\spark-shell
3. If you set the environment path correctly, you can type spark-shell to launch on. You may get a Java pop-
up. Select Allow access to continue.
Finally, the Spark logo appears, and the prompt displays the Scala shell.
ETL SPARK 11
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING
6. You should see an Apache Spark shell Web UI. The example below shows the Executors page.
7. To exit Spark and close the Scala shell, press ctrl-d in the command-prompt window.
ETL SPARK 12
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING
Apache Spark is known for its fast processing capabilities compared to MapReduce. It is fast because it runs
on Memory(RAM) which helps it to process data more quickly than on Disk. Apache Spark offers a wide
range of capabilities, allowing users to perform multiple operations such as creating data pipelines, integrating
data from various sources, running machine learning models, working with graphs, executing SQL queries,
and more.
ETL SPARK 13
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING
Spark applications run in clusters, where each worker node contains executors responsible for executing tasks
in parallel and storing data in memory across multiple machines, enhancing speed and fault tolerance. Spark’s
cluster manager (e.g., YARN, Mesos, or Kubernetes) allocates resources and coordinates between driver and
workers. Additionally, Spark uses the Resilient Distributed Dataset (RDD) abstraction, which ensures fault
tolerance by tracking data transformations, enabling the system to recompute lost data without full reruns.
When the Driver Program in the Apache Spark architecture executes, it calls the real program of an application
and creates a SparkContext. SparkContext contains all of the basic functions. The Spark Driver includes
several other components, including a DAG Scheduler, Task Scheduler, Backend Scheduler, and Block
Manager, all of which are responsible for translating user-written code into jobs that are actually executed on
the cluster.The Cluster Manager manages the execution of various jobs in the cluster. Spark Driver works in
conjunction with the Cluster Manager to control the execution of various other jobs. The cluster Manager does
the task of allocating resources for the job. Once the job has been broken down into smaller jobs, which are
then distributed to worker nodes, Spark driver will control the execution. Many worker nodes can be used to
process an RDD created in the SparkContext, and the results can also be cached.The Spark Context receives
task information from the Cluster Manager and enqueues it on worker nodes.The executor is in charge of
carrying out these duties. The lifespan of executors is the same as that of the Spark Application. We can
increase the number of workers if we want to improve the performance of the system. In this way, we can
divide jobs into more coherent parts.
ETL SPARK 14
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING
Apache Spark, a popular cluster computing framework, was created in order to accelerate data processing
applications. Spark, which enables applications to run faster by utilising in-memory cluster computing, is a
popular open source framework.
A cluster is a collection of nodes that communicate with each other and share data. Because of implicit data
parallelism and fault tolerance, Spark may be applied to a wide range of sequential and interactive
processing demands.
1. Speed: Spark performs up to 100 times faster than MapReduce for processing large amounts of data. It is
also able to divide the data into chunks in a controlled way.
2. Powerful Caching: Powerful caching and disk persistence capabilities are offered by a simple
programming layer.
3. Deployment: Mesos, Hadoop via YARN, or Spark’s own cluster manager can all be used to deploy it.
4. Real-Time: Because of its in-memory processing, it offers real-time computation and low latency.
5. Polyglot: In addition to Java, Scala, Python, and R, Spark also supports all four of these languages. You
can write Spark code in any one of these languages. Spark also provides a command-line interface in Scala
and Python.
ETL SPARK 15
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING
POWER BI
Power BI is a Microsoft business analytics tool that enables users to visualize data and
share insights through interactive reports and dashboards. It connects to various data sources,
allowing for data transformation and modeling using Power Query and DAX. With a wide
range of visualization options, Power BI facilitates data-driven decision-making and
collaboration, makingit easy to share reports within organizations and access them via mobile
devices.
Power BI can integrate with Apache Spark in an ETL (Extract, Transform, Load) pipeline
to leverage Spark's distributed data processing power and Power BI’s data visualization
capabilities. Here's how it works:
1. Extract: Data is collected from various sources (e.g., databases, cloud storage, or
APIs) into Apache Spark. Spark can handle large-scale, diverse datasets efficiently through its
parallel processing framework.
2. Transform: In Spark, data is cleaned, transformed, and processed using operations like
filtering, aggregation, and joining across distributed datasets. This is where Spark shines,
handling both batch and real-time streaming data.
3. Load into Power BI: After transformation, the processed data can be loaded into Power
BI for visualization. Power BI connects to Spark via:
Spark ODBC/JDBC: Power BI can connect to Spark using ODBC/JDBC drivers to
query and fetch data directly from the Spark cluster.
4. Visualization and Analysis: Power BI pulls the processed data from Spark, allowing users to
build interactive dashboards, perform real-time analytics, and visualize trends.
ETL SPARK 16
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING
1. Write a Program to create a SparkSession and Read Data from a CSV File
File Prerequisites:
OUTPUT:
Explanation:
ETL SPARK 17
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING
df = spark.read.csv("/content/supermarket_sales.csv",header=True, inferSchema=True)
inferSchema=True parameter tells Spark to automatically infer the data types of the columns based on the
values in the CSV file.
df.show()
It will display the content of the DataFrame. This is a method of the DataFrame class that prints the first 20
rows of the DataFrame to the console.
OUTPUT:
ETL SPARK 18
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING
Prerequisites:
Before running the code, ensure you have PySpark installed. You can install it using:
pip install pyspark
OUTPUT:
ETL SPARK 19
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING
ETL SPARK 20
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING
OUTPUT:
ETL SPARK 21
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING
output:
ETL SPARK 22
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING
ETL SPARK 23
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING
OUTPUT:
ETL SPARK 24
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING
OUTPUT:
ETL SPARK 25
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING
OUTPUT:
ETL SPARK 26
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING
OUTPUT :
ETL SPARK 27
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING
# Install Java
!apt-get install openjdk-11-jdk-headless -qq > /dev/null
# Install compatible versions of PySpark and Delta Lake
!pip install pyspark==3.1.3 delta-spark==1.0.0
Output :
ETL SPARK 28
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING
# Create a DataFrame
df = spark.createDataFrame(data, columns)
output :
ETL SPARK 29
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING
OUTPUT :
ETL SPARK 30
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING
# Install PySpark
!pip install pyspark
OUTPUT :
# Sample data
data = [
(1, "Alice", 34),
(2, "Bob", 45),
(3, "Cathy", 29),
]
# Define schema
columns = ["id", "name", "age"]
# Create DataFrame
df = spark.createDataFrame(data, schema=columns)
df.show()
ETL SPARK 31
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING
OUTPUT :
ETL SPARK 32
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING
Explanation:
Create a Spark Session:
spark = SparkSession.builder \
.appName(“NoSQL SQL Queries”) \
. config(“spark.cassandra.connection.host”,”127.0.0.1”).getOrCreate().
Initializes a Spark session with Cassandra configuration.
ETL SPARK 33
21K61A0586 SASI INSTITUTE OF TECHNOLOGY & ENGINEERING
Show Results:
result.show()
Displays the results of the SQL query.
ETL SPARK 34