Data Engineering

Download as pdf or txt
Download as pdf or txt
You are on page 1of 63

A

Practical Training Report


on
Data Engineering
Submitted in partial fulfillment for the award of the degree
of
BACHELOR OF TECHNOLOGY
In
Computer Science & Engineering

Coordinator: Submitted By:


Mr. Noman Khan Navaratan Jangid
Asst. professor 20EGJCS070

Department of Computer Science and Engineering


GLOBAL INSTITUTE OF TECHNOLOGY
JAIPUR (RAJASTHAN)-302022
SESSION: 2023-24
Certificate
Acknowledgement
I take this opportunity to express my deep sense of gratitude to my coordinator Mr. Noman
Khan, Assistant Professor Department of Computer Science and Engineering,
Information Technology and Artificial Intelligence and Data Science, Global Institute of
Technology, Jaipur, for his valuable guidance and cooperation throughout the Practical
Training work. He provided constant encouragement and unceasing enthusiasm at every stage
of the Practical Training work. We are grateful to our respected Dr. I. C. Sharma, Principal,
Global Institute of Technology for guiding us During Practical Training period. We express
our indebtedness to Mr. Pradeep Jha, Head of Department of Computer Science and
Engineering, Information Technology and Artificial Intelligence and Data Science,
Global Institute of Technology, Jaipur for providing me ample support during my Practical
Training period. Without their support and timely guidance, the completion of our Practical
Training would have seemed a farfetched dream. In this respect, we find ourselves lucky to
have mentors of such a great potential.

Place: GIT, Jaipur Navaratan Jangid


20EGJCS070
B.Tech. VII Semester, IV Year, CSE
Abstract

The objective of a practical training is to learn something about industries practically and to be
familiar with a working style of a technical worker to adjust simply according to industrial
environment. This report deals with the equipments their relations and their general operating
principles. Data engineering is a multi-disciplinary field that comprises of learning, statistics,
database, visualisation, optimisation, and information theory. It is slightly younger than its
sibling, data science. Data engineering is a set of operations aimed at creating interfaces and
mechanisms for the flow and access of information. It takes dedicated specialists—data
engineers—to maintain data so that it remains available and usable by others. In short, data
engineers set up and operate the organisation’s data infrastructure, preparing it for further
analysis by data analysts and scientists. The first type of data engineering is SQL- focused. The
work and primary storage of the data is in relational databases. All of the data processing is
done with SQL or a SQL-based language. It is a broad field, where a data engineer transforms
data into a useful format for analysis. This paper provides a brief introduction to data
engineering.
Table of Contents
Chapter-1 ........................................................................................... 1
Introduction ..................................................................................................... 1
What is Data Engineering? .................................................................................................... 1
The data engineer role ........................................................................................................... 1
Data engineer responsibilities ................................................................................................ 1
Data engineer vs. data scientist.............................................................................................. 2
Chapter- 2 .......................................................................................... 2
SQL ................................................................................................................... 2
What is SQL? ........................................................................................................................ 2
Why SQL? ............................................................................................................................ 3
History of SQL...................................................................................................................... 3
Process of SQL ..................................................................................................................... 3
SQL vs NO SQL ................................................................................................................... 4
Advantages of SQL ............................................................................................................... 5
Disadvantages of SQL........................................................................................................... 5
SQL Commands .................................................................................................................... 6
What is SQL Server? ............................................................................................................. 7
SQL Server Basic .................................................................................................................. 9
SQL Server Views .............................................................................................................. 12
Advantages of views ........................................................................................................... 14
Managing views in SQL Server ........................................................................................... 14
SQL Server Indexes ............................................................................................................ 15
Store Procedure ................................................................................................................... 15
Using SQL constraints ........................................................................................................ 15
Chapter-3 ......................................................................................... 16
Azure Data Factory ....................................................................................... 16
How does it work? .............................................................................................................. 17
Create a data factory by using the Azure portal.................................................................... 20
Create a data factory............................................................................................................ 20
Advanced creation in the Azure portal................................................................................. 21
Pipelines and activities in Azure Data Factory and Azure Synapse Analytics....................... 22
Creating a pipeline with UI ................................................................................................. 24
Linked services in Azure Data Factory and Azure Synapse Analytics .................................. 26
Linked service with UI:Azure Data Factory......................................................................... 27
Create linked services.......................................................................................................... 27
Setting up ADF ................................................................................................................... 28
Integration Runtime ............................................................................................................ 28
Linked Service .................................................................................................................... 29
Data Set .............................................................................................................................. 29
Source and Sink .................................................................................................................. 30
Simple project on ADF ....................................................................................................... 31
References .......................................................................................................................... 56
Data Engineering

Chapter-1
Introduction
What is Data Engineering?
A data engineer is an IT worker whose primary job is to prepare data for analytical or
operational uses. These software engineers are typically responsible for building data
pipelines to bring together information from different source systems. They integrate,
consolidate and cleanse data and structure it for use in analytics applications. They aim to
make data easily accessible and to optimise their organisation's big data ecosystem.
The amount of data an engineer works with varies with the organisation, particularly with
respect to its size. The bigger the company, the more complex the analytics architecture,
and the more data the engineer will be responsible for. Certain industries are more data-
intensive, including healthcare, retail and financial services.
Data engineers work in conjunction with data science teams, improving data transparency
and enabling businesses to make more trustworthy business decisions.

The data engineer role


Data engineers focus on collecting and preparing data for use by data scientists and
analysts. They take on three main roles as follows:
Generalists. Data engineers with a general focus typically work on small teams, doing end-
to-end data collection, intake and processing. They may have more skill than most data
engineers, but less knowledge of systems architecture. A data scientist looking to become
a data engineer would fit well into the generalist role.
A project a generalist data engineer might undertake for a small, metro-area food delivery
service would be to create a dashboard that displays the number of deliveries made each
day for the past month and forecasts the delivery volume for the following month.
Pipeline-centric engineers. These data engineers typically work on a midsize data analytics
team and more complicated data science projects across distributed systems. Midsize and
large companies are more likely to need this role.
A regional food delivery company might undertake a pipeline-centric project to create a
tool for data scientists and analysts to search metadata for information about deliveries.
They might look at distance driven and drive time required for deliveries in the past month,
then use that data in a predictive algorithm to see what it means for the company's future
business.
Database-centric engineers. These data engineers are tasked with implementing,
maintaining and populating analytics databases. This role typically exists at larger
companies where data is distributed across several databases. The engineers work with
pipelines, tune databases for efficient analysis and create table schemas using extract,
transform, load (ETL) methods. ETL is a process in which data is copied from several
sources into a single destination system.
A database-centric project at a large, multistate or national food delivery service would be
to design an analytics database. In addition to creating the database, the data engineer would
write the code to get data from where it's collected in the main application database into the
analytics database.

Data engineer responsibilities


Data engineers often work as part of an analytics team alongside data scientists. The
engineers provide data in usable formats to the data scientists who run queries and

1
Data Engineering

algorithms against the information for predictive analytics, machine learning and data
mining applications. Data engineers also deliver aggregated data to business executives and
analysts and other end users so they can analyze it and apply the results to improving
business operations.
Data engineers deal with both structured and unstructured data. Structured data is
information that can be organized into a formatted repository like a database. Unstructured
data -- such as text, images, audio and video files -- doesn't conform to conventional data
models. Data engineers must understand different approaches to data architecture and
applications to handle both data types. A variety of big data technologies, such as open
source data ingestion and processing frameworks, are also part of the data engineer's toolkit.

Data engineer vs. data scientist


Data engineers and data scientists work together. The data engineers prepare and organize
the data that companies have in databases and other formats. They also build data pipelines
that make data available to the data scientists. The data scientists use all that data for
analytics and other projects that improve business operations and outcomes.
Data scientists and data engineers differ in their skillsets and focus. Data engineers don't
necessarily have a specific focus; they tend to be competent in several areas and well-
rounded in their knowledge and skills. By contrast, data scientists often have specialized
areas of focus. They are concerned with more exploratory data analysis. Data scientists
tackle new, big-picture problems, while data engineers put the pieces in place to make that
possible.

2
Data Engineering

Chapter- 2
SQL
What is SQL?
SQL is a short-form of the structured query language, and it is pronounced as S-Q-L or
sometimes as See-Quell. This database language is mainly designed for maintaining the
data in relational database management systems. It is a special tool used by data
professionals for handling structured data (data which is stored in the form of tables). It is
also designed for stream processing in RDSMS.
You can easily create and manipulate the database, access and modify the table rows and
columns, etc. This query language became the standard of ANSI in the year of 1986 and
ISO in the year of 1987.
If you want to get a job in the field of data science, then it is the most important query
language to learn. Big enterprises like Facebook, Instagram, and LinkedIn, use SQL for
storing the data in the back-end.

Why SQL?
Nowadays, SQL is widely used in data science and analytics. Following are the reasons
which explain why it is widely used:
• The basic use of SQL for data professionals and SQL users is to insert, update, and delete
the data from the relational database.
• SQL allows the data professionals and users to retrieve the data from the relational
database management systems.
• It also helps them to describe the structured data.
• It allows SQL users to create, drop, and manipulate the database and its tables.
• It also helps in creating the view, stored procedure, and functions in the relational
database.
• It allows you to define the data and modify that stored data in the relational database.
• It also allows SQL users to set the permissions or constraints on table columns, views,
and stored procedures.

History of SQL
"A Relational Model of Data for Large Shared Data Banks" was a paper which was
published by the great computer scientist "E.F. Codd" in 1970.
The IBM researchers Raymond Boyce and Donald Chamberlin originally developed the
SEQUEL (Structured English Query Language) after learning from the paper given by E.F.
Codd. They both developed the SQL at the San Jose Research laboratory of IBM
Corporation in 1970.
At the end of the 1970s, relational software Inc. developed their own first SQL using the
concepts of E.F. Codd, Raymond Boyce, and Donald Chamberlin. This SQL was totally
based on RDBMS. Relational Software Inc., which is now known as Oracle Corporation,
introduced the Oracle V2 in June 1979, which is the first implementation of SQL language.
This Oracle V2 version operates on VAX computers.

Process of SQL
When we are executing the command of SQL on any Relational database management
system, then the system automatically finds the best routine to carry out our request, and
the SQL engine determines how to interpret that particular command.
Structured Query Language contains the following four components in its process:
o Query Dispatcher
3
Data Engineering

o Optimization Engines

4
Data Engineering

o Classic Query Engine


o SQL Query Engine, etc.
A classic query engine allows data professionals and users to maintain non-SQL queries.

Figure 1: Architecture of SQL

SQL vs NO SQL
SQL No-SQL

1. SQL is a relational database management 1. While No-SQL is a non-relational or


system. distributed database management system.

2. The query language used in this database 2. The query language used in the No-SQL
system is a structured query language. database systems is a non-declarative query
language.

3. The schema of SQL databases is predefined, 3. The schema of No-SQL databases is a


fixed, and static. dynamic schema for unstructured data.

4. These databases are vertically scalable. 4. These databases are horizontally scalable.

5. The database type of SQL is in the form of 5. The database type of No-SQL is in the form
tables, i.e., in the form of rows and columns. of documents, key-value, and graphs.

6. It follows the ACID model. 6. It follows the BASE model.

7. Complex queries are easily managed in the 7. NoSQL databases cannot handle complex
SQL database. queries.

8. This database is not the best choice for 8. While No-SQL database is a perfect option
storing hierarchical data. for storing hierarchical data.

5
Data Engineering

9. All SQL databases require object-relational 9. Many No-SQL databases do not require
mapping. object-relational mapping.

10. Gauges, CircleCI, Hootsuite, etc., are the 10. Airbnb, Uber, and Kickstarter are the top
top enterprises that are using this query enterprises that are using this query language.
language.

11. SQLite, Ms-SQL, Oracle, PostgreSQL, and 11. Redis, MongoDB, Hbase, BigTable,
MySQL are examples of SQL database CouchDB, and Cassandra are examples of
systems. NoSQL database systems.

Table 1 : SQL VS NO-SQL

Advantages of SQL
SQL provides various advantages which make it more popular in the field of data science.
It is a perfect query language which allows data professionals and users to communicate
with the database. Following are the best advantages or benefits of Structured Query
Language:
1. No programming needed
SQL does not require a large number of coding lines for managing the database systems.
We can easily access and maintain the database by using simple SQL syntactical rules.
These simple rules make the SQL user-friendly.
2. High-Speed Query Processing
A large amount of data is accessed quickly and efficiently from the database by using
SQL queries. Insertion, deletion, and updation operations on data are also performed in
less time.
3. Standardized Language
SQL follows the long-established standards of ISO and ANSI, which offer a uniform
platform across the globe to all its users.
4. Portability
The structured query language can be easily used in desktop computers, laptops, tablets,
and even smartphones. It can also be used with other applications according to the user's
requirements.
5. Interactive language
We can easily learn and understand the SQL language. We can also use this language for
communicating with the database because it is a simple query language. This language is
also used for receiving the answers to complex queries in a few seconds.
6. More than one Data View
The SQL language also helps in making the multiple views of the database structure for
the different database users.

Disadvantages of SQL
With the advantages of SQL, it also has some disadvantages, which are as follows:
1. Cost
The operation cost of some SQL versions is high. That's why some programmers cannot
use the Structured Query Language.

6
Data Engineering

2. Interface is Complex
Another big disadvantage is that the interface of Structured query language is difficult,
which makes it difficult for SQL users to use and manage it.
3. Partial Database control
The business rules are hidden. So, the data professionals and users who are using this
query language cannot have full database control.

SQL Commands
• SQL commands are instructions. It is used to communicate with the database. It is also
used to perform specific tasks, functions, and queries of data.
• SQL can perform various tasks like create a table, add data to tables, drop the table,
modify the table, set permission for users.
• Types of SQL Commands
There are five types of SQL commands: DDL, DML, DCL, TCL, and DQL.

Figure 2: Types of SQL command


Data Definition Language (DDL)
• DDL changes the structure of the table like creating a table, deleting a table, altering a
table, etc.
• All the command of DDL are auto-committed that means it permanently save all the
changes in the database.
Here are some commands that come under DDL:
• CREATE
• ALTER
• DROP
• TRUNCATE
Data Manipulation Language
• DML commands are used to modify the database. It is responsible for all form of
changes in the database.
• The command of DML is not auto-committed that means it can't permanently save all
the changes in the database. They can be rollback.
Here are some commands that come under DML:
• INSERT
• UPDATE
• DELETE

7
Data Engineering

Data Control Language


DCL commands are used to grant and take back authority from any database user.
Here are some commands that come under DCL:
• Grant
• Revoke
Transaction Control Language
TCL commands can only use with DML commands like INSERT, DELETE and
UPDATE only.
These operations are automatically committed in the database that's why they cannot be
used while creating tables or dropping them.
Here are some commands that come under TCL:
• COMMIT
• ROLLBACK
• SAVEPOINT
Data Query Language
DQL is used to fetch the data from the database.
It uses only one command:
SELECT

What is SQL Server?


SQL Server is a relational database management system, or RDBMS, developed and
marketed by Microsoft.
Similar to other RDBMS software, SQL Server is built on top of SQL, a standard
programming language for interacting with relational databases. SQL Server is tied to
Transact-SQL, or T-SQL, the Microsoft’s implementation of SQL that adds a set of
proprietary programming constructs.
SQL Server works exclusively on the Windows environment for more than 20 years. In
2016, Microsoft made it available on Linux. SQL Server 2017 became generally available
in October 2016 that ran on both Windows and Linux.
SQL Server consists of two main components:
 Database Engine
 SQLOS
2. Database Engine
The core component of the SQL Server is the Database Engine. The Database Engine
consists of a relational engine that processes queries and a storage engine that manages
database files, pages, indexes, etc. The database objects such as stored procedures, views,
and triggers are also created and executed by the Database Engine.
3. Relational Engine
The Relational Engine contains the components that determine the best way to execute a
query. The relational engine is also known as the query processor.
The relational engine requests data from the storage engine based on the input query and
processed the results.
Some tasks of the relational engine include querying processing, memory management,
thread and task management, buffer management, and distributed query processing.
4. Storage Engine
The storage engine is in charge of storage and retrieval of data from the storage systems
such as disks and SAN.
5. SQLOS
Under the relational engine and storage engine is the SQL Server Operating System or
SQLOS.

8
Data Engineering

SQLOS provides many operating system services such as memory and I/O management.
Other services include exception handling and synchronization services.
6. SQL Server Services and Tools
Microsoft provides both data management and business intelligence (BI) tools and
services together with SQL Server.
For data management, SQL Server includes SQL Server Integration Services (SSIS), SQL
Server Data Quality Services, and SQL Server Master Data Services. To develop
databases, SQL Server provides SQL Server Data tools; and to manage, deploy, and
monitor databases SQL Server has SQL Server Management Studio (SSMS).
For data analysis, SQL Server offers SQL Server Analysis Services (SSAS). SQL Server
Reporting Services (SSRS) provides reports and visualization of data. The Machine
Learning Services technology appeared first in SQL Server 2016 which was renamed
from the R Services.
7. SQL Server Editions
SQL Server has four primary editions that have different bundled services and tools. Two
editions are available free of charge:
SQL Server Developer edition for use in database development and testing.
SQL Server Expression for small databases with the size of up to 10 GB of disk storage
capacity.
For larger and more critical applications, SQL Server offers the Enterprise edition that
includes all SQL Server’s features.
SQL Server Standard Edition has partial feature sets of the Enterprise Edition and limits
on the Server regarding the numbers of processor core and memory that can be
configured.
For detailed information on the SQL Editions, check out the available Server Server 2019
Editions.
In this tutorial, you have a brief overview of the SQL Server including its architecture,
services, tools, and editions.

Figure 3 : SQL Server Architecture


9
Data Engineering

SQL Server Basic


1. Querying Data
The SELECT statement is used to select data from a database. The data returned is stored
in a result table, called the result-set.
2. Sorting data
• Order By– sort the result set based on values in a specified list of columns
3. Limiting rows
• OFFSET FETCH – limit the number of rows returned by a query.
• SELECT TOP– limit the number of rows or percentage of rows returned in a
query’s result set.
4. Filtering data
• DISTINCT – select distinct values in one or more columns of a table.
• WHERE– filter rows in the output of a query based on one or more conditions.
• AND – combine two Boolean expressions and return true if all expressions are
true.
• OR– combine two Boolean expressions and return true if either of conditions is
true.
• IN – check whether a value matches any value in a list or a subquery.
• BETWEEN – test if a value is between a range of values.
• LIKE – check if a character string matches a specified pattern.
• COLUMN AND TABLE ALIASES – show you how to use column aliases to
change the heading of the query output and table alias to improve the readability
of a query.
5. Joining tables
• JOINS – give you a brief overview of joins types in SQL Server including inner
join, left join, right join and full outer join.
• INNER JOIN – select rows from a table that have matching rows in another table.
• LEFT JOIN – return all rows from the left table and matching rows from the right
table. In case the right table does not have the matching rows, use null values for
the column values from the right table.
• RIGHT JOIN – learn a reversed version of the left join.
• FULL OUTER JOIN – return matching rows from both left and right tables, and
rows from each side if no matching rows exist.
• CROSS JOIN – join multiple unrelated tables and create Cartesian products of
rows in the joined tables.
• SELF JOIN – show you how to use the self-join to query hierarchical data and
compare rows within the same table.
6. Grouping data
• GROUP BY– group the query result based on the values in a specified list of
column expressions.
• HAVING – specify a search condition for a group or an aggregate.
• GROUPING SETS – generates multiple grouping sets.
• CUBE – generate grouping sets with all combinations of the dimension columns.
• ROLL UP – generate grouping sets with an assumption of the hierarchy between
input columns.
7. Subquery
This section deals with the subquery which is a query nested within another statement
such as SELECT, INSERT, UPDATE or DELETE statement.
• SUB QUERY – explain the subquery concept and show you how to use various
subquery type to select data.

10
Data Engineering

• CORRELATED SUB QUERY – introduce you to the correlated subquery


concept.
• EXISTS – test for the existence of rows returned by a subquery.
• ANY – compare a value with a single-column set of values returned by a
subquery and return TRUE the value matches any value in the set.
• ALL – compare a value with a single-column set of values returned by a subquery
and return TRUE the value matches all values in the set.
8. Set Operators
This section walks you through of using the set operators including union, intersect, and
except to combine multiple result sets from the input queries.
• UNION – combine the result sets of two or more queries into a single result set.
• INTERSECT – return the intersection of the result sets of two or more queries.
• EXCEPT – find the difference between the two result sets of two input queries.
9. Common Table Expression (CTE)
• CTE – use common table expresssions to make complex queries more readable.
• RECURSIVE CTE – query hierarchical data using recursive CTE.
10.PIVOT
• PIVOT – convert rows to columns
11. Modifying data
In this section, you will learn how to change the contents of tables in the SQL Server
database. The SQL commands for modifying data such as insert, delete, and update are
referred to as data manipulation language (DML).
• INSERT – insert a row into a table
• INSERT MULTIPLE ROWS – insert multiple rows into a table using a single
INSERT statement
• INSERT INTO SELECT – insert data into a table from the result of a query.
• UPDATE – change the existing values in a table.
• UPDATE JOIN – update values in a table based on values from another table
using JOIN clauses.
• DELETE – delete one or more rows of a table.
• MERGE – walk you through the steps of performing a mixture of insertion,
update, and deletion using a single statement.
• TRANSACTION – show you how to start a transaction explicitly using the
BEGIN TRANSACTION, COMMIT, and ROLLBACK statements
12. Data definition
This section shows you how to manage the most important database objects including
databases and tables.
• CREATE DATABASE – show you how to create a new database in a SQL Server
instance using the CREATE DATABASE statement and SQL Server
Management Studio.
• DROP DATABASE – learn how to delete existing databases.
• CREATE SCHEMA – describe how to create a new schema in a database.
• ALTER SCHEMA – show how to transfer a securable from a schema to another
within the same database.
• DROP SCHEMA – learn how to delete a schema from a database.
• CREATE TABLE – walk you through the steps of creating a new table in a
specific schema of a database.
• IDENTITY COLUMN – learn how to use the IDENTITY property to create the
identity column for a table.
• SEQUENCE – describe how to generate a sequence of numeric values based on a
specification.
• ALTER TABLE ADD COLUMN – show you how to add one or more columns to
11
Data Engineering

an existing table
• ALTER TABLE ALTER COLUMN – show you how to change the definition of
existing columns in a table.
• ALTER TABLE DROP COLUMN – learn how to drop one or more columns
from a table.
• COMPUTED COLUMNS – how to use the computed columns to resue the
calculation logic in multiple queries.
• DROP TABLE – show you how to delete tables from the database.
• TRUNCATE TABLE – delete all data from a table faster and more efficiently.
• SELECT INTO – learn how to create a table and insert data from a query into it.
• RENAME TABLE – walk you through the process of renaming a table to a new
one.
• TEMPORARY TABLE – introduce you to the temporary tables for storing
temporarily immediate data in stored procedures or database session.
• SYNONYM – explain you the synonym and show you how to create synonyms
for database objects.
13. SQL Server Data Types
• SQL SERVER DATA TYPES – give you an overview of the built-in SQL Server
data types.
• BIT – store bit data i.e., 0, 1, or NULL in the database with the BIT data type.
• INT – learn about various integer types in SQL server including BIGINT, INT,
SMALLINT, and TINYINT.
• DECIMAL – show you how to store exact numeric values in the database by
using DECIMAL or NUMERIC data type.
• CHAR – learn how to store fixed-length, non-Unicode character string in the
database.
• NCHAR – show you how to store fixed-length, Unicode character strings and
explain the differences between CHAR and NCHAR data types
• VARCHAR – store variable-length, non-Unicode string data in the database.
• NVARCHAR – learn how to store variable-length, Unicode string data in a table
and understand the main differences between VARCHAR and NVARCHAR.
• DATETIME2 – illustrate how to store both date and time data in a database.
• DATE – discuss the date data type and how to store the dates in the table.
• TIME – show you how to store time data in the database by using the TIME data
type.
• DATETIMEOFFSET – show you how to manipulate datetime with the time zone.
• GUID – learn about the GUID and how to use the NEWID() function to generate
GUID values.
14. Constraints
• PRIMARY KEY – explain you to the primary key concept and show you how to
use the primary key constraint to manage a primary key of a table.
• FOREIGN KEY – introduce you to the foreign key concept and show you use
the FOREIGN KEY constraint to enforce the link of data in two tables.
• NOT NULL CONSTRAINT – show you how to ensure a column not to accept
NULL.
• UNIQUE CONSTRAINT – ensure that data contained in a column, or a group of
columns, is unique among rows in a table.
• CHECK CONSTRAINT – walk you through the process of adding logic for
checking data before storing them in tables.
15. Expressions
• CASE – add if-else logic to SQL queries by using simple and searched CASE
expressions.
12
Data Engineering

• COALESCE – handle NULL values effectively using the COALESCE


expression.
• NULL IF – return NULL if the two arguments are equal; otherwise, return the
first argument.

SQL Server Views


Summary: in this tutorial, you will learn about views and how to manage views such as
creating a new view, removing a view, and updating data of the underlying tables through
a view.
When you use the SELECT statement to query data from one or more tables, you get a
result set.
For example, the following statement returns the product name, brand, and list price of all
products from the products and brands tables:
SELECT
product_name,
brand_name,
list_price
FROM
production.products p
INNER JOIN
production.brands b
ON
b.brand_id = p.brand_id;
Code language: SQL (Structured Query Language) (sql)

Next time, if you want to get the same result set, you can save this query into a text file,
open it, and execute it again.
SQL Server provides a better way to save this query in the database catalog through a
view.
A view is a named query stored in the database catalog that allows you to refer to it later.
So the query above can be stored as a view using the CREATE VIEW statement as
follows:
CREATE VIEW sales.product_info
AS
SELECT
product_name,
brand_name,
list_price
FROM
production.products p
INNER JOIN production.brands b
ON b.brand_id = p.brand_id;
Code language: SQL (Structured Query Language) (sql)

Later, you can reference to the view in the SELECT statement like a table as follows:
SELECT * FROM sales.product_info;
Code language: SQL (Structured Query Language) (sql)

When receiving this query, SQL Server executes the following query:
SELECT
*
FROM
13
Data Engineering

SELECT
product_name,
brand_name,
list_price
FROM
production.products p
INNER JOIN production.brands b
ON b.brand_id = p.brand_id;
);
Code language: SQL (Structured Query Language) (sql)

By definition, views do not store data except for indexed views.


A view may consist of columns from multiple tables using joins or just a subset of
columns of a single table. This makes views useful for abstracting or hiding complex
queries.
The following picture illustrates a view that includes columns from multiple tables:

Advantages of views
Generally speaking, views provide the following advantages:
Security
You can restrict users to access directly to a table and allow them to access a subset of
data via views.
For example, you can allow users to access customer name, phone, email via a view but
restrict them to access the bank account and other sensitive information.
Simplicity
A relational database may have many tables with complex relationships e.g., one-to-one
and one-to-many that make it difficult to navigate.
However, you can simplify the complex queries with joins and conditions using a set of
views.
Consistency
Sometimes, you need to write a complex formula or logic in every query.
To make it consistent, you can hide the complex queries logic and calculations in views.
Once views are defined, you can reference the logic from the views rather than rewriting
it in separate queries.

14
Data Engineering

Managing views in SQL Server


• Creating a new view – show you how to create a new view in a SQL Server
database.
• Renaming a view – learn how to rename a view using the SQL Server
Management Studio (SSMS) or Transact-SQL command.
• Listing views – discuss the various way to list all views in a SQL Server
Database.
• Getting view information – how to get information about a view.
• Removing a view – guide you how to use the DROP VIEW statement to remove
one or more views from the database.
• Creating an indexed view – show you how to create an indexed view against
tables that have infrequent data modification to optimize the performance of the
view.
SQL Server Indexes

Indexes are special data structures associated with tables or views that help speed up the
query. SQL Server provides two types of indexes: clustered index and non-clustered
index.
In this section, you will learn everything you need to know about indexes to come up with
a good index strategy and optimise your queries.
• Clustered Indexes – introduction to clustered indexes and learn how to create
clustered indexes for tables.
• Non Clustered Indexes – learn how to create non-clustered indexes using
the CREATE INDEX statement.
• Rename indexes – replace the current index name with the new name using
sp_rename stored procedure and SQL Server Management Studio.
• Disable indexes – show you how to disable indexes of a table to make the indexes
ineffective.
• Enable indexes – learn various statements to enable one or all indexes on a table.
• Unique indexes – enforce the uniqueness of values in one or more columns.
• Drop indexes – describe how to drop indexes from one or more tables.
• Indexes in included columns – describe how to add non-key columns to a non-
clustered index to improve the speed of queries.
• Filtered indexes – create an index on a portion of rows in a table.
• Indexes on computed columns – walk you through how to simulate function-based
indexes using the indexes on computed columns.

Store Procedure
It is a set of SQL statements with assigned names that can be shared and reused by
multiple programs
Syntax:
15
Data Engineering

CREATE PROCEDURE procedure_name


@variable AS datatype = value
AS
-- Comments
SELECT * FROM t GO

Using SQL constraints


Primary key: Set c1 and c2 as primary key
Syntax:
CREATE TABLE t(
c1 INT, c2 INT, c3 VARCHAR,
PRIMARY KEY (c1,c2)
);
Foreign Key: Set c2 column as foreign key
Syntax:
CREATE TABLE t1(
c1 INT PRIMARY KEY,
c2 INT,
FOREIGN KEY (c2) REFERENCES t2(c2)
);
Unique: Making the values in C1 and C2 as unique
Syntax:
CREATE TABLE t(
c1 INT, c1 INT,
UNIQUE(c2,c3)
);

16
Data Engineering

Chapter-3
Azure Data Factory

In the world of big data, raw, unorganised data is often stored in relational, non-relational,
and other storage systems. However, on its own, raw data doesn't have the proper context
or meaning to provide meaningful insights to analysts, data scientists, or business decision
makers.
Big data requires a service that can orchestrate and operationalise processes to refine these
enormous stores of raw data into actionable business insights. Azure Data Factory is a
managed cloud service that's built for these complex hybrid extract-transform-load (ETL),
extract-load-transform (ELT), and data integration projects.
Usage scenarios:
For example, imagine a gaming company that collects petabytes of game logs that are
produced by games in the cloud. The company wants to analyse these logs to gain insights
into customer preferences, demographics, and usage behaviour. It also wants to identify up-
sell and cross-sell opportunities, develop compelling new features, drive business growth,
and provide a better experience to its customers.
To analyze these logs, the company needs to use reference data such as customer
information, game information, and marketing campaign information that is in an on-
premises data store. The company wants to utilise this data from the on-premises data store,
combining it with additional log data that it has in a cloud data store.
To extract insights, it hopes to process the joined data by using a Spark cluster in the cloud
(Azure HDInsight), and publish the transformed data into a cloud data warehouse such as
Azure Synapse Analytics to easily build a report on top of it. They want to automate this
workflow, and monitor and manage it on a daily schedule. They also want to execute it
when files land in a blob store container.
Azure Data Factory is the platform that solves such data scenarios. It is the cloud-based
ETL and data integration service that allows you to create data-driven workflows for
orchestrating data movement and transforming data at scale. Using Azure Data Factory,
you can create and schedule data-driven workflows (called pipelines) that can ingest data
from disparate data stores. You can build complex ETL processes that transform data
visually with data flows or by using compute services such as Azure HDInsight Hadoop,
Azure Databricks, and Azure SQL Database.
Additionally, you can publish your transformed data to data stores such as Azure Synapse
Analytics for business intelligence (BI) applications to consume. Ultimately, through Azure
Data Factory, raw data can be organised into meaningful data stores and data lakes for
better business decisions.

How does it work?


Data Factory contains a series of interconnected systems that provide a complete end-to-
end platform for data engineers.

17
Data Engineering

This visual guide provides a detailed overview of the complete Data Factory architecture:

1. Connect and collect


Enterprises have data of various types that are located in disparate sources on-premises, in
the cloud, structured, unstructured, and semi-structured, all arriving at different intervals
and speeds.
The first step in building an information production system is to connect to all the required
sources of data and processing, such as software-as-a-service (SaaS) services, databases,
file shares, and FTP web services. The next step is to move the data as needed to a
centralized location for subsequent processing.
Without Data Factory, enterprises must build custom data movement components or write
custom services to integrate these data sources and processing. It's expensive and hard to
integrate and maintain such systems. In addition, they often lack the enterprise-grade
monitoring, alerting, and the controls that a fully managed service can offer.
With Data Factory, you can use the Copy Activity in a data pipeline to move data from both
on-premises and cloud source data stores to a centralization data store in the cloud for
further analysis. For example, you can collect data in Azure Data Lake Storage and
transform the data later by using an Azure Data Lake Analytics compute service. You can
also collect data in Azure Blob storage and transform it later by using an Azure HDInsight
Hadoop cluster.
2. Transform and enrich
After data is present in a centralized data store in the cloud, process or transform the
collected data by using ADF mapping data flows. Data flows enable data engineers to build
and maintain data transformation graphs that execute on Spark without needing to
understand Spark clusters or Spark programming.
If you prefer to code transformations by hand, ADF supports external activities for
executing your transformations on compute services such as HDInsight Hadoop, Spark,
Data Lake Analytics, and Machine Learning.
3. CI/CD and publish
Data Factory offers full support for CI/CD of your data pipelines using Azure DevOps and
GitHub. This allows you to incrementally develop and deliver your ETL processes before
publishing the finished product. After the raw data has been refined into a business-ready
consumable form, load the data into Azure Data Warehouse, Azure SQL Database, Azure
Cosmos DB, or whichever analytics engine your business users can point to from their
business intelligence tools.
4. Monitor
After you have successfully built and deployed your data integration pipeline, providing
business value from refined data, monitor the scheduled activities and pipelines for success
18
Data Engineering

and failure rates. Azure Data Factory has built-in support for pipeline monitoring via Azure
Monitor, API, PowerShell, Azure Monitor logs, and health panels on the Azure portal.
5. Top-level concepts
An Azure subscription might have one or more Azure Data Factory instances (or data
factories). Azure Data Factory is composed of below key components.
• Pipelines
• Activities
• Datasets
• Linked services
• Data Flows
• Integration Runtimes
These components work together to provide the platform on which you can compose data-
driven workflows with steps to move and transform data.
6. Pipeline
A data factory might have one or more pipelines. A pipeline is a logical grouping of
activities that performs a unit of work. Together, the activities in a pipeline perform a task.
For example, a pipeline can contain a group of activities that ingests data from an Azure
blob, and then runs a Hive query on an HDInsight cluster to partition the data.
The benefit of this is that the pipeline allows you to manage the activities as a set instead
of managing each one individually. The activities in a pipeline can be chained together to
operate sequentially, or they can operate independently in parallel.
7. Mapping data flows
Create and manage graphs of data transformation logic that you can use to transform any-
sized data. You can build-up a reusable library of data transformation routines and execute
those processes in a scaled-out manner from your ADF pipelines. Data Factory will execute
your logic on a Spark cluster that spins-up and spins-down when you need it. You won't
ever have to manage or maintain clusters.
8. Activity
Activities represent a processing step in a pipeline. For example, you might use a copy
activity to copy data from one data store to another data store. Similarly, you might use a
Hive activity, which runs a Hive query on an Azure HDInsight cluster, to transform or
analyze your data. Data Factory supports three types of activities: data movement activities,
data transformation activities, and control activities.
9. Datasets
Datasets represent data structures within the data stores, which simply point to or reference
the data you want to use in your activities as inputs or outputs.
10. Linked services
Linked services are much like connection strings, which define the connection information
that's needed for Data Factory to connect to external resources. Think of it this way: a linked
service defines the connection to the data source, and a dataset represents the structure of
the data. For example, an Azure Storage-linked service specifies a connection string to
connect to the Azure Storage account. Additionally, an Azure blob dataset specifies the
blob container and the folder that contains the data.
11. Linked services are used for two purposes in Data Factory:
• To represent a data store that includes, but isn't limited to, a SQL Server database,
Oracle database, file share, or Azure blob storage account. For a list of supported data
stores, see the copy activity article.
• To represent a compute resource that can host the execution of an activity. For
example, the HDInsightHive activity runs on an HDInsight Hadoop cluster. For a list of
transformation activities and supported compute environments, see the transform data
article.

19
Data Engineering

12. Integration Runtime


In Data Factory, an activity defines the action to be performed. A linked service defines a
target data store or a compute service. An integration runtime provides the bridge between
the activity and linked Services. It's referenced by the linked service or activity, and
provides the compute environment where the activity either runs on or gets dispatched
from. This way, the activity can be performed in the region closest possible to the target
data store or compute service in the most performant way while meeting security and
compliance needs.
13. Triggers
Triggers represent the unit of processing that determines when a pipeline execution needs
to be kicked off. There are different types of triggers for different types of events.
14. Pipeline runs
A pipeline run is an instance of the pipeline execution. Pipeline runs are typically
instantiated by passing the arguments to the parameters that are defined in pipelines. The
arguments can be passed manually or within the trigger definition.
15. Parameters
Parameters are key-value pairs of read-only configuration. Parameters are defined in the
pipeline. The arguments for the defined parameters are passed during execution from the
run context that was created by a trigger or a pipeline that was executed manually. Activities
within the pipeline consume the parameter values.
A dataset is a strongly typed parameter and a reusable/referenceable entity. An activity can
reference datasets and can consume the properties that are defined in the dataset definition.
A linked service is also a strongly typed parameter that contains the connection information
to either a data store or a compute environment. It is also a reusable/referenceable entity.
16. Control flow
Control flow is an orchestration of pipeline activities that includes chaining activities in a
sequence, branching, defining parameters at the pipeline level, and passing arguments
while invoking the pipeline on-demand or from a trigger. It also includes custom-state
passing and looping containers, that is, For-each iterators.
17. Variables
Variables can be used inside of pipelines to store temporary values and can also be used in
conjunction with parameters to enable passing values between pipelines, data flows, and
other activities.

Create a data factory by using the Azure portal


Prerequisites: Azure subscription

If you don't have an Azure subscription, create a free account before you begin.

Azure roles

To learn about the Azure role requirements to create a data factory, refer to Azure Roles
requirements.

Create a data factory


A quick creation experience provided in the Azure Data Factory Studio to enable users to
create a data factory within seconds. More advanced creation options are available in
Azure portal.

1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory
UI is supported only in Microsoft Edge and Google Chrome web browsers.
2. Go to the Azure Data Factory Studio and choose the Create a new data factory
20
Data Engineering

radio button.
3. You can use the default values to create directly, or enter a unique name and choose
a preferred location and subscription to use when creating the new data factory.
4. After creation, you can directly enter the homepage of the Azure Data Factory
Studio.

21
Data Engineering

Advanced creation in the Azure portal


Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is
supported only in Microsoft Edge and Google Chrome web browsers.
1. Go to the Azure portal data factories page.

2. After landing on the data factories page of the Azure portal, click Create.
1. For Resource Group, take one of the following steps:
a. Select an existing resource group from the drop-down list.
b. Select Create new, and enter the name of a new resource group.
2. To learn about resource groups, see Use resource groups to manage your Azure
resources.
3. For Region, select the location for the data factory.
The list shows only locations that Data Factory supports, and where your Azure Data
Factory meta data will be stored. The associated data stores (like Azure Storage and
Azure SQL Database) and computes (like Azure HDInsight) that Data Factory uses can
run in other regions.
4. For Name, enter ADFTutorialDataFactory.
The name of the Azure data factory must be globally unique. If you see the following
error, change the name of the data factory (for example,
<yourname>ADFTutorialDataFactory) and try creating again. For naming rules for Data
Factory artifacts, see the Data Factory - naming rules article.

22
Data Engineering

5. For Version, select V2.


6. Select Next: Git configuration, and then select Configure Git later check box.
7. Select Review + create, and select Create after the validation is passed. After the
creation is complete, select Go to resource to navigate to the Data Factory page.
8. Select Launch Studio to open Azure Data Factory Studio to start the Azure Data
Factory user interface (UI) application on a separate browser tab.

Pipelines and activities in Azure Data Factory and Azure Synapse


Analytics
A Data Factory or Synapse Workspace can have one or more pipelines. A pipeline is a
logical grouping of activities that together perform a task. For example, a pipeline could
contain a set of activities that ingest and clean log data, and then kick off a mapping
dataflow to analyze the log data. The pipeline allows you to manage the activities as a
set instead of each one individually. You deploy and schedule the pipeline instead of
the activities independently.
The activities in a pipeline define actions to perform on your data. For example, you may
use a copy activity to copy data from SQL Server to an Azure Blob Storage. Then, use a
data flow activity or a Databricks Notebook activity to process and transform data from
the blob storage to an Azure Synapse Analytics pool on top of which business
intelligence reporting solutions are built.
Azure Data Factory and Azure Synapse Analytics have three groupings of activities: data
movement activities, data transformation activities, and control activities. An activity can
take zero or more input datasets and produce one or more output datasets. The following
diagram shows the relationship between pipeline, activity, and dataset:

An input dataset represents the input for an activity in the pipeline, and an output dataset
represents the output for the activity. Datasets identify data within different data stores,
such as tables, files, folders, and documents. After you create a dataset, you can use it
with activities in a pipeline. For example, a dataset can be an input/output dataset of a
Copy Activity or an HDInsightHive Activity. For more information about datasets, see
Datasets in Azure Data Factory article.

23
Data Engineering

Data movement activities


Copy Activity in Data Factory copies data from a source data store to a sink data store.
Data Factory supports the data stores listed in the table in this section. Data from any
source can be written to any sink.
An input dataset represents the input for an activity in the pipeline, and an output dataset
represents the output for the activity. Datasets identify data within different data stores,
such as tables, files, folders, and documents. After you create a dataset, you can use it
with activities in a pipeline. For example, a dataset can be an input/output dataset of a
Copy Activity or an HDInsightHive Activity. For more information about datasets, see
Datasets in Azure Data Factory article.
Data movement activities
Copy Activity in Data Factory copies data from a source data store to a sink data store.
Data Factory supports the data stores listed in the table in this section. Data from any
source can be written to any sink.
Control flow activities
The following control flow activities are supported:
Control Description
activity

Append Add a value to an existing array variable.


Variable

Execute Execute Pipeline activity allows a Data Factory or Synapse pipeline to invoke
Pipeline another pipeline.

Filter Apply a filter expression to an input array

For Each ForEach Activity defines a repeating control flow in your pipeline. This
activity is used to iterate over a collection and executes specified activities in
a loop. The loop implementation of this activity is similar to the Foreach
looping structure in programming languages.

Get GetMetadata activity can be used to retrieve metadata of any data in a Data
Metadata Factory or Synapsepipeline.

If The If Condition can be used to branch based on condition that evaluates to


Condition true or false. TheIf Condition activity provides the same functionality that
Activity an if statement provides in programming languages. It evaluates a set of
activities when the condition evaluates to true andanother set of activities
when the condition evaluates to false.

Lookup Lookup Activity can be used to read or look up a record/ table name/ value
from any external
Activity
source. This output can further be referenced by succeeding activities.

Set Set the value of an existing variable.


Variable

Until Implements Do-Until loop that is similar to Do-Until looping structure in


Activity programming languages. It executes a set of activities in a loop until the
condition associated with the activityevaluates to true. You can specify a
timeout value for the until activity.

24
Data Engineering

Validation Ensure a pipeline only continues execution if a reference dataset exists,


Activity meets a specifiedcriteria, or a timeout has been reached.

Wait When you use a Wait activity in a pipeline, the pipeline waits for the
Activity specified time beforecontinuing with execution of subsequent activities.

Web Web Activity can be used to call a custom REST endpoint from a
Activity pipeline. You can passdatasets and linked services to be consumed and
accessed by the activity.

Webhook Using the webhook activity, call an endpoint, and pass a callback URL. The
Activity pipeline run waitsfor the callback to be invoked before proceeding to the
next activity.

Creating a pipeline with UI


To create a new pipeline, navigate to the Author tab in Data Factory Studio (represented
by the pencil icon), then click the plus sign and choose Pipeline from the menu, and
Pipeline again from the submenu.

Data factory will display the pipeline editor where you can find:
1. All activities that can be used within the pipeline.
2. The pipeline editor canvas, where activities will appear when added to the
pipeline.
3. The pipeline configurations pane, including parameters, variables, general
settings, and output.
4. The pipeline properties pane, where the pipeline name, optional description, and
annotations can be configured. This pane will also show any related items to the
pipeline within the data factory.

25
Data Engineering

Note the following points:


In the activities section, there is only one activity whose type is set to HDInsightHive.
The Hive script file, partitionweblogs.hql, is stored in the Azure Storage account
(specified by the scriptLinkedService, called AzureStorageLinkedService), and in script
folder in the container adfgetstarted.
The defines section is used to specify the runtime settings that are passed to the hive
script as Hive configuration values (for example, ${hiveconf:inputtable},
${hiveconf:partitionedtable}).
The typeProperties section is different for each transformation activity. To learn about
type properties supported for a transformation activity, click the transformation activity in
the Data transformation activities.
For a complete walkthrough of creating this pipeline, see Tutorial: transform data using
Spark.
Multiple activities in a pipeline
The previous two sample pipelines have only one activity in them. You can have more
than one activity in a pipeline. If you have multiple activities in a pipeline and subsequent
activities are not dependent on previous activities, the activities may run in parallel.
You can chain two activities by using activity dependency, which defines how subsequent
activities depend on previous activities, determining the condition whether to continue
executing the next task. An activity can depend on one or more previous activities with
different dependency conditions.
Scheduling pipelines
Pipelines are scheduled by triggers. There are different types of triggers (Scheduler
trigger, which allows pipelines to be triggered on a wall-clock schedule, as well as the
manual trigger, which triggers pipelines on-demand). For more information about
triggers, see pipeline execution and triggers article.
To have your trigger kick off a pipeline run, you must include a pipeline reference of the
particular pipeline in the trigger definition. Pipelines & triggers have an n-m relationship.
Multiple triggers can kick off a single pipeline, and the same trigger can kick off multiple
pipelines. Once the trigger is defined, you must start the trigger to have it start triggering
the pipeline. For more information about triggers, see pipeline execution and triggers
article.
Linked services in Azure Data Factory and Azure Synapse
Analytics
Azure Data Factory and Azure Synapse Analytics can have one or more pipelines. A
pipeline is a logical grouping of activities that together perform a task. The activities in a
pipeline define actions to perform on your data. For example, you might use a copy

26
Data Engineering

activity to copy data from SQL Server to Azure Blob storage. Then, you might use a Hive
activity that runs a Hive script on an Azure HDInsight cluster to process data from Blob
storage to produce output data. Finally, you might use a second copy activity to copy the
output data to Azure Synapse Analytics, on top of which business intelligence (BI)
reporting solutions are built. For more information about pipelines and activities, see
Pipelines and activities.
Now, a dataset is a named view of data that simply points or references the data you want
to use in your activities as inputs and outputs.
Before you create a dataset, you must create a linked service to link your data store to the
Data Factory or Synapse Workspace. Linked services are much like connection strings,
which define the connection information needed for the service to connect to external
resources. Think of it this way; the dataset represents the structure of the data within the
linked data stores, and the linked service defines the connection to the data source. For
example, an Azure Storage linked service links a storage account to the service. An Azure
Blob dataset represents the blob container and the folder within that Azure Storage
account that contains the input blobs to be processed.
Here is a sample scenario. To copy data from Blob storage to a SQL Database, you create
two linked services: Azure Storage and Azure SQL Database. Then, create two datasets:
Azure Blob dataset (which refers to the Azure Storage linked service) and Azure SQL
Table dataset (which refers to the Azure SQL Database linked service). The Azure
Storage and Azure SQL Database linked services contain connection strings that the
service uses at runtime to connect to your Azure Storage and Azure SQL Database,
respectively. The Azure Blob dataset specifies the blob container and blob folder that
contains the input blobs in your Blob storage. The Azure SQL Table dataset specifies the
SQL table in your SQL Database to which the data is to be copied.
The following diagram shows the relationships among pipeline, activity, dataset, and
linked service in the service:

Linked service with UI:Azure Data Factory


To create a new linked service in Azure Data Factory Studio, select the Manage tab and
then linked services, where you can see any existing linked services you defined. Select
New to create a new linked service.

27
Data Engineering

After selecting New to create a new linked service you will be able to choose any of the
supported connectors and configure its details accordingly. Thereafter you can use the
linked service in any pipelines you create.
Linked service JSON { "name": "", "properties": { "type": "", "typeProperties": { "" },
"connectVia": { "referenceName": "", "type": "IntegrationRuntimeReference" } } }
The following table describes properties in the above JSON: Property Description
Required name Name of the linked service. See Naming rules. Yes type Type of the
linked service. For example: AzureBlobStorage (data store) or AzureBatch (compute).
See the description for typeProperties. Yes typeProperties The type properties are
different for each data store or compute. Yes

Create linked services


Linked services can be created in the Azure Data Factory UX via the management hub
and any activities, datasets, or data flows that reference them.
You can create linked services by using one of these tools or SDKs: .NET API,
PowerShell, REST API, Azure Resource Manager Template, and Azure portal.
When creating a linked service, the user needs appropriate authorization to the designated
service. If sufficient access is not granted, the user will not be able to see the available
resources and will need to use manual entry option.
Data store linked services
You can find the list of supported data stores in the connector overview article. Click a
data store to learn the supported connection properties.
Compute linked services
Reference compute environments supported for details about different compute
environments you can connect to from your service as well as the different
configurations.

Setting up ADF
ADF is a data pipeline orchestrator and ETL tool that is part of the Microsoft Azure cloud
ecosystem. ADF can pull data from the outside world (FTP, Amazon S3, Oracle, and many
more), transform it, filter it, enhance it, and move it along to another destination. In my
work for a health-data project we are using ADF to drive our data flow from raw ingestion
to polished analysis that is ready to display.
There are many good resources for learning ADF, including an introduction anda
quickstart. When I was starting out with ADF however, I did not find a clear explanationfor
the basic underlying concepts it is built upon. This article is an attempt to fill that gap.
Getting ADF to do real work for you involves the following layers of technology, listed
from the highest level of abstraction that you interact with down to the software closest to
28
Data Engineering

the data.
 Pipeline, the graphical user interface where you place widgets and draw data
paths
 Activity, a graphical widget that does something to your data
 Source and Sink, the parts of an activity that specify where data is coming from
and going to
 Data Set, an explicitly defined set of data that ADF can operate on
 Linked Service, the connection information that allows ADF to access a
specific outside data resource
 Integration Runtime, a glue/gateway layer that lets ADF talk to software
outside of itself
Understanding the purpose of each layer and how it contributes to an overall ADF solution
is key to using the tool well. I find it easiest to understand ADF by considering the layers
in reverse order, starting at the bottom near the data.

Integration Runtime
An integration runtime provides the gateway between ADF and the actual data or compute
resources you need. If you are using ADF to marshal native Azure resources, such as an
Azure Data Lake or Databricks, then ADF knows how to talk to those resources. Just use
the built-in integration runtime and don’t think about it — no set up or configuration
required.
But suppose you want ADF to operate on data that is stored on an Oracle Database server
under your desk, or computers and data within your company’s private network. In these
cases you must set up the gateway with a self-hosted integration runtime.
Linked Service
A linked service tells ADF how to see the particular data or computers you want to operate
on. To access a specific Azure storage account, you create a linked service for it and include
access credentials. To read/write another storage account, you create another linked service.
To allow ADF to operate on an Azure SQL database, your linked service will state the
Azure subscription, server name, database name, and credentials.

29
Data Engineering

Data Set
A data set makes a linked service more specific; it describes the folder you are using within
a storage container, or the table within a database, etc.
The data set in this screenshot points to one directory in one container in one Azure storage
account. (The container and directory names are set in the Parameters tab.) Note how the
data set references a linked service. Note also that this data set specifies that the data is
zipped, which allows ADF to automatically unzip the data as you read it.

30
Data Engineering

Source and Sink


A source and a sink are, as their names imply, places data comes from and goes to. Sources
and sinks are built on data sets. ADF is mostly concerned with moving data from one place
to another, often with some kind of transformation along the way, so it needs to know where
to move the data.
It is important to understand that there is mushy distinction between data sets and
sources/sinks. A data set defines a particular collection of data, but a source or sink can
redefine the collection. For example, suppose DataSet1 is defined as the folder
/Vehicles/GM/Trucks/. When a source uses DataSet1, it can take that collection as-is (the
default), or narrow the set to /Vehicles/GM/Trucks/Silverado/ or expand it to /Vehicles/.
There is artful design involved in the trade-offs between data set scope and source/sink
scope. My practice is to define data sets somewhat broadly (thereby reducing the number
of them), and then allow sources and sinks to narrow down what each needs in a particular
situation.
This source uses the zipped data set shown above, narrows it, and makes sure to select only
files actually named *.zip (otherwise the unzipping will fail).

31
Data Engineering

Activity
Activities are the GUI widgets within Data Factory that do specific kinds of data movement
or transformation. There is a CopyData activity to move data, a ForEach activity to loop
over a file list, a Filter activity that chooses a subset of files, etc. Most activities have a
source and a sink.
Pipeline
An ADF pipeline is the top-level concept that you work with most directly. Pipelines are
composed of activities and data flow arrows. You program ADF by creating pipelines. You
get work done by running pipelines, either manually or via automatic triggers. You look at
the results of your work by monitoring pipeline execution.

This pipeline takes inbound data from an initial Data Lake folder, moves it to cold archive
storage, gets a list of the files, loops over each file, copies those files to an unzipped working
folder, then applies an additional filter by file type.

Simple project on ADF


Section 1: Create Azure Data Factory
First things first. Let's start by creating our Azure Data Factory resource.
First step, log into the portal and click the Create a resource button.

Figure 1a: Create a resource


Next, select the Integration option and then click Data Factory.

32
Data Engineering

Figure 1b: Select Data Factory


From here, we fill out the pertinent information. Name, Subscription, ResourceGroup
and Location are all required fields (If you do not have a resource group setup, click the
create new option prior to this step).
By default, the Enable GIT option is selected. For this tutorial, we're leaving the option
unchecked. If you prefer saving everything you build in a repository, you can configure
that option.
Click Create and let the magic happen.

33
Data Engineering

Figure 1c: New data factory - Click Create


Next, you'll be redirected to the deployment screen that shows the progress of your Azure
Data Factory resource being created.
After a minute or so, you should see Your deployment is complete.
Boom! You've finished the first step. Your Azure Data Factory resource setup is complete.
From here, click the Go to resource button.

34
Data Engineering

Figure 1d: Your deployment is complete - Click Go to resource


Section 2: Create Azure Data Factory Pipeline
Now that we have our Azure Data Factory resource setup, you should see something that
looks like the image below. This is the high level look at our resource. You can see metrics
about the CPU, memory, and get a quick glance at how things are running.
In order to create our first Azure Data Factory (ADF) pipeline ,we need to click the Author
& Monitor option.

35
Data Engineering

Figure 2a: ADF Resource - Click Author & Monitor


From this point, we should see an entirely new browser window open and see that our data
factory is ready for use. Here you have a few options. Today, we are going to focus on the
first option; click Create pipeline.

Figure 2b: Create pipeline


I know what you are thinking, Wowzers, that is a lot of options. But don't be discouraged.
We will start with something simple that everyone has had to do at one point or another
doing data integration.
First, let's rename our pipeline and call it first-pipeline.

36
Data Engineering

Figure 2c: Rename pipeline


Next, expand the option under the Activities column labeled Move & transform. Now,
drag the item labeled Copy data onto the middle section that is blank.
After doing this, we should see our Copy data activity show up and the options panel
expand from the bottom of the screen. This is where we are going to configure the options
for our copy data task.
Go ahead and rename our task that is labeled Copy data1 to Copy Characters.

37
Data Engineering

Figure 2d: Copy data task


Section 3: Setup Source
To begin, click the tab labeled Source. Here, we will configure our API endpoint. Since
we have not setup any source datasets, we won't see anything in the drop down.
Click the + New button.

38
Data Engineering

Figure 3a: Source dataset


After the previous step, you should now see the New dataset panel show up on the right
side. As you can see there are tons of options and more being added regularly.
Scroll down until you see the option for REST. Select it and click Continue.

39
Data Engineering

Figure 3b: New dataset - Select REST, Click Continue


Now we should see our Source dataset is updated and labeled RestResource1. This
screen allows you to configure some additional options around our new dataset.
To setup and configure our connection information for the dataset; click the Open button.

40
Data Engineering

Figure 3c: RestResource1 - Click Open


This will open up another overall tab next to our pipeline tab that allows us to configuration
our REST API. At this point, you could rename your dataset to something more meaningful,
but for now, we'll leave it as is.
Click the Connection tab. As you can see, we don't have a linked service setup.
Click the + New button.

41
Data Engineering

Figure 3d: RestResource1 - Connection Tab, Click + New


Notice the side panel which allows for setup of all the information we need to connect to
our REST API. For this tutorial, we will use the Star Wars API aka SWAPI.
Let's name our service something useful like SWAPI Service. Then, we'll setup our Base
URL and point it to our API which is: https://swapi.dev/api/people.
Since our API doesn't require any authentication, we select the type Anonymous.
Note: Best practice is to always provide some type of authentication.
Finally, click the Test connection button to ensure it is working and then click
the Create button.

42
Data Engineering

Figure 3e: New linked service - Click Create


We should now see our RestResource1 tab is updated with our new connection
information.
Now, let's go back to our pipeline, click the first tab labeled first-pipeline.

43
Data Engineering

Figure 3f: SWAPI Service - Click first-pipeline tab


Now that we have setup and configured our source, we can preview what our data is going
to look like.
Click Preview data.

44
Data Engineering

Figure 3g: Source configured - Click Preview data


You'll see a preview dialog popup that shows we have successfully connected to our API
and what our JSON response will look like.
Close the dialog and let's move onto the next steps.

45
Data Engineering

Figure 3h: Preview data - Close dialog


Section 4: Setup Sink (Target)
Now that we've completed the data source setup, we need to setup and configure where
we're going to store the data. For this tutorial we assume you already have an Azure Storage
resource setup and ready for use.
In ADF our target is called a Sink.

Figure 4a: Sink dataset - Click + New button


These next steps should look very familiar. We'll begin the process, just like before, but
this time we'll configure where we want our data stored.
Scroll down until we see the option Azure Table Storage. Select it and click Continue.

46
Data Engineering

Figure 4b: New dataset –

Select Azure Table Storage, Click Continue


Now we should see that our Sink dataset is updated and labeled AzureTable1.
To setup and configure our table storage information, we need to click the Open button.

47
Data Engineering

Figure 4c: AzureTable1 - Click Open


Once again, this will open up another tab and allow us to configure our Azure Table Storage
connection information.
Click the Connection tab, then click the + New button.

Figure 4d: AzureTable1 - Connection tab, Click + New button

48
Data Engineering

We should see another side panel that allows us to configure our table storage connection
information.
Let's name our service AzureTableStorage.
Next, we select our Storage account name for use. If you didn't setup one previously,
you'll need to do that first.
Finally, click the Test connection button to ensure it is working and then click
the Create button.

Figure 4e: New linked service - Click Create


We should now see our AzureTable1 tab is updated with our new connection information.
Lastly, we select the table want to store our data in. Select the table named Characters (or
whatever you named your table).
For the final part of this section, we'll go back to our pipeline. Click the first tablabeled
first-pipeline.

49
Data Engineering

Figure 4f: AzureTable1 - Click first-pipeline tab


Our sink dataset is configured and ready for use.
For this tutorial, we'll leave everything else as-is. In a typical scenario, this is where you
would configure how to handle duplicate data, partition keys, etc.

50
Data Engineering

Figure 4g: Sink configured


Section 5: Setup Mappings
Now that our source and target (Sink) is setup, we'll map our API data to our table where
we'll store this data. Data mappings are easy thanks to the ability to discover and import
our API schema.
To begin, click the tab labeled Mapping, then click the Import schemas button.

51
Data Engineering

Figure 5a: Mapping - Click Import schemas


After a few seconds we should see the format our of API response appear.
Behind the scenes, ADF makes an HTTP request to our API and formats the response into
a table layout for us to configure our mappings.

52
Data Engineering

Figure 5b: Mapping - Import schemas result


Notice the response layout includes properties such as count, next, previous and results.
ADF intuitively knows that the results property is an array.
Next, we'll set our Collection reference to the array named results. To do this, either select
it from the drop down list or check the box under the collection reference column.
Then we setup the Column names for the items we're mapping for import.
For this tutorial, we'll map the fields: name, height, mass and homeworld.
Uncheck the Include column for everything else.
Finally, type in the name we want our columns to be called (name, height, mass,
homeworld.)

53
Data Engineering

Figure 5c: Mapping - Include fields: name, height, mass, homeworld


Section 6: Validate, Publish & Test
Now that we have completed setting up our source, sink, and mappings we are ready to
validate, publish and test.
Validate
Let's validate everything before we move on to publishing.
Click the Validate all button. A successful factory validation looks like Figure 6a.

54
Data Engineering

Figure 6a: Validated - Success


Publish
Before trying out our new pipeline, we need to publish our changes.
Click the Publish all button. A side panel will appear that shows us all the changes to be
published.
Click Publish.

Figure 6b: Publish all - Click Publish

55
Data Engineering

Test
Finally, we test that our new pipeline works. For this we'll use the debug feature.
Click the Debug button.
This will show the output tab and allow us to see the status of our pipeline run.

Figure 6c: Test - Click Debug


Once you see the pipeline has ran successfully you can dig deeper into the details of the
run by hovering over the Copy Characters name.
The first icon is the input details and the second icon is the output details. You'll find overall
details of the pipeline run in the third icon.
Click the Eyeglasses icon.

Figure 6d: Pipeline run - Click Eyeglasses icon

56
Data Engineering

Figure 6e: Details - Close dialog

References
1. Microsoft learn
2. Wikipedia.com
3. Google.com
4. Databricks.com

57

You might also like