Azure Data Factory

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 14

Azure Data Factory V2

What is ADF?
• SSIS in the cloud
• Allows access to on-premises data in SQL server and cloud data in Azure storage
• It's visual (the JSON-encoded assets are still there, but that's largely hidden from
the user)
• It has broad connectivity across data sources and destinations, of both Microsoft
and non-Microsoft pedigrees
• It's modern, able to use Hadoop (including MapReduce, Pig and Hive) and Spark
to process the data or use its own simple activity construct to copy data
• It doesn't cut ties with the past; in fact, it serves as a cloud-based environment
for running packages designed in with the on-premises SSIS
ADF vs SSIS
• Both Azure Data Factory and SQL Server Integration Services are built to move
data between disparate sources, and they do have some overlapping capabilities.
• Installation
• Scheduling capabilities
• Workload
• ETL or ELT
• JSON scripts vs No coding
• SSIS has a wider range of supported data sources and destinations
• SSIS has error handling. ADF does not
• Tagging and tracking the data in ADF
Terminologies
• Pipelines
• Activities
• Integration Runtime
• Datasets
• Linked Services
• Scheduling Pipeline
Datasets
• Dataset is a named view of data that simply points or references the
data you want to use in your activities as inputs and outputs.
• Datasets identify data within different data stores, such as tables,
files, folders, and documents.
• There are many different types of datasets, depending on the data
store you use. Ex. Azure Blob storage, Azure SQL, etc.
Activities
• The activities in a pipeline define actions to perform on your data. For
example, you may use a copy activity to copy data from an on-
premises SQL Server to an Azure Blob Storage.
• The pipeline allows you to manage the activities as a set instead of
each one individually. For example, you can deploy and schedule the
pipeline, instead of the activities independently.
Types of Activities
• If
• For each
• Copy Data
• Stored Procedure
• Execute Pipeline
• Lookup
• Azure function
If Condition
• If Condition evaluates the boolean expression.
Depending on the expression result (true or false),
the pipeline will invoke an appropriate set
of activities.
• For instance, you can provide the following
expression as a condition (using pipeline
parameters): @bool(pipeline().parameters.param
Value). When you run a pipeline you can define
the parameter value. The program will perform
the appropriate branch with activities depending
on that value.
For each
• As with the If Condition activity, the For Each activity
concept is similar to that in programming languages.
• Imagine a situation where you have to copy files to
multiple locations within Blob storage. To achieve this,
just provide a For Each loop, where the parameter is a
collection of desired destination folder paths.
• When you have a lot of items to iterate, you can speed
up the execution by setting isSequentialproperty to
false. This will change the mode to parallel execution.
The current limit is 20 concurrent iterations.
Copy Activity
• In Azure Data Factory, you can use Copy Activity to copy data
among data stores located on-premises and in the cloud. After the
data is copied, it can be further transformed and analyzed. You
can also use Copy Activity to publish transformation and analysis
results for business intelligence (BI) and application consumption.
Actions:
• Reads data from a source data store.
• Performs serialization/deserialization,
compression/decompression, column mapping, etc. It does these
operations based on the configurations of the input dataset,
output dataset, and Copy Activity.
• Writes data to the sink/destination data store.
Stored Procedure
• You can use the Stored Procedure Activity to invoke a
stored procedure in one of the following data stores in
your enterprise or on an Azure virtual machine (VM):
• Azure SQL Database
• Azure SQL Data Warehouse
• SQL Server Database
• If you are using SQL Server, install Self-hosted
integration runtime on the same machine that hosts
the database or on a separate machine that has
access to the database.
Execute Pipeline
• This lets you run another pipeline (child) from an existing
one (parent). This is especially useful when your pipeline
expanded and you have repeatable workflow steps. This
is a good scenario to put such a piece of workflow in a
separate pipeline and reuse it whenever possible.
• Additionally, it is possible to make this “brick” more
generic if you define the appropriate parameters. You can
pass their values from parent pipeline to child pipeline.
• One more thing worth mentioning is
the waitOnCompletion property. It defines whether the
pipeline should wait for the related pipeline (parent) to
finish execution before the run (child).
Lookup activity
• This is another interesting activity. You can use it to read values
from an external source as an input for your pipeline. Then you
might use the output from Lookup activity in subsequent activities.
• An example scenario might be where you use Lookup to take
values from Azure SQL Database table as an input collection
through which the For Each loop should iterate. Basically, you can
use Lookup activity to read configuration for your pipeline.
• Limitation of lookup activity is the number of rows returned by
lookup activity which is limited to 5000 records and max. size is 10
MB.
• Output in JSON file format
Azure Function Activity
• Azure Functions is a serverless compute service
that enables you to run code on-demand
without having to explicitly provision or manage
infrastructure.
• Using Azure Functions, you can run a script or
piece of code in response to a variety of events.
• Azure Data Factory (ADF) is a managed data
integration service in Azure that allows you to
iteratively build, orchestrate, and monitor your
Extract Transform Load (ETL) workflows.

You might also like