ETL Power Point Presentation
ETL Power Point Presentation
ETL Power Point Presentation
Warehouse
Chirayu Poundarik
Outline
ETL
Extraction
Transformation
Loading
ETL Overview
Extraction Transformation Loading – ETL
To get data out of the source and load it into the data
warehouse – simply a process of copying data from one
database to other
Data is extracted from an OLTP database, transformed
to match the data warehouse schema and loaded into
the data warehouse database
Many data warehouses also incorporate data from non-
OLTP systems such as text files, legacy systems, and
spreadsheets; such data also requires extraction,
transformation, and loading
When defining ETL for a data warehouse, it is important
to think of ETL as a process, not a physical
implementation
ETL Overview
ETL is often a complex combination of process and
technology that consumes a significant portion of the data
warehouse development efforts and requires the skills of
business analysts, database designers, and application
developers
It is not a one time event as new data is added to the
Data Warehouse periodically – monthly, daily, hourly
Because ETL is an integral, ongoing, and recurring part of
a data warehouse
Automated
Well documented
Easily changeable
ETL Staging Database
ETL operations should be performed on a
relational database server separate from the
source databases and the data warehouse
database
Creates a logical and physical separation
between the source systems and the data
warehouse
Minimizes the impact of the intense periodic ETL
activity on source and data warehouse databases
Extraction
Extraction
Need to have a logical data map before the physical data can
be transformed
The logical data map describes the relationship between the
extreme starting points and the extreme ending points of your
ETL system usually presented in a table or spreadsheet
Target Source Transformation
Table Name Column Name Data Type Table Name Column Name Data Type
The content of the logical data mapping document has been proven to be the critical
element required to efficiently plan ETL processes
The table type gives us our queue for the ordinal position of our data load processes
—first dimensions, then facts.
The primary purpose of this document is to provide the ETL developer with a clear-cut
blueprint of exactly what is expected from the ETL process. This table must depict,
without question, the course of action involved in the transformation process
The transformation can contain anything from the absolute solution to nothing at all.
Most often, the transformation can be expressed in SQL. The SQL may or may not be
the complete statement
The analysis of the source system is
usually broken into two major phases:
The data discovery phase
The anomaly detection phase
Extraction - Data Discovery Phase
Data Discovery Phase
key criterion for the success of the data
warehouse is the cleanliness and
cohesiveness of the data within it
Once you understand what the target
needs to look like, you need to identify and
examine the data sources
Data Discovery Phase
It is up to the ETL team to drill down further into the data
requirements to determine each and every source system, table,
and attribute required to load the data warehouse
Collecting and Documenting Source Systems
Keeping track of source systems
Determining the System of Record - Point of originating of data
Definition of the system-of-record is important because in most
enterprises data is stored redundantly across many different systems.
Enterprises do this to make nonintegrated systems share data. It is very
common that the same piece of data is copied, moved, manipulated,
transformed, altered, cleansed, or made corrupt throughout the
enterprise, resulting in varying versions of the same data
Data Content Analysis - Extraction
Understanding the content of the data is crucial for determining the best
approach for retrieval
- NULL values. An unhandled NULL value can destroy any ETL process.
NULL values pose the biggest risk when they are in foreign key columns.
Joining two or more tables based on a column that contains NULL values
will cause data loss! Remember, in a relational database NULL is not equal
to NULL. That is why those joins fail. Check for NULL values in every
foreign key in the source database. When NULL values are present, you
must outer join the tables
- Dates in nondate fields. Dates are very peculiar elements because they
are the only logical elements that can come in various formats, literally
containing different values and having the exact same meaning. Fortunately,
most database systems support most of the various formats for display
purposes but store them in a single standard format
During the initial load, capturing changes to data content
in the source data is unimportant because you are most
likely extracting the entire data source or a potion of it
from a predetermined point in time.
Yes
Cleaning
Staged Data And Fatal Errors No Loading
Confirming
Loading
Loading Dimensions
Loading Facts
Loading Dimensions
Physically built to have the minimal sets of components
The primary key is a single field containing meaningless
unique integer – Surrogate Keys
The DW owns these keys and never allows any other
entity to assign them
De-normalized flat tables – all attributes in a dimension
must take on a single value in the presence of a
dimension primary key.
Should possess one or more other fields that compose
the natural key of the dimension
The data loading module consists of all the steps
required to administer slowly changing dimensions
(SCD) and write the dimension to disk as a physical
table in the proper dimensional format with correct
primary keys, correct natural keys, and final descriptive
attributes.
Creating and assigning the surrogate keys occur in this
module.
The table is definitely staged, since it is the object to be
loaded into the presentation system of the data
warehouse.
Loading dimensions
When DW receives notification that an
existing row in dimension has changed it
gives out 3 types of responses
Type 1
Type 2
Type 3
Type 1 Dimension
Type 2 Dimension
Type 3 Dimensions
Loading facts
Facts
Fact tables hold the measurements of an
enterprise. The relationship between fact
tables and measurements is extremely
simple. If a measurement exists, it can be
modeled as a fact table row. If a fact table
row exists, it is a measurement
Key Building Process - Facts
When building a fact table, the final ETL step is
converting the natural keys in the new input records into
the correct, contemporary surrogate keys
ETL maintains a special surrogate key lookup table for
each dimension. This table is updated whenever a new
dimension entity is created and whenever a Type 2
change occurs on an existing dimension entity
All of the required lookup tables should be pinned in
memory so that they can be randomly accessed as each
incoming fact record presents its natural keys. This is
one of the reasons for making the lookup tables separate
from the original data warehouse dimension tables.
Key Building Process
Loading Fact Tables
Managing Indexes
Performance Killers at load time
Drop all indexes in pre-load time
Segregate Updates from inserts
Load updates
Rebuild indexes
Managing Partitions
Partitions allow a table (and its indexes) to be physically divided
into minitables for administrative purposes and to improve query
performance
The most common partitioning strategy on fact tables is to
partition the table by the date key. Because the date dimension
is preloaded and static, you know exactly what the surrogate
keys are
Need to partition the fact table on the key that joins to the date
dimension for the optimizer to recognize the constraint.
The ETL team must be advised of any table partitions that need
to be maintained.
Outwitting the Rollback Log
The rollback log, also known as the redo log, is
invaluable in transaction (OLTP) systems. But in a data
warehouse environment where all transactions are
managed by the ETL process, the rollback log is a
superfluous feature that must be dealt with to achieve
optimal load performance. Reasons why the data
warehouse does not need rollback logging are:
All data is entered by a managed process—the ETL system.
Data is loaded in bulk.
Data can easily be reloaded if a load process fails.
Each database management system has different logging
features and manages its rollback log differently
Questions