ETL Introduction

Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

Extract, Transform, Load (ETL)

Slides adapted from Torben Bach Pedersen


, Man Lung Yiu and Dimitra Vista
ETL Overview
• The ETL Process
• General ETL issues
 Building dimensions
 Building fact tables
 Extract
 Transformations/cleaning
 Load
• MS Integration Services

• Reading materials
 Kimball ch. 10;
 Jarke ch. 4.1-4.3;
 Supplementary: Designing and Implementing Packages Tutorials
http://msdn.microsoft.com/en-us/library/ms167031.aspx
Aalborg University 2008 - DWDM course 2
The ETL Process

• The most underestimated process in DW development


• The most time-consuming process in DW development
 80% of development time is spent on ETL!
• Extract
 Extract relevant data
• Transform
 Transform data to DW format
 Build keys, etc.
 cleaning of data
• Load
 Load data into DW
 Build aggregates, etc.

Aalborg University 2008 - DWDM course 3


ETL In The Architecture

ETL side Query side


Metadata
Data
sources Query Reporting Tools

Services
Presentation servers Desktop Data
-Warehouse Browsing Access Tools
- Extract Data marts with
-Access and Security
- Transform aggregate-only data -Query Management
- Load - Standard Reporting Data mining

Data Staging Data Conformed -Activity Monitor


Area Warehouse dimensions
Operational
Bus and facts system

Data marts with


atomic data

Data Service
Element

Aalborg University 2008 - DWDM course 5


Designing the staging area

• The staging area is owned by the ETL team


 no indexes, no aggregations, no presentation access, no querying, no
service level agreements
 Transformations/cleaning done here
• Users are not allowed in the staging area for any reason
 staging is a “construction” site
• Reports cannot access data in the staging area
 tables can be added, or dropped without modifying the user community
• Only ETL processes can read/write the staging area (ETL
developers must capture table names, update strategies, load
frequency, ETL jobs, expected growth and other details about
the staging area)
• The staging area consists of both RDBMS tables and data files
(DBMS have become better at this)

Aalborg University 2008 - DWDM course


ETL Construction Process
 Plan
1) Make high-level diagram of source-destination flow
2) Test, choose and implement ETL tool
3) Develop default strategies for common activities, e.g. extraction,
dimension management, etc
4) Drill down by target table, each column
 Develop One-time historic load process
4) Build and test the historic dimension table loads
5) Build and test the historic fact table loads
Why we consider
 Construction of fact tables and automation dimensions before
7) Build and test the dimension table incremental load process fact tables?
8) Build and test the fact table incremental load process
9) Build and test aggregate table / OLAP (you do this later)
10) Design, construct, and test ETL automation

Aalborg University 2008 - DWDM course 8


High-level diagram
1) Make high-level diagram of source-destination flow
 Mainly used for communication purpose
 One page only, highlight sources and destinations
 Steps: extract, transform, load

Raw-Product Raw-Sales
Source (Spreadsheet) (RDBMS)

Check Referential Integrity


Add product
type
Aggregate sales
Extract time
per product per day

Destination Product Sales Time

Aalborg University 2008 - DWDM course 9


The basic structure of a dimension
• Primary key (PK)
 Meaningless, unique integer
 Aka as surrogate key
 Joins to Fact Tables
 Is a Foreign Key to Fact Tables
• Natural key (NK)
 Meaningful key extracted from
source systems
 1-to-1 relationship to the PK for
static dimensions
 1-to-many relationship to the PK for
slowly changing dimensions, tracks
history of changes to the dimension
• Descriptive Attributes
 Primary textual but numbers
legitimate but not numbers that are
measured quantities
 100 such attributes normal
 Static or slow changing only
 Product price -- either fact or
dimension attribute
Aalborg University 2008 - DWDM course 10
Generating surrogate keys for Dimensions

• Via triggers in the DBMS


 Read the latest surrogate key, generate the next value, create the
record
 Disadvantages: severe performance bottlenecks
• Via the ETL process, an ETL tool or a 3-rd party
application generate the unique numbers
 A surrogate key counter per dimension
 Maintain consistency of surrogate keys between dev, test and
production
• Using Smart Keys
 Concatenate the natural key of the dimension in the source(s)
with the timestamp of the record in the source or the Data
Warehouse.
 Tempting but wrong 11

Aalborg University 2008 - DWDM course


Building Dimensions

• Static dimension table Product table


 DW key assignment: production keys to DW keys Type Product ……
(surrogate) using table Flask Tuborg
 Combination of data sources: find common key? Flask Carlsberg
 Check one-one and one-many relationships using sorting Milk Skim-Milk

• Handling dimension changes Milk Cacao-Milk


Vand Sodavand
 Described in last lecture
…… ……
 Find the newest DW key for a given production key
 Table for mapping production keys to DW keys must be
maintained and updated
• Load of dimension changes
 Small dimensions: replace
 Large dimensions: load only changes Product dimension of FClub vs.
Product dimension of a supermarket

Aalborg University 2008 - DWDM course 12


How to check 1 to 1 and 1 to many
• Sorting
• 1 to many, given a table product, check if product_sku and
product_model are 1: many
SELECT product_sku
Count[*] as row count
Count (distinct product_model) as model_count
FROM staging_database.product
GROUP BY product_sku
HAVING count (distinct product_model) > 1

Aalborg University 2008 - DWDM course 13


The grain of a dimension
• The definition of the key of the dimension in business terms, what does the
dimension represent
• Challenge: analyze the source systems so that a particular set of fields in
that source corresponds to the grain of the dimension
• Verify that a given source (file) implements the intended grain
 Nothing should be returned by this from the source system/file

 If something is returned by this, the fields A, B and C do not represent

the grain of the dimension

select A, B, C, count(*)
from DimensionTableSource
group by A, B, C
having count(*) > 1 14

Aalborg University 2008 - DWDM course


The basic load plan for a dimension
• Simple Case: the dimension is loaded as a lookup table
• Typical Case
 Data cleaning
 Validate the data, apply business rules to make the data consistent, column
validity enforcement, cross-column value checking, row de-duplication
 Data conforming
 Align the content of some or all of the fields in the dimension with fields in
similar or identical dimensions in other parts of the data warehouse
 Fact tables: billing transactions, customer support calls

 IF they use the same dimensions, then the dimensions are conformed

 Data Delivery
 All the steps required to deal with slow-changing dimensions
 Write the dimension to the physical table
 Creating and assigning the surrogate key, making sure the natural key is
correct, etc.

15

Aalborg University 2008 - DWDM course


The basic structure of a fact table

• Every table defined by its


grain
 in business terms
 in terms of the dimension foreign
keys and other fields
• A set of foreign keys (FK)
 context for the fact
 Join to Dimension Tables
• Degenerate Dimensions
 Part of the key
 Not a foreign key to a Dimension
table
• Primary Key
 a subset of the FKs
 must be defined in the table
• Fact Attributes
 measurements

16
Building Fact Tables
• Two types of load
• Initial load of historic data
 ETL for all data up till now
 Done when DW is started the first time , human
 Very heavy - large data volumes
• Incremental update
 Move only changes since last load
 Done periodically (e.g., month or week) after DW start, automatically
 Less heavy - smaller data volumes
• Dimensions must be updated before facts
 The relevant dimension rows for new facts must be in place
 Special key considerations if initial load must be performed again

Aalborg University 2008 - DWDM course 17


Guaranteeing Referential Integrity
1. Check Before Loading
• Check before you add fact
records
• Check before you delete
dimension records
• Best approach
2. Check While Loading
• DBMS enforces RI
• Elegant but typically SLOW
• Exception: Red Brick database
system is capable of loading 100
million records an hour into a
fact table where it is checking
referential integrity on all the
dimensions simultaneously!
3. Check After Loading
• No RI in the DBMS
• Periodic checks for invalid
foreign keys looking for invalid
data
• Ridiculously slow

18
Types of Data Sources

• Non-cooperative sources
 Snapshot sources – provides only full copy of source, e.g., files
 Specific sources – each is different, e.g., legacy systems
 Logged sources – writes change log, e.g., DB log
 Queryable sources – provides query interface, e.g., RDBMS
• Cooperative sources
 Replicated sources – publish/subscribe mechanism
 Call back sources – calls external code (ETL) when changes occur
 Internal action sources – only internal actions when changes occur
 DB triggers is an example
• Extract strategy depends on the source types

Aalborg University 2008 - DWDM course 19


Extract

• Goal: fast extract of relevant data


 Extract from source systems can take long time
• Types of extracts:
 Extract applications (SQL): co-existence with other applications
 DB unload tools: faster than SQL-based extracts
 e.g., MS SQL Export Wizard, MySQL DB dump
• Extract applications the only solution in some scenarios
• Too time consuming to ETL all data at each load
 Extraction can take days/weeks
 Drain on the operational systems and DW systems
• Extract/ETL only changes since last load (delta)

Aalborg University 2008 - DWDM course 20


Common Transformations

• Data type conversions


 EBCDIC  ASCII/UniCode
 String manipulations
 Date/time format conversions
 E.g., unix time 1201928400 = what time?
• Normalization/denormalization
 To the desired DW format
 Depending on source format
• Building keys
 Table matches production keys to surrogate DW keys
 Correct handling of history - especially for total reload

Aalborg University 2008 - DWDM course 23


Data Quality

• Data almost never has decent quality


• Data in DW must be:
 Precise
 DW data must match known numbers - or explanation needed
 Complete
 DW has all relevant data and the users know
 Consistent
 No contradictory data: aggregates fit with detail data
 Unique
 The same things is called the same and has the same key
(customers)
 Timely
 Data is updated ”frequently enough” and the users know when

Aalborg University 2008 - DWDM course 24


The high cost of low quality data
• Wrong price data in retail databases may cost American
consumers as much as $2.5 billion in overcharges annually. Data
audits show 4 out of 5 errors in prices are overcharges.
(Information Week Sept 1992)
• The Gartner group estimates for the worldwide costs to modify
software and change databases to fix the Y2K problem was $400-
$600 billion. T.Capers Jones says this estimate is low, it should be
$1.5 trillion. The cost to fix this single pervasive error is one
eighth of the US federal deficit ($8 trillion Oct 2005).
• Another way to look at it. The 50 most profitable companies in
the world earned a combined $178 billion in profits in 1996. If
the entire profit of these companies was used to fix the problem,
it would only fix about 12% of the problem
• And MS Excell, in year 2000, still regards 1900 as a leap year
(which is not).
Aalborg University 2008 - DWDM course 25
cleaning

• Why cleaning? Garbage In Garbage Out


• BI does not work on “raw” data
 Pre-processing necessary for BI analysis
• Handle inconsistent data formats
 Spellings, codings, …
• Remove unnecessary attributes
 Production keys, comments,…
• Replace codes with text (Why?)
 City name instead of ZIP code, e.g., Aalborg Centrum vs. DK-9000
• Combine data from multiple sources with common key
 E.g., customer data from customer address, customer name, …

Aalborg University 2008 - DWDM course 26


Types of cleaning

• Conversion and normalization


 Most common type of cleaning
 Text coding, date formats
 e.g., 3/2/2008 means 3rd February or 2nd March?
• Special-purpose cleaning
 Look-up tables, dictionaries to find valid data, synonyms, abbreviations
 Normalize spellings of names, addresses, etc.
 Remove duplicates, e.g., duplicate customers
• Domain-independent cleaning
 Approximate, “fuzzy” joins on records from different sources
 E.g., two customers are regarded as the same if their respective values match for
most of the attributes (e.g., address, phone number)
• Rule-based cleaning
 User-specifed rules: if-then style
 Automatic rules: use data mining to find patterns in data
 Guess missing sales person based on customer and item

Aalborg University 2008 - DWDM course 27


cleaning
• Should a “special” value (e.g., 0, -1) be used in your data?
 Why this issue is relevant to query/analysis operations?
• Mark facts with Data Status dimension
 Normal, abnormal, outside bounds, impossible,…
 Facts can be taken in/out of analyses
• Uniform treatment of NULL
 Use explicit NULL value rather than “special” value (0,-1,…)
 Use NULLs only for measure values (estimates instead?)
 Use special dimension key (i.e., surrogate key value) for NULL
dimension values
 E.g., for the time dimension, instead of NULL, use special key values to
represent “Date not known”, “Soon to happen”
 Avoid problems in joins, since NULL is not equal to NULL
• Mark facts with changed status
 New customer, Customer about to cancel contract, ……

Aalborg University 2008 - DWDM course 28


Improving Data Quality

• Appoint “data quality administrator”


 Responsibility for data quality
 Includes manual inspections and corrections!
• Source-controlled improvements
 The optimal?
• Construct programs that check data quality
 Are totals as expected?
 Do results agree with alternative source?
 Number of NULL values?
• Do not fix all problems with data quality
 Allow management to see “weird” data in their reports?
 Such data may be meaningful for them? (e.g., fraud detection)

Aalborg University 2008 - DWDM course 29


Load

• Goal: fast loading into DW


 Loading deltas is much faster than total load
• SQL-based update is slow
 Large overhead (optimization, locking, etc.) for every SQL call
 DB load tools are much faster
• Index on tables slows load a lot
 Drop index and rebuild after load
 Can be done per index partition
• Parallellization
 Dimensions can be loaded concurrently
 Fact tables can be loaded concurrently
 Partitions can be loaded concurrently

Aalborg University 2008 - DWDM course 30


Load

• Relationships in the data


 Referential integrity and data consistency must be ensured
before loading (Why?)
 Can be done by loader
• Aggregates
 Can be built and loaded at the same time as the detail data
• Load tuning
 Load without log
 Sort load file first
 Make only simple transformations in loader
 Use loader facilities for building aggregates

Aalborg University 2008 - DWDM course 31


ETL Tools

• ETL tools from the big vendors


 Oracle Warehouse Builder
 IBM DB2 Warehouse Manager
 Microsoft Integration Services
• Offers much functionality at a reasonable price
 Data modeling
 ETL code generation
 Scheduling DW jobs
• The “best” tool does not exist
 Choose based on your own needs

Aalborg University 2008 - DWDM course 32


Aalborg University 2008 - DWDM course 33
Vendor ETL tools

• DataStage IBM
 big player, high ability to execute, gets good results from Gartner, visionary
• Informatica
 another big player, gets good results from Gartner, visionary with high ability to
execute
 most expensive
• SAS ETL Server
 fast becoming a major player, very positive results from Gartner
 low exposure as an ETL tool (SAS a significant statistical analysis vendor)
• Information Builder's Data Migrator/ETL Manager tool
suite
 part of Enterprise Focus/WebFocus
 not a major player but industrial strength language, data connectors, etc
• Sunopsis
 cheap
 relies on native RDBMS functionality
 CIGNA people exposed to it at conferences liked it
34

Aalborg University 2008 - DWDM course


Issues
• Pipes
 Redirect output from one process to input of another process
ls | grep 'a' | sort -r
• Files versus streams/pipes
 Streams/pipes: no disk overhead, fast throughput
 Files: easier restart, often only possibility
• Use ETL tool or write ETL code
 Code: easy start, co-existence with IT infrastructure
 Tool: better productivity on subsequent projects
• Load frequency
 ETL time dependent of data volumes
 Daily load is much faster than monthly
 Applies to all steps in the ETL process

Aalborg University 2008 - DWDM course 35


MS Integration Services

• A concrete ETL tool


• Example ETL flow
• Demo

Aalborg University 2008 - DWDM course 36


Integration Services (IS)

• Microsoft’s ETL tool


 Part of SQL Server 2005
• Tools
 Import/export wizard - simple transformations
 BI Development Studio - advanced development
 New/Open an Integration Services Project
• Functionality available in several ways
 Through GUI - basic functionality
 Programming - advanced functionality

Aalborg University 2008 - DWDM course 37


Packages

• A package as a “dtsx” file


• Encapsulation, reuse
• The central concept in IS
• Package a collection of:
 Data flow: Sources,
Destinations, Transformations
 Connections
 Control flow: Tasks, Workflows
 Variables
 ……

Aalborg University 2008 - DWDM course 38


Package Control Flow
• “Containers” provide
 Structure to packages
 Services to tasks
• Control flow
 Foreach loop container
 Repeat tasks by using an enumerator
 For loop container
 Repeat tasks by testing a condition
 Sequence container
 Groups tasks and containers into
control flows that are subsets of the
package control flow
• Task host container
 Provides services to a single task
Arrows:
green (success)
red (failure)
Aalborg University 2008 - DWDM course 39
Tasks

• Workflow Tasks
 Execute package – execute other IS packages, good for structure!
 Execute Process – run external application/batch file
• SQL Servers Tasks
 Bulk insert – fast load of data
 Execute SQL – execute any SQL query
• Data Flow – runs data flows
• Data Preparation Tasks
 File System – operations on files
 FTP – up/down-load data
Why there are so
• Scripting Tasks many different tasks?
 Script – execute VN .NET code
• Maintenance Tasks – DB maintenance
Aalborg University 2008 - DWDM course 40
A Simple IS Case
• Use BI Dev Studio/Import Wizard to
copy FClub tables
• Look at package structure
 Available from mini-project web
page
• Look at package parts
 DROP, CREATE, source,
transformation, destination
• Execute package
 Error messages?
• Steps execute in parallel
 Dependencies can be set up

Aalborg University 2008 - DWDM course 41


Event Handlers

• Executables (packages, containers)


can raise events
• Event handlers manage the events
• Similar to those in languages JAVA, C#

Aalborg University 2008 - DWDM course 42


Data Flow Elements
• Sources
 Makes external data available
 All ODBC/OLE DB data sources:
RDBMS, Excel, Text files, ……
• Transformations
 Update, summarize, cleanse, merge,
distribute
• Destinations
 Write data to specific store
 Create in-memory data set
• Input, Output, Error output

Aalborg University 2008 - DWDM course 43


ETL Demo. Session
1. Load data into the Product dimension table
1. Use the SSIS Import and Export Wizard
2. Construct the DW key for the table by using “IDENTITY”
3. Copy data to the Product dimension table
2. Load data into the Member dimension table
3. Copy the “raw” Sale fact table (for convenience only)
4. Load data into the NewSale fact table
1. Join “raw” sales table with other tables to get DW keys for each sales record
2. Output of the query written into the fact table
5. Create a StartUp package that organizes the 4 packages above

Product CP_Product

Member CP_Member NewSale

Sale CP_Sale

Aalborg University 2008 - DWDM course 45


ETL Demo. Session Today
• Method 1: Copy a source table into the destination table, or
• Sample SQL for creating the Member dimension table:
CREATE TABLE CP_Member (
[dw_member_id] int IDENTITY,
[id] int NOT NULL,
[balance] int NOT NULL
……
)
• Method 2: Write into the destination table the result of a SQL query
• Sample SQL for inserting data into the NewSale fact table:
select
CP_Member.dw_member_id, CP_Product.dw_product_id,
sum(CP_Sale.price) as sales
from
CP_Sale, CP_Member, CP_Product
where
CP_Sale.member_id=CP_Member.id and
CP_Sale.product_id=CP_Product.id
group by
CP_Member.dw_member_id, CP_Product.dw_product_id
Aalborg University 2008 - DWDM course 46
A Few Hints on ETL Design
• Don’t implement all transformations in one step!
 Build first step and check that result is as expected
 Add second step and execute both, check result (How to check?)
 Add third step ……
• Test SQL statements before putting into IS
• Do one thing at the time
 Copy source data one-by-one to the data staging area (DSA)
 Compute deltas
 Only if doing incremental load
 Handle versions and DW keys
 Versions only if handling slowly changing dimensions
 Implement complex transformations
 Load dimensions
 Load facts

Aalborg University 2008 - DWDM course 48


Summary

• The ETL Process


• General ETL issues
 Building dimensions
 Building fact tables
 Extract
 Transformations/cleaning
 Load
• MS Integration Services

Aalborg University 2008 - DWDM course 49


ETL Part of Mini Project
• Core:
 Build an ETL flow using MS DTS that can do an initial (first-time) load of
the data warehouse
 Include logic for generating special DW surrogate integer keys for the
tables
 Discuss and implement basic transformations/data cleaning
• Extensions:
 Extend the ETL flow to handle incremental loads, i.e., updates to the
DW, both for dimensions and facts
 Extend the DW design and the ETL logic to handle slowly changing
dimensions of Type 2
 Implement more advanced transformations/data cleaning
 Perform error handling in the ETL flow

Aalborg University 2008 - DWDM course 50

You might also like