ETL Introduction
ETL Introduction
ETL Introduction
• Reading materials
Kimball ch. 10;
Jarke ch. 4.1-4.3;
Supplementary: Designing and Implementing Packages Tutorials
http://msdn.microsoft.com/en-us/library/ms167031.aspx
Aalborg University 2008 - DWDM course 2
The ETL Process
Services
Presentation servers Desktop Data
-Warehouse Browsing Access Tools
- Extract Data marts with
-Access and Security
- Transform aggregate-only data -Query Management
- Load - Standard Reporting Data mining
Data Service
Element
Raw-Product Raw-Sales
Source (Spreadsheet) (RDBMS)
select A, B, C, count(*)
from DimensionTableSource
group by A, B, C
having count(*) > 1 14
IF they use the same dimensions, then the dimensions are conformed
Data Delivery
All the steps required to deal with slow-changing dimensions
Write the dimension to the physical table
Creating and assigning the surrogate key, making sure the natural key is
correct, etc.
15
16
Building Fact Tables
• Two types of load
• Initial load of historic data
ETL for all data up till now
Done when DW is started the first time , human
Very heavy - large data volumes
• Incremental update
Move only changes since last load
Done periodically (e.g., month or week) after DW start, automatically
Less heavy - smaller data volumes
• Dimensions must be updated before facts
The relevant dimension rows for new facts must be in place
Special key considerations if initial load must be performed again
18
Types of Data Sources
• Non-cooperative sources
Snapshot sources – provides only full copy of source, e.g., files
Specific sources – each is different, e.g., legacy systems
Logged sources – writes change log, e.g., DB log
Queryable sources – provides query interface, e.g., RDBMS
• Cooperative sources
Replicated sources – publish/subscribe mechanism
Call back sources – calls external code (ETL) when changes occur
Internal action sources – only internal actions when changes occur
DB triggers is an example
• Extract strategy depends on the source types
• DataStage IBM
big player, high ability to execute, gets good results from Gartner, visionary
• Informatica
another big player, gets good results from Gartner, visionary with high ability to
execute
most expensive
• SAS ETL Server
fast becoming a major player, very positive results from Gartner
low exposure as an ETL tool (SAS a significant statistical analysis vendor)
• Information Builder's Data Migrator/ETL Manager tool
suite
part of Enterprise Focus/WebFocus
not a major player but industrial strength language, data connectors, etc
• Sunopsis
cheap
relies on native RDBMS functionality
CIGNA people exposed to it at conferences liked it
34
• Workflow Tasks
Execute package – execute other IS packages, good for structure!
Execute Process – run external application/batch file
• SQL Servers Tasks
Bulk insert – fast load of data
Execute SQL – execute any SQL query
• Data Flow – runs data flows
• Data Preparation Tasks
File System – operations on files
FTP – up/down-load data
Why there are so
• Scripting Tasks many different tasks?
Script – execute VN .NET code
• Maintenance Tasks – DB maintenance
Aalborg University 2008 - DWDM course 40
A Simple IS Case
• Use BI Dev Studio/Import Wizard to
copy FClub tables
• Look at package structure
Available from mini-project web
page
• Look at package parts
DROP, CREATE, source,
transformation, destination
• Execute package
Error messages?
• Steps execute in parallel
Dependencies can be set up
Product CP_Product
Sale CP_Sale