02 - Basic Data Warehousing & Architectures
02 - Basic Data Warehousing & Architectures
02 - Basic Data Warehousing & Architectures
Zip
TransactionLedger
PK TransactionID
PK Line Number
FK SKU
Price
Discount
CategoryList BrandList
ProductList SKUList
NetPrice
PK CategoryCode
PK BrandCode PK SKU
PK ProductNumber
CategoryName
FK CategoryCode FK ProductNumber
FK BrandCode
Category Description
BrandName Size
Color
BrandManager
ProductName
ManufacturerID
PK SKU
CharDate PK FK SKU
PK FK Date ProductNumber
(OtherDate Info...)
BrandID
NetPrice
StoreDim BrandName
QuantitySold
StoreName CategoryCode
RegionNumber
(Region Info...)
DivisionNumber
DivisionName
(Other DivisionInfo...)
Facts collected by CMI POS are Quantity sold, Net Price, Price,
Discount
Dimensional Modeling Components
FACT TABLE
• Primary table which stores the performance measurements of the
business
• The term “fact” refers to a business measure
• Each row in a fact table corresponds to a specific measurement
• Each measurement is taken at the intersection of all the relevant
dimensions (e.g., day, product, and store) – this list of dimensions
defines the “grain” of the fact table
Dimensional Modeling Components
FACT TABLE
• All measurements in a fact table must be at the same grain
• Facts are either additive, semiadditive, or nonadditive – most are
numeric
• Contains two or more foreign keys to dimension tables
• Expresses the many-to-many relationships between dimensions in
dimensional models
Dimensional Modeling Components
DESIGN GUIDELINES
• Look at the OLTP schema (or available extracts) to determine
identify possible measures
• Fields that can be manipulated (sum, avg) to generate useful information
• Sometimes you need to convert a field to make it measurable. (‘Y’/’N’ to 1/0)
• Values should be related to the grain of the fact table.
• Determine the lowest grain possible
The Warehouse Data Model
DateDim
ProductDim
PK Date SalesFact
PK SKU
CharDate PK FK SKU
PK FK Date ProductNumber
(OtherDate Info...)
CategoryCode
NetPrice
PK StoreNumber BrandID
StoreName BrandName
RegionNumber
(Region Info...)
DivisionNumber
DivisionName
(Other DivisionInfo...)
Dimensional Modeling Components
DIMENSION TABLES
• Contain the textual descriptors of the business
• Usually low in cardinality, but very wide (50-100 attributes not
uncommon)
• Dimension attributes used as query constraints, groupings, and
report labels
• The more descriptive the dimension attributes, the better
• Often contain hierarchical relationships (city=>state=>region)
Dimensional Modeling Components
DESIGN GUIDELINES
• Look at the OLTP schema (or available extracts) to determine
identify possible attributes
• Look for hierarchies. They are key to successful reporting.
• Includes codes and descriptions. Can include multiple formats of
descriptions
• Think in terms of reporting. What descriptions do you need to provide user
friendly reports. You want it easy for the user to create reports
• Too much data is better than too little
• Determine the lowest grain possible.
The Warehouse Data Model
DateDim
ProductDim
PK Date SalesFact
PK SKU
CharDate PK FK SKU
PK FK Date ProductNumber
(OtherDate Info...)
CategoryCode
NetPrice
PK StoreNumber BrandID
StoreName BrandName
RegionNumber
(Region Info...)
DivisionNumber
DivisionName
(Other DivisionInfo...)
Dimensional Modeling Components
• Fact Table + Dimension Tables = Dimensional Model (Star
Schema)
• Benefits of dimensional model
• Simplicity
• Easy for business users to understand
• Improved query performance
• Extensibility
• Easily accommodates change (but not that easily!)
Cardinal Merch Preliminary Star Schema
Dim_Product Fact _Sales Dim Store
PK FK Store_Number
Product Attributes Store Attributes
PK FK Date
Item_Sale_Price
Total_Sales
Quantity
Dim_Date
PK Calendar Date
Date Attributes
Date Dimension
PK FK Store_Number
Product Attributes Store Attributes
PK FK Date
Item_Sale_Price
Total_Sales
Quantity
Dim_Date
PK Calendar Date
Date Attributes
Surrogate Keys
• It is highly recommended to use surrogate keys for dimension table keys
• Surrogate keys are simply integers assigned sequentially to a particular
dimension row
• Operational codes (e.g., SKU number) are frequently source natural
keys and used to determine the surrogate key to use
• Operational codes are also retained for analysis purposes
• Many dimensions will have a surrogate key of -1 to indicate no value in
the Dimension
Surrogate Keys
• Benefits of surrogate keys:
• Buffer the data warehouse from changes in operational codes
• Can save space due to their small size compared to operational codes
• Allow recording of conditions which do not have an operational code (e.g., “No
Promotion”)
• Allow handling of changes to dimension table attributes (to be discussed later)
• The main disadvantage of using surrogate keys is that it requires some
effort to implement
• ALWAYS, ALWAYS, ALWAYS USE SURROGATE KEYS!
Star Schema with Surrogate Keys
Dim_Product Fact _Sales Dim Store
PK FK StoreKey
SKU Store_Number
PK FK DateKey
Product Attributes Store Attributes
Item_Sale_Price
Total_Sales
Quantity
Dim_Date
PK DateKey
Calendar Date
Date Attributes
Star Schema Size Analysis
• Product Dimension
• 150,000 products x 1 KB per row = 150 MB
• Date Dimension
• 3,650 dates (10 years) x 1 KB per row = 3.5 MB
• Store Dimension
• 100 stores x 2 KB per row = 0.2 MB
• Promotion Dimension
• 5,000 promotions x 1KB per row = 5 MB
• Total Dimensions = 158.7 MB
Star Schema Size Analysis
• Fact Table
• Assume 10,000 transactions per day per store
• 10,000 purchases x 3650 days x 10 products per purchase x 1
promotion per purchase
• 365,000,000 records x 1KB per record = 365 GB
• Total Size = Fact + Dimensions
• Total Size = 365,000 MB + 158.7 MB = 365.2 GB
• Sizing rule – when calculating size, the size of the dimension
tables can usually be ignored.
Data Warehousing Architectures
• There are several architectures for data warehousing: two-tier,
three-tier, and sometimes one tier.
• One can distinguish among them by dividing data warehouse
into three parts:
• The data warehouse itself that contains the data and associated
software
• Data acquisition (back-end) software that extracts data from legacy
systems and external sources, consolidates and summarizes them,
and loads them into the data warehouse
• Client (front-end) software that allows users to access and analyze
data from the warehouse
Generic Warehouse Architecture
Client Client
Query & Analysis
Design Phase Loading
Warehouse Metadata
Maintenance
Optimization
Integrator
...
Three-layer Architecture: Conceptual View
• Transformation of real-time data to derived data really requires
two steps
Operational Informational
systems systems
View level
“Particular informational
Derived Data
needs”
Physical Implementation
Reconciled Data
of the Data Warehouse
Real-time data
Data Warehousing Architectures
DW Architecture: Conceptual View
• Single-layer Operational
systems
Informational
systems
Real-time data
Data Warehousing Architectures
Issues to consider when deciding which architecture to use:
• Which database management system (DBMS) should be used?
• Will parallel processing and/or partitioning be used?
• Will data migration tools be used to load the data warehouse?
• What tools will be used to support data retrieval and analysis?
Data Warehousing Architectures
Ten factors that potentially affect the architecture selection
decision:
1. Information interdependence 6. Strategic view of the data
between organizational units warehouse prior to
implementation
2. Upper management’s
information needs 7. Compatibility with existing
systems
3. Urgency of need for a data 8. Perceived ability of the in-
warehouse house IT staff
4. Nature of end-user tasks 9. Technical issues
5. Constraints on resources 10. Social/political factors
Alternative Data Warehouse Architectures
Alternative Data Warehouse Architectures
Alternative Data Warehouse Architectures
Alternative Data Warehouse Architectures
Alternative Data Warehouse Architectures
Alternative Data Warehouse Architectures
Assignment
Complete the 4 steps of the Dimensional Modeling Process and then
build a draft schema for a fantasy soccer league to answer the
following questions
You have been asked to design a data warehouse for your online
retail company consisting of numerous websites selling the same
product set.
You need to design a data warehouse that will allow the following
questions to be answered at a minimum:
• What is the average dollar sales per order?
• How many item in the sweater category did we sell in January?
• What is the most popular brand of t-shirt sold this year?
• Which website sold the largest number of items last month?
Create the schema in Visio and submit according to the Submission
Instructions
Questions?