02 - Basic Data Warehousing & Architectures

Download as pdf or txt
Download as pdf or txt
You are on page 1of 51

CS131-8:

Data Warehousing and Data Mining


02 – Basic Data Warehousing & Architectures
Recall: CMI Problem
CMI wants to answer the following question:
How many men statement shirts were sold in the Philippines the
week before NCAA basketball semifinals in 2016 and was the total
net sales?
Let’s try to answer this using their transactional OLTP system
• Men statement shirts is a category type
• Philippines is a division
StoreList
RegionList
Sale Transaction
DivisionList
PK StoreNumber
PK RegionNumber
PK TransactionID
PK DivisionNumber
StoreName
RegionName
FK StoreNumber
DivisionName
FK RegionNumber
FK DivisionNumber
TimestampOf Sale
DivisionDirector
AddressLine1
RegionManager
RegisterNumber
AddressLine2
TotalSales
City
PaymentMethod
State

Zip
TransactionLedger

PK TransactionID

PK Line Number

FK SKU

Price

Discount
CategoryList BrandList
ProductList SKUList
NetPrice
PK CategoryCode
PK BrandCode PK SKU
PK ProductNumber

CategoryName
FK CategoryCode FK ProductNumber
FK BrandCode

Category Description
BrandName Size
Color

BrandManager
ProductName

ManufacturerID

Figure 10. The Transactional Data Model


The Query
select count(TL,NetPrice) as Total_Items, sum(TL.NetPrice) as Total_Sales
from TransactionLedger TL
inner join SaleTransation ST on (TL.TransactionID = ST.transactionID)
inner join StoreList SL on (ST.StoreNumber = SL.StoreNumber)
inner join RegionList RL on (SL.RegionNumber = RL.RegionNumber)
inner join DivisionList DL on (RL.DivisionNumber = DL.DivisionNumber)
inner join SKUList SK on (TL.SKU = SK.SKU)
inner join ProductList PL on (SK.ProductNumber = PL.ProductNumber)
inner join BrandList BL on (PL.BrandCode = BL.BrandCode)
inner join CategoryList CL on (BL.CategoryCode = CL.CategoryCode)
where DL.DivisionName = ‘Ireland’ Complex and easy
and BL,BrandName = ‘Woman Kilt’ to commit errors
and ST.TimestampOfSale > to_timestamp(‘2016-03-09:00.00.00’)
and ST.TimestampOfSale < to_timestamp(‘2016-03-16:23.59.59’)
Simple Warehouse Data Model
DateDim
ProductDim
PK Date SalesFact

PK SKU
CharDate PK FK SKU

PK FK Date ProductNumber
(OtherDate Info...)

PK FK StoreNumber (Product Info ...)

BrandID
NetPrice

StoreDim BrandName
QuantitySold

PK StoreNumber (Other Brand Info...)

StoreName CategoryCode

(StoreInfo ...) (Category Info...)

RegionNumber

(Region Info...)

DivisionNumber

DivisionName

(Other DivisionInfo...)

Figure 11. A Simple Warehouse Data Model


The Query
select sum(SF.QuantitySold) as Total_Items, sum(SF.NetPrice) as Total_Sales
from SalesFact SF
inner join StoreDim SD on (SF.StoreNumber = SD.StoreNumber)
inner join ProductDim PD on (SF.SKU = PD.SKU)
inner join DateDim DD on (SF.Date = DD.Date)
where DD.DivisionName = ‘Ireland’
and PD.CategoryName = ‘Woman Kilt’
and DD.CharDate >=‘2016-03-09’
and DD.CharDate <=‘2016-03-16’
Simpler and easy
to understand
Data Warehouse Requirements
• In a data warehouse, the requirements help define the business
goal you will achieve with your design.

• If you don’t have business requirements, don’t build the data


warehouse because it won’t be right.
Cardinal Merch Requirements
Themes >> Critical Success Factors >> Business Questions
• Theme – A central goal you are trying to achieve
• To grow sales across all market segments and product lines
• Critical Success Factor – a group of data elements that are central to
achieving that goal
• Analyze sales volume, price and cost trends for PH stores over time
• Business Question – a specific question that can be tied to data to
identify if the critical success factor is being met or not.
• How many men statement shirts were sold in the Philippines during NCAA
semifinals 2016 and was the total net sales?
Dimensional Modeling Process
Consists of four main steps:

1. Select the business process to model

2. Declare the grain of the business process

3. Choose the dimensions that apply to each fact table row

4. Identify the facts

Dimensional modeling is part science, and part art…


Cardinal Merch Case Study
1. Select the business process to model:
• Requires an understanding of both business requirements and
available data
• Management wants to better understand customer purchases as
captured by the POS system
• The business process CMI will model is POS retail sales
Cardinal Merch Case Study
2. Declare the grain of the business process:
• Specify exactly what an individual fact table row represents – the
grain conveys the level of detail associated with fact table
measurements
• It is highly recommended to choose the most granular or atomic
information captured by the business process. Why?
The grain for CMI is SKU (Stock Keeping Unit), Date, and Store
(The transaction is a lower level, but at this time we have chosen not
to build the warehouse at this level.)
Cardinal Merch Case Study
3. Choose the dimensions:
• Determine the ways the data will be aggregated or filtered.
• Identify the level of hierarchy associated with each part of the grain.

The dimensions for CMI are: Date, Product, Store


Cardinal Merch Case Study
4. Identify the facts
• Determine the measurements that are available at the chosen grain
• Identify any consolidations, calculations or conversions to be done

Facts collected by CMI POS are Quantity sold, Net Price, Price,
Discount
Dimensional Modeling Components
FACT TABLE
• Primary table which stores the performance measurements of the
business
• The term “fact” refers to a business measure
• Each row in a fact table corresponds to a specific measurement
• Each measurement is taken at the intersection of all the relevant
dimensions (e.g., day, product, and store) – this list of dimensions
defines the “grain” of the fact table
Dimensional Modeling Components
FACT TABLE
• All measurements in a fact table must be at the same grain
• Facts are either additive, semiadditive, or nonadditive – most are
numeric
• Contains two or more foreign keys to dimension tables
• Expresses the many-to-many relationships between dimensions in
dimensional models
Dimensional Modeling Components
DESIGN GUIDELINES
• Look at the OLTP schema (or available extracts) to determine
identify possible measures
• Fields that can be manipulated (sum, avg) to generate useful information
• Sometimes you need to convert a field to make it measurable. (‘Y’/’N’ to 1/0)
• Values should be related to the grain of the fact table.
• Determine the lowest grain possible
The Warehouse Data Model
DateDim
ProductDim
PK Date SalesFact

PK SKU
CharDate PK FK SKU

PK FK Date ProductNumber
(OtherDate Info...)

PK FK StoreNumber (Product Info ...)

CategoryCode
NetPrice

StoreDim (Category Info...)


QuantitySold

PK StoreNumber BrandID

StoreName BrandName

(StoreInfo ...) (Other Brand Info...)

RegionNumber

(Region Info...)

DivisionNumber

DivisionName

(Other DivisionInfo...)
Dimensional Modeling Components
DIMENSION TABLES
• Contain the textual descriptors of the business
• Usually low in cardinality, but very wide (50-100 attributes not
uncommon)
• Dimension attributes used as query constraints, groupings, and
report labels
• The more descriptive the dimension attributes, the better
• Often contain hierarchical relationships (city=>state=>region)
Dimensional Modeling Components
DESIGN GUIDELINES
• Look at the OLTP schema (or available extracts) to determine
identify possible attributes
• Look for hierarchies. They are key to successful reporting.
• Includes codes and descriptions. Can include multiple formats of
descriptions
• Think in terms of reporting. What descriptions do you need to provide user
friendly reports. You want it easy for the user to create reports
• Too much data is better than too little
• Determine the lowest grain possible.
The Warehouse Data Model
DateDim
ProductDim
PK Date SalesFact

PK SKU
CharDate PK FK SKU

PK FK Date ProductNumber
(OtherDate Info...)

PK FK StoreNumber (Product Info ...)

CategoryCode
NetPrice

StoreDim (Category Info...)


QuantitySold

PK StoreNumber BrandID

StoreName BrandName

(StoreInfo ...) (Other Brand Info...)

RegionNumber

(Region Info...)

DivisionNumber

DivisionName

(Other DivisionInfo...)
Dimensional Modeling Components
• Fact Table + Dimension Tables = Dimensional Model (Star
Schema)
• Benefits of dimensional model
• Simplicity
• Easy for business users to understand
• Improved query performance
• Extensibility
• Easily accommodates change (but not that easily!)
Cardinal Merch Preliminary Star Schema
Dim_Product Fact _Sales Dim Store

PK SKU PK FK SKU PK Store_Number

PK FK Store_Number
Product Attributes Store Attributes

PK FK Date

Item_Sale_Price

Total_Sales

Quantity
Dim_Date

PK Calendar Date

Date Attributes
Date Dimension

The Date Dimension


Example attributes:
• Date
• Full Date Description
• All data warehouses have a • Month Number
• Month Name
Date/Time dimension
• Month Short Name
• It is possible to pre-populate Date • Day Number in Month
dimension • Day of Week
• Day Number in Year
• Relatively small dimension table, e.g., • Year
10 years of days is only about 3650 • Fiscal Quarter
rows • Fiscal Year
• Multiple hierarchies exist within Date • Holiday Indicator
• First Day of Quarter Indicator
dimension
• Selling Season…
• Etc.
Date Dimension Example

Figure 1. Date Dimension Example.


Product Dimension

The Product Dimension Example attributes:


• SKU Number (Natural Key)
• UPC
• Recall there are about 60K SKUs • Product Description
• Product dimension will contain about • Brand Description
150K rows when accounting for • Category Description
different merchandising schemes • Department Description
across stores and historical products • Color
• Product hierarchy: • Size
• Style
SKU=>Brand=>Category=>Department • Closure
• Length
• Package Weight
• Season
Store Dimension

The Store Dimension Example attributes:


• Store Number
• Store Name
• Represents primary geographic • Store Street Address
dimension • Store City
• Store hierarchies include: • Store County
• Store=>State • Store State
• Store=>District=>Region • Store Zip Code
• Store Manager
• Any number of different geography
• Store District
or sales hierarchies can exist in this
• Store Region
dimension
• Floor Plan Type
• Selling Square Footage
• First Open Date…
Additivity
• If CMI wants to look at gross margin:
• Gross margin = gross profit/sales dollar amount
• Gross profit = sales dollar amount – cost dollar amount

• Should we choose to store gross profit or gross margin as a


fact?
Opportunities
• Product A • Product B
• Price is $10 • Price is $100,
• Cost is $5 • Cost is $90
• Gross Profit is $5 ($10 - $5) • Gross Profit is $10 ($100 - $90)
• Gross Margin is 50% ($5 / $10) • Gross Margin is 10% ($10 / $100)

• Assume we sell one of each product


• What is the gross margin for both products?
• Is the gross margin 60% (10% + 50%) ?
Gross Margin Example
• Assume we sell one of each product
• Revenue is additive
• $100 + $10 = $110
• Gross Profit is additive
• $10 + $5 = $15
• Gross Margin is not additive
• Not 60% (10% + 50%)
• It is 13.6% ($10 + $5) / ($100 + $10)
• This value could not be calculated only from the gross margin on each
individual product or transaction
Additivity
• A fact is additive if we can sum the fact across all
dimensions and obtain a valid and correct number
• A fact is nonadditive if the summation of the fact across
any dimension results in a meaningless, nonsensical
number
• A fact is semiadditive if it is additive across some
dimensions and nonadditive across other dimensions
Cardinal Merch Case Study
• Assume the chain now switches POS systems and must
renumber their SKU’s and store numbers.
• Anyone see a problem with this model?
Dim_Product Fact _Sales Dim Store

PK SKU PK FK SKU PK Store_Number

PK FK Store_Number
Product Attributes Store Attributes

PK FK Date

Item_Sale_Price

Total_Sales

Quantity
Dim_Date

PK Calendar Date

Date Attributes
Surrogate Keys
• It is highly recommended to use surrogate keys for dimension table keys
• Surrogate keys are simply integers assigned sequentially to a particular
dimension row
• Operational codes (e.g., SKU number) are frequently source natural
keys and used to determine the surrogate key to use
• Operational codes are also retained for analysis purposes
• Many dimensions will have a surrogate key of -1 to indicate no value in
the Dimension
Surrogate Keys
• Benefits of surrogate keys:
• Buffer the data warehouse from changes in operational codes
• Can save space due to their small size compared to operational codes
• Allow recording of conditions which do not have an operational code (e.g., “No
Promotion”)
• Allow handling of changes to dimension table attributes (to be discussed later)
• The main disadvantage of using surrogate keys is that it requires some
effort to implement
• ALWAYS, ALWAYS, ALWAYS USE SURROGATE KEYS!
Star Schema with Surrogate Keys
Dim_Product Fact _Sales Dim Store

PK ProductKey PK FK ProductKey PK StoreKey

PK FK StoreKey
SKU Store_Number

PK FK DateKey
Product Attributes Store Attributes

Item_Sale_Price

Total_Sales

Quantity
Dim_Date

PK DateKey

Calendar Date

Date Attributes
Star Schema Size Analysis
• Product Dimension
• 150,000 products x 1 KB per row = 150 MB
• Date Dimension
• 3,650 dates (10 years) x 1 KB per row = 3.5 MB
• Store Dimension
• 100 stores x 2 KB per row = 0.2 MB
• Promotion Dimension
• 5,000 promotions x 1KB per row = 5 MB
• Total Dimensions = 158.7 MB
Star Schema Size Analysis
• Fact Table
• Assume 10,000 transactions per day per store
• 10,000 purchases x 3650 days x 10 products per purchase x 1
promotion per purchase
• 365,000,000 records x 1KB per record = 365 GB
• Total Size = Fact + Dimensions
• Total Size = 365,000 MB + 158.7 MB = 365.2 GB
• Sizing rule – when calculating size, the size of the dimension
tables can usually be ignored.
Data Warehousing Architectures
• There are several architectures for data warehousing: two-tier,
three-tier, and sometimes one tier.
• One can distinguish among them by dividing data warehouse
into three parts:
• The data warehouse itself that contains the data and associated
software
• Data acquisition (back-end) software that extracts data from legacy
systems and external sources, consolidates and summarizes them,
and loads them into the data warehouse
• Client (front-end) software that allows users to access and analyze
data from the warehouse
Generic Warehouse Architecture
Client Client
Query & Analysis
Design Phase Loading

Warehouse Metadata
Maintenance

Optimization
Integrator

Extractor/ Extractor/ Extractor/


Monitor Monitor Monitor

...
Three-layer Architecture: Conceptual View
• Transformation of real-time data to derived data really requires
two steps
Operational Informational
systems systems

View level
“Particular informational
Derived Data
needs”

Physical Implementation
Reconciled Data
of the Data Warehouse

Real-time data
Data Warehousing Architectures
DW Architecture: Conceptual View
• Single-layer Operational
systems
Informational
systems

• Every data element is stored


once only “Real-time data”
• Virtual warehouse
• Two-layer
• Real-time + derived data Operational
systems
Informational
systems

• Most commonly used


approach in the industry today
Derived Data

Real-time data
Data Warehousing Architectures
Issues to consider when deciding which architecture to use:
• Which database management system (DBMS) should be used?
• Will parallel processing and/or partitioning be used?
• Will data migration tools be used to load the data warehouse?
• What tools will be used to support data retrieval and analysis?
Data Warehousing Architectures
Ten factors that potentially affect the architecture selection
decision:
1. Information interdependence 6. Strategic view of the data
between organizational units warehouse prior to
implementation
2. Upper management’s
information needs 7. Compatibility with existing
systems
3. Urgency of need for a data 8. Perceived ability of the in-
warehouse house IT staff
4. Nature of end-user tasks 9. Technical issues
5. Constraints on resources 10. Social/political factors
Alternative Data Warehouse Architectures
Alternative Data Warehouse Architectures
Alternative Data Warehouse Architectures
Alternative Data Warehouse Architectures
Alternative Data Warehouse Architectures
Alternative Data Warehouse Architectures
Assignment
Complete the 4 steps of the Dimensional Modeling Process and then
build a draft schema for a fantasy soccer league to answer the
following questions
You have been asked to design a data warehouse for your online
retail company consisting of numerous websites selling the same
product set.
You need to design a data warehouse that will allow the following
questions to be answered at a minimum:
• What is the average dollar sales per order?
• How many item in the sweater category did we sell in January?
• What is the most popular brand of t-shirt sold this year?
• Which website sold the largest number of items last month?
Create the schema in Visio and submit according to the Submission
Instructions
Questions?