2 - Logical and Physical Data Modeling
2 - Logical and Physical Data Modeling
2 - Logical and Physical Data Modeling
1
Objectives
Review rules of third normal form database
design.
Provide a “toolkit” of denormalization
techniques for physical database design.
Characterize the tradeoffs in performance
versus space and maintenance costs.
Introduce advanced physical database
design considerations.
2
Topics
Quick review of normalization rules.
Pre-join denormalization.
Column replication/movement.
Pre-aggregation denormalization.
3
A Quick Review of Database 101
First Normal Form: Domains of attributes must include only atomic
(simple, indivisible) values.
4
A Quick Review of Database 101
Users should not have to “decode” attribute values based
on the value of other attributes in the relation.
Recommended Fix: Invest in the analysis work to derive a
domain for the (registration) values that does not have
multiple meanings for the same value and does not contain
redundant values. This will usually require standardization
of values across domains.
5
A Quick Review of Database 101
First Normal Form: Domains of attributes must include only atomic
(simple, indivisible) values.
Typical Violation: Multiple values glued together in a single attribute.
6
A Quick Review of Database 101
Recommended Fix: Separate attribute for each meaningful domain.
7
A Quick Review of Database 101
First Normal Form: Domains of attributes must include only atomic
(simple, indivisible) values.
8
A Quick Review of Database 101
9
A Quick Review of Database 101
First Normal Form: Domains of attributes must include only atomic
(simple, indivisible) values.
Typical Violation: Repeating group structures.
10
Getting Rid of Repeating Groups
Recommended Fix: One row for each month of balance figures.
What is the cost?
Assume 10M accounts and 3 years of monthly balance history.
Storage in Denormalized Case = 10M * 3 * 68b = 2.04 GB
Storage in Normalized Case = 10M * 36 * 27b = 9.72 GB
Factor of 4.76 in storage “penalty” for normalized design.
A few thousand dollars in today's disk prices.
Note that this is worst case for the normalized design because it is likely
that some rows prior to open date and subsequent toclose date on
the account would not need to be stored, but in denormalized design
zero entries are required.
11
Getting Rid of Repeating Groups
Recommended Fix: One row for each month of balance figures.
Why do I care?
12
Getting Rid of Repeating Groups
Average of the first 12 months of account balance for accounts opened in
1999 using denormalized design:
select sum(case
when account.open_dt between '1999-01-01' and '1999-01-31'
and account_history.snapshot_year = '1999' then
account_history.feb_bal_amt + account_history.mar_bal_amt +
account_history.apr_bal_amt + account_history.may_bal_amt +
account_history.jun_bal_amt + account_history.jul_bal_amt +
account_history.aug_bal_amt + account_history.sep_bal_amt +
account_history.oct_bal_amt + account_history.nov_bal_amt +
account_history.dec_bal_amt
when account.open_dt between ’1999-01-01' and ’1999-01-31'
and account_history.snapshot_year = ’2000' then
account_history.jan_bal_amt
when account.open_dt between '1999-02-01' and '1999-02-28'
and account_history.snapshot_year = '1999' then
account_history.mar_bal_amt + account_history.apr_bal_amt +
account_history.may_bal_amt + account_history.jun_bal_amt +
account_history.jul_bal_amt + account_history.aug_bal_amt +
account_history.sep_bal_amt + account_history.oct_bal_amt +
account_history.nov_bal_amt + account_history.dec_bal_amt
when account.open_dt between '1999-02-01' and '1999-02-28'
and account_history.snapshot_year = ’2000' then
account_history.jan_bal_amt + account_history.feb_bal_amt
when . . .
13
Getting Rid of Repeating Groups
when account.open_dt between '1999-11-01' and '1999-11-30'
and account_history.snapshot_year = '1999' then
account_history.dec_bal_amt
when account.open_dt between '1999-11-01' and '1999-11-30'
and account_history.snapshot_year = ’2000' then
account_history.jan_bal_amt + account_history.feb_bal_amt +
account_history.mar_bal_amt + account_history.apr_bal_amt +
account_history.may_bal_amt + account_history.jun_bal_amt +
account_history.jul_bal_amt + account_history.aug_bal_amt +
account_history.sep_bal_amt + account_history.oct_bal_amt +
account_history.nov_bal_amt
when account.open_dt between '1999-11-01' and '1999-11-30'
and account_history.snapshot_year = '1999' then
0
when account.open_dt between '1999-12-01' and '1999-12-31'
and account_history.snapshot_year = ’2000' then
account_history.jan_bal_amt + account_history.feb_bal_amt +
account_history.mar_bal_amt + account_history.apr_bal_amt +
account_history.may_bal_amt + account_history.jun_bal_amt +
account_history.jul_bal_amt + account_history.aug_bal_amt +
account_history.sep_bal_amt + account_history.oct_bal_amt +
account_history.nov_bal_amt + account_history.dec_bal_amt
end) / (12 * count (distinct account.account_id))
from account
,account_history
where account.account_id = account_history.account_id
and account.open_dt between '1999-01-01' and '1999-12-31'
and account_history.snapshot_year in ('1999',’2000')
;
14
Getting Rid of Repeating Groups
Which piece of code would you rather write and maintain?
How will your front-end tool work with the two choices?
Appending rows to the account_history table each month
will be roughly ten times faster than updating balance
history buckets.
This example holds true for many DSS application
domains...account balance history, store/department sales
history, etc.
15
Getting Rid of Repeating Groups
Second Normal Form: Every non-prime attribute must be Fully
Functionally Dependent on the primary key.
16
A Quick Review of Database 101
Recommended Fix: Split table into its fundamental entities with an
appropriate associative entity to capture entity relationships.
1
m
SSN Project_Id Date Hours
Employee_x_Project:
m
1
Project: Project_Id Project_Nm …
Ensuring Full Functional Dependency on
the Primary Key
Recommended Fix: Split table into its fundamental
entities with an appropriate associative entity to
capture entity relationships.
18
Ensuring Full Functional
Dependency on the Primary Key
What are the savings?
Note: May also want a table that describes the valid set of
projects against which an employee can allocate time.
19
A Quick Review of Database 101
Third Normal Form: Must be in second normal form and every non-
prime attribute is non-transitively dependent on the primary key.
20
A Quick Review of Database 101
21
Ensuring Non-Transitive Dependency on
the Primary Key
Recommended Fix: Split the table into its fundamental entities.
22
Ensuring Non-Transitive Dependency on
the Primary Key
Recommended Fix: Split the table into its fundamental entities.
23
Summary Review of Database 101
24
When is a Little Bit of Sin a Good Thing?
The Goal:
25
Common Forms of Denormalization
Pre-join denormalization.
Column replication or movement.
Pre-aggregation.
26
Considerations in Assessing
Denormalization
Performance implications
Storage implications
Ease-of-use implications
Maintenance implications
27
Pre-join Denormalization
28
Pre-join Denormalization
A simplified retail example...
Before denormalization:
1
m
tx_id sale_id item_id … item_qty sale$
29
Pre-join Denormalization
A simplified retail example...
After denormalization:
30
Pre-join Denormalization
Storage implications...
Assume 1:3 record count ratio between sales header and
detail.
Assume 1 billion sales (3 billion sales detail).
Assume 8 byte sales_id.
Assume 30 byte header and 40 byte detail records.
31
Pre-join Denormalization
Storage implications...
32
Pre-join Denormalization
Sample Query:
33
Pre-join Denormalization
Before denormalization:
select sum(sales_detail.sale_amt)
from sales
,sales_detail
where sales.sales_id = sales_detail.sales_id
and sales.sales_dt between ‘2011-11-26' and '2011-12-25'
;
34
Pre-join Denormalization
After denormalization:
select sum(d_sales_detail.sale_amt)
from d_sales_detail
where d_sales_detail.sales_dt between '2011-11-26' and '2011-12-25'
;
35
Pre-join Denormalization
Difference in performance (with no index utilization) depends on
join plans available to RDBMS:
36
Pre-join Denormalization
37
Pre-join Denormalization
Before denormalization:
select count(*)
from sales
where sales.sales_dt between '2011-11-26' and '2011-12-25';
After denormalization:
select count(distinct d_sales_detail.sales_id)
from d_sales_detail
where d_sales_detail.sales_dt between '2011-11-26' and '2011-12-25';
38
Pre-join Denormalization
Performance implications...
Performance penalty for count distinct (forces sort) can
be quite large.
May be worth 30 GB overhead to keep sales header
records if this is a common query structure because both
ease-of-use and performance will be enhanced (at some
cost in storage)?
39
Column Replication or Movement
40
Column Replication or Movement
41
Column Replication or Movement
Beware of the results of denormalization:
Assuming a 100 byte record before the denormalization, all
scans through the claim line detail will now take 10% longer
than previously.
A significant percentage of queries must get benefit from
access to the denormalized column in order to justify
movement into the claim line table.
Need to quantify both cost and benefit of each
denormalization decision.
42
Column Replication or Movement
May want to replicate columns in order to facilitate co-location of commonly
joined tables.
Before denormalization:
43
Column Replication or Movement
May want to replicate columns in order to facilitate co-location of commonly
joined tables.
After denormalization:
All three tables can be co-located using customer# as primary index to make
the three table join run much more quickly.
44
Column Replication or Movement
What is the impact of this approach to achieving table
co-location?
• Increases size of transaction table (largest table in the
database) by the size of the customer_id key.
• If customer key changes (consider impact of
individualization), then updates down to transaction
table must be propagated.
• Must include customer_id in join between transaction
table and account table to ensure optimizer recognition
of co-location (even though it is redundant to join on
account_id).
45
Column Replication or Movement
Resultant query example:
select sum(tx.tx_amt)
from customer
,account
,tx
where customer.customer_id = account.customer_id
and account.customer_id = tx.customer_id
and account.account_id = tx.account_id
and customer.birth_dt > '1972-01-01'
and account.registration_cd = 'IRA'
and tx.tx_dt between '2000-01-01' and '2000-04-15'
;
46
Pre-aggregation
47
Pre-aggregation
Typical pre-aggregate summary tables:
Retail: Inventory on hand, sales revenue, cost of goods sold, quantity of good
sold, etc. by store, item, and week.
Telecommunications: Toll call activity in time slot and destination region buckets
by customer and month.
Financial Services: First DOE, last DOE, first DOI, last DOI, rolling $ and transaction
volume in account type buckets, etc. by household.
48
Pre-aggregation
49
Pre-aggregation
Overhead for maintaining aggregates should not be under estimated.
Can choose transactional update strategy or re-build strategy for
maintaining aggregates.
Choice depends on volatility of aggregates and ability to segregate
aggregate records that need to be refreshed based on incoming data.
e.g., customer aggregates vs. weekly POS activity aggregates.
50
Pre-aggregation
51
Pre-aggregation
52
Bottom Line
In a perfect world of infinitely fast machines and well-
designed end user access tools denormalization would
never be discussed.
53