0% found this document useful (0 votes)
19 views

Lect 1

This document provides an introduction to data warehousing. It discusses why data warehousing is needed due to the large and growing amounts of data organizations collect. It also notes that data warehousing provides a complete view of an organization's data to support intelligent decision making. The document outlines some key differences between data warehousing and transactional systems, noting that data warehouses integrate historical data from multiple sources for ad-hoc querying by knowledge workers. It also provides a definition of a data warehouse as a complete repository of historical corporate data extracted from transaction systems and available for knowledge workers to access.

Uploaded by

Filmy Studio
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Lect 1

This document provides an introduction to data warehousing. It discusses why data warehousing is needed due to the large and growing amounts of data organizations collect. It also notes that data warehousing provides a complete view of an organization's data to support intelligent decision making. The document outlines some key differences between data warehousing and transactional systems, noting that data warehouses integrate historical data from multiple sources for ad-hoc querying by knowledge workers. It also provides a definition of a data warehouse as a complete repository of historical corporate data extracted from transaction systems and available for knowledge workers to access.

Uploaded by

Filmy Studio
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 45

Data Warehousing

Lecture-1
Introduction and Background

1
Reference Books
 W. H. Inmon, Building the Data Warehouse
(Second Edition), John Wiley & Sons Inc., NY.
 A. Abdullah, “Data Warehousing for
beginners: Concepts & Issues” (First
Edition).
 Paulraj Ponniah, Data Warehousing
Fundamentals,
John Wiley & Sons Inc., NY.
Additional Material
 Research Papes

2
At The End of the Course
Develop an application for an organization
of your choice.

A case study and coding based approach


to be followed.

You MUST collect the necessary data and


should have a first draft of the project
description approved by the instructor
BEFORE initiating on detailed work.

3
Approach of the course
►Develop an understanding of underlying RDBMS
concepts.
►Applythese concepts to VLDB DSS environments
and understand where and why they break down?
►Exposethe differences between RDBMS and Data
Warehouse in the context of VLDB.
►Provide the basics of DSS tools such as OLAP, Data
Mining and demonstrate their application.

►Demonstrate the application of DSS concepts and


limitations of the OLTP concepts through lab
exercises. 4
Why this course?
► The world is changing (actually changed),
either change or be left behind.

► Missing the opportunities or going in the


wrong direction has prevented us from
growing.

► What is the right direction?


► Harnessing the data, in a knowledge driven
economy.
5
The need

“Drowning in data and starving


for information”
Knowledge is power, Intelligence
is absolute power!

6
The need
$
POWER

INTELLIGENCE

KNOWLEDGE

INFORMATION

DATA

7
Historical overview

1960
Master Files & Reports

1965
Lots of Master files!

1970
Direct Access Memory & DBMS

1975
Online high performance transaction processing 

8
Historical overview

1980
PCs and 4GL Technology (MIS/DSS) 
1985 & 1990 
Extract programs, extract processing,
The legacy system’s web

9
Historical overview: Crisis of Credibility
What is the financial health of our company?


??

 

-10%

+10%



10
Introduction and Background

11
Why a Data Warehouse (DWH)?
► Data recording and storage is growing.

► History is excellent predictor of the future.

► Gives total view of the organization.

► Intelligent
decision-support is required for
decision-making.

12
Reason-1: Why a Data Warehouse?
► Data Sets are growing.

How Much Data is that?


1 MB 220 or 106 bytes Small novel – 31/2 Disk
Paper rims that could fill the back of
1 GB 230 or 109 bytes
a pickup van
50,000 trees chopped and converted
1 TB 240 or 1012 bytes
into paper and printed
Academic research libraries across
2 PB 1 PB = 250 or 1015 bytes
the U.S.
All words ever spoken by human
5 EB 1 EB = 260 or 1018 bytes
beings

13
Reason-1: Why a Data Warehouse?
► Sizeof Data Sets are going up .
► Cost of data storage is coming down .

 The amount of data average business collects


and stores is doubling every year

 Total hardware and software cost to store and


manage 1 Mbyte of data
►1990: ~ $15
►2002: ~ ¢15 (Down 100 times)
►By 2007: < ¢1 (Down 150 times)

14
Reason-1: Why a Data Warehouse?
 A Few Examples
►WalMart: 24 TB
►France Telecom: ~ 100 TB
►CERN: Up to 20 PB by 2006
►Stanford Linear Accelerator Center (SLAC):
500TB

15
Caution!

A Warehouse of Data
is NOT a
Data Warehouse

16
Caution!

Size
is NOT
Everything

17
Reason-2: Why a Data Warehouse?

►Businesses demand Intelligence (BI).


 Complex questions from integrated data.
 “Intelligent Enterprise”

18
Reason-2: Why a Data Warehouse?
DBMS Approach
List of all items that were sold last
month?

List of all items purchased by Tariq


Majeed?

The total sales of the last month


grouped by branch?

How many sales transactions


occurred during the month of
January?
19
Reason-2: Why a Data Warehouse?
Intelligent Enterprise
Which items sell together? Which
items to stock?

Where and how to place the items?


What discounts to offer?

How best to target customers to


increase sales at a branch?

Which customers are most likely to


respond to my next promotional
campaign, and why?
20
Reason-3: Why a Data Warehouse?
►Businesses want much more…

 What happened?
 Why it happened? Stages of
Data
 What will happen? Warehouse

 What is happening?
 What do you want to happen?
21
What is a Data Warehouse?

A complete repository of historical


corporate data extracted from
transaction systems that is
available for ad-hoc access by
knowledge workers.

22
What is a Data Warehouse?
Complete repository
History
Transaction System
Ad-Hoc access
Knowledge workers

23
What is a Data Warehouse?
Transaction System
 Management Information System (MIS)
 Could be typed sheets (NOT transaction system)

Ad-Hoc access
 Dose not have a certain access pattern.
 Queries not known in advance.
 Difficult to write SQL in advance.

Knowledge workers
 Typically NOT IT literate (Executives, Analysts, Managers).
 NOT clerical workers.
 Decision makers. 24
Another View of a DWH

Subject
Oriented

Integrated

Time
Variant

Non
Volatile

25
What is a Data Warehouse ?
It is a blend of many technologies, the basic
concept being:

 Take all data from different operational systems.


 If necessary, add relevant data from industry.

 Transform all data and bring into a uniform format.

 Integrate all data as a single entity.

26
What is a Data Warehouse ? (Cont…)
It is a blend of many technologies, the basic
concept being:

Store data in a format supporting easy access for


decision support.
 Create performance enhancing indices.

 Implement performance enhancement joins.

 Run ad-hoc queries with low selectivity.

27
How is it Different?
► Fundamentally different
Business user
needs info

Answers result
User requests
in more questions
IT people

?
Business user
may get answers
 IT people do
system analysis
and design

IT people
send reports to IT people
business user create reports

28
How is it Different?
► Different patterns of hardware utilization

100%

0%

Operational DWH

Bus Service vs. Train


29
How is it Different?
► Combines operational and historical data.
Don’t do data entry into a DWH, OLTP or ERP are the source
systems.
OLTP systems don’t keep history, cant get balance statement
more than a year old.
 DWH keep historical data, even of bygone customers. Why?
 In the context of bank, want to know why the customer left?

 What were the events that led to his/her leaving? Why?


 Customer retention.

30
How much history?

► Depends on:
 Industry.
 Cost of storing historical data.

 Economic value of historical data.

31
How much history?
► Industries and history
 Telecomm calls are much much more as compared to
bank transactions- 18 months.

 Retailers interested in analyzing yearly seasonal


patterns- 65 weeks.

 Insurance companies want to do actuary analysis, use


the historical data in order to predict risk- 7 years.

32
How much history?

Economic value of data


Vs.
Storage cost

Data Warehouse a
complete repository of data?
33
How is it Different?
► Usually (but not always) periodic or batch
updates rather than real-time.

 The boundary is blurring for active data warehousing.


 For an ATM, if update not in real-time, then lot of real
trouble.
DWH is for strategic decision making based on historical
data. Wont hurt if transactions of last one hour/day are
absent.

34
How is it Different?

 Rate of update depends on:


 volume of data,
 nature of business,
 cost of keeping historical data,
 benefit of keeping historical data.

35
How is it Different?
► Starts with a 6x12 availability requirement ...
but 7x24 usually becomes the goal.
Decision makers typically don’t work 24 hrs a day and 7
days a week. An ATM system does.

 Once decision makers start using the DWH, and start


reaping the benefits, they start liking it…

Start using the DWH more often, till want it available


100% of the time.

36
How is it Different?
► Starts with a 6x12 availability requirement ...
but 7x24 usually becomes the goal.
 For business across the globe, 50% of the world may be
sleeping at any one time, but the businesses are up 100%
of the time.

100% availability not a trivial task, need to take into


account loading strategies, refresh rates etc.

37
How is it Different?
► Does
not follows the traditional development
model
Requirements

 Program

Classical SDLC

 Requirements gathering
 Analysis
 Design
 Programming
 Testing
 Integration
 Implementation
38
How is it Different?
► Does
not follows the traditional development
model
DWH

Program

 Requirements
DWH SDLC (CLDS)

 Implement warehouse
 Integrate data
 Test for biasness
 Program w.r.t data
 Design DSS system
 Analyze results
 Understand requirement 39
Data Warehouse Vs. OLTP

OLTP (On Line Transaction Processing)


Select tx_date, balance from tx_table
Where account_ID = 23876;

40
Data Warehouse Vs. OLTP

DWH
Select balance, age, sal, gender from
customer_table, tx_table
Where age between (30 and 40) and
Education = ‘graduate’ and
CustID.customer_table =
Customer_ID.tx_table;

41
Data Warehouse Vs. OLTP

OLTP DWH
Primary key used Primary key NOT used
No concept of Primary Index Primary index used
Few rows returned Many rows returned
May use a single table Uses multiple tables
High selectivity of query Low selectivity of query
Indexing on primary key Indexing on primary index
(unique) (non-unique)

42
Data Warehouse Vs. OLTP
OLTP: OnLine Transaction Processing (MIS or Database System)

Data Warehouse OLTP


Scope * Application –Neutral * Application specific
* Single source of “truth” * Multiple databases with repetition
* Evolves over time * Off the shelf application
* How to improve business * Runs the business

Data * Historical, detailed data * Operational data


Perspective * Some summary * No summary
* Lightly denormalized * Fully normalized

Queries * Hardly uses PK * Based on PK


* Number of results * Number of results returned in
returned in thousands hundreds

Time factor * Minutes to hours * Sub seconds to seconds


* Typical availability 6x12 * Typical availability 24x7
43
Comparison of Response Times
► On-lineanalytical processing (OLAP) queries must be
executed in a small number of seconds.
 Often requires denormalization and/or sampling.

► Complex query scripts and large list selections can


generally be executed in a small number of minutes.

► Sophisticated clustering algorithms (e.g., data mining)


can generally be executed in a small number of hours
(even for hundreds of thousands of customers).

44
Putting the pieces together

Data Data Warehouse Server OLAP Servers Clients


(Tier 0) (Tier 1) (Tier 2) (Tier 3)
Semistructured


MOLAP
Sources Query/Reporting

www data
Meta
Data 
 Extract
Data Analysis 








IT
Archived
data
Transform
Load
(ETL)
Warehouse
ROLAP 
Data Mining

Business
Users
Users
Operational
Data Bases 

Data sources Data Marts  Tools
Business Users

45

You might also like