Notes DWDM

DATA WAREHOUSE & DATA MINING (ITITE05)
UNIT: 1
Introduction to data warehousing: Overview, Difference between database system and data
warehouse, the compelling need for data warehousing, Data warehouse – The building blocks,
Defining features, data warehouses and data marts, overview of the components, Three-tier
architecture, Metadata in the data warehouse.
Data pre-processing: Data cleaning, data gateway, synchronization of databases, data
transformation and ETL Process, ETL tools, interoperability of data and applications.
DATA WAREHOUSING
 The term "Data Warehouse" was first coined by Bill Inmon in 1990. According to Inmon, a
data warehouse is a subject oriented, integrated, time-variant, and non-volatile collection
of data. This data helps analysts to make informed decisions in an organization (company).
 An operational database undergoes frequent changes on a daily basis on account of the
transactions that take place. Suppose a business executive wants to analyze previous
feedback on any data such as a product, a supplier, or any consumer data, then the
executive will have no data available to analyze because the previous data has been
updated due to transactions.
 A data warehouse provides us generalized and consolidated data in multidimensional view.
Along with generalized and consolidated view of data, a data warehouses also provides us
Online Analytical Processing (OLAP) tools. These tools help us in interactive and effective
analysis of data in a multidimensional space. This analysis results in data generalization and
data mining.
 Data mining functions such as association, clustering, classification, prediction can be
integrated with OLAP operations to enhance the interactive mining of knowledge at
multiple level of abstraction. That's why data warehouse has now become an important
platform for data analysis and OLAP.
Understanding a Data Warehouse
 A data warehouse is a database, which is kept separate from the organization's operational
database.
 There is no frequent updating done in a data warehouse.
 It possesses consolidated historical data, which helps the organization to analyze its
business.
 A data warehouse helps executives to organize, understand, and use their data to take
informed decisions.
 A data warehouse system helps in historical data analysis.
Why a Data Warehouse is separated from Operational Database?
A data warehouse is kept separate from operational databases due to the following reasons −
 An operational database is constructed for well-known tasks and workloads such as
searching particular records, indexing, etc. In contrast, data warehouse queries are often
complex and they present a general form of data.
 Operational databases support concurrent processing of multiple transactions. Concurrency
control and recovery mechanisms are required for operational databases to ensure
robustness and consistency of the database.
 An operational database query allows to read and modify operations, while an OLAP query
needs only read only access of stored data.
 An operational database maintains current data. On the other hand, a data warehouse
maintains historical data.
Data Warehouse Features
 Subject Oriented − A data warehouse is subject oriented because it provides information

around a subject rather than the organization's ongoing operations. These subjects can be
product, customers, suppliers, sales, revenue, etc. A data warehouse does not focus on the
ongoing operations, rather it focuses on modelling and analysis of data for decision making.
 Integrated − A data warehouse is constructed by integrating data from heterogeneous
sources such as relational databases, flat files, etc. This integration enhances the effective
analysis of data.
 Time Variant −The data collected in a data warehouse is identified with a particular time
period. The data in a data warehouse provides information from the historical point of
view.
 Non-volatile − Non-volatile means the previous data is not erased when new data is added
to it. A data warehouse is kept separate from the operational database and therefore
frequent changes in operational database is not reflected in the data warehouse.
Note − A data warehouse does not require transaction processing, recovery, and concurrency
controls, because it is physically stored and separate from the operational database.
Data Warehouse Applications
A data warehouse helps business executives to organize, analyze, and use their data for decision
making.
Data warehouses are widely used in the following fields −
 Financial services
 Banking services
 Consumer goods
 Retail sectors
 Controlled manufacturing
Types of Data Warehouse applications:
Information processing, analytical processing, and data mining are the three types of data
warehouse applications that are discussed below −
 Information Processing − A data warehouse allows to process the data stored in it. The data
can be processed by means of querying, basic statistical analysis, reporting using crosstabs,
tables, charts, or graphs.
 Analytical Processing − A data warehouse supports analytical processing of the information
stored in it. The data can be analyzed by means of basic OLAP operations, including slice-
and-dice, drill down, drill up, and pivoting.
 Data Mining − Data mining supports knowledge discovery by finding hidden patterns and
associations, constructing analytical models, performing classification and prediction. These
mining results can be presented using the visualization tools.
Sr. Data Warehouse (OLAP) Operational Database(OLTP)

No.
1 It involves historical processing of information. It involves day-to-day processing.
2 OLAP systems are used by knowledge workers OLTP systems are used by clerks, DBAs, or database
such as executives, managers, and analysts. professionals.
3 It is used to analyze the business. It is used to run the business.
4 It focuses on Information out. It focuses on Data in.
5 It is based on Star Schema, Snowflake Schema, It is based on Entity Relationship Model.

and Fact Constellation Schema.
6 It focuses on Information out. It is application oriented.
7 It contains historical data. It contains current data.
8 It provides summarized and consolidated data. It provides primitive and highly detailed data.
9 It provides summarized and multidimensional It provides detailed and flat relational view of data.
view of data.
10 The number of users is in hundreds. The number of users is in thousands.
11 The number of records accessed is in millions. The number of records accessed is in tens.
12 The database size is from 100GB to 100 TB. The database size is from 100 MB to 100 GB.
13 These are highly flexible. It provides high performance.
What is Data Warehousing?
Data warehousing is the process of constructing and using a data warehouse. A data warehouse is
constructed by integrating data from multiple heterogeneous sources that support analytical
reporting, structured and/or ad hoc queries, and decision making. Data warehousing involves data
cleaning, data integration, and data consolidations.
Using Data Warehouse Information
There are decision support technologies that help utilize the data available in a data warehouse.
These technologies help executives to use the warehouse quickly and effectively. They can gather
data, analyze it, and take decisions based on the information present in the warehouse. The
information gathered in a warehouse can be used in any of the following domains −
 Tuning Production Strategies − The product strategies can be well tuned by repositioning
the products and managing the product portfolios by comparing the sales quarterly or
yearly.
 Customer Analysis − Customer analysis is done by analyzing the customer's buying
preferences, buying time, budget cycles, etc.
 Operations Analysis − Data warehousing also helps in customer relationship management,
and making environmental corrections. The information also allows us to analyze business
operations.
Integrating Heterogeneous Databases
To integrate heterogeneous databases, we have two approaches −
 Query-driven Approach
 Update-driven Approach
Query-Driven Approach
This is the traditional approach to integrate heterogeneous databases. This approach was used to
build wrappers and integrators on top of multiple heterogeneous databases. These integrators are
also known as mediators.
Process of Query-Driven Approach

 When a query is issued to a client side, a metadata dictionary translates the query into an
appropriate form for individual heterogeneous sites involved.
 Now these queries are mapped and sent to the local query processor.
 The results from heterogeneous sites are integrated into a global answer set.
Disadvantages
 Query-driven approach needs complex integration and filtering processes.
 This approach is very inefficient.
 It is very expensive for frequent queries.
 This approach is also very expensive for queries that require aggregations.
Update-Driven Approach
This is an alternative to the traditional approach. Today's data warehouse systems follow update-
driven approach rather than the traditional approach discussed earlier. In update-driven approach,
the information from multiple heterogeneous sources are integrated in advance and are stored in a
warehouse. This information is available for direct querying and analysis.
Advantages
This approach has the following advantages −
 This approach provide high performance.
 The data is copied, processed, integrated, annotated, summarized and restructured in
semantic data store in advance.
 Query processing does not require an interface to process data at local sources.
Functions of Data Warehouse Tools and Utilities
The following are the functions of data warehouse tools and utilities −
 Data Extraction − Involves gathering data from multiple heterogeneous sources.
 Data Cleaning − Involves finding and correcting the errors in data.
 Data Transformation − Involves converting the data from legacy format to warehouse
format.
 Data Loading − Involves sorting, summarizing, consolidating, checking integrity, and building
indices and partitions.
 Refreshing − Involves updating from data sources to warehouse.
Note − Data cleaning and data transformation are important steps in improving the quality of data
and data mining results.
Data Warehousing - Terminologies

Metadata
Metadata is simply defined as data about data. The data that are used to represent other data is
known as metadata. For example, the index of a book serves as a metadata for the contents in the
book. In other words, we can say that metadata is the summarized data that leads us to the
detailed data.
In terms of data warehouse, we can define metadata as following −
 Metadata is a road-map to data warehouse.
 Metadata in data warehouse defines the warehouse objects.
 Metadata acts as a directory. This directory helps the decision support system to locate the
contents of a data warehouse.
Metadata Repository
Metadata repository is an integral part of a data warehouse system. It contains the following
metadata −
 Business metadata − It contains the data ownership information, business definition, and
changing policies.
 Operational metadata − It includes currency of data and data lineage. Currency of data
refers to the data being active, archived, or purged. Lineage of data means history of data
migrated and transformation applied on it.
 Data for mapping from operational environment to data warehouse − It metadata includes
source databases and their contents, data extraction, data partition, cleaning,
transformation rules, data refresh and purging rules.
 The algorithms for summarization − It includes dimension algorithms, data on granularity,
aggregation, summarizing, etc.
Three-Tier Data Warehouse Architecture
Generally, a data warehouses adopts a three-tier architecture. Following are the three tiers of the
data warehouse architecture.
 Bottom Tier − the bottom tier of the architecture is the data warehouse database server. It
is the relational database system. We use the back end tools and utilities to feed data into
the bottom tier. These back end tools and utilities perform the Extract, Clean, Load, and
refresh functions.
 Middle Tier − In the middle tier, we have the OLAP Server that can be implemented in
either of the following ways.
o By Relational OLAP (ROLAP), which is an extended relational database management
system. The ROLAP maps the operations on multidimensional data to standard
relational operations.
o By Multidimensional OLAP (MOLAP) model, which directly implements the
multidimensional data and operations.
 Top-Tier − This tier is the front-end client layer. This layer holds the query, reporting,
analysis and data mining tools.
The following diagram depicts the three-tier architecture of data warehouse −
Data Warehouse Models
From the perspective of data warehouse architecture, we have the following data warehouse
models
 Virtual Warehouse
 Data mart
 Enterprise Warehouse
Virtual Warehouse
The view over an operational data warehouse is known as a virtual warehouse. It is easy to build a
virtual warehouse. Building a virtual warehouse requires excess capacity on operational database
servers.
Data Mart
Data mart contains a subset of organization-wide data. This subset of data is valuable to specific
groups of an organization.
In other words, we can claim that data marts contain data specific to a particular group. For
example, the marketing data mart may contain data related to items, customers, and sales. Data
marts are confined to subjects.
Points to remember about data marts −
 Window-based or Unix/Linux-based servers are used to implement data marts.
 The implementation data mart cycles is measured in short periods of time, i.e., in weeks
rather than months or years.
 The life cycle of a data mart may be complex in long run, if its planning and design are not
organization-wide.
 Data marts are small in size.
 Data marts are customized by department.
 The source of a data mart is departmentally structured data warehouse.
 Data mart are flexible.
Enterprise Warehouse
 An enterprise warehouse collects all the information and the subjects spanning an entire
organization
 It provides us enterprise-wide data integration.
 The data is integrated from operational systems and external information providers.
 This information can vary from a few gigabytes to hundreds of gigabytes, terabytes or
beyond.
NHI KRNA

 Load Manager
This component performs the operations required to extract and load process.
The size and complexity of the load manager varies between specific solutions from one data
warehouse to other.
Load Manager Architecture

The load manager performs the following functions −
 Extract the data from source system.
 Fast Load the extracted data into temporary data store.
 Perform simple transformations into structure similar to the one in the data warehouse.
Extract Data from Source

The data is extracted from the operational databases or the external information providers.
Gateways is the application programs that are used to extract data. It is supported by underlying
DBMS and allows client program to generate SQL to be executed at a server. Open Database
Connection (ODBC), Java Database Connection (JDBC), are examples of gateway.
Fast Load
 In order to minimize the total load window the data need to be loaded into the warehouse
in the fastest possible time.
 The transformations affects the speed of data processing.
 It is more effective to load the data into relational database prior to applying
transformations and checks.
 Gateway technology proves to be not suitable, since they tend not be performant when
large data volumes are involved.
Simple Transformations
While loading it may be required to perform simple transformations. After this has been completed
we are in position to do the complex checks. Suppose we are loading the EPOS sales transaction we
need to perform the following checks:
 Strip out all the columns that are not required within the warehouse.
 Convert all the values to required data types.
Warehouse Manager
A warehouse manager is responsible for the warehouse management process. It consists of third-
party system software, C programs, and shell scripts.
The size and complexity of warehouse managers varies between specific solutions.
Warehouse Manager Architecture

A warehouse manager includes the following −
 The controlling process

 Stored procedures or C with SQL
 Backup/Recovery tool
 SQL Scripts
Operations Performed by Warehouse Manager

 A warehouse manager analyzes the data to perform consistency and referential integrity
checks.
 Creates indexes, business views, partition views against the base data.
 Generates new aggregations and updates existing aggregations. Generates normalizations.
 Transforms and merges the source data into the published data warehouse.
 Backup the data in the data warehouse.
 Archives the data that has reached the end of its captured life.
Note − A warehouse Manager also analyzes query profiles to determine index and aggregations are
appropriate.
Query Manager
 Query manager is responsible for directing the queries to the suitable tables.
 By directing the queries to appropriate tables, the speed of querying and response
generation can be increased.
 Query manager is responsible for scheduling the execution of the queries posed by the user.
Query Manager Architecture
The following screenshot shows the architecture of a query manager. It includes the following:
 Query redirection via C tool or RDBMS

 Stored procedures
 Query management tool
 Query scheduling via C tool or RDBMS
 Query scheduling via third-party software
Detailed Information
Detailed information is not kept online, rather it is aggregated to the next level of detail and then
archived to tape. The detailed information part of data warehouse keeps the detailed information
in the starflake schema. Detailed information is loaded into the data warehouse to supplement the
aggregated data.
The following diagram shows a pictorial impression of where detailed information is stored and
how it is used.
Note − If detailed information is held offline to minimize disk storage, we should make sure that the
data has been extracted, cleaned up, and transformed into starflake schema before it is archived.

Notes DWDM

Uploaded by

Copyright:

Available Formats

Notes DWDM

Uploaded by

Copyright:

Available Formats

DATA WAREHOUSE & DATA MINING (ITITE05)

Understanding a Data Warehouse

Data Warehouse Features

 Subject Oriented − A data warehouse is subject oriented because it provides information

Data Warehouse Applications

Data warehouses are widely used in the following fields −

Sr. Data Warehouse (OLAP) Operational Database(OLTP)

1 It involves historical processing of information. It involves day-to-day processing.

3 It is used to analyze the business. It is used to run the business.

4 It focuses on Information out. It focuses on Data in.

5 It is based on Star Schema, Snowflake Schema, It is based on Entity Relationship Model.

6 It focuses on Information out. It is application oriented.

7 It contains historical data. It contains current data.

10 The number of users is in hundreds. The number of users is in thousands.

13 These are highly flexible. It provides high performance.

What is Data Warehousing?

Using Data Warehouse Information

Integrating Heterogeneous Databases

To integrate heterogeneous databases, we have two approaches −

Process of Query-Driven Approach

Functions of Data Warehouse Tools and Utilities

Data Warehousing - Terminologies

Three-Tier Data Warehouse Architecture

Data Warehouse Models

Load Manager Architecture

Extract Data from Source

Warehouse Manager Architecture

 The controlling process

Operations Performed by Warehouse Manager

 Query redirection via C tool or RDBMS

You might also like