Notes DWDM
Notes DWDM
UNIT: 1
Introduction to data warehousing: Overview, Difference between database system and data
warehouse, the compelling need for data warehousing, Data warehouse – The building blocks,
Defining features, data warehouses and data marts, overview of the components, Three-tier
architecture, Metadata in the data warehouse.
Data pre-processing: Data cleaning, data gateway, synchronization of databases, data
transformation and ETL Process, ETL tools, interoperability of data and applications.
DATA WAREHOUSING
The term "Data Warehouse" was first coined by Bill Inmon in 1990. According to Inmon, a
data warehouse is a subject oriented, integrated, time-variant, and non-volatile collection
of data. This data helps analysts to make informed decisions in an organization (company).
An operational database undergoes frequent changes on a daily basis on account of the
transactions that take place. Suppose a business executive wants to analyze previous
feedback on any data such as a product, a supplier, or any consumer data, then the
executive will have no data available to analyze because the previous data has been
updated due to transactions.
A data warehouse provides us generalized and consolidated data in multidimensional view.
Along with generalized and consolidated view of data, a data warehouses also provides us
Online Analytical Processing (OLAP) tools. These tools help us in interactive and effective
analysis of data in a multidimensional space. This analysis results in data generalization and
data mining.
Data mining functions such as association, clustering, classification, prediction can be
integrated with OLAP operations to enhance the interactive mining of knowledge at
multiple level of abstraction. That's why data warehouse has now become an important
platform for data analysis and OLAP.
A data warehouse is a database, which is kept separate from the organization's operational
database.
There is no frequent updating done in a data warehouse.
It possesses consolidated historical data, which helps the organization to analyze its
business.
A data warehouse helps executives to organize, understand, and use their data to take
informed decisions.
A data warehouse system helps in historical data analysis.
Why a Data Warehouse is separated from Operational Database?
A data warehouse is kept separate from operational databases due to the following reasons −
An operational database is constructed for well-known tasks and workloads such as
searching particular records, indexing, etc. In contrast, data warehouse queries are often
complex and they present a general form of data.
Operational databases support concurrent processing of multiple transactions. Concurrency
control and recovery mechanisms are required for operational databases to ensure
robustness and consistency of the database.
An operational database query allows to read and modify operations, while an OLAP query
needs only read only access of stored data.
An operational database maintains current data. On the other hand, a data warehouse
maintains historical data.
A data warehouse helps business executives to organize, analyze, and use their data for decision
making.
Financial services
Banking services
Consumer goods
Retail sectors
Controlled manufacturing
Types of Data Warehouse applications:
Information processing, analytical processing, and data mining are the three types of data
warehouse applications that are discussed below −
Information Processing − A data warehouse allows to process the data stored in it. The data
can be processed by means of querying, basic statistical analysis, reporting using crosstabs,
tables, charts, or graphs.
Analytical Processing − A data warehouse supports analytical processing of the information
stored in it. The data can be analyzed by means of basic OLAP operations, including slice-
and-dice, drill down, drill up, and pivoting.
Data Mining − Data mining supports knowledge discovery by finding hidden patterns and
associations, constructing analytical models, performing classification and prediction. These
mining results can be presented using the visualization tools.
2 OLAP systems are used by knowledge workers OLTP systems are used by clerks, DBAs, or database
such as executives, managers, and analysts. professionals.
8 It provides summarized and consolidated data. It provides primitive and highly detailed data.
9 It provides summarized and multidimensional It provides detailed and flat relational view of data.
view of data.
11 The number of records accessed is in millions. The number of records accessed is in tens.
12 The database size is from 100GB to 100 TB. The database size is from 100 MB to 100 GB.
Data warehousing is the process of constructing and using a data warehouse. A data warehouse is
constructed by integrating data from multiple heterogeneous sources that support analytical
reporting, structured and/or ad hoc queries, and decision making. Data warehousing involves data
cleaning, data integration, and data consolidations.
There are decision support technologies that help utilize the data available in a data warehouse.
These technologies help executives to use the warehouse quickly and effectively. They can gather
data, analyze it, and take decisions based on the information present in the warehouse. The
information gathered in a warehouse can be used in any of the following domains −
Tuning Production Strategies − The product strategies can be well tuned by repositioning
the products and managing the product portfolios by comparing the sales quarterly or
yearly.
Customer Analysis − Customer analysis is done by analyzing the customer's buying
preferences, buying time, budget cycles, etc.
Operations Analysis − Data warehousing also helps in customer relationship management,
and making environmental corrections. The information also allows us to analyze business
operations.
Query-driven Approach
Update-driven Approach
Query-Driven Approach
This is the traditional approach to integrate heterogeneous databases. This approach was used to
build wrappers and integrators on top of multiple heterogeneous databases. These integrators are
also known as mediators.
Disadvantages
Query-driven approach needs complex integration and filtering processes.
This approach is very inefficient.
It is very expensive for frequent queries.
This approach is also very expensive for queries that require aggregations.
Update-Driven Approach
This is an alternative to the traditional approach. Today's data warehouse systems follow update-
driven approach rather than the traditional approach discussed earlier. In update-driven approach,
the information from multiple heterogeneous sources are integrated in advance and are stored in a
warehouse. This information is available for direct querying and analysis.
Advantages
This approach has the following advantages −
This approach provide high performance.
The data is copied, processed, integrated, annotated, summarized and restructured in
semantic data store in advance.
Query processing does not require an interface to process data at local sources.
The following are the functions of data warehouse tools and utilities −
Data Extraction − Involves gathering data from multiple heterogeneous sources.
Data Cleaning − Involves finding and correcting the errors in data.
Data Transformation − Involves converting the data from legacy format to warehouse
format.
Data Loading − Involves sorting, summarizing, consolidating, checking integrity, and building
indices and partitions.
Refreshing − Involves updating from data sources to warehouse.
Note − Data cleaning and data transformation are important steps in improving the quality of data
and data mining results.
Metadata is simply defined as data about data. The data that are used to represent other data is
known as metadata. For example, the index of a book serves as a metadata for the contents in the
book. In other words, we can say that metadata is the summarized data that leads us to the
detailed data.
In terms of data warehouse, we can define metadata as following −
Metadata is a road-map to data warehouse.
Metadata in data warehouse defines the warehouse objects.
Metadata acts as a directory. This directory helps the decision support system to locate the
contents of a data warehouse.
Metadata Repository
Metadata repository is an integral part of a data warehouse system. It contains the following
metadata −
Business metadata − It contains the data ownership information, business definition, and
changing policies.
Operational metadata − It includes currency of data and data lineage. Currency of data
refers to the data being active, archived, or purged. Lineage of data means history of data
migrated and transformation applied on it.
Data for mapping from operational environment to data warehouse − It metadata includes
source databases and their contents, data extraction, data partition, cleaning,
transformation rules, data refresh and purging rules.
The algorithms for summarization − It includes dimension algorithms, data on granularity,
aggregation, summarizing, etc.
Generally, a data warehouses adopts a three-tier architecture. Following are the three tiers of the
data warehouse architecture.
Bottom Tier − the bottom tier of the architecture is the data warehouse database server. It
is the relational database system. We use the back end tools and utilities to feed data into
the bottom tier. These back end tools and utilities perform the Extract, Clean, Load, and
refresh functions.
Middle Tier − In the middle tier, we have the OLAP Server that can be implemented in
either of the following ways.
o By Relational OLAP (ROLAP), which is an extended relational database management
system. The ROLAP maps the operations on multidimensional data to standard
relational operations.
o By Multidimensional OLAP (MOLAP) model, which directly implements the
multidimensional data and operations.
Top-Tier − This tier is the front-end client layer. This layer holds the query, reporting,
analysis and data mining tools.
The following diagram depicts the three-tier architecture of data warehouse −
From the perspective of data warehouse architecture, we have the following data warehouse
models
Virtual Warehouse
Data mart
Enterprise Warehouse
Virtual Warehouse
The view over an operational data warehouse is known as a virtual warehouse. It is easy to build a
virtual warehouse. Building a virtual warehouse requires excess capacity on operational database
servers.
Data Mart
Data mart contains a subset of organization-wide data. This subset of data is valuable to specific
groups of an organization.
In other words, we can claim that data marts contain data specific to a particular group. For
example, the marketing data mart may contain data related to items, customers, and sales. Data
marts are confined to subjects.
Points to remember about data marts −
Window-based or Unix/Linux-based servers are used to implement data marts.
The implementation data mart cycles is measured in short periods of time, i.e., in weeks
rather than months or years.
The life cycle of a data mart may be complex in long run, if its planning and design are not
organization-wide.
Data marts are small in size.
Data marts are customized by department.
The source of a data mart is departmentally structured data warehouse.
Data mart are flexible.
Enterprise Warehouse
An enterprise warehouse collects all the information and the subjects spanning an entire
organization
It provides us enterprise-wide data integration.
The data is integrated from operational systems and external information providers.
This information can vary from a few gigabytes to hundreds of gigabytes, terabytes or
beyond.
NHI KRNA
Load Manager
This component performs the operations required to extract and load process.
The size and complexity of the load manager varies between specific solutions from one data
warehouse to other.
Fast Load
In order to minimize the total load window the data need to be loaded into the warehouse
in the fastest possible time.
The transformations affects the speed of data processing.
It is more effective to load the data into relational database prior to applying
transformations and checks.
Gateway technology proves to be not suitable, since they tend not be performant when
large data volumes are involved.
Simple Transformations
While loading it may be required to perform simple transformations. After this has been completed
we are in position to do the complex checks. Suppose we are loading the EPOS sales transaction we
need to perform the following checks:
Strip out all the columns that are not required within the warehouse.
Convert all the values to required data types.
Warehouse Manager
A warehouse manager is responsible for the warehouse management process. It consists of third-
party system software, C programs, and shell scripts.
The size and complexity of warehouse managers varies between specific solutions.
Query Manager
Query manager is responsible for directing the queries to the suitable tables.
By directing the queries to appropriate tables, the speed of querying and response
generation can be increased.
Query manager is responsible for scheduling the execution of the queries posed by the user.
Query Manager Architecture
The following screenshot shows the architecture of a query manager. It includes the following:
Detailed Information
Detailed information is not kept online, rather it is aggregated to the next level of detail and then
archived to tape. The detailed information part of data warehouse keeps the detailed information
in the starflake schema. Detailed information is loaded into the data warehouse to supplement the
aggregated data.
The following diagram shows a pictorial impression of where detailed information is stored and
how it is used.
Note − If detailed information is held offline to minimize disk storage, we should make sure that the
data has been extracted, cleaned up, and transformed into starflake schema before it is archived.