Unit 2 Data Warehousing and OLAP
Unit 2 Data Warehousing and OLAP
OLAP Technology
Introduction
● Data warehouses(DWs) generalize and consolidate data in multidimensional
space.
● The construction of DWs involves data cleaning, data integration, and data
transformation, and can be viewed as an important preprocessing step for data
mining.
● DWs provide online analytical processing (OLAP) tools for the interactive
analysis of multidimensional data of varied granularities.
● DM functionalities can be integrated with OLAP operations to enhance
interactive mining of knowledge at multiple levels of abstraction.
Definition of the DW
Diff b/w Operational DB vs DW
Need for using DWs for Data analysis,
DW: Basic Concepts DW architecture: Multitired
Three DW models (Enterprise, Data Mart, Virtual)
Backend utilities for DW (extraction, transformation, loading)
Metadata repository
● DW provides architectures and tools for business executives to
systematically organize, understand, and use their data to make strategic
decisions.
● A DW refers to a data repository that is maintained separately from an
organization’s operational databases.
● “A data warehouse is a subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s decision making
process” - William H. Inmon (Architect of DWs)
● The four keywords—subject-oriented, integrated, time-variant, and
nonvolatile—distinguish DWs from other data repository systems.
Key Features
● Subject-oriented:
○ A DW is organized around major subjects such as customer, supplier, product, and sales.
○ Rather than concentrating on the day-to-day operations and transaction processing of an
organization, a DW focuses on the modeling and analysis of data for decision makers.
● Integrated:
○ A DW is usually constructed by integrating multiple heterogeneous sources.
○ Data cleaning and data integration techniques are applied to ensure consistency in
naming conventions, encoding structures, attribute measures, and so on.
Key Features
● Time-variant:
○ Data are stored to provide information from an historic perspective (e.g., the past 5–10
years).
○ Every key structure in the data warehouse contains, either implicitly or explicitly, a time
element.
● Nonvolatile:
○ A DW is always a physically separate store of data transformed from the application data
found in the operational environment.
○ Due to this separation, a data warehouse does not require transaction processing,
recovery, and concurrency control mechanisms.
○ It usually requires only two operations in data accessing:
■ initial loading of data
■ access of data.
How are organizations using the information from
data warehouses?
Many organizations use information to support business decision-making
activities, includes
(1) Increasing customer focus, which includes the analysis of customer
buying patterns (such as buying preference, buying time, budget cycles,
and appetites for spending).
(2) Repositioning products and managing product portfolios by comparing
the performance of sales by quarter, by year, and by geographic regions
in order to fine-tune production strategies.
(3) Analyzing operations and looking for sources of profit.
(4) Managing customer relationships, making environmental corrections,
and managing the cost of corporate assets.
Heterogeneous Database Integration
It is highly desirable, yet challenging, to integrate such data and provide easy and efficient
access to it.
Query-driven approach:
● The traditional database approach to heterogeneous database integration is to build
wrappers and integrators (or mediators) on top of multiple, heterogeneous databases.
● When a query is posed to a client site, a metadata dictionary is used to translate the
query into queries appropriate for the individual heterogeneous sites involved.
● These queries are then mapped and sent to local query processors.
● The results returned from the different sites are integrated into a global answer set.
● Dis Adv
○ This query-driven approach requires complex information filtering and integration processes, and competes
with local sites for processing resources.
○ It is inefficient and potentially expensive for frequent queries, especially queries requiring aggregations.
Heterogeneous Database Integration
• From the architecture point of view, there are three data warehouse
models:
• Enterprise warehouse
• Data mart
• Virtual warehouse
Enterprise Warehouse
● The cuboid that holds the lowest level of summarization is called the base
cuboid.
● The 0-D cuboid, which holds the highest level of summarization, is called the
apex cuboid.
● A 3-D (nonbase) cuboid for time, item, and location, summarized for all
suppliers
Schemas for Multidimensional Data Models
○ Star schema
○ Snowflake schema
○ Fact constellation schema
Star Schema
● Each dimension is
represented by only
one table, and each
table contains a set
of attributes.
● The attributes with
in a dimension table
may form either a
hierarchy (total
order) or a lattice
(partial order)
Snowflake Schema
Total order concept hierarchy location “street < city < province or state < country.”
Dimensions: The Role of Concept Hierarchies
An interval ($X...$Y]
denotes the range
from $X (exclusive) to
$Y (inclusive).
Measures: Their Categorization and Computation
Pivot(rotate):
A visualization operation that rotates the data axes in view
to provide an alternative data presentation.
Ex: the item and location axes in a 2-D slice are rotated
Other examples include rotating the axes in a 3-D cube, or
transforming a 3-D cube into a series of 2-D planes.
Data Warehouse Implementation
Create a data cube for AllElectronics sales that contains the following: city,
item, year, and sales in dollars.
● What is the total number of cuboids, or
group-by’s, that can be computed for this
data cube?
● The total number of cuboids, or group-
by’s, that can be computed for this data
cube is .
● The possible group-by’s are the
following: {(city, item, year), (city, item),
(city, year), (item, year), (city), (item),
(year), ()}, where () means that the
group-by is empty.
Efficient Data Cube Computation
● The base cuboid contains all three dimensions, city, item, and year.
● It can return the total sales for any combination of the three dimensions.
● The apex cuboid, or 0-D cuboid, refers to the case where the group-by
is empty. It contains the total sum of all sales.
● The base cuboid is the least generalized (most specific) of the cuboids.
The apex cuboid is the most generalized (least specific) of the cuboids,
and is often denoted as all.
● If we start at the apex cuboid and explore downward in the lattice, this
is equivalent to drilling down within the data cube.
● If we start at the base cuboid and explore upward, this is equivalent to
rolling up.
Efficient Data Cube Computation
Based on the syntax of DMQL introduced, the data cube could be defined
as
define cube sales_cube [city, item, year]: sum(sales in dollars)
For a cube with n dimensions, there are a total of cuboids, including the
base cuboid.
compute cube sales_cube
It would explicitly instruct the system to compute the sales aggregate
cuboids for all of the eight subsets of the set {city, item, year}, including the
empty subset.
Efficient Data Cube Computation
○ A DW constructed by such preprocessing serves as a valuable source of high-quality data for OLAP as well as for
data mining. Notice that data mining may also serve as a valuable tool for data cleaning and data integration as
well.
● Available information processing infrastructure surrounding data warehouses:
○ Information processing and data analysis infrastructures have been or will be systematically constructed
surrounding DWs, which include accessing, integration, consolidation, and transformation of multiple
heterogeneous databases, ODBC/OLE DB connections, Web-accessing and service facilities, and reporting and
OLAP analysis tools.
○ It is prudent to make the best use of the available infrastructures rather than constructing everything from scratch.
From On-Line Analytical Processing(OLAP) to
On-Line Analytical Mining(OLAM)
● OLAP-based exploratory data analysis:
○ Effective data mining needs exploratory data analysis. A user will often want to traverse
through a database, select portions of relevant data, analyze them at different granularities,
and present knowledge/results in different forms.
○ OLAM provides facilities for data mining on different subsets of data and at different levels of
abstraction, by drilling, pivoting, filtering, dicing, and slicing on a data cube and on some
intermediate data mining results.
○ This, together with data/knowledge visualization tools, will greatly enhance the power and
flexibility of exploratory data mining.
● On-line selection of data mining functions:
○ Often a user may not know what kinds of knowledge would like to mine. By integrating
OLAP with multiple data mining functions, OLAM provides users with the flexibility to select
desired data mining functions and swap data mining tasks dynamically.
Architecture for OLAM
● An OLAM server performs analytical
mining in data cubes in a similar manner
as an OLAP server performs OLAP.
● The OLAM and OLAP servers both accept
user on-line queries (or commands) via a
graphical user interface API and work with
the data cube in the data analysis via a
cube API.
● A metadata directory is used to guide the
access of the data cube.
● The data cube can be constructed by
accessing and/or integrating multiple
databases via an MDDB API and/or by
filtering a data warehouse via a database
API that may support OLE DB or ODBC
connections.
Architecture for OLAM
● Since an OLAM server may perform
multiple data mining tasks, such as
concept description, association,
classification, prediction, clustering,
time-series analysis, and so on, it
usually consists of multiple integrated
data mining modules and is more
sophisticated than an OLAP server.
Thank You