Unit 2 Data Warehousing and OLAP

Data Warehousing and
OLAP Technology
Introduction
● Data warehouses(DWs) generalize and consolidate data in multidimensional
space.
● The construction of DWs involves data cleaning, data integration, and data
transformation, and can be viewed as an important preprocessing step for data
mining.
● DWs provide online analytical processing (OLAP) tools for the interactive
analysis of multidimensional data of varied granularities.
● DM functionalities can be integrated with OLAP operations to enhance
interactive mining of knowledge at multiple levels of abstraction.
Definition of the DW
Diff b/w Operational DB vs DW
Need for using DWs for Data analysis,
DW: Basic Concepts DW architecture: Multitired
Three DW models (Enterprise, Data Mart, Virtual)
Backend utilities for DW (extraction, transformation, loading)
Metadata repository
● DW provides architectures and tools for business executives to
systematically organize, understand, and use their data to make strategic
decisions.
● A DW refers to a data repository that is maintained separately from an
organization’s operational databases.
● “A data warehouse is a subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s decision making
process” - William H. Inmon (Architect of DWs)
● The four keywords—subject-oriented, integrated, time-variant, and
nonvolatile—distinguish DWs from other data repository systems.
Key Features
● Subject-oriented:
○ A DW is organized around major subjects such as customer, supplier, product, and sales.
○ Rather than concentrating on the day-to-day operations and transaction processing of an
organization, a DW focuses on the modeling and analysis of data for decision makers.
● Integrated:
○ A DW is usually constructed by integrating multiple heterogeneous sources.
○ Data cleaning and data integration techniques are applied to ensure consistency in
naming conventions, encoding structures, attribute measures, and so on.
Key Features
● Time-variant:
○ Data are stored to provide information from an historic perspective (e.g., the past 5–10
years).
○ Every key structure in the data warehouse contains, either implicitly or explicitly, a time
element.
● Nonvolatile:
○ A DW is always a physically separate store of data transformed from the application data
found in the operational environment.
○ Due to this separation, a data warehouse does not require transaction processing,
recovery, and concurrency control mechanisms.
○ It usually requires only two operations in data accessing:
■ initial loading of data
■ access of data.
How are organizations using the information from
data warehouses?
Many organizations use information to support business decision-making
activities, includes
(1) Increasing customer focus, which includes the analysis of customer
buying patterns (such as buying preference, buying time, budget cycles,
and appetites for spending).
(2) Repositioning products and managing product portfolios by comparing
the performance of sales by quarter, by year, and by geographic regions
in order to fine-tune production strategies.
(3) Analyzing operations and looking for sources of profit.
(4) Managing customer relationships, making environmental corrections,
and managing the cost of corporate assets.
Heterogeneous Database Integration
It is highly desirable, yet challenging, to integrate such data and provide easy and efficient
access to it.
Query-driven approach:
● The traditional database approach to heterogeneous database integration is to build
wrappers and integrators (or mediators) on top of multiple, heterogeneous databases.
● When a query is posed to a client site, a metadata dictionary is used to translate the
query into queries appropriate for the individual heterogeneous sites involved.
● These queries are then mapped and sent to local query processors.
● The results returned from the different sites are integrated into a global answer set.
● Dis Adv
○ This query-driven approach requires complex information filtering and integration processes, and competes
with local sites for processing resources.
○ It is inefficient and potentially expensive for frequent queries, especially queries requiring aggregations.
Heterogeneous Database Integration
Update Driven Approach

● Data warehousing employs an update driven approach in which information
from multiple, heterogeneous sources is integrated in advance and stored in
a warehouse for direct querying and analysis.
● A data warehouse brings high performance to the integrated heterogeneous
database system because data are copied, preprocessed, integrated,
annotated, summarized, and restructured into one semantic data store.
● Query processing in data warehouses does not interfere with the processing
at local sources.
● Data warehouses can store and integrate historic information and support
complex multidimensional queries
Difference
s between
Operationa
l Database
Systems
and Data
Warehouse
s
Need for using DWs for Data analysis
● “Why not perform OLAP directly on such operational databases instead of

spending additional time and resources to construct a separate data
warehouse?”
● To help promote the high performance of both systems.
Operational Databases Data Warehouse
● Designed and tuned from known ● Involve the computation of large

tasks and workloads like indexing data groups at summarized levels,
and hashing using primary keys, and may require the use of special
searching for particular records, data organization, access, and
and optimizing “canned” queries. implementation methods based on
● Processing OLAP queries in multi dimensional views.
operational databases would
substantially degrade the
performance of operational tasks.
Operational Databases Data Warehouse

● Supports the concurrent processing ● An OLAP query often needs read-
of multiple transactions. Concurrency only access of data records for
control and recovery mechanisms summarization and aggregation.
(e.g., locking and logging) are
● Decision support requires historic
required to ensure the consistency
and robustness of transactions. data
● Do not typically maintain historic data ● Decision support requires
consolidation (e.g., aggregation and
● Operational databases contain only summarization) of data from
detailed raw data, such as heterogeneous sources, resulting in
transactions, which need to be high-quality, clean, integrated data.
consolidated before analysis.
Data
Warehousing:
A Multitiered
(three-tier)
Architecture
Three-tier Architecture
● Bottom-Tier: is a warehouse database server that is almost always a

relational database system.
● Back-end tools and utilities are used to feed data into the bottom tier from
operational databases or other external sources.
● These tools and utilities perform data extraction, cleaning, and transformation,
as well as load and refresh functions to update the data warehouse.
● The data are extracted using application program interfaces known as
gateways.
● Examples of gateways include
○ ODBC (Open Database Connection)
○ OLEDB (Object Linking and Embedding Database) by Microsoft
○ JDBC (Java Database Connection)
● It also contains a metadata repository, which stores information about the data
warehouse and its contents.
Three-tier Architecture
● Middle-tier: is an OLAP server that is typically implemented using either

○ Relational OLAP(ROLAP) model: an extended relational DBMS that maps operations on
multidimensional data to standard relational operations
○ Multidimensional OLAP (MOLAP) model: a special-purpose server that directly
implements multidimensional data and operations
● Top-tier: is a front-end client layer, which contains query and reporting

tools, analysis tools, and/or data mining tools (e.g., trend analysis,
prediction, and so on).
Data Warehouse Models
• From the architecture point of view, there are three data warehouse
models:
• Enterprise warehouse
• Data mart
• Virtual warehouse
Enterprise Warehouse
• Collects all of the information about subjects spanning the entire

organization.
• Provides corporate-wide data integration, usually from one or more
operational systems or external information providers, and is cross-
functional in scope.
• Contains detailed data as well as summarized data, and can range in size
from a few gigabytes to hundreds of gigabytes, terabytes, or beyond.
• May be implemented on traditional mainframes, computer super servers,
or parallel architecture platforms.
• Requires extensive business modeling and may take years to design and
build.
Datamart
• Contains a subset of corporate-wide data that is of value to a specific group of
users. Scope is confined to specific selected subjects.
• Ex: a marketing data mart may confine its subjects to customer, item, and
sales. The data contained in data marts tend to be summarized.
• Implemented on low-cost departmental servers that are Unix/Linux or Windows
based.
• The implementation cycle of a data mart is more likely to be measured in weeks
rather than months or years.
• Independent data marts are sourced from data captured from one or more
operational systems or external information providers, or from data generated
locally within a particular department or geographic area.
• Dependent data marts are sourced directly from enterprise data warehouses.
Virtual Warehouse
● A virtual warehouse is a set of views over operational databases.

● For efficient query processing, only some of the possible summary views
may be materialized.
● A virtual warehouse is easy to build but requires excess capacity on
operational database servers.
Data Warehouse Development Approaches
Top-down development of an enterprise warehouse serves as a

systematic solution and minimizes integration problems.
It is expensive, takes a long time to develop, and lacks flexibility due to
the difficulty in achieving consistency and consensus for a common data
model for the entire organization.
Bottomup approach to the design, development, and deployment of
independent data marts provides flexibility, low cost, and rapid return of
investment.
Problems when integrating various data marts into a consistent
enterprise data warehouse.
Recommended Approach for DW Development
To implement the warehouse in an incremental and
evolutionary manner
First a high-level corporate data model is defined within
a reasonably short period, that provides a corporate-
wide, consistent, integrated view of data among different
subjects and potential usages.
Second independent data marts can be implemented in
parallel with the enterprise warehouse based on the
same corporate data model set noted before.
Third distributed data marts can be constructed to
integrate different data marts via hub servers
Finally a multi-tier data warehouse is constructed where
the enterprise warehouse is the sole custodian of all
warehouse data, which is then distributed to the various
Data Warehouse Back-End tools and Utilities
• Data extraction: which typically gathers data from multiple,

heterogeneous, and external sources.
• Data cleaning: which detects errors in the data and rectifies them when
possible.
• Data transformation: which converts data from legacy or host format to
warehouse format.
• Load: which sorts, summarizes, consolidates, computes views, checks
integrity, and builds indices and partitions.
• Refresh: which propagates the updates from the data sources to the
warehouse.
Metadata Repository
● Description of the DW Structure - includes the warehouse schema, view,

dimensions, hierarchies, and derived data definitions, as well as data mart
locations and contents.
● Operational metadata - includes

○ Data lineage (history of migrated data and the sequence of transformations applied to it)
○ Currency of data (active, archived, or purged)
○ Monitoring information (warehouse usage statistics, error reports, and audit trails).
● Algorithms used for summarization - include measures and dimension

definition algorithms, data on granularity, partitions, subject areas,
aggregation, summarization, and predefined queries and reports.
Metadata Repository
● Mapping from the Operational Environment to the DW - includes

source databases and their contents, gateway descriptions, data partitions,
data extraction, cleaning, transformation rules and defaults, data refresh
and purging rules, and security (user authorization and access control).
● Data related to System Performance - includes indices and profiles that

improve data access and retrieval performance, in addition to rules for the
timing and scheduling of refresh, update, and replication cycles.
● Business metadata - includes business terms and definitions, data

ownership information, and policies.
MULTIDIMENSIONAL DATA MODEL: Data Cube
and OLAP
● Data warehouses and OLAP tools are based on a multidimensional data
model. This model views data in the form of a data cube.
● A data cube allows data to be modeled and viewed in multiple

dimensions and facts.
● Ex: Sales data cube is defined by dimensions (time, item, branch,
location)
● Each dimension may have a table associated to it, called dimension
table.
MULTIDIMENSIONAL DATA MODEL
● A multidimensional data model is typically organized around a central

theme, such as sales. This theme is represented by a fact table.
● Facts are numerical measures. Ex. of facts for a sales DW include
dollars sold (sales amount in dollars), units sold (number of units sold),
and amount budgeted.
● Fact table contains the names of the facts, or measures, as well as keys
to each of the related dimension tables.
MULTIDIMEN
SIONAL
DATA MODEL
MULTIDIMEN
SIONAL
DATA MODEL
MULTIDIM
ENSIONAL
DATA
MODEL
MULTIDIM
ENSIONAL
DATA
MODEL
MULTIDIMENSIONAL DATA MODEL
● Given a set of dimensions, we can generate a cuboid for each of the possible
subsets of the given dimensions.
● The result would form a lattice of cuboids, each showing the data at a
different level of summarization, or group-by.
● The lattice of cuboids is then referred to as a data cube.
● The cuboid that holds the lowest level of summarization is called the base
cuboid.
● The 0-D cuboid, which holds the highest level of summarization, is called the
apex cuboid.
● A 3-D (nonbase) cuboid for time, item, and location, summarized for all
suppliers
Schemas for Multidimensional Data Models
● A data warehouse requires a concise, subject-oriented schema that

facilitates online data analysis.
● Most popular data model for a data warehouse is a multidimensional
model, which can exist in the form of
○ Star schema
○ Snowflake schema
○ Fact constellation schema
Star Schema
● The data warehouse contains

(1) A large central table (fact table) containing the bulk of the data, with no
redundancy
(2) A set of smaller attendant tables (dimension tables), one for each
dimension.
● The schema graph resembles a starburst, with the dimension tables
displayed in a radial pattern around the central fact table.
Star Schema
● Each dimension is
represented by only
one table, and each
table contains a set
of attributes.
● The attributes with
in a dimension table
may form either a
hierarchy (total
order) or a lattice
(partial order)
Snowflake Schema
● The snowflake schema is a variant of the star schema model.

● Some dimension tables are normalized to reduce the redundancies,
thereby further splitting the data into additional tables.
● The resulting schema graph forms a shape similar to a snowflake.
● Such a table is easy to maintain and saves storage space.
● Reduce the effectiveness of browsing, since more joins will be needed to
execute a query.
● Consequently, the system performance may be adversely impacted.
Snowflake
Schema
Fact Constellation
● Sophisticated applications may require multiple fact tables to share

dimension tables.
● This kind of schema can be viewed as a collection of stars, and hence
is called a galaxy schema or a fact constellation.
Fact
Constellatio
n
Dimensions: The Role of Concept Hierarchies
● A concept hierarchy defines a sequence of mappings from a set of low-

level concepts to higher-level, more general concepts.
● Ex: concept hierarchy for the dimension location.
● Each city can be mapped to the province or state to which it belongs.
● The provinces and states can in turn be mapped to the country.
● Concept hierarchy for the dimension location, mapping a set of low-level
concepts (i.e.,cities) to higher-level, more general concepts (i.e.,
countries).
Total order concept hierarchy location “street < city < province or state < country.”
Partial order concept hierarchy time based on

the attributes day, week, month, quarter, and
year
“day < {month < quarter; week} < year”
A concept hierarchy that is a total or partial

order among attributes in a database schema is
called a schema hierarchy
● Concept hierarchies may also be defined by discretizing or grouping

values for a given dimension, resulting in a set-grouping hierarchy
● A total or partial order can be defined among groups of values.
An interval ($X...$Y]
denotes the range
from $X (exclusive) to
$Y (inclusive).
Measures: Their Categorization and Computation
● A multidimensional point in the data cube space can be defined by a set

of dimension–value pairs.
<time =“Q1”, location=“Vancouver”, item=“computer”>
● A data cube measure is a numeric function that can be evaluated at each
point in the data cube space
● A measure value is computed for a given point by aggregating the data
corresponding to the respective dimension–value pairs defining the given
point.
● Measures can be organized into three categories—distributive,
algebraic, and holistic—based on the kind of aggregate functions used.
Distributive Measure:
● An aggregate function is distributive if it can be computed in a distributed

manner as follows
● The data are partitioned into n sets. We apply the function to each
partition, resulting in n aggregate values.
● If the result derived by applying the function to the n aggregate values is
the same as that derived by applying the function to the entire data set
(without partitioning), the function can be computed in a distributed
manner.
● Sum(), count(), min(), and max() are distributive aggregate functions
Algebraic Measure:
● An aggregate function is algebraic if it can be computed by an algebraic

function with M arguments (where M is a bounded positive integer), each
of which is obtained by applying a distributive aggregate function.
● Ex: avg() can be computed by sum()/count(), where both sum() and
count() are distributive aggregate functions.
● Similarly, it can be shown that min N() and max N() (which find the N
minimum and N maximum values, respectively, in a given set) and
standard deviation() are algebraic aggregate functions.
Holistic Measures:
● An aggregate function is holistic if there is no constant bound on the

storage size needed to describe a sub aggregate.
● Common examples of holistic functions include median(), mode(), and
rank().
Typical OLAP Operations
Roll-up (Drill-up): The roll-up operation performs aggregation on

a data cube, either by climbing up a concept hierarchy for a
dimension or by dimension reduction.
The hierarchy was defined as the total order
“street < city < province or state < country.”
The roll-up operation shown aggregates the data by ascending
the location hierarchy from the level of city to the level of country.
Drill-down: It navigates from less detailed data to more

detailed data.
Drill-down can be realized by either stepping down a
concept hierarchy for a dimension or introducing additional
dimensions.
Concept hierarchy for time defined as
“day < month < quarter < year”
Drill-down occurs by descending the time hierarchy from
the level of quarter to the more detailed level of month.
Slice and Dice:

The slice operation performs a selection on one
dimension of the given cube, resulting in a subcube.
Ex: the sales data are selected from the central cube for
the dimension time using the criterion time=“Q1.”
The dice operation defines a subcube by performing a
selection on two or more dimensions.
Ex: Involve three dimensions: (location=“Toronto” or
“Vancouver”) and (time=“Q1” or “Q2”) and (item=“home
entertainment” or “computer”).
Pivot(rotate):
A visualization operation that rotates the data axes in view
to provide an alternative data presentation.
Ex: the item and location axes in a 2-D slice are rotated
Other examples include rotating the axes in a 3-D cube, or
transforming a 3-D cube into a series of 2-D planes.
Data Warehouse Implementation
● Data warehouses contain huge volumes of data.

● OLAP servers demand that decision support queries be answered in the
order of seconds.
● It is crucial for DW systems to support highly efficient cube
computation techniques, access methods, and query processing
techniques.
Efficient Data Cube Computation
● Multidimensional data analysis is the efficient computation of

aggregations across many sets of dimensions
● In SQL terms, these aggregations are referred to as group-by’s.
● Each group-by can be represented by a cuboid, where the set of group-
by’s forms a lattice of cuboids defining a data cube.
● compute cube operator computes aggregates over all subsets of the
dimensions specified in the operation.
● This can require excessive storage space, especially for large numbers
of dimensions
Create a data cube for AllElectronics sales that contains the following: city,
item, year, and sales in dollars.
● What is the total number of cuboids, or
group-by’s, that can be computed for this
data cube?
● The total number of cuboids, or group-
by’s, that can be computed for this data
cube is .
● The possible group-by’s are the
following: {(city, item, year), (city, item),
(city, year), (item, year), (city), (item),
(year), ()}, where () means that the
group-by is empty.
● The base cuboid contains all three dimensions, city, item, and year.
● It can return the total sales for any combination of the three dimensions.
● The apex cuboid, or 0-D cuboid, refers to the case where the group-by
is empty. It contains the total sum of all sales.
● The base cuboid is the least generalized (most specific) of the cuboids.
The apex cuboid is the most generalized (least specific) of the cuboids,
and is often denoted as all.
● If we start at the apex cuboid and explore downward in the lattice, this
is equivalent to drilling down within the data cube.
● If we start at the base cuboid and explore upward, this is equivalent to
rolling up.
Based on the syntax of DMQL introduced, the data cube could be defined
as
define cube sales_cube [city, item, year]: sum(sales in dollars)
For a cube with n dimensions, there are a total of cuboids, including the
base cuboid.
compute cube sales_cube
It would explicitly instruct the system to compute the sales aggregate
cuboids for all of the eight subsets of the set {city, item, year}, including the
empty subset.
● OLAP may need to access different cuboids for different queries.

 “Compute the sum of sales, grouping by city and item.”
 “Compute the sum of sales, grouping by city.”
 “Compute the sum of sales, grouping by item.”
● Compute all or at least some of the cuboids in a data cube in advance, leads
to fast response time and avoids some redundant computation.
● Challenges to Precomputations:
○ The required storage space may explode if all of the cuboids in a data cube are
precomputed, especially when the cube has many dimensions.
○ Curse of Dimensionality: The storage requirements are even more excessive when many
of the dimensions have associated concept hierarchies, each with multiple levels
● “How many cuboids are there in an n-dimensional data cube?”

● If no hierarchies associated with each dimension, then the total number
of cuboids for an n-dimensional data cube is .
● If dimensions have hierarchies, then for n-dimensional data cube,
generates
○ Where is the number of levels associated with dimension i

○ 1 is added to include the all at top level.
● The size of each cuboid also depends on the cardinality (i.e., number of
distinct values) of each dimension.
Partial Materialization: Computation of Selected
Cuboids
● It is unrealistic to precompute and materialize all of the cuboids that can
possibly be generated for a data cube
● A more reasonable option is partial materialization, that is, to materialize
only some of the possible cuboids that can be generated.
● There are three choices for data cube materialization given a base
cuboid:
Data cube Materialization
● No materialization: Do not precompute any of the “nonbase” cuboids.

- This leads to computing expensive multidimensional aggregates on the fly, which can be
extremely slow.
● Full materialization: Precompute all of the cuboids.
- The resulting lattice of computed cuboids is referred to as the full cube.
- This choice typically requires huge amounts of memory space in order to store all of
the precomputed cuboids.
● Partial materialization: Selectively compute a proper subset of the
whole set of possible cuboids.
- Partial materialization represents an interesting trade-off between storage space and
response time.
Partial Materialization
● The partial materialization of cuboids or subcubes should consider three

factors:
■ Identify the subset of cuboids or subcubes to materialize
■ Exploit the materialized cuboids or subcubes during query processing
■ Efficiently update the materialized cuboids or subcubes during load and refresh.
● The selection of the subset of cuboids or subcubes to materialize should take
into account the queries in the workload, their frequencies, and their
accessing costs.
● It should consider workload characteristics, the cost for incremental
updates, and the total storage requirements.
● Also consider the broad context of physical database design, such as the
generation and selection of indices.
Partial Materialization
● Several OLAP products have adopted heuristic approaches for cuboid

and subcube selection.
● A popular approach is to materialize the set of cuboids on which other
frequently referenced cuboids are based.
● Alternatively, we can compute an iceberg cube, which is a datacube that
stores only those cube cells whose aggregate value (e.g.,count) is above
some minimum support threshold.
● Another common strategy is to materialize a shell cube. This involves
precomputing the cuboids for only a small number of dimensions (such
as 3 to 5) of a data cube.
Indexing OLAP Data
● To facilitate efficient data accessing, most data warehouse systems

support index structures.
● The bitmap indexing method is popular in OLAP products because it
allows quick searching in data cubes.
● Bitmap indexing is advantageous compared to hash and tree indices.
● Useful for low-cardinality domains because comparison, join, and
aggregation operations are then reduced to bit arithmetic, which
substantially reduces the processing time.
● Leads to significant reductions in space and I/O since a string of
characters can be represented by a single bit
Bitmap Indexing
Join Indexing
● The join indexing

method gained
popularity from its use in
relational data base
query processing.
● Join indexing registers
the joinable rows of two
relations from a
relational database.
Join Indexing
● The star schema model

of DWs makes join
indexing attractive for
cross table search,
because the linkage
between a fact table and
its corresponding
dimension tables
comprises the foreign key
of the fact table and the
primary key of the
dimension table
Efficient Processing of OLAP Queries
● Determine which operations should be performed on the available

cuboids:
○ This involves transforming any selection, projection, roll-up (group-by), and drill-down
operations specified in the query into corresponding SQL and/or OLAP operations.
○ For example, slicing and dicing a data cube may correspond to selection and/or
projection operations on a materialized cuboid.
● Determine to which materialized cuboid(s) the relevant operations should
be applied:
○ This involves identifying all of the materialized cuboids that may potentially be used to
answer the query, pruning the above set using knowledge of “dominance” relationships
among the cuboids, estimating the costs of using the remaining materialized cuboids,
and selecting the cuboid with the least cost.
From Data Warehousing to Data Mining
Business executives use the data in data warehouses and data marts to perform data
analysis and make strategic decisions.
Three kinds of data warehouse applications: information processing, analytical
processing, and data mining
● Information processing: supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts, or graphs. A current trend in DW information
processing is to construct low-cost Web-based accessing tools that are then
integrated with Web browsers.
● Analytical processing: supports basic OLAP operations, including slice-and-dice,
drill-down, roll-up, and pivoting. It generally operates on historical data in both
summarized and detailed forms. The major strength of OLAP over information
processing is the multi dimensional data analysis of DW data.
● Data mining: supports knowledge discovery by finding hidden patterns and
associations, constructing analytical models, performing classification and prediction,
and presenting the mining results using visualization tools.
From On-Line Analytical Processing(OLAP) to
On-Line Analytical Mining(OLAM)
● OLAM also called OLAP mining integrates OLAP with data mining and mining knowledge in
multidimensional databases. OLAM is important for the following reasons
● High quality of data in data warehouses:
○ Most data mining tools need to work on integrated, consistent, and cleaned data, which requires costly data
cleaning, data integration, and data transformation as preprocessing steps.
○ A DW constructed by such preprocessing serves as a valuable source of high-quality data for OLAP as well as for
data mining. Notice that data mining may also serve as a valuable tool for data cleaning and data integration as
well.
● Available information processing infrastructure surrounding data warehouses:
○ Information processing and data analysis infrastructures have been or will be systematically constructed
surrounding DWs, which include accessing, integration, consolidation, and transformation of multiple
heterogeneous databases, ODBC/OLE DB connections, Web-accessing and service facilities, and reporting and
OLAP analysis tools.
○ It is prudent to make the best use of the available infrastructures rather than constructing everything from scratch.
From On-Line Analytical Processing(OLAP) to
On-Line Analytical Mining(OLAM)
● OLAP-based exploratory data analysis:
○ Effective data mining needs exploratory data analysis. A user will often want to traverse
through a database, select portions of relevant data, analyze them at different granularities,
and present knowledge/results in different forms.
○ OLAM provides facilities for data mining on different subsets of data and at different levels of
abstraction, by drilling, pivoting, filtering, dicing, and slicing on a data cube and on some
intermediate data mining results.
○ This, together with data/knowledge visualization tools, will greatly enhance the power and
flexibility of exploratory data mining.
● On-line selection of data mining functions:
○ Often a user may not know what kinds of knowledge would like to mine. By integrating
OLAP with multiple data mining functions, OLAM provides users with the flexibility to select
desired data mining functions and swap data mining tasks dynamically.
Architecture for OLAM
● An OLAM server performs analytical
mining in data cubes in a similar manner
as an OLAP server performs OLAP.
● The OLAM and OLAP servers both accept
user on-line queries (or commands) via a
graphical user interface API and work with
the data cube in the data analysis via a
cube API.
● A metadata directory is used to guide the
access of the data cube.
● The data cube can be constructed by
accessing and/or integrating multiple
databases via an MDDB API and/or by
filtering a data warehouse via a database
API that may support OLE DB or ODBC
connections.
Architecture for OLAM
● Since an OLAM server may perform
multiple data mining tasks, such as
concept description, association,
classification, prediction, clustering,
time-series analysis, and so on, it
usually consists of multiple integrated
data mining modules and is more
sophisticated than an OLAP server.
Thank You

Unit 2 Data Warehousing and OLAP

Uploaded by

Unit 2 Data Warehousing and OLAP

Uploaded by

Data Warehousing and

Update Driven Approach

● “Why not perform OLAP directly on such operational databases instead of

Operational Databases Data Warehouse

● Designed and tuned from known ● Involve the computation of large

Operational Databases Data Warehouse

● Bottom-Tier: is a warehouse database server that is almost always a

● Middle-tier: is an OLAP server that is typically implemented using either

● Top-tier: is a front-end client layer, which contains query and reporting

• Collects all of the information about subjects spanning the entire

● A virtual warehouse is a set of views over operational databases.

Top-down development of an enterprise warehouse serves as a

• Data extraction: which typically gathers data from multiple,

● Description of the DW Structure - includes the warehouse schema, view,

● Operational metadata - includes

● Algorithms used for summarization - include measures and dimension

● Mapping from the Operational Environment to the DW - includes

● Data related to System Performance - includes indices and proﬁles that

● Business metadata - includes business terms and deﬁnitions, data

● A data cube allows data to be modeled and viewed in multiple

● A multidimensional data model is typically organized around a central

● A data warehouse requires a concise, subject-oriented schema that

● The data warehouse contains

● The snowﬂake schema is a variant of the star schema model.

● Sophisticated applications may require multiple fact tables to share

● A concept hierarchy deﬁnes a sequence of mappings from a set of low-

Partial order concept hierarchy time based on

A concept hierarchy that is a total or partial

● Concept hierarchies may also be deﬁned by discretizing or grouping

● A multidimensional point in the data cube space can be deﬁned by a set

● An aggregate function is distributive if it can be computed in a distributed

● An aggregate function is algebraic if it can be computed by an algebraic

● An aggregate function is holistic if there is no constant bound on the

Roll-up (Drill-up): The roll-up operation performs aggregation on

Drill-down: It navigates from less detailed data to more

Slice and Dice:

● Data warehouses contain huge volumes of data.

● Multidimensional data analysis is the efﬁcient computation of

● OLAP may need to access different cuboids for different queries.

● “How many cuboids are there in an n-dimensional data cube?”

○ Where is the number of levels associated with dimension i

● No materialization: Do not precompute any of the “nonbase” cuboids.

● The partial materialization of cuboids or subcubes should consider three

● Several OLAP products have adopted heuristic approaches for cuboid

● To facilitate efﬁcient data accessing, most data warehouse systems

● The join indexing

● The star schema model

● Determine which operations should be performed on the available

You might also like