Lecture 2

Data Warehouse and Data Mining – Second Lecture
1.13 Data Warehouse Design Process
A data warehouse can be built using a top-down approach, a bottom-up

approach, or a combination of both.
 The top-down approach starts with the overall design and planning. It is
useful in cases where the technology is mature and well known, and
where the business problems that must be solved are clear and well
understood.
 The bottom-up approach starts with experiments and prototypes. This is
useful in the early stage of business modeling and technology
development. It allows an organization to move forward at considerably
less expense and to evaluate the benefits of the technology before
making significant commitments.
 In the combined approach, an organization can exploit the planned and
strategic nature of the top-down approach while retaining the rapid
implementation and opportunistic application of the bottom-up
approach.
The warehouse design process consists of the following steps:
1. Choose a business process to model, for example, orders, invoices,

shipments, inventory, account administration, sales, or the general
ledger. If the business process is organizational and involves multiple
complex object collections, a data warehouse model should be
followed. However, if the process is departmental and focuses on the
Data Warehouse and Data Mining 2023-2024
analysis of one kind of business process, a data mart model should be

chosen.
2. Choose the grain of the business process. The grain is the
fundamental, atomic level of data to be represented in the fact table
for this process, for example, individual transactions, individual daily
snapshots, and so on.
3. Choose the dimensions that will apply to each fact table record.
Typical dimensions are time, item, customer, supplier, warehouse,
transaction type, and status.
4. Choose the measures that will populate each fact table record. Typical
measures are numeric additive quantities like dollars sold and units
sold.
1.14 A Three Tier Data Warehouse Architecture
Prepared by Dr. Dunia H. Hameed Page 14

Tier-1:
The bottom tier is a warehouse database server that is almost always a

relational database system. Back-end tools and utilities are used to feed data
into the bottom tier from operational databases or other external sources
(such as customer profile information provided by external consultants).
These tools and utilities perform data extraction, cleaning, and
transformation (e.g., to merge similar data from different sources into a
unified format), as well as load and refresh functions to update the data
warehouse. The data are extracted using application program interfaces
known as gateways. A gateway is supported by the underlying DBMS and
allows client programs to generate SQL code to be executed at a server.
Examples of gateways include ODBC (Open Database Connection) and
OLEDB (Open Linking and Embedding for Databases) by Microsoft and
JDBC (Java Database Connection). This tier also contains a metadata
repository, which stores information about the data warehouse and its
contents.
Tier-2:
The middle tier is an OLAP server that is typically implemented using either
a relational OLAP (ROLAP) model or a multidimensional OLAP. OLAP
model is an extended relational DBMS that maps operations on
multidimensional data to standard relational operations. A multidimensional
OLAP (MOLAP) model, that is, a special-purpose server that directly
implements multidimensional data and operations.

Tier-3:
The top tier is a front-end client layer, which contains query and reporting
tools, analysis tools, and/or data mining tools (e.g., trend analysis,
prediction, and so on).
1.15 Meta Data Repository
Metadata are data about data. When used in a data warehouse, metadata are
the data that define warehouse objects. Metadata are created for the data
names and definitions of the given warehouse. Additional metadata are
created and captured for timestamping any extracted data, the source of the
extracted data, and missing fields that have been added by data cleaning or
integration processes.
A metadata repository should contain the following:
 A description of the structure of the data warehouse, which includes

the warehouse schema, view, dimensions, hierarchies, and derived
data definitions, as well as data mart locations and contents.
 Operational metadata, which include data lineage (history of migrated
data and the sequence of transformations applied to it), currency of
data (active, archived, or purged), and monitoring information
(warehouse usage statistics, error reports, and audit trails).
 The algorithms used for summarization, which include measure and
dimension definition algorithms, data on granularity, partitions,
subject areas, aggregation, summarization, and predefined queries and
reports.

 The mapping from the operational environment to the data warehouse,

which includes source databases and their contents, gateway
descriptions, data partitions, data extraction, cleaning, transformation
rules and defaults, data refresh and purging rules, and security (user
authorization and access control).
 Data related to system performance, which include indices and
profiles that improve data access and retrieval performance, in
addition to rules for the timing and scheduling of refresh, update, and
replication cycles.
 Business metadata, which include business terms and definitions, data
ownership information, and charging policies.
1.16 OLAP (Online analytical Processing)
OLAP is an approach to answering multi-dimensional analytical (MDA)

queries swiftly. OLAP is part of the broader category of business
intelligence, which also encompasses relational database, report writing and
data mining. OLAP tools enable users to analyze multidimensional data
interactively from multiple perspectives.
OLAP consists of three basic analytical operations:
 Consolidation (Roll-Up)
 Drill-Down
 Slicing and Dicing

Consolidation involves the aggregation of data that can be accumulated and

computed in one or more dimensions. For example, all sales offices are
rolled up to the sales department or sales division to anticipate sales trends.
The drill-down is a technique that allows users to navigate through the

details. For instance, users can view the sales by individual products that
make up a region’s sales.
Slicing and dicing is a feature whereby users can take out (slicing) a
specific set of data of the OLAP cube and view (dicing) the slices from
different viewpoints.
1.17 Types of OLAP
1. Relational OLAP (ROLAP):
ROLAP works directly with relational databases. The base data and the
dimension tables are stored as relational tables and new tables are created to
hold the aggregated information. It depends on a specialized schema design.
This methodology relies on manipulating the data stored in the relational
database to give the appearance of traditional OLAP's slicing and dicing
functionality. In essence, each action of slicing and dicing is equivalent to
adding a "WHERE" clause in the SQL statement. ROLAP tools do not use
pre-calculated data cubes but instead pose the query to the standard
relational database and its tables in order to bring back the data required to
answer the question. ROLAP tools feature the ability to ask any question
because the methodology does not limit to the contents of a cube. ROLAP
also has the ability to drill down to the lowest level of detail in the database.

2. Multidimensional OLAP (MOLAP):
MOLAP is the 'classic' form of OLAP and is sometimes referred to as just

OLAP.
MOLAP stores this data in an optimized multi-dimensional array storage,

rather than in a relational database. Therefore it requires the pre-computation
and storage of information in the cube - the operation known as processing.
MOLAP tools generally utilize a pre-calculated data set referred to as a data

cube. The data cube contains all the possible answers to a given range of
questions.
MOLAP tools have a very fast response time and the ability to quickly write
back data into the data set.
3. Hybrid OLAP (HOLAP):
There is no clear agreement across the industry as to what constitutes Hybrid

OLAP, except that a database will divide data between relational and
specialized storage. For example, for some vendors, a HOLAP database will
use relational tables to hold the larger quantities of detailed data, and use
specialized storage for at least some aspects of the smaller quantities of
more-aggregate or less-detailed data. HOLAP addresses the shortcomings of
MOLAP and ROLAP by combining the capabilities of both approaches.
HOLAP tools can utilize both pre-calculated cubes and relational data
sources.

1.18 Data Preprocessing
1. Data Integration: It combines data from multiple sources into a coherent

data store, as in data warehousing. These sources may include multiple
databases, data cubes, or flat files.
The data integration systems are formally defined as triple<G,S,M>
Where G: The global schema
S: Heterogeneous source of schemas
M: Mapping between the queries of source and global schema

2. Issues in Data integration
1. Schema integration and object matching:
How can the data analyst or the computer be sure that customer id in one
database and customer number in another reference to the same attribute.
2. Redundancy:
An attribute (such as annual revenue, for instance) may be redundant if it

can be derived from another attribute or set of attributes. Inconsistencies in
attribute or dimension naming can also cause redundancies in the resulting
data set.
3. detection and resolution of data value conflicts:
For the same real-world entity, attribute values from different sources may
differ.
3. Data Transformation:
In data transformation, the data are transformed or consolidated into forms

appropriate for mining. Data transformation can involve the following:
 Smoothing, which works to remove noise from the data. Such

techniques include binning, regression, and clustering.
 Aggregation, where summary or aggregation operations are applied to
the data. For example, the daily sales data may be aggregated so as to
compute monthly and annual total amounts. This step is typically

used in constructing a data cube for analysis of the data at multiple

granularities.
 Generalization of the data, where low-level or "primitive" (raw) data
are replaced by higher-level concepts through the use of concept
hierarchies. For example, categorical attributes, like street, can be
generalized to higher-level concepts, like city or country.
 Normalization, where the attribute data are scaled so as to fall within a
small specified range, such as 1:0 to 1:0, or 0:0 to 1:0.
 Attribute construction (or feature construction), where new attributes
are constructed and added from the given set of attributes to help the
mining process.
4. Data Reduction
Data reduction techniques can be applied to obtain a reduced representation

of the data set that is much smaller in volume, yet closely maintains the
integrity of the original data. That is, mining on the reduced data set should
be more efficient yet produce the same (or almost the same) analytical
results.
Strategies for data reduction include the following:
 Data cube aggregation, where aggregation operations are applied to

the data in the construction of a data cube.
 Attribute subset selection, where irrelevant, weakly relevant, or
redundant attributes or dimensions may be detected and removed.
 Dimensionality reduction, where encoding mechanisms are used to
reduce the dataset size.

 Numerosity reduction, where the data are replaced or estimated by

alternative, smaller data representations such as parametric models
(which need store only the model parameters instead of the actual
data) or nonparametric methods such as clustering, sampling, and the
use of histograms.
 Discretization and concept hierarchy generation, where raw data
values for attributes are replaced by ranges or higher conceptual
levels. Data discretization is a form of numerosity reduction that is
very useful for the automatic generation of concept hierarchies.
Discretization and concept hierarchy generation are powerful tools for
datamining, in that they allow the mining of data at multiple levels of
abstraction.

Lecture 2

Uploaded by

Lecture 2

Uploaded by

Data Warehouse and Data Mining – Second Lecture

1.13 Data Warehouse Design Process

A data warehouse can be built using a top-down approach, a bottom-up

The warehouse design process consists of the following steps:

1. Choose a business process to model, for example, orders, invoices,

analysis of one kind of business process, a data mart model should be

1.14 A Three Tier Data Warehouse Architecture

Prepared by Dr. Dunia H. Hameed Page 14

The bottom tier is a warehouse database server that is almost always a

Prepared by Dr. Dunia H. Hameed Page 15

1.15 Meta Data Repository

A metadata repository should contain the following:

 A description of the structure of the data warehouse, which includes

Prepared by Dr. Dunia H. Hameed Page 16

 The mapping from the operational environment to the data warehouse,

1.16 OLAP (Online analytical Processing)

OLAP is an approach to answering multi-dimensional analytical (MDA)

OLAP consists of three basic analytical operations:

Prepared by Dr. Dunia H. Hameed Page 17

Consolidation involves the aggregation of data that can be accumulated and

The drill-down is a technique that allows users to navigate through the

1.17 Types of OLAP

1. Relational OLAP (ROLAP):

Prepared by Dr. Dunia H. Hameed Page 18

2. Multidimensional OLAP (MOLAP):

MOLAP is the 'classic' form of OLAP and is sometimes referred to as just

MOLAP stores this data in an optimized multi-dimensional array storage,

MOLAP tools generally utilize a pre-calculated data set referred to as a data

3. Hybrid OLAP (HOLAP):

There is no clear agreement across the industry as to what constitutes Hybrid

Prepared by Dr. Dunia H. Hameed Page 19

1.18 Data Preprocessing

1. Data Integration: It combines data from multiple sources into a coherent

The data integration systems are formally defined as triple<G,S,M>

Where G: The global schema

S: Heterogeneous source of schemas

M: Mapping between the queries of source and global schema

Prepared by Dr. Dunia H. Hameed Page 20

2. Issues in Data integration

1. Schema integration and object matching:

An attribute (such as annual revenue, for instance) may be redundant if it

3. detection and resolution of data value conflicts:

In data transformation, the data are transformed or consolidated into forms

 Smoothing, which works to remove noise from the data. Such

Prepared by Dr. Dunia H. Hameed Page 21

used in constructing a data cube for analysis of the data at multiple

Data reduction techniques can be applied to obtain a reduced representation

Strategies for data reduction include the following:

 Data cube aggregation, where aggregation operations are applied to

Prepared by Dr. Dunia H. Hameed Page 22

 Numerosity reduction, where the data are replaced or estimated by

Prepared by Dr. Dunia H. Hameed Page 23

You might also like