R16 4-2 DataMining Notes UNIT-I
R16 4-2 DataMining Notes UNIT-I
R16 Regulation
UNIT-I
Introduction to Data Warehouse:
A data warehouse is a database designed to enable business intelligence
activities: it exists to help users understand and enhance their organization's
performance. It is designed for query and analysis rather than for transaction
processing, and usually contains historical data derived from transaction data,
but can include data from other sources. Data warehouses separate analysis
workload from transaction workload and enable an organization to consolidate
data from several sources.
A data warehouse usually stores many months or years of data to support
historical analysis. The data in a data warehouse is typically loaded through an
extraction, transformation, and loading (ETL) process from multiple data
sources.
A decision support database that is maintained separately from the
organization’s operational database
Support information processing by providing a solid platform of consolidated,
historical data for analysis.
A data warehouse is a subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s
decision-making process.
Data warehouse characteristics:
A common way of introducing data warehousing is to refer to the
characteristics of a data warehouse as set forth by William Inmon:
●Subject Oriented
● Integrated
● Nonvolatile
● Time Varient
Data Warehouse—Subject-Oriented
Data warehouses are designed to help you analyze data. For example, to learn
more about your company's sales data, you can build a data warehouse that
concentrates on sales. Using this data warehouse, you can answer questions
such as "Who was our best customer for this item last year?" or "Who is likely
to be our best customer next year?" This ability to define a data warehouse by
subject matter, sales in this case, makes the data warehouse subject oriented.
■ Organized around major subjects, such as customer, product, sales
■ Focusing on the modeling and analysis of data for decision makers, not
on daily operations or transaction processing
■ Provide a simple and concise view around particular subject issues by
excluding data that are not useful in the decision support process
Data Warehouse—Integrated
Integration is closely related to subject orientation. Data warehouses must put
data from disparate sources into a consistent format. They must resolve such
problems as naming conflicts and inconsistencies among units of measure.
■ Constructed by integrating multiple, heterogeneous data sources
relational databases, flat files, on-line transaction records
■ Data cleaning and data integration techniques are applied.
Ensure consistency in naming conventions, encoding structures,
attribute measures, etc. among different data sources
● E.g., Hotel price: currency, tax, breakfast covered, etc.
When data is moved to the warehouse, it is converted.
Data Warehouse—Time Variant
A data warehouse's focus on change over time is what is meant by the term
time variant. In order to discover trends and identify hidden patterns and
relationships in business, analysts need large amounts of data.
■ The time horizon for the data warehouse is significantly longer than that
of operational systems
Operational database: current value data
Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
■ Every key structure in the data warehouse
Contains an element of time, explicitly or implicitly
But the key of operational data may or may not contain “time
element”
Data Warehouse—Nonvolatile
Nonvolatile means that, once entered into the data warehouse, data should not
change. This is logical because the purpose of a data warehouse is to enable
you to analyze what has occurred.
■ A physically separate store of data transformed from the operational
environment
■ Operational update of data does not occur in the data warehouse
environment
Does not require transaction processing, recovery, and
concurrency control mechanisms
Requires only two operations in data accessing:
● initial loading of data and access o f data
Why a Separate Data Warehouse?
A data warehouses is kept separate from operational databases due to the
following reasons −
An operational database is constructed for well-known tasks and
workloads such as searching particular records, indexing, etc. In
contract, data warehouse queries are often complex and they present a
general form of data.
Operational databases support concurrent processing of multiple
transactions. Concurrency control and recovery mechanisms are
required for operational databases to ensure robustness and
consistency of the database.
An operational database query allows to read and modify operations,
while an OLAP query needs only read only access of stored data.
An operational database maintains current data. On the other hand, a
data warehouse maintains historical data.
A data warehouse is a database, which is kept separate from the
organization's operational database.
There is no frequent updating done in a data warehouse.
It possesses consolidated historical data, which helps the organization to
analyze its business.
A data warehouse helps executives to organize, understand, and use
their data to take strategic decisions.
Data warehouse systems help in the integration of diversity of
application systems.
A data warehouse system helps in consolidated historical data analysis.
DBMS— tuned for OLTP: access methods, indexing, concurrency control,
recovery
Warehouse—tuned for OLAP: complex OLAP queries, multidimensional
view, consolidation
Note: There are more and more systems which perform OLAP analysis directly
on relational databases.
Data Warehouse: A Multi-Tiered Architecture:
Generally a data warehouses adopts a three-tier architecture. Following are
the three tiers of the data warehouse architecture.
● Bottom Tier − The bottom tier of the architecture is the data warehouse
database server. It is the relational database system. We use the back
end tools and utilities to feed data into the bottom tier. These back end
tools and utilities perform the Extract, Clean, Load, and refresh
functions.
● Middle Tier − In the middle tier, we have the OLAP Server that can be
implemented in either of the following ways.
o By Relational OLAP (ROLAP), which is an extended relational
database management system. The ROLAP maps the operations
on multidimensional data to standard relational operations.
o By Multidimensional OLAP (MOLAP) model, which directly
implements the multidimensional data and operations.
● Top-Tier − This tier is the front-end client layer. This layer holds the
query tools and reporting tools, analysis tools and data mining tools.
Fig: Data Warehouse Architecture
Each person has different views regarding the design of a data warehouse.
These views are as follows −
● The top-down view − This view allows the selection of relevant
information needed for a data warehouse.
● The data source view − This view presents the information being
captured, stored, and managed by the operational system.
● The data warehouse view − This view includes the fact tables and
dimension tables. It represents the information stored inside the data
warehouse.
● The business query view − It is the view of the data from the viewpoint
of the end-user.
Data Marts:
Data Warehouse Models:
From the perspective of data warehouse architecture, we have the following
data warehouse models −
● Virtual Warehouse
● Data mart
● Enterprise Warehouse
Virtual Warehouse:
The view over an operational data warehouse is known as a virtual warehouse.
It is easy to build a virtual warehouse. Building a virtual warehouse requires
excess capacity on operational database servers.
Data Mart:
Data mart contains a subset of organization-wide data. This subset of data is
valuable to specific groups of an organization.
In other words, we can claim that data marts contain data specific to a
particular group. For example, the marketing data mart may contain data
related to items, customers, and sales. Data marts are confined to subjects.
Points to remember about data marts −
● Window-based or Unix/Linux-based servers are used to implement data
marts. They are implemented on low-cost servers.
● The life cycle of a data mart may be complex in long run, if its planning
and design are not organization-wide.
● Data marts are small in size.
Enterprise Warehouse:
● An enterprise warehouse collects all the information and the subjects
spanning an entire organization
● It provides us enterprise-wide data integration.
Metadata Repository:
Metadata is simply defined as data about data. The data that is used to
represent other data is known as metadata. For example, the index of a book
serves as a metadata for the contents in the book. In other words, we can say
that metadata is the summarized data that leads us to detailed data. In terms
of data warehouse, we can define metadata as follows.
● Metadata is the road-map to a data warehouse.
● Metadata in a data warehouse defines the warehouse objects.
● Metadata acts as a directory. This directory helps the decision support
system to locate the contents of a data warehouse.
Meta data is the data defining warehouse objects. It stores:
● Description of the structure of the data warehouse
Fig: a lattice of cuboids for the sales data cube
Conceptual Modeling of Data Warehouses:
Data modeling is the process of creating a data model for the data to be stored
in a Database. Data modeling helps in the visual representation of data and
enforces business rules, regulatory compliances, and government policies on
the data.
The basic concepts of dimensional modeling are: facts, dimensions and
measures.
A fact is a collection of related data items, consisting of measures and context
data. It typically represents business items or business transactions.
A dimension is a collection of data that describe one business dimension.
Dimensions determine the contextual background for the facts; they are the
parameters over which we want to perform OLAP.
A measure is a numeric attribute of a fact, representing the performance or
behavior of the business relative to the dimensions.
There are three basic schemas that are used in dimensional modeling:
1. Star schema
2. Snowflake schema
3. Fact constellation schema
Star schema: A fact table in the middle connected to a set of dimension
tables. A diagram of a star schema resembles a star, with a fact table at the
center.
Fig: Star schema example
Snowflake schema: A refinement of star schema where some dimensional
hierarchy is normalized into a set of smaller dimension tables, forming a shape
similar to snowflake.
In computing, a snowflake schema is a logical arrangement of tables in a
multidimensional database such that the entity relationship diagram resembles
a snowflake shape.
The snowflake schema is represented by centralized fact tables which are
connected to multiple dimensions.
Fig: Snowflake schema example
Fact constellations: Multiple fact tables share dimension tables, viewed
as a collection of stars, therefore called galaxy schema or fact constellation.
Fact constellation is a measure of online analytical processing, which is a
collection of multiple facttables sharing dimension tables, viewed as a
collection of stars. This is an improvement over Star schema.
Fig: Fact Constellation schema example
A Concept hierarchy in Data Cube:
A concept hierarchy defines a sequence of mappings from a set of low-level
Concept hierarchies may also be defined by discretizing or grouping values for
a given dimension or attribute, resulting in a set-grouping hierarchy. A total or
partial order can be defined among groups of values. There may be more than
one concept hierarchy for a given attribute or dimension, based on different
user viewpoints.
OLAP Operations in the Multidimensional Data Model:
In the multidimensional model, data are organized into multiple dimensions,
and each dimension contains multiple levels of abstraction defined by concept
hierarchies. This organization provides users with the flexibility to view data
from different perspectives. A number of OLAP data cube operations exist to
materialize these different views, allowing interactive querying and analysis of
the data at hand. Hence, OLAP provides a user-friendly environment for
interactive data analysis.
Roll-up: The roll-up operation (also called the drill-up operation by some
vendors) performs aggregation on a data cube, either by climbing up a concept
hierarchy for a dimension or by dimension reduction.
Figure shows the result of a roll-up operation performed on the central cube by
climbing up the concept hierarchy for location given in Figure. This hierarchy
was defined as the total order “street < city < province or state < country. ” The
roll-up operation shown aggregates the data by ascending the location
hierarchy from the level of city t o the level of country. When roll-up is
performed by dimension reduction, one or more dimensions are removed from
the given cube.
Drill-down: Drill-down is the reverse of roll-up. It navigates from less detailed
data to more detailed data. Drill-down can be realized by either stepping down
a concept hierarchy for a dimension or introducing additional dimensions.
Figure shows the result of a drill-down operation performed on the central cube
by stepping down a concept hierarchy for time defined as “day < month <
quarter < year.” Drill-down occurs by descending the time hierarchy from the
level of quarter to the more detailed level of month. The resulting data cube
details the total sales per month rather than summarizing them by quarter.
Because a drill-down adds more detail to the given data, it can also be
performed by adding new dimensions to a cube.
Figure shows a slice operation where the sales data are selected from the
central cube for the dimension time u sing the criterion time = “Q1”. It will form
a new sub-cube by selecting one or more dimensions.
Dice: The dice operation defines a subcube by performing a selection on two
or more dimensions.
Figure shows a dice operation on the central cube based on the following
selection criteria that involve three dimensions: (location = “Toronto” or
“Vancouver”) and (time = “Q1” o r “Q2”) and (item =" Mobile" or "Modem").
Pivot (rotate): Pivot (also called rotate) is a visualization operation that
rotates the data axes in view in order to provide an alternative presentation of
the data.
Figure shows a pivot operation where the item and location axes in a 2-D slice
are rotated. Other examples include rotating the axes in a 3-D cube, or
transforming a 3-D cube into a series of 2-D planes.
Types of Facts:
There are three types of facts:
Fully Additive: Additive facts are facts that can be summed up through all of
the dimensions in the fact table.
Example assumes that we are a retailer, and we have a fact table with the
following columns:
Date
Store
Product
Sales_Amount
The purpose of this table is to record the sales amount for each product in
each store on a daily basis. Sales_Amount is the fact. In this
case, Sales_Amount is an additive fact, because you can sum up this fact
along any of the three dimensions present in the fact table -- date, store, and
product.
Semi-Additive: Semi-additive facts are facts that can be summed up for some
of the dimensions in the fact table, but not the others.
Example bank with the following fact table:
Date
Account
Current_Balance
Profit_Margin
The purpose of this table is to record the current balance for each account at
the end of each day, as well as the profit margin for each account for each
day. Current_Balance and Profit_Margin are the facts. Current_Balance is a
semi-additive fact, as it makes sense to add them up for all accounts (what's
the total current balance for all accounts in the bank?), but it does not make
sense to add them up through time (adding up all current balances for a given
account for each day of the month does not give us any useful information).
Non-Additive: Non-additive facts are facts that cannot be summed up for any
of the dimensions present in the fact table.
Example bank with the following fact table:
Date
Account
Current_Balance
Profit_Margin
Profit_Margin is a non-additive fact, for it does not make sense to add them up
for the account level or the day level.
Information processing:
Supports querying, basic statistical analysis, and reporting using
crosstabs, tables, charts and graphs. A current trends in data warehouse
information processing is to construct low-cost web-based accessing tools that
are then integrated with web browsers.
Analytical processing:
It supports basic OLAP operations, including slice-and-dice, drill-down,
roll-up, and pivoting. It generally operates on historic data in both summarized
and details forms. The major strength of online analytical processing over
information processing is the multidimensional analysis of data warehouse
data.
Data mining:
It supports knowledge discovery by finding hidden patterns and associations,
constructing analytical models, performing classification and prediction, and
presenting the mining results using visualization tools.
Types of OLAP Servers:
We have three types of OLAP servers
Relational OLAP (ROLAP)
Multidimensional OLAP (M OLAP)
Hybrid OLAP (HOLAP)
Relational OLAP:
ROLAP servers are placed between relational back-end server and client
front-end tools. To store and manage warehouse data, ROLAP uses relational
or extended-relational DBMS.
ROLAP includes the following −
● Implementation of aggregation navigation logic.
Multidimensional OLAP:
MOLAP uses array-based multidimensional storage engines for
multidimensional views of data. With multidimensional data stores, the
storage utilization may be low if the data set is sparse. Therefore, many
MOLAP server use two levels of data storage representation to handle dense
and sparse data sets.
Hybrid OLAP:
Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher
scalability of ROLAP and faster computation of MOLAP. HOLAP servers allows
to store the large data volumes of detailed information. The aggregations are
stored separately in MOLAP store.