Data Warehousing and Data Mining

Data Warehousing and
Mining
Dr Shivani Thapliyal
Data Warehouse
 A data warehouse is a subject-oriented, integrated,
time-variant and non-volatile collection of data in
support of management's decision making process.
 Subject-Oriented: A data warehouse can be used to

analyze a particular subject area. For
example, "sales" can be a particular subject.
 Integrated: A data warehouse integrates data from
multiple data sources. For example, source A
and source B may have different ways of identifying
a product, but in a data warehouse, there
will be only a single way of identifying a product.
Contd..
 Time-Variant: Historical data is kept in a data
warehouse. For example, one can retrieve data
from 3 months, 6 months, 12 months, or even
older data from a data warehouse. This contrasts
with a transactions system, where often only the
most recent data is kept. For example, a
transaction system may hold the most recent
address of a customer, where a data warehouse can
hold all addresses associated with a customer.
 Non-volatile: Once data is in the data warehouse, it

will not change. So, historical data in a data
warehouse should never be altered.
Data Warehouse Architecture
 A data-warehouse is a heterogeneous
collection of different data sources organized
under a unified schema. There are 2
approaches for constructing data-warehouse:
Top-down approach and Bottom-up
approach.
1. Top-down approach:
1. External Sources – External source is a source from
where data is collected irrespective of the type of
data. Data can be structured, semi structured and
unstructured as well.
2. Stage Area – Since the data, extracted from the
external sources does not follow a particular format,
so there is a need to validate this data to load into
data warehouse. For this purpose, it is recommended
to use ETL tool.
 E(Extracted): Data is extracted from External data
source.
 T(Transform): Data is transformed into the standard
format.
 L(Load): Data is loaded into data warehouse after
transforming it into the standard format.

3.Data-warehouse – After cleansing of data, it is
stored in the data warehouse as central repository. It
actually stores the meta data and the actual data
gets stored in the data marts. Note that data
warehouse stores the data in its purest form in this
top-down approach.
4. Data Marts – Data mart is also a part of storage

component. It stores the information of a particular
function of an organisation which is handled by
single authority. There can be as many number of
data marts in an organisation depending upon the
functions. We can also say that data mart contains
subset of the data stored in data warehouse.
5.Data Mining – The practice of analysing the
big data present in data warehouse is data
mining. It is used to find the hidden patterns
that are present in the database or in data
warehouse with the help of algorithm of data
mining.
 This approach is defined by Inmon as – data
warehouse as a central repository for the

complete organisation and data marts are
created from it after the complete data
warehouse has been created.
2. Bottom-up approach:
Bottom-up approach
 First, the data is extracted from external sources (same as
happens in top-down approach).
 Then, the data go through the staging area (as explained

above) and loaded into data marts instead of data
warehouse. The data marts are created first and provide
reporting capability. It addresses a single business area.
 These data marts are then integrated into data warehouse.
 This approach is given by Kinball as – data marts are

created first and provides a thin view for analyses and data
warehouse is created after complete data marts have been
created.
Components of Data Warehouse
1. Data Warehouse Database
 The central component of a typical data warehouse
architecture is a database that stocks all enterprise data
and makes it manageable for reporting. Obviously, this
means you need to choose which kind of database
you’ll use to store data in your warehouse.
2. Extraction, Transformation, and Loading Tools (ETL)

ETL tools are central components of an enterprise data
warehouse design. These tools help extract data from
different sources, transform it into a suitable
arrangement, and load it into a data warehouse.
3. Metadata
Metadata in a data warehouse is equal to the data
dictionary or the data catalog in a database
management system. It helps in constructing,
preserving, handling, and making use of the data
warehouse.
4. Data Marts
It includes a subset of corporate-wide data that is
of value to a specific group of users. The scope is
confined to particular selected subjects. A data
mart contains data that is precise to a specific
group.
Data Mining
 Data mining refers to extracting or mining
knowledge from large amounts of data.
 The term is actually a misnomer. Thus, data
mining should have been more appropriately

named as knowledge mining which emphasis
on mining from large amounts of data.
 The overall goal of the data mining process is
to extract information from a data set and

transform it into an understandable structure
for further use.
Functionalities of Data Mining
1. Data Characterization:
 This refers to the summary of general characteristics
or features of the class that is under the study. For

example. To study the characteristics of a software
product whose sales increased by 15% two years ago,
anyone can collect these type of data related to such
products by running SQL queries.
2. Data Discrimination:
 It compares common features of class which is under
study. The output of this process can be represented

in many forms. Eg., bar charts, curves and pie charts.
Contd..
3. Mining Frequent Patterns: Frequent patterns
are nothing but things that are found to be
most common in the data. There are different
kinds of frequency that can be observed in the
dataset.
 Frequent item set:
This applies to a number of items that can be

seen together regularly for eg: milk and sugar.
 Frequent Subsequence:
This refers to the pattern series that often

occurs regularly such as purchasing a phone
followed by a back cover.
4.AssociationAnalysis:
The process involves uncovering the relationship

between data and deciding the rules of the
association. It is a way of discovering the relationship
between various items. for example, it can be used to
determine the sales of items that are frequently
purchased together.
5. Correlation Analysis:
Correlation is a mathematical technique that can show

whether and how strongly the pairs of attributes are
related to each other. For example, Highted people
tend to have more weight.
Classification of Data Mining
Systems
 Classification according to the kinds of
databases mined: A data mining system can
be classified according to the kinds of
databases mined. Database systems can be
classified according to different criteria (such
as data models, or the types of data or
applications involved), each of which may
require its own data mining technique. Data
mining systems can therefore be classified
accordingly.
Classification according to the kinds
of knowledge mined:
 Data mining systems can be categorized
according to the kinds of knowledge they
mine, that is, based on data mining
functionalities, such as characterization,
discrimination, association and correlation
analysis, classification, prediction, clustering,
outlier analysis, and evolution analysis. A
comprehensive data mining system usually
provides multiple and/or integrated data
mining functionalities.
Classification according to the kinds
of techniques utilized:
 Data mining systems can be categorized according to
the underlying data mining techniques employed.
These techniques can be described according to the
degree of user interaction involved (e.g., autonomous
systems, interactive exploratory systems, query-driven
systems) or the methods of data analysis employed
(e.g., database-oriented or data warehouse– oriented
techniques, machine learning, statistics, visualization,
pattern recognition, neural networks, and so on). A
sophisticated data mining system will often adopt
multiple data mining techniques or work out an
effective, integrated technique that combines the
merits of a few individual approaches.
Classification according to the
applications adapted
 Data mining systems can also be categorized
according to the applications they adapt. For
example, data mining systems may be
tailored specifically for finance,
telecommunications, DNA, stock markets, e-
mail, and so on. Different applications often
require the integration of application-specific
methods. Therefore, a generic, all-purpose
data mining system may not fit domain-
specific mining tasks.
Data Warehousing and OLAP
 Data Warehouse is a collection of corporate data
aggregated from one or several sources. It serves
as a business analytical tool, which allows
analyzing and comparing data in order to solve
working issues and improve business processes.
 How does it work?
 data is extracted into one area from
heterogeneous sources
⇓
converted in accordance with the needs of the
decision support system
⇓
stored in the warehouse
OLAP tools in data warehouse
 We can define OLAP in data warehouse as a
computing technology that allows query data and
analyze it from different perspectives. The
technology is a great solution for business analysts
who need to pre-aggregate and pre-calculate data
for fast analysis.
 OLAP supports complex calculations;
 Provides data view in multidimensional manner;
There are following three major OLAP
models in data warehouse:
 ROLAP or Relational OLAP: the kind of system
where users query data from a relational
database or from their own local tables. Thus,
the number of potential questions is not limited.
 MOLAP or Multidimensional OLAP: this system
stores the data in multidimensional database.
Provides high speed of calculations.
 HOLAP or Hybrid OLAP: a mix of two above
mentioned systems. Pre-computed aggregates
and cube structure stored in multidimensional
database.
OLAP Operations
Here is the list of OLAP operations −
 Roll-up
 Drill-down
 Slice and dice
 Pivot (rotate)
1. Roll-up
Roll-up performs aggregation
on a data cube in any of the
following ways −
•By climbing up a concept
hierarchy for a dimension
•By dimension reduction
Roll-up
 Roll-up is performed by climbing up a concept
hierarchy for the dimension location.
 Initially the concept hierarchy was "street < city
< province < country".
 On rolling up, the data is aggregated by
ascending the location hierarchy from the level
of city to the level of country.
 The data is grouped into cities rather than
countries.
 When roll-up is performed, one or more
dimensions from the data cube are removed.
2. Drill-down
 Drill-down is the
reverse
operation of
roll-up. It is
performed by
either of the
following ways −
 By stepping
down a concept
hierarchy for a
dimension
 By introducing a
new dimension.
Drill-down
 Drill-down is performed by stepping down a
concept hierarchy for the dimension time.
 Initially the concept hierarchy was "day <
month < quarter < year."
 On drilling down, the time dimension is
descended from the level of quarter to the level
of month.
 When drill-down is performed, one or more
dimensions from the data cube are added.
 It navigates the data from less detailed data to
highly detailed data.
3. Slice
 The slice operation

selects one particular
dimension from a
given cube and
provides a new sub-
cube.
 Here Slice is performed
for the dimension
"time" using the
criterion time = "Q1".
 It will form a new sub-
cube by selecting one
or more dimensions.
Dice
 Dice selects two or more
dimensions from a given
cube and provides a new
sub-cube.
 The dice operation on the
cube based on the following
selection criteria involves
three dimensions.
 (location = "Toronto" or
"Vancouver")
 (time = "Q1" or "Q2")
 (item =" Mobile" or
"Modem")
4. Pivot
 The pivot operation
is also known as
rotation. It rotates
the data axes in view
in order to provide
an alternative
presentation of data.
Consider the
following diagram
that shows the pivot
operation.
Data Mining Applications
 Financial/Banking Sector: Banks use data mining to
better understand market risks. It is commonly
applied to credit ratings and to intelligent anti-fraud
systems to analyse transactions, card transactions,
purchasing patterns and customer financial data.
 A credit card company can leverage its vast
warehouse of customer transaction data to identify
customers most likely to be interested in a new
credit product.
 Identify ‘Loyal’ customers.
 Extraction of information related to customers.
 Determine credit card spending by customer groups.
 Education: For analyzing the education
sector, data mining uses Educational Data
Mining (EDM) method. This method generates
patterns that can be used both by learners
and educators. By using data mining EDM we
can perform some educational task:
 Predicting students admission in higher
education
 Predicting students profiling
 Predicting student performance
 Teachers teaching performance
 Predicting student placement opportunities
 Market Basket Analysis: Market Basket
Analysis is a technique that gives the careful
study of purchases done by a customer in a
supermarket. This concept identifies the
pattern of frequent purchase items by
customers. This analysis can help to promote
deals, offers, sale by the companies and data
mining techniques helps to achieve this
analysis task.
 Medicine: Data mining enables more accurate
diagnostics. Having all of the patient's
information, such as medical records,
physical examinations, and treatment
patterns, allows more effective treatments to
be prescribed. It also enables more effective,
efficient and cost-effective management of
health resources by identifying risks,
predicting illnesses in certain segments of the
population or forecasting the length of
hospital admission.
 Intrusion Detection:
 A network intrusion refers to any unauthorized
activity on a digital network.
 Network intrusions often involve stealing
valuable network resources.
 Data mining technique plays a vital role in searching
intrusion detection, network attacks, and anomalies.
These techniques help in selecting and refining
useful and relevant information from large data sets.
Data mining technique helps in classify relevant data
for Intrusion Detection System. Intrusion Detection
system generates alarms for the network traffic
about the foreign invasions in the system.

Data Warehousing and Data Mining

Uploaded by

Data Warehousing and Data Mining

Uploaded by

Data Warehousing and

 Subject-Oriented: A data warehouse can be used to

 Non-volatile: Once data is in the data warehouse, it

transforming it into the standard format.

4. Data Marts – Data mart is also a part of storage

warehouse as a central repository for the

 Then, the data go through the staging area (as explained

 These data marts are then integrated into data warehouse.

 This approach is given by Kinball as – data marts are

2. Extraction, Transformation, and Loading Tools (ETL)

mining should have been more appropriately

to extract information from a data set and

or features of the class that is under the study. For

study. The output of this process can be represented

This applies to a number of items that can be

This refers to the pattern series that often

The process involves uncovering the relationship

Correlation is a mathematical technique that can show

 The slice operation

You might also like