DMDW 1 2nd Module
DMDW 1 2nd Module
DMDW 1 2nd Module
MODULE-2
DATA WAREHOUSE IMPLEMENTATION& DATA MINING
2.1 Introduction
2.2 Efficient Data Cube computation: An overview
2.3 Indexing OLAP Data: Bitmap index and join index
2.4 Efficient processing of OLAP Queries
2.5 OLAP server Architecture ROLAP versus MOLAP Versus HOLAP
2.6 Introduction: What is data mining
2.7 Challenges, Data Mining Tasks
2.8 Data: Types of Data
2.9 Data Quality
2.10 Data Preprocessing,
2.11 Measures of Similarity and Dissimilarity
2.12 Outcome
2.13 Important Questions
2.1 Introduction
Data Warehouse Design Process:
A data warehouse can be built using a top-down approach, a bottom-up approach, or a
combination of both.
• The top-down approach starts with the overall design and planning. It is useful in cases where
the technology is mature and well known, and where the business problems that must be solved are clear
and well understood.
• The bottom-up approach starts with experiments and prototypes. This is useful in the early
stage of business modeling and technology development. It allows an organization to move forward at
considerably less expense and to evaluate the benefits of the technology before making significant
commitments.
• In the combined approach, an organization can exploit the planned and strategic nature of the
top-down approach while retaining the rapid implementation and opportunistic application of the bottom-
up approach.
The compute cube Operator and the Curse of Dimensionality A data cube is a lattice of cuboids.
Suppose that you want to create
a data cube for AllElectronics sales that contains the following: city, item, year, and sales in dollars. You
want to be able to analyze the data, with queries such as the following:
• ―Compute the sum of sales, grouping by city and item.‖
• ―Compute the sum of sales, grouping by city.‖
• ―Compute the sum of sales, grouping by item.‖
What is the total number of cuboids, or group-by’s, that can be computed for this data cube? Taking the three
attributes, city, item, and year, as the dimensions for the data cube, and sales in dollars as the measure, the
total number of cuboids, or groupby’s, that can be computed for this data cube is 23 =8. The possible group-
by’s are the following: {(city, item, year), (city, item), (city, year), (item, year), (city), (item), (year), ()},
where () means that the group-by is empty (i.e., the dimensions are not grouped). These group-by’s form a
lattice of cuboids for the data cube, as shown in the figure above.
An SQL query containing no group-by(e.g.,―compute the sum of total sales‖) is a zero dimensional
operation. An SQL query containing one group-by (e.g., ―compute the sum of sales,
group-bycity‖) is a one-dimensional operation. A cube operator on n dimensions is equivalent to a collection
of group-by statements, one for each subset of the n dimensions. Therefore, the cube operator is the n-
dimensional generalization of the group-by operator. Similar to the SQL syntax, the data cube in Example
could be defined as:
define cube sales cube [city, item, year]: sum(sales in dollars)
For a cube with n dimensions, there are a total of 2n cuboids, including the base cuboid. A statement such as
compute cube sales cube would explicitly instruct the system to compute the sales aggregate cuboids for all
eight subsets of the set {city, item, year}, including the empty subset.
Online analytical processing may need to access different cuboids for different queries. Therefore, it may
seem like a good idea to compute in advance all or at least some of the cuboids in a data cube.
Precomputation leads to fast response time and avoids some redundant computation.
A major challenge related to this precomputation, however, is that the required storage space may explode if
all the cuboids in a data cube are precomputed, especially when the cube has many dimensions. The storage
requirements are even more excessive when many of the dimensions have associated concept hierarchies,
each with multiple levels. This problem is referred to as the curse of dimensionality
◼ How many cuboids in an n-dimensional cube with L levels?
where Li is the number of levels associated with dimension i. One is added to Li to include
the virtual top level, all. (Note that generalizing to all is equivalent to the removal of the dimension.) If there
are many cuboids, and these cuboids are large in size, a more reasonable option is partial materialization; that
is, to materialize only some of the possible cuboids that can be generated.
◼ Materialize every (cuboid) (full materialization), none (no materialization), or some (partial
materialization)
◼ Selection of which cuboids to materialize -Based on size, sharing, access frequency, etc
The join indexing method gained popularity from its use in relational database query processing.
Traditional indexing maps the value in a given column to a list of rows having that value. In contrast, join
indexing registers the joinable rows of two relations from a relational database.
◼ In data warehouses, join index relates the values of the dimensions of a start schema to rows in the
fact table.
E.g. fact table: Sales and two dimensions city and product
A join index on city maintains for each distinct city a list of R-IDs of the tuples recording the Sales in the city
◼ Join indices can span multiple dimensions
m
2.4 Efficient Processing of OLAP Queries
Given materialized views, query processing should proceed as follows:
◼ Determine which operations should be performed on the available cuboids
Transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g., dice = selection +
projection
◼ Determine which materialized cuboid(s) should be selected for OLAP op.
Let the query to be processed be on {brand, province_or_state} with the condition ―year = 2004‖, and there
are 4 materialized cuboids available:
Data mining techniques can be used to support a wide range of business intelligence applications such as
customer profiling, targeted marketing, workflow management, store layout, and fraud detection.
Scientific Viewpoint
Data is collected and stored at enormous speeds (GB/hour). E.g.
– remote sensors on a satellite
– telescopes scanning the skies
– scientific simulations
As an important step toward improving our understanding of the Earth's climate system, NASA has deployed
a series of Earth orbiting satellites that continuously generate global observations of the Land surface,
oceans, and atmosphere.
Techniques developed in data mining can aid Earth scientists in answering questions such as
• "What is the relationship between the frequency and intensity of ecosystem disturbances such as
droughts and hurricanes to global warming?"
• "How is land surface precipitation and temperature affected by ocean surface temperature?"
• "How well can we predict the beginning and end of the growing season for a region?"
Preprocessing
The input data can be stored in a variety of formats (flat files, spreadsheets, or relational tables) and may
reside in a centralized data repository or be distributed across multiple sites.
• The purpose of preprocessing is to transform the raw input data into an appropriate format for
subsequent analysis.
• Data preprocessing include fusing data from multiple sources, cleaning data to remove noise
and duplicate observations, and selecting records and features that are relevant to the data mining task at hand.
Post processing:
• Ensures that only valid and useful results are incorporated into the system.
• Which allows analysts to explore the data and the data mining results from a variety of
viewpoints.
• Testing methods can also be applied during post processing to eliminate spurious data mining
results.
Classification: Definition
✓ Given a collection of records (training set )
✓ Each record contains a set of attributes, one of the attributes is the class.
✓ Find a model for class attribute as a function of the values of other attributes.
✓ Goal: previously unseen records should be assigned a class as accurately as possible.
✓ A test set is used to determine the accuracy of the model. Usually, the given data set is divided
into training and test sets, with training set used to build the model and test set used to validate it.
Classification: Application 1
Direct Marketing Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell
phone product.
Approach:
✓ Use the data for a similar product introduced before.
✓ We know which customers decided to buy and which decided otherwise. This {buy, don’t buy}
decision forms the class attribute.
✓ Collect various demographic, lifestyle, and company-interaction related information about all such
customers, such as type of business, where they stay, how much they earn, etc.
✓ Use this information as input attributes to learn a classifier model.
Classification: Application 2
Fraud Detection
Goal: Predict fraudulent cases in credit card transactions.
Approach:
✓ Use credit card transactions and the information on its account-holder as attributes.
✓ When does a customer buy, what does he buy, how often he pays on time, etc
✓ Label past transactions as fraud or fair transactions. This forms the class attribute.
✓ Learn a model for the class of the transactions.
✓ Use this model to detect fraud by observing credit card transactions on an account.
Classification: Application 3
Customer Attrition/Churn:
Goal:
To predict whether a customer is likely to be lost to a competitor.
Approach:
✓
Use detailed record of transactions with each of the past and present customers, to find attributes.
✓ How often the customer calls, where he calls, what time-of-the day he calls most, his financial
status, marital status, etc.
✓ Lianbdelatm
F heocduesl tfoomr elorsyaalstylo. yal or disloyal.
Clustering:
Definition
Given a set of data points, each having a set of attributes, and a similarity measure among them, find
clusters such that
✓ Data points in one cluster are more similar to one another.
✓ Data points in separate clusters are less similar to one another.
✓ Similarity Measures:
• Euclidean Distance if attributes are continuous.
• Other Problem-specific Measures.
Clustering: Application 1
Market Segmentation:
Goal: Subdivide a market into distinct subsets of customers where any subset may conceivably be
selected as a market target to be reached with a distinct marketing mix.
Approach:
✓ Collect different attributes of customers based on their geographical and lifestyle related
information.
✓ Find clusters of similar customers.
✓ Measure the clustering quality by observing buying patterns of customers in same cluster vs. those
from different clusters.
Clustering: Application 2
Document Clustering
Goal: To find groups of documents that are similar to each other based on the important terms appearing in
them.
Approach:
✓ To identify frequently occurring terms in each document. Form a similarity measure based on the
frequencies of different terms. Use it to cluster.
Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered
documents.
✓ Potato Chips as consequent => Can be used to determine what should be done to boost its sales.
✓ Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling
bagels.
Bagels in antecedent and Potato chips in consequent
=> Can be used to see what products should be sold with Bagels to promote sale of Potato chips!
Problems:
Discuss whether or not each of the following activities is a data mining task.
(a) Dividing the customers of a company according to their gender. No. This is a simple database query.
(b) Dividing the customers of a company according to their profitability.
No. This is an accounting calculation, followed by the application of a threshold. However, predicting the
profitability of a new customer would be data mining.
(c) Computing the total sales of a company. No. Again, this is simple accounting.
(d) Sorting a student database based on student identification numbers. No. Again, this is a simple
database query.
(e) Predicting the outcomes of tossing a (fair) pair of dice.
No. Since the die is fair, this is a probability calculation. If the die were not fair, and we needed to estimate
the probabilities of each outcome from the data, then this is more like the problems considered by data
mining. However, in this specific case, solutions to this problem were developed by mathematicians a long
time ago, and thus, we wouldn’t consider it to be data mining.
(f) Predicting the future stock price of a company using historical records.
Yes. We would attempt to create a model that can predict the continuous value of the stock price. This is an
example of the area of data mining known as predictive modeling. We could use regression for this
modeling, although researchers in many fields have developed a wide variety of techniques for predicting
time series.
(g) Monitoring the heart rate of a patient for abnormalities.
Yes. We would build a model of the normal behavior of heart rate and raise an alarm when an unusual heart
behavior occurred. This would involve the area of data mining known as anomaly detection. This could also
be considered as a classification problem if we had examples of both normal and abnormal heart behavior.
(h) Monitoring seismic waves for earthquake activities.
Yes. In this case, we would build a model of different types of seismic wave behavior associated with
earthquake activities and raise an alarm when one of these different types of seismic activity was observed.
This is an example of the area of data mining known as classification.
(i) Extracting the frequencies of a sound wave. No. This is signal processing.
2.8 Data Types: What is Data?
✓ Collection of data objects and their attributes
✓ An attribute is a property or characteristic of an object
o Examples: eye color of a person, temperature, etc.
o Attribute is also known as variable, field, characteristic, or feature
✓ A collection of attributes describe an object
o Object is also known as record, point, case, sample, entity, or instance.
Attribute Values
✓ Attribute values are numbers or symbols assigned to an attribute
✓ Distinction between attributes and attribute values
o Example: height can be measured in feet or meters
✓ Different attributes can be mapped to the same set of values
o Example: Attribute values for ID and age are integers
✓ But properties of attribute values can be different
o ID has no limit but age has a maximum and minimum value
Types of Attributes
There are different types of attributes
✓ Nominal
o Examples: ID numbers, eye color, zip codes
✓ Ordinal
o Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in
{tall, medium, short}
✓ Interval
o Examples: calendar dates, temperatures in Celsius or Fahrenheit.
✓ Ratio
o Examples: temperature in Kelvin, length, time, counts
female} test
Continuous Attribute
Has real numbers as attribute values
✓ Examples: temperature, height, or weight.
✓ Practically, real values can only be measured and represented using a finite number of digits.
✓ Continuous attributes are typically represented as floating-point variables.
Ordered Data
For some types of data, the attributes have relationships that involve order in time or space Different types of
ordered data are
Sequential Data: Also referred to as temporal data, can be thought of as an extension of record data, where
each record has a time associated with it.
Example:A retail transaction data set that also stores the time at which the transaction took place Sequence
Data : Sequence data consists of a data set that is a sequence of individual entities, such as a sequence of
words or letters. It is quite similar to sequential data, except that there are no time stamps; instead, there are
positions in an ordered sequence.
Example: the genetic information of plants and animals can be represented in the form of sequences of
nucleotides that are known as genes.
Time Series Data : Time series data is a special type of sequential data in which each record is a time series,
i.e., a series of measurements taken over time.
Example: A financial data set might contain objects that are time series of the daily prices of various stocks.
Example,: consider a time series of the average monthly temperature for a city during the years 1982 to 1994
Spatial Data : Some objects have spatial attributes, such as positions or areas, as well as other types of
attributes.
Example: Weather data (precipitation, temperature, pressure) that is collected for a variety of geographical
locations
Missing Values:
✓ Reasons for missing values
• Information is not collected (e.g., people decline to give their age and weight)
• Attributes may not be applicable to all cases (e.g., annual income is not applicable to children)
✓ Handling missing values
• Eliminate Data Objects
• Estimate Missing Values
• Ignore the Missing Value During Analysis
• Replace with all possible values (weighted by their probabilities)
Duplicate Data:
✓ Data set may include data objects that are duplicates, or almost duplicates of one another
✓ Major issue when merging data from heterogeneous sources
✓ Examples: Same person with multiple email addresses
✓ Data cleaning Process of dealing with duplicate data issues.
Aggregation
✓ Combining two or more attributes (or objects) into a single attribute (or object) Purpose:
o Data reduction
o Reduce the number of attributes or objects
o Change of scale
▪ Cities aggregated into regions, states, countries, etc
o More ―stable‖ data
o Aggregated data tends to have less variability.
Sampling
✓ Sampling is the main technique employed for data selection.
✓ It is often used for both the preliminary investigation of the data and the final data analysis.
✓ Statisticians sample because obtaining the entire set of data of interest is too expensive or time
consuming.
✓ Sampling is used in data mining because processing the entire set of data of interest is too
expensive or time consuming.
✓ The key principle for effective sampling is the following:
representative
e
original set of data.
Types of Sampling
✓ Simple Random Sampling
There is an equal probability of selecting any particular item
✓ Sampling without replacement
As each item is selected, it is removed from the population
✓ Sampling with replacement
Objects are not removed from the population as they are selected for the sample. In sampling with
replacement, the same object can be picked up more than once
✓ Stratified sampling
Split the data into several partitions; then draw random samples from each partition.
Dimensionality Reduction:
Purpose:
o Avoid curse of dimensionality
o Reduce amount of time and memory required by data mining algorithms.
o Allow data to be more easily visualized
o May help to eliminate irrelevant features or reduce noise
✓
Techniques of feature subset selection:
Brute-force approach:
Try all possible feature subsets as input to data mining algorithm Embedded approaches:
Feature selection occurs naturally as part of the data mining algorithm Filter approaches:
Features are selected before data mining algorithm is run Wrapper approaches:
Use the data mining algorithm as a black box to find best subset of attributes.
Feature Creation
✓ Create new attributes that can capture the important information in a data set much more efficiently
than the original attributes
✓ Three general methodologies:
▪ Feature Extraction
▪ domain-specific
▪ Mapping Data to New Space
▪ Feature Construction
▪ combining features
Where n is the number of dimensions (attributes) and xk and yk are, respectively, the kth attributes
(components) or data objects x and y.
Example: Distance Matrix
p x y
o
i
n
t
p 0 2
1
p 2 0
2
p 3 1
3
p 5 1
4
p p p p
1 2 3 4
p 0 2 3 5
1 . . .
8 1 0
2 6 9
8 2 9
p 2 0 1 3
2 . . .
8 4 1
2 1 6
8 4 2
p 3 1 0 2
61 14
2 4
p 5 3 2 0
4 . .
0 1
9 6
9 2
Minkowski Distance
Minkowski Distance is a generalization of Euclidean Distance
Where r is a parameter. Where n is the number of dimensions (attributes) and xk and yk are, respectively, the
kth attributes (components) or data objects x and y.
The following are the three most common examples of Minkowski distances.
1) r = 1. City block (Manhattan, taxicab, L1 norm) distance.
A common example of this is the Hamming distance, which is just the number of bits that are different
between two binary vectors
2) r = 2. Euclidean distance
3) r =>∞. ―supremum‖ (Lmax norm, L∞ norm) distance.
This is the maximum difference between any component of the vectors
Si
Jac
Example :
The SMC and Jaccard Similarity Coefficients
Calculate SMC and J for the following two binary vectors.
Cosine Similarity
x and y are two document vectors, then
u
Example (Cosine Similarity of Two Doc ument Vectors) Extended Jaccard Coefficient (Tanimoto
Coefficient):
Correlation The correlation between two data objects that have binary or continuous variables is a
measure of the linear relationship between the attributes of the objects.
Problems:
1) Classify the following attributes as binary, discrete, or continuous. Also classify them as qualitative
(nominal or ordinal) or quantitative (interval or ratio). Some cases may have more than one interpretation,
so briefly indicate your reasoning if you think there may be some ambiguity.
Example: Age in years. Answer: Discrete, quantitative, ratio
(a) Time in terms of AM or PM. Binary, qualitative, ordinal
(b) Brightness as measured by a light meter. Continuous, quantitative, ratio
(c) Brightness as measured by people’s judgments. Discrete, qualitative, ordinal
(d) Angles as measured in degrees between 0◦ and 360◦. Continuous, quantitative, ratio
(e) Bronze, Silver, and Gold medals as awarded at the Olympics. Discrete, qualitative, ordinal
(f) Height above sea level. Continuous, quantitative, interval/ratio (depends on whether sea level is
regarded as an arbitrary origin)
(g) Number of patients in a hospital. Discrete, quantitative, ratio
(h) ISBN numbers for books. (Look up the format on the Web.) Discrete, qualitative, nominal (ISBN
numbers do have order information, though)
(i) Ability to pass light in terms of the following values: opaque, translucent, transparent. Discrete,
qualitative, ordinal
(j) Military rank. Discrete, qualitative, ordinal
(k) Distance from the center of campus. Continuous, quantitative, interval/ratio (depends)
(l) Density of a substance in grams per cubic centimeter. Discrete, quantitative,ratio
(m) Coat check number. (When you attend an event, you can often give your coat to someone who, in
turn, gives you a number that you can use to claim your coat when you leave.) Discrete, qualitative,
nominal.
2) Compute the Hamming distance and the Jaccard similarity between the following two binary vectors
x = 0101010001
y = 0100011000
Solution:
Hamming distance = number of different bits = 3
Jaccard Similarity = number of 1-1 matches /( number of bits – number matches) = 2 / 5 = 0.4
4) For the following vectors, x and y, calculate the indicated similarity or distance measures.
(a) x=: (1, 1, 1, 1), y : (2,2,2,2) cosine, correlation, Euclidean
(b) x=: (0, 1,0, 1), y : (1,0, 1,0) cosine, correlation, Euclidean, Jaccard
(c) x= (0,- 1,0, 1) , y: (1,0,- 1,0) ) cosine, correlation Euclidean
(d) x = (1,1 ,0,1 ,0,1 ) , y : (1,1 ,1 ,0,0,1 ) ) cosine, correlation ,Jaccard
(e) x = (2, -7,0,2,0, -3) , y : ( -1, 1,- 1,0,0, -1) cosine, correlation
23,Discuss whether or not each of the following activities is a data mining task.
(a) Dividing the customers of a company according to their gender.
(b) Dividing the customers of a company according to their profitability.
(c) Computing the total sales of a company.
(d) Sorting a student database based on student identification numbers.
(e) Predicting the outcomes of tossing a (fair) pair of dice.
(f) Predicting the future stock price of a company using historical records.
(g) Monitoring the heart rate of a patient for abnormalities.
(h) Monitoring seismic waves for earthquake activities.
(i) Extracting the frequencies of a sound wave
24. Classify the following attributes as binary, discrete, or continuous. Also classify them as qualitative
(nominal or ordinal) or quantitative (interval or ratio). Some cases may have more
than one interpretation, so briefly indicate your reasoning if you think there may be some ambiguity.
Example: Age in years. Answer: Discrete, quantitative, ratio
(a) Time in terms of AM or PM.
(b) Brightness as measured by a light meter.
(c) Brightness as measured by people’s judgments.
(d) Angles as measured in degrees between 0◦ and 360◦.
(e) Bronze, Silver, and Gold medals as awarded at the Olympics.
(f) Height above sea level.
(g) Number of patients in a hospital.
(h) ISBN numbers for books. (Look up the format on the Web.)
(i) Ability to pass light in terms of the following values: opaque, translucent, transparent.
(j) Military rank.
(k) Distance from the center of campus.