Data Mining Lecture Notes

24/01/2021
Lecture 1:
Introduction to Data Mining
Lecturer: Dr. Nguyen, Thi Thanh Sang
(nttsang@hcmiu.edu.vn)
References:
[1] Chapter 1 in Data Mining: Concepts and Techniques (Third Edition), by
Jiawei Han, Micheline Kamber
[2] Chapter 1 in Data Mining: Practical Machine Learning Tools and
Techniques (Third Edition), by Ian H.Witten, Eibe Frank and Eibe Frank
1
Introduction
 What is data mining?

 Data Mining Goals
 Stages of the Data Mining Process
 Data Mining Techniques
 Knowledge Representation Methods
 Applications
 Example: weather data
1
2
24/01/2021
What is data mining?
 Example 1: Web usage mining

◆ Given: click streams
◆ Problem: prediction of user behaviour
◆ Data: historical records of embryos and outcome
 Example 2: cow culling

◆ Given: cows described by 700 features
◆ Problem: selection of cows that should be culled
◆ Data: historical records and farmers’ decisions
 Extracting
◆ implicit,
◆ previously unknown,
◆ potentially useful
information from data

 Needed: programs that detect patterns and
regularities in the data
 Strong patterns  good predictions
◆ Problem 1: most patterns are not interesting
◆ Problem 2: patterns may be inexact (or spurious)
◆ Problem 3: data may be garbled or missing
2
4
24/01/2021
What Is Data Mining?

 Data mining (knowledge discovery from data)
 Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
 Data mining: a misnomer?
 Alternative names
 Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
 Watch out: Is everything “data mining”?
 Simple search and query processing
 (Deductive) expert systems
Definitions:
 DM: The practice of examining large databases in
order to generate new information.
 DM: The process of analyzing data from different
perspectives and summarizing it into useful
information - information that can be used to
increase revenue, cut costs, or both.
3
6
24/01/2021
Data mining is defined as the process of

discovering patterns in data.
 The process must be automatic or (more usually)
semiautomatic.
 The patterns discovered must be meaningful in
that they lead to some advantage, usually an
economic one.
 The data is invariably presented in substantial
quantities.
Introduction

 Applications
4
8
24/01/2021
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Presentation Business

Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses

DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
9
Data Mining Goals
 Extract information from a data set

 Transform it into an understandable structure for further
use
 The ultimate goal of data mining is prediction
5
10
24/01/2021
Introduction

 Applications
11
Knowledge Discovery (KDD) Process

 This is a view from typical
database systems and data
warehousing communities Pattern Evaluation
 Data mining plays an essential
role in the knowledge discovery
process Data Mining
Task-relevant Data
Data Warehouse Selection
Data Cleaning
Data Integration
Databases 12
6
12
24/01/2021
Example: A Web Mining Framework
 Web mining usually involves

 Data cleaning
 Data integration from multiple sources
 Warehousing the data
 Data cube construction
 Data selection for data mining
 Data mining
 Presentation of the mining results
 Patterns and knowledge to be used or stored into knowledge-base
13
13
KDD Process: A Typical View from ML and Statistics
Input Data Data Pre- Data Post-

Processing Mining Processing
Data integration Pattern discovery Pattern evaluation

Normalization Association & correlation Pattern selection
Feature selection Classification Pattern interpretation
Clustering
Dimension reduction Pattern visualization
Outlier analysis
…………
 This is a view from typical machine learning and statistics communities
14
7
14
24/01/2021
Which View Do You Prefer?

 Which view do you prefer?
 KDD vs. ML/Stat. vs. Business Intelligence
 Depending on the data, applications, and your focus
 Data Mining vs. Data Exploration
 Business intelligence view
 Warehouse, data cube, reporting but not much
mining
 Business objects vs. data mining tools
 Supply chain example: mining vs. OLAP vs.
presentation tools
 Data presentation vs. data exploration
15
15
Introduction

 Applications
8
16
24/01/2021
Data Mining: Confluence of Multiple Disciplines
Machine Pattern Statistics

Learning Recognition
Applications Data Mining Visualization
Algorithm Database High-Performance

Technology Computing
17
17
Why Confluence of Multiple Disciplines?
 Tremendous amount of data

 Algorithms must be scalable to handle big data
 High-dimensionality of data
 Micro-array may have tens of thousands of dimensions
 High complexity of data
 Data streams and sensor data
 Time-series data, temporal data, sequence data
 Structure data, graphs, social and information networks
 Spatial, spatiotemporal, multimedia, text and Web data
 Software programs, scientific simulations
 New and sophisticated applications
18
9
18
24/01/2021
Machine learning techniques

 Algorithms for acquiring structural descriptions
from examples
 Structural descriptions represent patterns
explicitly
◆ Can be used to predict outcome in new situation
◆ Can be used to understand and explain how prediction
is derived
(may be even more important)
 Methods originate from artificial intelligence,
statistics, and research on databases
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1) 19
19
Structural descriptions
 Example: if-then rules
If tear production rate = reduced

then recommendation = none
Otherwise, if age = young and astigmatic = no
then recommendation = soft
Age Spectacle Astigmatism Tear production Recommended

prescription rate lenses
Young Myope No Reduced None
Young Hypermetrope No Normal Soft
Pre-presbyopic Hypermetrope No Reduced None
Presbyopic Myope Yes Normal Hard
… … … … …

10
20
24/01/2021
Can machines really learn?

 Definitions of “learning” from dictionary:
To get knowledge of by study,
experience, or being taught Difficult to measure
To become aware by information or
from observation
To commit to memory Trivial for computers
To be informed of, ascertain; to receive instruction
 Operational definition:
Things learn when they change their behavior
in a way that makes them perform better in Does a slipper learn?
the future.
 Does learning imply intention?

21
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1)
21
Introduction

 Applications
11
22
24/01/2021
Knowledge Representation Methods
 Tables
 Data cube
 Linear models
 Trees
 Rules
 Instance-based Representation
 Clusters
23
 Decision table for the weather problem:
Outlook Humidity Play

Sunny High No
Sunny Normal Yes
Overcast High Yes
Overcast Normal Yes
Rainy High No
Rainy Normal No
12
24
24/01/2021

 A sample data cube:
Total annual sales
Date of TVs in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
TV
PC U.S.A
VCR
Country
sum
Canada
Mexico
sum
25
25

 A linear regression function for the CPU performance data
PRP = 37.06 + 2.47CACH

13
26
24/01/2021

 Regression tree for the CPU data
27
 If-then Rules
If tear production rate = reduced

then recommendation = none
Otherwise, if age = young and astigmatic = no
then recommendation = soft
14
28
24/01/2021
 Instance-based representation
29
 Clusters
15
30
24/01/2021
Introduction

 Applications
31
Applications
 The result of learning—or the learning method itself—
is deployed in practical applications
◆ Processing loan applications
◆ Screening images for oil slicks
◆ Electricity supply forecasting
◆ Diagnosis of machine faults
◆ Marketing and sales
◆ Separating crude oil and natural gas
◆ Reducing banding in rotogravure printing
◆ Finding appropriate technicians for telephone faults
◆ Scientific applications: biology, astronomy, chemistry
◆ Automatic selection of TV programs
◆ Monitoring intensive care patients
16
32
24/01/2021
Processing loan applications (American Express)
 Given: questionnaire with

financial and personal information
 Question: should money be lent?
 Simple statistical method covers 90% of cases
 Borderline cases referred to loan officers
 But: 50% of accepted borderline cases defaulted!
 Solution: reject all borderline cases?
◆ No! Borderline cases are most active customers
33
Enter machine learning
 1000 training examples of borderline cases

 20 attributes:
◆ age
◆ years with current employer
◆ years at current address
◆ years with the bank
◆ other credit cards possessed,…
 Learned rules: correct on 70% of cases

◆ human experts only 50%
 Rules could be used to explain decisions to
customers
17
34
24/01/2021
Screening images
 Given: radar satellite images of coastal waters
 Problem: detect oil slicks in those images
 Oil slicks appear as dark regions with changing
size and shape
 Not easy: lookalike dark regions can be caused
by weather conditions (e.g. high wind)
 Expensive process requiring highly trained
personnel
35

 Extract dark regions from normalized image
 Attributes:
◆ size of region
◆ shape, area
◆ intensity
◆ sharpness and jaggedness of boundaries
◆ proximity of other regions
◆ info about background
 Constraints:
◆ Few training examples—oil slicks are rare!
◆ Unbalanced data: most dark regions aren’t slicks
◆ Regions from same image form a batch
◆ Requirement: adjustable false-alarm rate
36
18
36
24/01/2021
Load forecasting
 Electricity supply companies
need forecast of future demand
for power
 Forecasts of min/max load for each hour
significant savings
 Given: manually constructed load model that
assumes “normal” climatic conditions
 Problem: adjust for weather conditions
 Static model consists of:
◆ base load for the year
◆ load periodicity over the year
◆ effect of holidays
37
37

 Prediction corrected using “most similar” days
 Attributes:
◆ temperature
◆ humidity
◆ wind speed
◆ cloud cover readings
◆ plus difference between actual load and predicted load
 Average difference among three “most similar” days

added to static model
 Linear regression coefficients form attribute weights
in similarity function

19
38
24/01/2021
Diagnosis of machine faults
 Diagnosis: classical domain

of expert systems
 Given: Fourier analysis of vibrations measured
at various points of a device’s mounting
 Question: which fault is present?
 Preventative maintenance of electromechanical
motors and generators
 Information very noisy
 So far: diagnosis by expert/hand-crafted rules
39
 Available: 600 faults with expert’s diagnosis

 ~300 unsatisfactory, rest used for training
 Attributes augmented by intermediate concepts
that embodied causal domain knowledge
 Expert not satisfied with initial rules because they
did not relate to his domain knowledge
 Further background knowledge resulted in more
complex rules that were satisfactory
 Learned rules outperformed hand-crafted ones
20
40
24/01/2021
Marketing and sales I
 Companies precisely record massive amounts of

marketing and sales data
 Applications:
◆ Customer loyalty:
identifying customers that are likely to defect by
detecting changes in their behavior
(e.g. banks/phone companies)
◆ Special offers:
identifying profitable customers
(e.g. reliable owners of credit cards that need extra
money during the holiday season)
41
Marketing and sales II
 Market basket analysis

◆ Association techniques find
groups of items that tend to
occur together in a
transaction
(used to analyze checkout data)
 Historical analysis of purchasing patterns
 Identifying prospective customers
◆ Focusing promotional mailouts
(targeted campaigns are cheaper than mass-marketed
ones)
21
42
24/01/2021
Introduction

 Applications
43
The weather problem

 Conditions for playing a certain game
Outlook Temperature Humidity Windy Play

Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild Normal False Yes
… … … … …
If outlook = sunny and humidity = high then play = no

If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity = normal then play = yes
If none of the above then play = yes
44
22
44
24/01/2021
Classification vs. association rules
 Classification rule:
predicts value of a given attribute (the classification of an example)
If outlook = sunny and humidity = high
then play = no
 Association rule:
predicts value of arbitrary attribute (or combination)
If temperature = cool then humidity = normal
If humidity = normal and windy = false
then play = yes
If outlook = sunny and play = no
then humidity = high
If windy = false and play = no
then outlook = sunny and humidity = high
45
45
Weather data with mixed attributes
 Some attributes have numeric values

Sunny 85 85 False No
Sunny 80 90 True No
Overcast 83 86 False Yes
Rainy 75 80 False Yes
… … … … …
If outlook = sunny and humidity > 83 then play = no

If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity < 85 then play = yes
If none of the above then play = yes

23
46
24/01/2021
Summary
 What is Data Mining?

 What kinds of Data can be mined?
 Which Technologies are used?
 Which kinds of applications are targeted?
47
24
31/01/2021
Lecture 2:
Getting to Know Your Data
References:
Chapter 2 in Data Mining: Concepts and Techniques (Third Edition), by Jiawei
Han, Micheline Kamber
1
31/01/2021
 Data Objects and Attribute Types
 Basic Statistical Descriptions of Data
 Measuring Data Similarity and Dissimilarity
 Summary
Types of Data Sets

 Record
 Relational records
 Data matrix, e.g., numerical matrix, crosstabs
 Document data: text documents: term-frequency vector
 Transaction data
 Graph and network
 World Wide Web
 Social or information networks
 Molecular Structures
 Ordered
 Video data: sequence of images TID Items
 Temporal data: time-series 1 Bread, Coke, Milk

 Sequential Data: transaction sequences 2 Beer, Bread
 Genetic sequence data 3 Beer, Coke, Diaper, Milk
 Spatial, image and multimedia:
4 Beer, Bread, Diaper, Milk
 Spatial data: maps
5 Coke, Diaper, Milk
 Image data:
 Video data: 4
2
31/01/2021
Important Characteristics of Structured Data
 Dimensionality
 Curse of dimensionality
 Sparsity
 Only presence counts

 Resolution
 Patterns depend on the scale
 Distribution
 Centrality and dispersion
5
Data Objects
 Data sets are made up of data objects.

 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Also called samples , examples, instances, data points, objects,
tuples.
 Data objects are described by attributes.
 Database rows -> data objects; columns ->attributes.
3
31/01/2021
Attributes
 Attribute (or dimensions, features, variables): a data

field, representing a characteristic or feature of a data
object.
 E.g., customer _ID, name, address
 Types:
 Nominal
 Binary
 Numeric: quantitative
 Interval-scaled
 Ratio-scaled
Attribute Types
 Nominal: categories, states, or “names of things”

 Hair_color = {auburn, black, blond, brown, grey, red, white}
 marital status, occupation, ID numbers, zip codes
 Binary
 Nominal attribute with only 2 states (0 and 1)
 Symmetric binary: both outcomes equally important
 e.g., gender
 Asymmetric binary: outcomes not equally important.
 e.g., medical test (positive vs. negative)
 Convention: assign 1 to most important outcome (e.g., HIV
positive)
 Ordinal
 Values have a meaningful order (ranking) but magnitude between
successive values is not known.
 Size = {small, medium, large}, grades, army rankings
4
31/01/2021
Numeric Attribute Types
 Quantity (integer or real-valued)

 Interval
 Measured on a scale of equal-sized units
 Values have order
• E.g., temperature in C˚or F˚, calendar dates
 No true zero-point
 Ratio
 Inherent zero-point
 We can speak of values as being an order of magnitude
larger than the unit of measurement (10 K˚ is twice as high
as 5 K˚).
• e.g., temperature in Kelvin, length, counts, monetary
quantities
9
Discrete vs. Continuous Attributes
 Discrete Attribute
 Has only a finite or countably infinite set of values
 E.g., zip codes, profession, or the set of words in a collection of documents
 Sometimes, represented as integer variables

 Note: Binary attributes are a special case of discrete
attributes
 Continuous Attribute
 Has real numbers as attribute values
 E.g., temperature, height, or weight
 Practically,real values can only be measured and

represented using a finite number of digits
 Continuous attributes are typically represented as
floating-point variables
10
10
5
31/01/2021
Activities
 Identify attribute types in datasets
11
 Summary
12
12
6
31/01/2021
Basic Statistical Descriptions of Data

 Motivation
 To
better understand the data: central tendency,
variation and spread
 Data dispersion characteristics
 median, max, min, quantiles, outliers, variance, etc.
 Numerical dimensions correspond to sorted intervals
 Data dispersion: analyzed with multiple granularities
of precision
 Boxplot or quantile analysis on sorted intervals
 Dispersion analysis on computed measures
 Folding measures into numerical dimensions
 Boxplot or quantile analysis on the transformed cube
13
13
Measuring the Central Tendency

n
1
 Mean (algebraic measure) (sample vs. population): x 
n

i 1
xi
Note: n is sample size and N is population size. n
wx i i
 Weighted arithmetic mean: x  i 1

n
w

i
x
 Trimmed mean: chopping extreme values i 1
N
 Example:
 Suppose we have the following values for salary (in
thousands of dollars), shown in increasing order: 30,
36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
 Mean = ?
14
14
7
31/01/2021
Measuring the Central Tendency (cont.)

 Median:
 Middle value if odd number of values, or average of the middle two values
otherwise
 Estimated by interpolation (for grouped data):
n / 2  ( freq )l
median  L1  ( ) width
freqmedian
 Mode Median
interval
 Value that occurs most frequently in the data
 Unimodal, bimodal, trimodal
 Empirical formula:
mean  mode  3  (mean  median)
15
Measuring the Central Tendency (cont.)
 Example:
 Suppose we have the following values for salary (in thousands of dollars),
shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
 Median = ?
 Mode = ?
16
8
31/01/2021
Symmetric vs. Skewed Data
 Median, mean and mode of symmetric,

positively and negatively skewed data symmetric
positively skewed negatively skewed
17
January 31, 2021 Data Mining: Concepts and Techniques
17
Measuring the Dispersion of Data
 Quartiles, outliers and boxplots

 Quartiles: Q1 (25th percentile), Q3 (75th percentile)
 Inter-quartile range: IQR = Q3 – Q1
 Five number summary: min, Q1, median, Q3, max
 Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and
plot outliers individually
 Outlier: usually, a value higher/lower than 1.5 x IQR beyond the quartiles.
 Variance and standard deviation (σ)
 Variance: (algebraic, scalable computation)
 Standard deviation σ is the square root of variance σ2
18
18
9
31/01/2021
Boxplot Analysis
 Five-number summary of a distribution

 Minimum, Q1, Median, Q3, Maximum
 Boxplot
 Data is represented with a box
 The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
 The median is marked by a line within the box
 Whiskers: two lines outside the box extended
to Minimum and Maximum
 Outliers: points beyond a specified outlier
threshold, plotted individually
19
19
Fig. 2.3. Boxplot for the unit price data for items sold at four
branches of AllElectronics during a given time period.
For branch 1, Median = $80, Q1 is $60, and Q3 =

Max $100. Notice that two outlying observations for this
branch were plotted individually, as their values of
175 and 202 are more than 1.5 times the IQR here of
40.
Q3
Median
Q1
Min
20
10
31/01/2021
Measuring the Dispersion of Data
 Example:
 Suppose we have the following values for salary (in thousands of
dollars), shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60,
63, 70, 70, 110.
 Q1 = ?; Q3 = ?
 IQR = ?
 Variance 2 = ?; Standard deviation  = ?
21
Visualization of Data Dispersion: 3-D Boxplots
22
22
11
31/01/2021
Properties of Normal Distribution Curve
 The normal (distribution) curve

 From μ–σ to μ+σ: contains about 68% of the measurements
(μ: mean, σ: standard deviation)
 From μ–2σ to μ+2σ: contains about 95% of it
 From μ–3σ to μ+3σ: contains about 99.7% of it
23
23
Graphic Displays of Basic Statistical Descriptions
 Boxplot: graphic display of five-number summary
 Histogram: x-axis are values, y-axis repres. frequencies
 Quantile plot: each value xi is paired with fi indicating that

approximately 100 fi % of data are  xi
 Quantile-quantile (q-q) plot: graphs the quantiles of one univariant

distribution against the corresponding quantiles of another
 Scatter plot: each pair of values is a pair of coordinates and plotted

as points in the plane
24
24
12
31/01/2021
Histogram Analysis
 Histogram: Graph display of tabulated

frequencies, shown as bars 40
 It shows what proportion of cases fall 35
into each of several categories 30
 Differs from a bar chart in that it is the 25
area of the bar that denotes the 20
value, not the height as in bar charts,
15
a crucial distinction when the
10
categories are not of uniform width
5
 The categories are usually specified as
0
non-overlapping intervals of some 10000 30000 50000 70000 90000
variable. The categories (bars) must
be adjacent
25
25
Quantile Plot
 Displays all of the data (allowing the user to assess both
the overall behavior and unusual occurrences)
 Plots quantile information
 For a data xi data sorted in increasing order, fi
indicates that approximately 100 fi% of the data are
below or equal to the value xi
26
26
13
31/01/2021
Quantile-Quantile (Q-Q) Plot
 Graphs the quantiles of one univariate distribution against the

corresponding quantiles of another
 View: Is there is a shift in going from one distribution to another?
 Example shows unit price of items sold at Branch 1 vs. Branch 2 for
each quantile. Unit prices of items sold at Branch 1 tend to be
lower than those at Branch 2.
27
27
Scatter plot
 Provides a first look at bivariate data to see clusters of

points, outliers, etc.
 Each pair of values is treated as a pair of coordinates
and plotted as points in the plane
28
28
14
31/01/2021
Example: Histogram and Scatter
29
Positively and Negatively Correlated Data
 The left half fragment is positively

correlated
 The right half is negative correlated
30
30
15
31/01/2021
Uncorrelated Data
31
31
 Summary
32
32
16
31/01/2021
Similarity and Dissimilarity

 Similarity
 Numerical measure of how alike two data objects are
 Value is higher when objects are more alike
 Often falls in the range [0,1]
 Dissimilarity (e.g., distance)
 Numerical measure of how different two data objects
are
 Lower when objects are more alike
 Minimum dissimilarity is often 0
 Upper limit varies
 Proximity refers to a similarity or dissimilarity
33
33
Data Matrix and Dissimilarity Matrix
 Data matrix
 n data points with p  x 11 ... x 1f ... x 1p 
dimensions  
 ... ... ... ... ... 
 Two modes x ... x if ... x ip 
 i1 
 ... ... ... ... ... 
x ... x nf ... x np 
 n1 
 Dissimilarity matrix
 0 
 n data points, but  d(2,1) 
0
registers only the  
distance  d(3,1 ) d ( 3 ,2 ) 0 
 
A triangular matrix  : : : 
 d ( n ,1 ) d ( n,2 ) ... ... 0 
 Single mode
34
34
17
31/01/2021
Proximity Measure for Nominal Attributes
 Can take 2 or more states, e.g., red, yellow, blue, green

(generalization of a binary attribute)
 Method: Simple matching

 m: # of matches, p: total # of variables
d ( i , j )  p p m
35
35
Proximity Measure for Binary Attributes

Object j
 A contingency table for binary data
Object i
 Distance measure for symmetric binary

variables:
 Distance measure for asymmetric

binary variables:
 Jaccard coefficient (similarity measure

for asymmetric binary variables):
 Note: Jaccard coefficient is the same as “coherence”:
36
36
18
31/01/2021
Dissimilarity between Binary Variables

q t
 Example Asymmetric att.
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
 Gender is a symmetric attribute
 The remaining attributes are asymmetric binary
 Let the values Y and P be 1, and the value N 0
0  1 based only on
d ( jack , mary )   0 . 33 the asymmetric
2  0  1
attributes.
1  1
d ( jack , jim )   0 . 67
1  1  1
1  2
d ( jim , mary )   0 . 75
1  1  2
37
37
Standardizing Numeric Data


z  x
 Z-score:
 X: raw score to be standardized, μ: mean of the population, σ: standard

deviation
 the distance between the raw score and the population mean in units of
the standard deviation
 negative when the raw score is below the mean, “+” when above
 An alternative way: Calculate the mean absolute deviation
s f  1n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)
where
m f  1n (x1 f  x2 f  ...  xnf )
.
xif  m f
 standardized measure (z-score): zif  sf
 Using mean absolute deviation is more robust than using standard deviation
38
38
19
31/01/2021
Example:
Data Matrix and Dissimilarity Matrix
x2 x4
Data Matrix
point attribute1 attribute2
4 x1 1 2
x2 3 5
x3 2 0
x4 4 5
2 x1
Dissimilarity Matrix
(with Euclidean Distance)
x3
0 2 4 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
39
39
Distance on Numeric Data: Minkowski Distance
 Minkowski distance: A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are

two p-dimensional data objects, and h is the order
(the distance so defined is also called L-h norm)
 Properties
 d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
 d(i, j) = d(j, i) (Symmetry)
 d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
 A distance that satisfies these properties is a metric
40
40
20
31/01/2021
Special Cases of Minkowski Distance

 h = 1: Manhattan (city block, L1 norm) distance
 E.g., the Hamming distance: the number of bits that are different
between two binary vectors
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp
 h = 2: (L2 norm) Euclidean distance

d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j 2 ip jp
 h  . “supremum” (Lmax norm, L norm) distance.

 This is the maximum difference between any component
(attribute) of the vectors
41
41
Example: Minkowski Distance

Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
42
42
21
31/01/2021
Ordinal Variables
 An ordinal variable can be discrete or continuous

 Order is important, e.g., rank
 Can be treated like interval-scaled
 replace x by their rank rif  {1,..., M f }
if
 map the range of each variable onto [0, 1] by

replacing i-th object in the f-th variable by
r if  1
z if 
M f  1
 compute the dissimilarity using methods for interval-

scaled variables
43
43
Example: Ordinal variables
 Dissimilarity matrix:
44
22
31/01/2021
Attributes of Mixed Type

 A database may contain all attribute types
 Nominal, symmetric binary, asymmetric binary, numeric,
ordinal
 One may use a weighted formula to combine their effects
 pf  1 ij( f ) dij( f )
d (i, j) 
 pf  1 ij( f )
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
f is numeric: use the normalized distance
f is ordinal
zif 
r 1
if
 Compute ranks rif and
M 1 f
 Treat zif as numeric
45
45
Example: Mixed Type
 Dissimilarity matrix:
 For test-3: ; for the data:
46
23
31/01/2021
Cosine Similarity
 A document can be represented by thousands of attributes, each recording

the frequency of a particular word (such as keywords) or phrase in the
document.
 Other vector objects: gene features in micro-arrays, …

 Applications: information retrieval, biologic taxonomy, gene feature
mapping, ...
 Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency
vectors), then
cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d||: the length of vector d
47
47
Example: Cosine Similarity
 cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,

where  indicates vector dot product, ||d||: the length of vector d
 Ex: Find the similarity between documents 1 and 2.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 0.94
48
48
24
31/01/2021
Activities
 Do exercises in Text [1]: 2.2.a-f, 2.4.a-b, 2.6.a-c, 2.8.a
49
 Summary
50
50
25
31/01/2021
Summary
 Data attribute types: nominal, binary, ordinal, interval-

scaled, ratio-scaled
 Many types of data sets, e.g., numerical, text, graph,
Web, image.
 Gain insight into the data by:
 Basic statistical data description: central tendency,
dispersion, graphical displays
 Datavisualization: map data onto graphical primitives
 Measure data similarity
 Above steps are the beginning of data preprocessing
 Many methods have been developed but still an active
area of research
51
52
January 31, 2021 Data Mining: Concepts and Techniques
52
26
Lecture 2 – Activities
1. Identify attribute types: Numeric, Nominal, or Binary, in the following data tables.
Sunny Hot High FALSE Yes

Overcast Mild Normal TRUE Yes
Rainy Cool High FALSE No
Sunny 85.0 70.0 FALSE Yes

Overcast 80.0 80.0 TRUE Yes
Rainy 75.0 85.0 FALSE No
2. Exercises in Text [1]: 2.2.a-f, 2.4.a-b, 2.6.a-c, 2.8.a

2.2. Suppose that the data for analysis includes the attribute age. The age values for the data tuples
are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35,
35, 36, 40, 45, 46, 52, 70.
(a) What is the mean of the data? What is the median?
(b) What is the mode of the data? Comment on the data’s modality (i.e., bimodal, trimodal, etc.).
(c) What is the midrange of the data?
(d) Can you find (roughly) the first quartile (Q1) and the third quartile (Q3) of the data?
(e) Give the five-number summary of the data.
(f ) Show a boxplot of the data.
2.4. Suppose a hospital tested the age and body fat data for 18 randomly selected adults with the
following result.
Age 23 23 27 27 39 41 47 49 50
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2
Age 52 54 54 56 57 58 58 60 61
%fat 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7
(a) Calculate the mean, median and standard deviation of age and %fat.
(b) Draw the boxplots for age and %fat.
2.6. Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):
(a) Compute the Euclidean distance between the two objects.
1
(b) Compute the Manhattan distance between the two objects.

(c) Compute the Minkowski distance between the two objects, using h = 3.
2.8. It is important to define or select similarity measures in data analysis. However, there is no
commonly accepted subjective similarity measure. Results can vary depending on the similarity
measures used. Nonetheless, seemingly different similarity measures may be equivalent after
some transformation.
Suppose we have the following two-dimensional data set:
(a) Consider the data as two-dimensional data points. Given a new data point, x = (1.4, 1.6) as a
query, rank the database points based on similarity with the query using Euclidean distance,
Manhattan distance, supremum distance, and cosine similarity.
2
Answers:
2.2.
a. Mean = 809/27 = 30; Median = 25
b. Mode = 25 and 35
c. The midrange (average of the largest and smallest values in the data set) of the data is: (70 +
13)/2 = 41.5
d. Q1 = 20; Q3 = 35
e. 13, 20, 25, 35, 52, and outlier 70
f.
2.4.
a. For age: Mean = 46.44; Median = 51; σ = 12.85
For fat: Mean = 28.78; Median = 30.7; σ = 8.99
b.
3
2.6.
a. Euclidean distance = 6.7;
b. Manhattan distance = 11;
c. Minkowski distance = 6.1534 (h=3).
2.8.
a.
The following rankings of the data points based on similarity:

Euclidean distance: x1, x4, x3, x5, x2
Manhattan distance: x1, x4, x3, x5, x2
Supremum distance: x1, x4, x3, x5, x2
Cosine similarity: x1, x3, x4, x2, x5
4
Homework - Session 2
Nguyen Tien Duc ITITIU18029
2.2
a.
b.
c.
d.
e.
f. Box plot:
2.4
a.
b. Box plots:
2.6
a. Euclidean distance:
b. Manhattan distance:
c. Minkowski distance with h = 3:
2.8
Formula 1:
Use for Euclidean (h = 2), Manhattan (h = 1), Supremum (h = infinity)
Formula 2 for cosine similarity:
a. The data points are ranked from higher to lower similarity to the query.
Euclidean distance data points rank:
Data points Distances
x1 0.14
x4 0.22
x3 0.28
x5 0.61
x2 0.67
Manhattan distance data points rank:
x1 0.19
x4 0.3
x3 0.4
x5 0.7
x2 0.89
Supremum distance data points rank:
x1 0.1
x4 0.19
x3 0.2
x2-x5 0.6
Cosine similarity data points rank:
Data points Similarities
x1 0.99999
x3 0.99996
x4 0.99903
x2 0.99575
x5 0.96536
2 / 2 8
/ 2 0 2
1
Lecture 3: 1
Data preprocessing
References:
2/28/2021
2
2 / 2 8
/ 2 0 2
1
Data Preprocessing
 Data Preprocessing: An Overview
2
 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
3
Data Quality: Why Preprocess the Data?
 Measures for data quality: A multidimensional view

 Accuracy: correct or wrong, accurate or not
 Completeness: not recorded, unavailable, …
 Consistency: some modified but some not, dangling,
…
 Timeliness: timely update?
 Believability: how trustable the data are correct?
 Interpretability: how easily the data can be
understood?
4
2 / 2 8
/ 2 0 2
1
Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies 3
 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization
 Normalization
 Concept hierarchy generation
5
Data Preprocessing
 Data Quality
 Data Cleaning
 Data Reduction
 Summary 6
6
2 / 2 8
/ 2 0 2
1
Data Cleaning
 Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
4
 incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
 e.g., Occupation = “ ” (missing data)
 noisy: containing noise, errors, or outliers
 e.g., Salary = “−10” (an error)
 inconsistent: containing discrepancies in codes or names, e.g.,
 Age = “42”, Birthday = “03/07/2010”
 Was rating “1, 2, 3”, now rating “A, B, C”
 discrepancy between duplicate records
 Intentional (e.g., disguised missing data)
 Jan. 1 as everyone’s birthday?
7
Incomplete (Missing) Data
 Data is not always available

 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time
of entry
 not register history or changes of the data
 Missing data may need to be inferred
8
2 / 2 8
/ 2 0 2
1
How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing

(when doing classification)—not effective when the % of
5
missing values per attribute varies considerably
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the
same class: smarter
 the most probable value: inference-based such as
Bayesian formula or decision tree
9
Noisy Data
 Noise: random error or variance in a measured variable

 Incorrect attribute values may be due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which require data cleaning
 duplicate records
 incomplete data
 inconsistent data
10
10
2 / 2 8
/ 2 0 2
1
How to Handle Noisy Data?
 Binning
 first sort data and partition into (equal-frequency) bins
6
 then one can smooth by bin means, smooth by bin median,

smooth by bin boundaries, etc.
 Regression
 smooth by fitting the data into regression functions
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 detect suspicious values and check by human (e.g., deal
with possible outliers)
11
11
Data Cleaning as a Process

 Data discrepancy detection
 Use metadata (e.g., domain, range, dependency, distribution)
 Check field overloading
 Check uniqueness rule, consecutive rule and null rule
 Use commercial tools
 Data scrubbing: use simple domain knowledge (e.g., postal code,
spell-check) to detect errors and make corrections
 Data auditing: by analyzing data to discover rules and relationship to
detect violators (e.g., correlation and clustering to find outliers)
 Data migration and integration
 Data migration tools: allow transformations to be specified
 ETL (Extraction/Transformation/Loading) tools: allow users to specify
transformations through a graphical user interface
 Integration of the two processes
 Iterative and interactive (e.g., Potter’s Wheels)
12
12
2 / 2 8
/ 2 0 2
1
Data Preprocessing
7
 Data Quality
 Data Cleaning
 Data Reduction
 Summary
13
13
Data Integration
 Data integration:
 Combines data from multiple sources into a coherent store
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
 Entity identification problem:
 Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from different
sources are different
 Possible reasons: different representations, different scales, e.g.,
metric vs. British units
14
14
2 / 2 8
/ 2 0 2
1
Handling Redundancy in Data Integration
 Redundant data occur often when integration of

multiple databases
8
 Objectidentification: The same attribute or object

may have different names in different databases
 Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
 Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
 Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality
15
15
Correlation Analysis (Nominal Data)
 Χ2 (chi-square) test
(Observed  Expected) 2
 
2
Expected
 The larger the Χ2 value, the more likely the variables are related
 The cells that contribute the most to the Χ2 value are those
whose actual count is very different from the expected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population
16
16
2 / 2 8
/ 2 0 2
1
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)

Like science fiction 250(90) 200(360) 450
9
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
 Χ2 (chi-square) calculation (numbers in parenthesis are expected

counts calculated based on the data distribution in the two
categories)
(250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2
2      507.93
90 210 360 840
 It shows that like_science_fiction and play_chess are correlated
in the group
17
17
Correlation Analysis (Numeric Data)

 Correlation coefficient (also called Pearson’s product moment
coefficient)
 
n n
(ai  A)(bi  B) (ai bi )  n A B
rA, B  i 1
 i 1
n A B n A B
where n is the number of tuples, A and B are the respective

means of A and B, σA and σB are the respective standard
deviation of A and B, and Σ(aibi) is the sum of the AB cross-
product.
 If rA,B > 0, A and B are positively correlated (A’s values increase as
B’s). The higher, the stronger correlation.
 rA,B = 0: independent; rAB < 0: negatively correlated
18
18
2 / 2 8
/ 2 0 2
1
Correlation (viewed as linear relationship)

 Correlation measures the linear relationship between
objects
1 0
 To compute correlation, we standardize data objects, A

and B, and then take their dot product
a'k  (ak  mean( A)) / std ( A)
b'k  (bk  mean( B)) / std ( B)
correlation( A, B )  A' B '
19
19
Covariance (Numeric Data)
 Covariance is similar to correlation
Correlation coefficient:
where n is the number of tuples, A and B are the respective mean or expected
values of A and B, σA and σB are the respective standard deviation of A and B
 Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their
expected values
 Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely
to be smaller than its expected value
 Independence: CovA,B = 0 but the converse is not true:
 Some pairs of random variables may have a covariance of 0 but are not independent.
Only under some additional assumptions (e.g., the data follow multivariate normal
distributions) does a covariance of 0 imply independence
20
20
2 / 2 8
/ 2 0 2
1
Co-Variance: An Example
 It can be simplified in computation as

1 1
 Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5, 10),
(4, 11), (6, 14).
 Question: If the stocks are affected by the same industry trends, will their prices rise or
fall together?
 E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
 Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
 Thus, A and B rise together since Cov(A, B) > 0.
21
Data Preprocessing
 Data Quality
 Data Cleaning
 Data Reduction
 Summary
22
22
2 / 2 8
/ 2 0 2
1
Data Reduction Strategies

 Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
 Why data reduction? — A database/data warehouse may store 1 2
terabytes of data. Complex data analysis may take a very long time
to run on the complete data set.
 Data reduction strategies
 Dimensionality reduction, e.g., remove unimportant attributes
 Wavelet transforms
 Principal Components Analysis (PCA)
 Feature subset selection, feature creation
 Numerosity reduction (some simply call it: Data Reduction)
 Regression and Log-Linear Models
 Histograms, clustering, sampling
 Data cube aggregation
 Data compression
23
23
Data Reduction 1: Dimensionality Reduction
 Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse
 Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
 The possible combinations of subspaces will grow exponentially
 Dimensionality reduction
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
 Dimensionality reduction techniques
 Wavelet transforms
 Principal Component Analysis
 Supervised and nonlinear techniques (e.g., feature selection)
24
24
2 / 2 8
/ 2 0 2
1
Mapping Data to a New Space

 Fourier transform
 Wavelet transform
1 3
Two Sine Waves Two Sine Waves + Noise Frequency
25
25
What Is Wavelet Transform?
 Decomposes a signal into

different frequency subbands
 Applicable to n-dimensional
signals
 Data are transformed to
preserve relative distance
between objects at different
levels of resolution
 Allow natural clusters to
become more distinguishable
 Used for image compression
26
26
2 / 2 8
/ 2 0 2
1
Wavelet Transformation
Haar2 Daubechie4
 Discrete wavelet transform (DWT) for linear signal processing, multi-
resolution analysis 1 4
 Compressed approximation: store only a small fraction of the

strongest of the wavelet coefficients
 Similar to discrete Fourier transform (DFT), but better lossy
compression, localized in space
 Method:
 Length, L, must be an integer power of 2 (padding with 0’s, when necessary)
 Each transform has 2 functions: smoothing, difference
 Applies to pairs of data, resulting in two set of data of length L/2
 Applies two functions recursively, until reaches the desired length
27
27
Wavelet Decomposition
 Wavelets: A math tool for space-efficient hierarchical

decomposition of functions
 S = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to S^ = [23/4, -11/4,
1/ , 0, 0, -1, -1, 0]
2
 Compression: many small detail coefficients can be replaced by

0’s, and only the significant coefficients are retained
28
28
2 / 2 8
/ 2 0 2
1
Why Wavelet Transform?
 Use hat-shape filters

 Emphasize region where points cluster
1 5
 Suppress weaker information in their boundaries

 Effective removal of outliers
 Insensitive to noise, insensitive to input order
 Multi-resolution
 Detect arbitrary shaped clusters at different scales
 Efficient
 Complexity O(N)
 Only applicable to low dimensional data
29
29
Principal Component Analysis (PCA)

 Find a projection that captures the largest amount of variation in
data
 The original data are projected onto a much smaller space,
resulting in dimensionality reduction. We find the eigenvectors
of the covariance matrix, and these eigenvectors define the new
space
x2
x1 30
30
2 / 2 8
/ 2 0 2
1
Principal Component Analysis (Steps)

 Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
 Normalize input data: Each attribute falls within the same range 1 6
 Compute k orthonormal (unit) vectors, i.e., principal components

 Each input data (vector) is a linear combination of the k principal
component vectors
 The principal components are sorted in order of decreasing
“significance” or strength
 Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data)
 Works for numeric data only
31
31
Attribute Subset Selection
 Another way to reduce dimensionality of data

 Redundant attributes
 Duplicate much or all of the information contained in one or more
other attributes
 E.g., purchase price of a product and the amount of sales tax paid
 Irrelevant attributes
 Contain no information that is useful for the data mining task at
hand
 E.g., students' ID is often irrelevant to the task of predicting
students' GPA
32
32
2 / 2 8
/ 2 0 2
1
Heuristic Search in Attribute Selection
 There are 2d possible attribute combinations of d

attributes
 Typical heuristic attribute selection methods: 1 7
 Best single attribute under the attribute independence

assumption: choose by significance tests
 Best step-wise feature selection:
 The best single-attribute is picked first
 Then next best attribute condition to the first, ...
 Step-wise attribute elimination:

 Repeatedly eliminate the worst attribute
 Best combined attribute selection and elimination

 Optimal branch and bound:
 Use attribute elimination and backtracking
33
33
Attribute Creation (Feature Generation)
 Create new attributes (features) that can capture the important

information in a data set more effectively than the original ones
 Three general methodologies
 Attribute extraction
 Domain-specific
 Mapping data to new space (see: data reduction)

 E.g., Fourier transformation, wavelet transformation, manifold approaches (not covered)
 Attribute construction
 Combining features (see: discriminative frequent patterns in Chapter on “Advanced Classification”)
 Data discretization
34
34
2 / 2 8
/ 2 0 2
1
Data Reduction 2: Numerosity Reduction

 Reduce data volume by choosing alternative, smaller
forms of data representation
 Parametric methods (e.g., regression) 1 8
 Assume the data fits some model, estimate model

parameters, store only the parameters, and discard
the data (except possible outliers)
 Ex.:Log-linear models—obtain value at a point in m-
D space as the product on appropriate marginal
subspaces
 Non-parametric methods
 Do not assume models
 Major families: histograms, clustering, sampling, …
35
35
Parametric Data Reduction: Regression and

Log-Linear Models
 Linear regression
 Data modeled to fit a straight line
 Often uses the least-square method to fit the line
 Multiple regression
 Allows a response variable Y to be modeled as a
linear function of multidimensional feature vector
 Log-linear model
 Approximates discrete multidimensional probability
distributions
36
36
2 / 2 8
/ 2 0 2
1
Regression Analysis y
Y1
 Regression analysis: A collective name for
techniques for the modeling and analysis of
Y1’
numerical data consisting of values of a y=x+1 1 9
dependent variable (also called response

variable or measurement) and of one or more
X1 x
independent variables (aka. explanatory
variables or predictors)
 Used for prediction (including
 The parameters are estimated so as to give a forecasting of time-series data),
"best fit" of the data inference, hypothesis testing,
and modeling of causal
 Most commonly the best fit is evaluated by
relationships
using the least squares method, but other
criteria have also been used
37
37
Regress Analysis and Log-Linear Models

 Linear regression: Y = w X + b
 Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
 Using the least squares criterion to the known values of Y1, Y2, …, X1,
X2, ….
 Multiple regression: Y = b0 + b1 X1 + b2 X2
 Many nonlinear functions can be transformed into the above
 Log-linear models:
 Approximate discrete multidimensional probability distributions
 Estimate the probability of each point (tuple) in a multi-dimensional
space for a set of discretized attributes, based on a smaller subset of
dimensional combinations
 Useful for dimensionality reduction and data smoothing
38
38
2 / 2 8
/ 2 0 2
1
Histogram Analysis
40
 Divide data into buckets and 35
store average (sum) for each 2 0
30
bucket
25
 Partitioning rules:
20
 Equal-width: equal bucket
15
range
10
 Equal-frequency (or equal-
depth) 5
0
10000 30000 50000 70000 90000
39
39
Clustering
 Partition data set into clusters based on similarity, and

store cluster representation (e.g., centroid and
diameter) only
 Can be very effective if data is clustered but not if data
is “smeared”
 Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
 There are many choices of clustering definitions and
clustering algorithms
 Cluster analysis will be studied in depth in Chapter 10
40
40
2 / 2 8
/ 2 0 2
1
Sampling
 Sampling: obtaining a small sample s to represent the
whole data set N
 Allow a mining algorithm to run in complexity that is 2 1
potentially sub-linear to the size of the data

 Key principle: Choose a representative subset of the data
 Simple random sampling may have very poor
performance in the presence of skew
 Develop adaptive sampling methods, e.g., stratified
sampling:
 Note: Sampling may not reduce database I/Os (page at a
time)
41
41
Types of Sampling
 Simple random sampling

 There is an equal probability of selecting any particular
item
 Sampling without replacement
 Once an object is selected, it is removed from the
population
 Sampling with replacement
 A selected object is not removed from the population
 Stratified sampling:
 Partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the same
percentage of the data)
 Used in conjunction with skewed data
42
42
2 / 2 8
/ 2 0 2
1
Sampling: With or without Replacement
2 2
Raw Data
43
43
Sampling: Cluster or Stratified Sampling
Raw Data Cluster/Stratified Sample
44
44
2 / 2 8
/ 2 0 2
1
Data Cube Aggregation

 The lowest level of a data cube (base cuboid)
 The aggregated data for an individual entity of interest
 E.g., a customer in a phone calling data warehouse

2 3
 Multiple levels of aggregation in data cubes

 Further reduce the size of data to deal with
 Reference appropriate levels

 Use the smallest representation which is enough to solve the
task
 Queries regarding aggregated information should be answered
using data cube, when possible
45
45
Data Reduction 3: Data Compression

 String compression
 There are extensive theories and well-tuned algorithms
 Typically lossless, but only limited manipulation is possible
without expansion
 Audio/video compression
 Typically lossy compression, with progressive refinement
 Sometimes small fragments of signal can be reconstructed
without reconstructing the whole
 Time sequence is not audio
 Typically short and vary slowly with time
 Dimensionality and numerosity reduction may also be
considered as forms of data compression
46
46
2 / 2 8
/ 2 0 2
1
Data Compression
2 4
Original Data Compressed

Data
lossless
Original Data
Approximated
47
47
Data Preprocessing
 Data Quality
 Data Cleaning
 Data Reduction
 Summary
48
48
2 / 2 8
/ 2 0 2
1
Data Transformation
 A function that maps the entire set of values of a given attribute to a new
set of replacement values s.t. each old value can be identified with one of
the new values
2 5
 Methods
 Smoothing: Remove noise from data
 Attribute/feature construction
 New attributes constructed from the given ones
 Aggregation: Summarization, data cube construction
 Normalization: Scaled to fall within a smaller, specified range
 min-max normalization
 z-score normalization
 normalization by decimal scaling
 Discretization: Concept hierarchy climbing
49
49
Normalization
 Min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
 Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
Then $73,600 is mapped to 73 , 600  12 , 000 ( 1 . 0  0 )  0  0 . 716
98 , 000  12 , 000
 Z-score normalization (μ: mean, σ: standard deviation):
v  A
v'
 A
73,600  54,000
 Ex. Let μ = 54,000, σ = 16,000. Then  1.225
16,000
 Normalization by decimal scaling
v
v' Where j is the smallest integer such that Max(|ν’|) < 1
10 j
50
50
2 / 2 8
/ 2 0 2
1
Discretization
 Three types of attributes

 Nominal—values from an unordered set, e.g., color, profession
2 6
 Ordinal—values from an ordered set, e.g., military or academic
rank
 Numeric—real numbers, e.g., integer or real numbers
 Discretization: Divide the range of a continuous attribute into intervals
 Interval labels can then be used to replace actual data values
 Reduce data size by discretization
 Supervised vs. unsupervised
 Split (top-down) vs. merge (bottom-up)
 Discretization can be performed recursively on an attribute
 Prepare for further analysis, e.g., classification
51
51
Data Discretization Methods

 Typical methods: All the methods can be applied recursively
 Binning
 Top-down split, unsupervised
 Histogram analysis
 Top-down split, unsupervised
 Clustering analysis (unsupervised, top-down split or bottom-up

merge)
 Decision-tree analysis (supervised, top-down split)
 Correlation (e.g., 2) analysis (unsupervised, bottom-up merge)
52
52
2 / 2 8
/ 2 0 2
1
Simple Discretization: Binning
 Equal-width (distance) partitioning

 Divides the range into N intervals of equal size: uniform grid
 if A and B are the lowest and highest values of the attribute, the 2 7
width of intervals will be: W = (B –A)/N.

 The most straightforward, but outliers may dominate presentation
 Skewed data is not handled well
 Equal-depth (frequency) partitioning

 Divides the range into N intervals, each containing approximately
same number of samples
 Good data scaling
 Managing categorical attributes can be tricky
53
53
Binning Methods for Data Smoothing

 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
54
54
2 / 2 8
/ 2 0 2
1
Discretization by Classification &

Correlation Analysis
 Classification (e.g., decision tree analysis)
 Supervised: Given class labels, e.g., cancerous vs. benign 2 8
 Using entropy to determine split point (discretization point)
 Top-down, recursive split
 Details to be covered in Chapter “Classification”
 Correlation analysis (e.g., Chi-merge: χ2-based discretization)
 Supervised: use class information
 Bottom-up merge: find the best neighboring intervals (those having

similar distributions of classes, i.e., low χ2 values) to merge
 Merge performed recursively, until a predefined stopping condition
55
55
Concept Hierarchy Generation
 Concept hierarchy organizes concepts (i.e., attribute values)

hierarchically and is usually associated with each dimension in a data
warehouse
 Concept hierarchies facilitate drilling and rolling in data warehouses
to view data in multiple granularity
 Concept hierarchy formation: Recursively reduce the data by
collecting and replacing low level concepts (such as numeric values
for age) by higher level concepts (such as youth, adult, or senior)
 Concept hierarchies can be explicitly specified by domain experts
and/or data warehouse designers
 Concept hierarchy can be automatically formed for both numeric and
nominal data—For numeric data, use discretization methods shown
56
56
2 / 2 8
/ 2 0 2
1
Concept Hierarchy Generation

for Nominal Data
 Specification of a partial/total ordering of attributes explicitly

at the schema level by users or experts
2 9
 street < city < state < country

 Specification of a hierarchy for a set of values by explicit data
grouping
 {Urbana, Champaign, Chicago} < Illinois
 Specification of only a partial set of attributes
 E.g., only street < city, not others
 Automatic generation of hierarchies (or attribute levels) by
the analysis of the number of distinct values
 E.g., for a set of attributes: {street, city, state, country}
57
57
Automatic Concept Hierarchy Generation
 Some hierarchies can be automatically generated

based on the analysis of the number of distinct values
per attribute in the data set
 The attribute with the most distinct values is
placed at the lowest level of the hierarchy
 Exceptions, e.g., weekday, month, quarter, year
country 15 distinct values
province_or_ state 365 distinct values
city 3567 distinct values
street 674,339 distinct values
58
58
2 / 2 8
/ 2 0 2
1
Data Preprocessing
3 0
 Data Quality
 Data Cleaning
 Data Reduction
 Summary
59
59
Summary
 Data quality: accuracy, completeness, consistency,

timeliness, believability, interpretability
 Data cleaning: e.g. missing/noisy values, outliers
 Data integration from multiple sources:
 Entityidentification problem; Remove redundancies;
Detect inconsistencies
 Data reduction
 Dimensionality reduction; Numerosity reduction; Data
compression
 Data transformation and data discretization
 Normalization; Concept hierarchy generation
60
60
2 / 2 8
/ 2 0 2
1
3 1
2/28/2021 61
61
3/10/2021
Lecture 4:
Data mining knowledge
representation
References:
Chapter 3 in Data Mining: Practical Machine Learning Tools and Techniques
(Third Edition), by Ian H.Witten, Eibe Frank and Eibe Frank
1
3/10/2021
Knowledge representation
 Tables
 Linear models
 Trees
 Rules
 Classification rules
 Association rules
 Rules with exceptions
 More expressive rules
 Clusters
3
Output: representing structural patterns
 Many different ways of representing patterns
 Decision trees, rules, instance-based, …
 Also called “knowledge” representation
 Representation determines inference method
 Understanding the output is the key to understanding the underlying
learning methods
 Different types of output for different learning problems (e.g.
classification, regression, …)
4
2
3/10/2021
Tables
► Simplest way of representing output:
► Use the same format as input!
► Decision table for the weather problem:
Outlook Humidity Play
Sunny High No
Sunny Normal Yes
Overcast High Yes
Overcast Normal Yes
Rainy High No
Rainy Normal No
► Main problem: selecting the right attributes
 Tables
 Linear models
 Trees
 Rules
 Clusters
3
3/10/2021
Linear models
 Another simple representation
 Regression model
 Inputs (attribute values) and output are all numeric
 Output is the sum of weighted attribute values
 The trick is to find good values for the weights
A linear regression function for the CPU

performance data
PRP = 37.06 + 2.47CACH

8
4
3/10/2021
Linear models for classification
 Binary classification
 Line separates the two classes
 Decision boundary - defines where the decision changes
from one class value to the other
 Prediction is made by plugging in observed values of
the attributes into the expression
 Predict one class if output  0, and the other class if
output < 0
 Boundary becomes a high-dimensional plane
(hyperplane) when there are multiple attributes
Separating setosas from versicolors
2.0 – 0.5PETAL-LENGTH – 0.8PETAL-WIDTH = 0
10
5
3/10/2021
 Tables
 Linear models
 Trees
 Rules
 Clusters
11
Trees
 “Divide-and-conquer” approach produces tree

 Nodes involve testing a particular attribute
 Usually, attribute value is compared to constant
 Other possibilities:
 Comparing values of two attributes
 Using a function of one or more attributes
 Leaves assign classification, set of classifications, or
probability distribution to instances
 Unknown instance is routed down the tree
12
6
3/10/2021
Nominal and numeric attributes
Nominal:
●
number of children usually equal to number values

 attribute won’t get tested more than once
●Other possibility: division into two subsets
Numeric:
●
test whether value is greater or less than constant

 attribute may get tested several times
●Other possibility: three-way split (or multi-way split)
● Integer: less than, equal to, greater than

● Real: below, within, above
13
Missing values
Does absence of value have some significance?

●
●Yes  “missing” is a separate value
●No  “missing” must be treated in a special way
Solution A: assign instance to most popular branch

Solution B: split instance into pieces
●Pieces receive weight according to fraction of training instances that

go down each branch
●Classifications from leave nodes are combined using the weights that
have percolated to them
14
7
3/10/2021
Trees for numeric prediction
 Regression: the process of computing an expression that

predicts a numeric quantity
 Regression tree: “decision tree” where each leaf predicts a
numeric quantity
 Predicted value is average value of training instances that reach the
leaf
 Model tree: “regression tree” with linear regression models
at the leaf nodes
 Linear patches approximate continuous function
15
Linear regression for the CPU data
PRP =
- 56.1
+ 0.049 MYCT
+ 0.015 MMIN
+ 0.006 MMAX
+ 0.630 CACH
- 0.270 CHMIN
+ 1.46 CHMAX
16
8
3/10/2021
Regression tree for the CPU data
17
Model tree for the CPU data
18
9
3/10/2021
 Tables
 Linear models
 Trees
 Rules
 Clusters
19
Classification rules
●Popular alternative to decision trees
●Antecedent (pre-condition): a series of tests (just like
the tests at the nodes of a decision tree)

●Tests are usually logically ANDed together (but may
also be general logical expressions)

●Consequent (conclusion): classes, set of classes, or
probability distribution assigned by rule

●Individual rules are often logically ORed together
 Conflicts arise if different conclusions apply
20
10
3/10/2021
From trees to rules
●Easy: converting a tree into a set of

rules
 One rule for each leaf:
●Antecedent contains a condition for every node
on the path from the root to the leaf
●Consequent is class assigned by the leaf
●Produces rules that are unambiguous

Doesn’t matter in which order they are
executed
21
From rules to trees
● More difficult: transforming a rule set into a tree

Tree cannot easily express disjunction between rules
●
● Example: rules which test different attributes

If a and b then x
If c and d then x
●Symmetry needs to be broken

●Corresponding tree contains identical subtrees
( “replicated subtree problem”)
22
11
3/10/2021
A tree for a simple disjunction
23
The exclusive-or problem
If x = 1 and y = 0
then class = a
If x = 0 and y = 1
then class = a
If x = 0 and y = 0
then class = b
If x = 1 and y = 1
then class = b
24
12
3/10/2021
A tree with a replicated subtree
If x = 1 and y = 1
then class = a
If z = 1 and w = 1
then class = a
Otherwise class = b
25
“Nuggets” of knowledge
●Are rules independent pieces of knowledge? (It

seems easy to add a rule to an existing rule base.)
●Problem: ignores how rules are executed
●Two ways of executing a rule set:
Ordered set of rules (“decision list”)

● Order is important for interpretation
Unordered set of rules
●Rules may overlap and lead to different conclusions for the same
instance
26
13
3/10/2021
Interpreting rules
● What if two or more rules conflict?

Give no conclusion at all?
Go with rule that is most popular on training data?
…
● What if no rule applies to a test instance?

Give no conclusion at all?
Go with class that is most frequent in training data?
…
27
Special case: boolean class

●Assumption: if instance does not belong to class
“yes”, it belongs to class “no”
●Trick: only learn rules for class “yes” and use
default rule for “no”

If x = 1 and y = 1 then class = a
If z = 1 and w = 1 then class = a
Otherwise class = b
●Order of rules is not important. No conflicts!

●Rule can be written in disjunctive normal form
28
14
3/10/2021
 Tables
 Linear models
 Trees
 Rules
 Clusters
29
Association rules
 Association rules…
 … can predict any attribute and combinations of attributes
 … are not intended to be used together as a set
 Problem: immense number of possible associations
 Output needs to be restricted to show only the most predictive
associations  only those with high support and high confidence
30
15
3/10/2021
Support and confidence of a rule

 Support: number of instances predicted correctly
 Confidence: number of correct predictions, as
proportion of all instances that rule applies to
 Example: 4 cool days with normal humidity
If temperature = cool then humidity = normal
 Support= 4, confidence = 100%

 Normally: minimum support and confidence pre-
specified (e.g. 58 rules with support  2 and
confidence  95% for weather data)
31
Interpreting association rules
 Interpretation is not obvious:

If windy = false and play = no then outlook = sunny
and humidity = high
is not the same as

If windy = false and play = no then outlook = sunny
If windy = false and play = no then humidity = high
 It means that the following also holds:

If humidity = high and windy = false and play = no
then outlook = sunny
32
16
3/10/2021
 Tables
 Linear models
 Trees
 Rules
 Clusters 33
33
Rules with exceptions
 Idea:
allow rules to have exceptions
 Example: rule for iris data
If petal-length  2.45 and petal-length < 4.45 then Iris-versicolor
 New instance:
Sepal Sepal Petal Petal Type
length width length width
5.1 3.5 2.6 0.2 Iris-setosa
 Modified rule:
If petal-length  2.45 and petal-length < 4.45 then Iris-versicolor
EXCEPT if petal-width < 1.0 then Iris-setosa
34
17
3/10/2021
A more complex example
 Exceptions to exceptions to exceptions …
default: Iris-setosa
except if petal-length  2.45 and petal-length < 5.355
and petal-width < 1.75
then Iris-versicolor
except if petal-length  4.95 and petal-width < 1.55
then Iris-virginica
else if sepal-length < 4.95 and sepal-width  2.45
then Iris-virginica
else if petal-length  3.35
then Iris-virginica
except if petal-length < 4.85 and sepal-length < 5.95
then Iris-versicolor
35
Advantages of using exceptions
 Rules can be updated incrementally

 Easy to incorporate new data
 Easy to incorporate domain knowledge
 People often think in terms of exceptions

 Each conclusion can be considered just in the
context of rules and exceptions that lead to it
 Locality property is important for understanding large
rule sets
 “Normal” rule sets don’t offer this advantage
36
18
3/10/2021
 Tables
 Linear models
 Trees
 Rules
 Clusters
37
More on exceptions
 Default...except if...then...
is logically equivalent to
if...then...else
(where the else specifies what the default did)
 But:exceptions offer a psychological advantage
 Assumption: defaults and tests early on apply
more widely than exceptions further down
 Exceptions reflect special cases
38
19
3/10/2021
Rules involving relations
●So far: all rules involved comparing an attribute-value

to a constant (e.g. temperature < 45)
●These rules are called “propositional” because they
have the same expressive power as propositional logic

●What if problem involves relationships between
attributes
Can’t be expressed with propositional rules
More expressive representation required
39
The shapes problem
● Target concept: standing up

● Shaded: standing
● Unshaded: lying
40
20
3/10/2021
A propositional solution
Width Height Sides Class
2 4 4 Standing
3 6 4 Standing
4 3 4 Lying
7 8 3 Standing
7 6 3 Lying
2 9 4 Standing
9 1 4 Lying
10 2 3 Lying
If width  3.5 and height < 7.0

then lying
If height  3.5 then standing
41
A relational solution
●Comparing attributes with each other

If width > height then lying
If height > width then standing
●Generalizes better to new data

●Standard relations: =, <, >
●But: learning relational rules is costly
●Simple solution: add extra attributes
(e.g. a binary attribute is width < height?)
42
21
3/10/2021
Rules with variables

Using variables and multiple relations:
●
If height_and_width_of(x,h,w) and h > w

then standing(x)
The top of a tower of blocks is standing:

●
If height_and_width_of(x,h,w) and h > w

and is_top_of(y,x)
then standing(x)
The whole tower is standing:

●
If is_top_of(x,z) and
height_and_width_of(z,h,w) and h > w
and is_rest_of(x,y)and standing(y)
then standing(x)
If empty(x) then standing(x)
Recursive definition!
●
43
Inductive logic programming
●Recursive definition can be seen as logic program

●Techniques for learning logic programs stem from
the area of “inductive logic programming” (ILP)

●But: recursive definitions are hard to learn
Also: few practical problems require recursion

Thus: many ILP techniques are restricted to non-recursive
definitions to make learning easier
44
22
3/10/2021
 Tables
 Linear models
 Trees
 Rules
 Clusters
45
Instance-based representation
 Simplest form of learning: rote learning

 Training instances are searched for instance that most closely
resembles new instance
 The instances themselves represent the knowledge
 Also called instance-based learning
 Similarity function defines what’s “learned”
 Instance-based learning is lazy learning
 Methods: nearest-neighbor, k-nearest-neighbor, …
46
23
3/10/2021
The distance function
 Simplest case: one numeric attribute

 Distance is the difference between the two attribute values involved
(or a function thereof)
 Several numeric attributes: normally, Euclidean distance is
used and attributes are normalized
 Nominal attributes: distance is set to 1 if values are
different, 0 if they are equal
 Are all attributes equally important?
 Weighting the attributes might be necessary
47
Learning prototypes
 Only those instances involved in a decision need to be stored

 Noisy instances should be filtered out
 Idea: only use prototypical examples
48
24
3/10/2021
Rectangular generalizations
 Nearest-neighbor rule is used outside rectangles

 Rectangles are rules! (But they can be more
conservative than “normal” rules.)
 Nested rectangles are rules with exceptions
49
 Tables
 Linear models
 Trees
 Rules
 Clusters
50
25
3/10/2021
Representing clusters I
Simple 2-D Venn

representation diagram
Overlapping clusters
51
Representing clusters II
Probabilistic Dendrogram
assignment
1 2 3
a 0.4 0.1 0.5

b 0.1 0.8 0.1
c 0.3 0.3 0.4
d 0.1 0.1 0.8
e 0.4 0.2 0.4
f 0.1 0.4 0.5
g 0.7 0.2 0.1
h 0.5 0.4 0.1
…
NB: dendron is the Greek

word for tree
52
26
3/10/2021
Summary
27
Introduction to Data mining
Preprocessing and Knowledge Representation
1. Given the dataset:
https://drive.google.com/drive/folders/1kkXcdni6SN-2Thp6j2YiV1rMXflSqVkO
userId movieId rating s1 s2 s3 s4 s5 s6 s7 s8

205229 108979 5 1 1 3 4 2 2 5 5
205229 6947 4 1 1 3 4 4 2 5 4
205229 117444 4 1 4 4 2 2 2 4 2
205229 150548 4 2 2 4 2 4 2 4 1
205229 136542 5 1 1 5 1 1 2 5 2
117112 77455 3.5 1 2 2 2 4 4 4 4
144726 1303 4 1 1 ? 4 3 2 5 3
144726 103306 3.5 1 ? 3 2 1 2 4 2
144726 2060 4.5 1 1 4 2 1 1 5 2
144726 135534 3.5 2 1 5 1 1 1 5 2
144726 128542 3.5 1 1 4 3 1 2 5 2
200400 26939 4 1 2 4 2 2 2 4 1
200400 40491 3.5 1 2 4 2 2 2 4 1
125112 104337 3.5 ? ? ? ? ? ? ? ?
125112 162082 4 2 ? 3 5 5 ? 4 4
125112 96966 4.5 ? ? ? ? ? ? ? ?
125112 165551 4.5 1 3 4 3 3 2 5 4
125112 89759 5 ? ? ? ? ? ? ? ?
113031 5046 3.5 1 1 3 1 1 2 4 1
113031 116855 4 1 1 4 1 1 5 5 1
Propose an application using this dataset, e.g., supply chain management, and do the following tasks to
preprocess the data:
1. Handle missing values
2. Find correlated attributes (applying Correlation Analysis to create a correlation matrix), and important
attributes.
1
2. Describe how to preprocess the following web log data in order to mine web access sequences, and
represent the preprocessed data.
in24.inetnebr.com - - [01/Aug/1995:00:00:01 -0400] "GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0" 200 1839
uplherc.upl.com - - [01/Aug/1995:00:00:07 -0400] "GET / HTTP/1.0" 304 0
uplherc.upl.com - - [01/Aug/1995:00:00:08 -0400] "GET /images/ksclogo-medium.gif HTTP/1.0" 304 0
uplherc.upl.com - - [01/Aug/1995:00:00:08 -0400] "GET /images/MOSAIC-logosmall.gif HTTP/1.0" 304 0
uplherc.upl.com - - [01/Aug/1995:00:00:08 -0400] "GET /images/USA-logosmall.gif HTTP/1.0" 304 0
ix-esc-ca2-07.ix.netcom.com - - [01/Aug/1995:00:00:09 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1713
uplherc.upl.com - - [01/Aug/1995:00:00:10 -0400] "GET /images/WORLD-logosmall.gif HTTP/1.0" 304 0
slppp6.intermind.net - - [01/Aug/1995:00:00:10 -0400] "GET /history/skylab/skylab.html HTTP/1.0" 200 1687
piweba4y.prodigy.com - - [01/Aug/1995:00:00:10 -0400] "GET /images/launchmedium.gif HTTP/1.0" 200 11853
slppp6.intermind.net - - [01/Aug/1995:00:00:11 -0400] "GET /history/skylab/skylab-small.gif HTTP/1.0" 200 9202
slppp6.intermind.net - - [01/Aug/1995:00:00:12 -0400] "GET /images/ksclogosmall.gif HTTP/1.0" 200 3635
ix-esc-ca2-07.ix.netcom.com - - [01/Aug/1995:00:00:12 -0400] "GET /history/apollo/images/apollo-logo1.gif HTTP/1.0" 200 1173
slppp6.intermind.net - - [01/Aug/1995:00:00:13 -0400] "GET /history/apollo/images/apollo-logo.gif HTTP/1.0" 200 3047
uplherc.upl.com - - [01/Aug/1995:00:00:14 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 304 0
References:
- Web log format:
127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET

/apache_pb.gif HTTP/1.0" 200 2326
A "-" in a field indicates missing data.
- 127.0.0.1 is the IP address of the client (remote host) which made the request to the server.
- user-identifier is the RFC 1413 identity of the client.
- frank is the userid of the person requesting the document.
- [10/Oct/2000:13:55:36 -0700] is the date, time, and time zone that the request was received,
by default in strftime format %d/%b/%Y:%H:%M:%S %z.
- "GET /apache_pb.gif HTTP/1.0" is the request line from the client. The
method GET, /apache_pb.gif the resource requested, and HTTP/1.0 the HTTP protocol.
- 200 is the HTTP status code returned to the client. 2xx is a successful response, 3xx a
redirection, 4xx a client error, and 5xx a server error.
- 2326 is the size of the object returned to the client, measured in bytes.
- A preprocessed web log data

(http://archive.ics.uci.edu/ml/datasets/Anonymous+Microsoft+Web+Data)
2
3. Describe how to represent the following data in a vector space model (document-term matrix).
Analysis models of technical and economic data of mining enterprises based on big data
1 analysis
2 Spatial and Spatio-temporal Data Mining
3 Hair data model: A new data model for Spatio-Temporal data mining
4 A Data Stream Mining System
DD-Rtree: A dynamic distributed data structure for efficient data distribution among cluster
5 nodes for spatial data mining algorithms
6 Big data gathering and mining pipelines for CRM using open-source
7 The Research on Safety Monitoring System of Coal Mine Based on Spatial Data Mining
CAKE – Classifying, Associating and Knowledge DiscovEry - An Approach for Distributed Data
8 Mining (DDM) Using PArallel Data Mining Agents (PADMAs)
9 Digital construction of coal mine big data for different platforms based on life cycle
10 Privacy-Preserving Big Data Stream Mining: Opportunities, Challenges, Directions
11 Analysis Methods of Workflow Execution Data Based on Data Mining
12 Efficient Tree Structures for High Utility Pattern Mining in Incremental Databases
13 Comparison of Tools for Data Mining and Retrieval in High Volume Data Stream
14 Adaptive Differentially Private Data Release for Data Sharing and Data Mining
15 Domain Driven Data Mining (D3M)
Notice of Retraction<br>The research of building production-oriented data mart for mine
16 enterprises based on data mining
17 Study on land use of changping district with spatial data mining method
Linked Open Data mining for democratization of big dataData Mining Library for Big Data
18 Processing Platforms: A Case Study-Sparkling Water Platform
19 Towards Data-Oriented Hospital Services: Data Mining-Based Hospital Management
20 Developing an Integrated Time-Series Data Mining Environment for Medical Data Mining
3
Answers:
1.
userId movieId rating s1 s2 s3 s4 s5 s6 s7 s8

205229 108979 5 1 1 3 4 2 2 5 5
205229 6947 4 1 1 3 4 4 2 5 4
205229 117444 4 1 4 4 2 2 2 4 2
205229 150548 4 2 2 4 2 4 2 4 1
205229 136542 5 1 1 5 1 1 2 5 2
117112 77455 3.5 1 2 2 2 4 4 4 4
144726 1303 4 1 1 3? 4 3 2 5 3
144726 103306 3.5 1 1? 3 2 1 2 4 2
144726 2060 4.5 1 1 4 2 1 1 5 2
144726 135534 3.5 2 1 5 1 1 1 5 2
144726 128542 3.5 1 1 4 3 1 2 5 2
200400 26939 4 1 2 4 2 2 2 4 1
200400 40491 3.5 1 2 4 2 2 2 4 1
125112 104337 3.5 1? 2? 4? 2? 1? 2? 4? 1?
125112 162082 4 2 1? 3 5 5 2? 4 4
125112 96966 4.5 1? 2? 4? 2? 1? 1? 5? 4?
125112 165551 4.5 1 3 4 3 3 2 5 4
125112 89759 5 1? 1? 5? 1? 1? 2? 5? 2?
113031 5046 3.5 1 1 3 1 1 2 4 1
113031 116855 4 1 1 4 1 1 5 5 1
Using Population standard deviation
rating s1 s2 s3 s4 s5 s6 s7 s8
rating 1
s1 -0.17436 1
s2 -0.06547 -0.1131 1
0.137016
s3 0.344 0.0608 1
-0.07058 -0.54144
s4 0.059 0.1346 1
s5 -0.07835 0.420013 0.167015 -0.54634 0.670483 1
-0.00699 0.170926
s6 -0.172 -0.2048 -0.33 -0.13 1
-
s7 0.567964 -0.183 -0.38095 0.360588 0.061467 -0.278 0.12438 1
-
-0.02187 -0.01941 -0.45844 0.668603 0.474526 0.07907 0.361009
s8 0.3814 1
4
= 4.05 = 0.522
= 1.15 = 0.357
= 1.55 = 0.8
= 3.75 = 0.77
= 2.3 = 1.144
= 2.05 = 1.28
= 2.1 = 0.88
= 4.55 = 0.5
= 2.4 = 1.2806
Using Sample standard deviation
rating s1 s2 s3 s4 s5 s6 s7 s8
rating 1
s1 -0.15 1
-0.11311
s2 -0.072 1
s3 0.326467 0.130165 0.057761 1
s4 0.055651 0.127848 -0.06705 -0.51437 1
s5 -0.07444 0.399013 0.158665 -0.51902 0.636959 1
s6 -0.1638 -0.19457 -0.00664 -0.31375 -0.1214 0.16238 1
- -
s7 0.539566 -0.17381 -0.3619 0.342559 0.058394 0.26407 0.11816 1
-
s8 0.362375 -0.02078 -0.01844 -0.43552 0.635173 0.4508 0.07512 0.342959 1
= 4.05 = 0.535576
= 1.15 = 0.366348
= 1.55 = 0.825578
= 3.75 = 0.786398
= 2.3 = 1.174286
= 2.05 = 1.316894
= 2.1 = 0.91191
= 4.55 = 0.5
= 2.4 = 1.313893
5
2. Generate WAS
Hint: create a database warehouse consisting of three-dimension tables and one fact table as shown
below:
1 session = 1 hour
Ref: See Lab4.
3.
VSM:
Set a list M of documents, M = null;
Set a list T of vocabularies, T = null;
For each line L
6
Split L into terms using spaces
Set a list Tm of vocabularies for each doc, Tm = null
For each ti in L
If ti not in T {add ti into T, and count ti}
Else count ti in T;
If ti not in Tm {add ti into Tm, and count ti for this Tm}
Else count ti in Tm;
Add Tm into M
7
3/17/2021
Lecture 5:
Evaluating what’s been
learned
References: Chapter 5 in Data Mining: Practical Machine Learning Tools and

Techniques (Third Edition), by Ian H.Witten, Eibe Frank and Eibe Frank
1
Evaluating what’s been learned
 Issues: training and testing

 Predicting performance: confidence limits
 Holdout, cross-validation, bootstrap
 Comparing schemes: the t-test
 Predicting probabilities: loss functions
 Cost-sensitive measures
 Evaluating numeric prediction
 The Minimum Description Length principle
1
3/17/2021
Evaluation: the key to success
 How predictive is the model we learned?

 Error on the training data is not a good indicator of
performance on future data
 Otherwise 1-NN would be the optimum classifier!
 Simple solution that can be used if lots of (labeled)
data is available:
 Split data into training and test set
 However: (labeled) data is usually limited
 More sophisticated techniques need to be used
Issues in evaluation
 Statistical reliability of estimated differences in

performance ( significance tests)
 Choice of performance measure:
 Number of correct classifications
 Accuracy of probability estimates
 Error in numeric predictions
 Costs assigned to different types of errors
 Many practical applications involve costs
2
3/17/2021
Training and testing I
 Natural performance measure for classification

problems: error rate
 Success: instance’s class is predicted correctly
 Error: instance’s class is predicted incorrectly
 Error rate: proportion of errors made over the whole
set of instances
 Resubstitution error: error rate obtained from
training data
Training and testing II
 Test set: independent instances that have played no

part in formation of classifier
 Assumption: both training data and test data are
representative samples of the underlying problem
 Test and training data may differ in nature
 Example: classifiers built using customer data from two
different towns A and B
 To estimate performance of classifier from town A in completely new
town, test it on data from B
3
3/17/2021
Making the most of the data
 Once evaluation is complete, all the data can be used

to build the final classifier
 Generally, the larger the training data the better the
classifier
 The larger the test data the more accurate the error
estimate
 Holdout procedure: method of splitting original data
into training and test set
 Dilemma: ideally both training set and test set should be
large!
Predicting performance
 Assume the estimated error rate is 25%. How close is

this to the true error rate?
 Depends on the amount of test data
 Prediction is just like tossing a (biased!) coin
 “Head” is a “success”, “tail” is an “error”
 In statistics, a succession of independent events like
this is called a Bernoulli process
 Statistical theory provides us with confidence intervals for
the true underlying proportion
4
3/17/2021
Confidence intervals
 We can say: p lies within a certain specified interval

with a certain specified confidence
 Example: S=750 successes in N=1000 trials
 Estimated success rate: 75%
 How close is this to true success rate p?
 Answer: with 80% confidence p in [73.2,76.7]
 Another example: S=75 and N=100
 Estimated success rate: 75%
 With 80% confidence p in [69.1,80.1]
Mean and variance
 Mean and variance for a Bernoulli trial:

p, p (1–p)
 Expected success rate f=S/N
 Mean and variance for f : p, p (1–p)/N
 For large enough N, f follows a Normal
distribution
 c% confidence interval [–z  X  z] for random
variable with 0 mean is given by:
𝑃𝑟 −𝑧 ≤ 𝑋 ≤ 𝑧 = 𝑐
 With a symmetric distribution:

𝑃𝑟 −𝑧 ≤ 𝑋 ≤ 𝑧 = 1 − 2 × 𝑃𝑟 𝑥 ≥ 𝑧
10
10
5
3/17/2021
Confidence limits
 Confidence limits for the normal distribution with 0
mean and a variance of 1: Pr[X  z] z
0.1% 3.09
0.5% 2.58
1% 2.33
5% 1.65
10% 1.28
20% 0.84
–1 0 1 1.65 40% 0.25
 Thus:
𝑃𝑟 −1.65 ≤ 𝑋 ≤ 1.65 = 90%
 To use this we have to reduce our random variable f to

have 0 mean and unit variance
11
11
Transforming f
 Transformed value for f :

𝑓−𝑝
𝑝 1 − 𝑝 ⁄𝑁
(i.e. subtract the mean and divide by the standard deviation)
 Resulting equation:
𝑓−𝑝
 Solving for p : 𝑃𝑟 −𝑧 ≤
𝑝 1 − 𝑝 ⁄𝑁
≤𝑧
=𝑐
12
12
6
3/17/2021
Examples
 f = 75%, N = 1000, c = 80% (so that z = 1.28):

𝑝 ∈ 0.732,0.767
 f = 75%, N = 100, c = 80% (so that z = 1.28):

𝑝 ∈ 0.691,0.801
 Note that normal distribution assumption is only valid for

large N (i.e. N > 100)
 f = 75%, N = 10, c = 80% (so that z = 1.28):
𝑝 ∈ 0.549,0.881
(should be taken with a grain of salt)
13
13
Holdout estimation
 What to do if the amount of data is limited?

 The holdout method reserves a certain amount for
testing and uses the remainder for training
 Usually: one third for testing, the rest for training
 Problem: the samples might not be representative
 Example: class might be missing in the test data
 Advanced version uses stratification
 Ensures that each class is represented with approximately
equal proportions in both subsets
14
14
7
3/17/2021
Repeated holdout method
 Holdout estimate can be made more reliable by

repeating the process with different subsamples
 In each iteration, a certain proportion is randomly selected for
training (possibly with stratification)
 The error rates on the different iterations are averaged to yield
an overall error rate
 This is called the repeated holdout method
 Still not optimum: the different test sets overlap
 Can we prevent overlapping?
15
15
Cross-validation
 Cross-validation avoids overlapping test sets

 First step: split data into k subsets of equal size
 Second step: use each subset in turn for testing, the
remainder for training
 Called k-fold cross-validation
 Often the subsets are stratified before the cross-
validation is performed
 The error estimates are averaged to yield an
overall error estimate
16
16
8
3/17/2021
More on cross-validation
 Standard method for evaluation: stratified ten-fold

cross-validation
 Why ten?
 Extensive experiments have shown that this is the best choice
to get an accurate estimate
 There is also some theoretical evidence for this
 Stratification reduces the estimate’s variance
 Even better: repeated stratified cross-validation
 E.g. ten-fold cross-validation is repeated ten times and results
are averaged (reduces the variance)
17
17
Leave-One-Out cross-validation
 Leave-One-Out:
a particular form of cross-validation:
 Set number of folds to number of training instances
 I.e., for n training instances, build classifier n times
 Makes best use of the data
 Involves no random subsampling
 Very computationally expensive
 (exception: NN)
18
18
9
3/17/2021
Leave-One-Out-CV and stratification
 Disadvantage of Leave-One-Out-CV:
stratification is not possible
 It guarantees a non-stratified sample because
there is only one instance in the test set!
19
19
The bootstrap
 CV uses sampling without replacement

 The same instance, once selected, cannot be selected
again for a particular training/test set
 The bootstrap uses sampling with replacement to
form the training set
 Sample a dataset of n instances n times with replacement
to form a new dataset of n instances
 Use this data as the training set
 Use the instances from the original
dataset that don’t occur in the new
training set for testing
20
20
10
3/17/2021
The 0.632 bootstrap
 Also called the 0.632 bootstrap

 A particular instance has a probability of 1–
1/n of not being picked
 Thus its probability of ending up in the test
data is: 1
(1 − ) ≈ 𝑒 ≈ 0.368
𝑛
 This means the training data will contain

approximately 63.2% of the instances
21
21
Estimating error with the bootstrap
 The error estimate on the test data will be very

pessimistic
 Trained on just ~63% of the instances
 Therefore, combine it with the resubstitution
error:
𝑒𝑟𝑟 = 0.632 × 𝑒test instances + 0.368 × 𝑒training_instances
 The resubstitution error gets less weight than

the error on the test data
 Repeat process several times with different
replacement samples; average the results
22
22
11
3/17/2021
More on the bootstrap
 Probably the best way of estimating

performance for very small datasets
 However, it has some problems
 Consider the random dataset from above
 A perfect memorizer will achieve
0% resubstitution error and
~50% error on test data
 Bootstrap estimate for this classifier:
𝑒𝑟𝑟 = 0.632 × 50% + 0.368 × 0% = 31.6%
 True expected error: 50%

23
23
Comparing data mining schemes
 Frequent question: which of two learning schemes

performs better?
 Note: this is domain dependent!
 Obvious way: compare 10-fold CV estimates
 Generally sufficient in applications (we don't loose if
the chosen method is not truly better)
 However, what about machine learning research?
 Need to show convincingly that a particular
method works better
24
24
12
3/17/2021
Comparing schemes II
 Want to show that scheme A is better than scheme B in
a particular domain
 For a given amount of training data
 On average, across all possible training sets
 Let's assume we have an infinite amount of data from
the domain:
 Sample infinitely many dataset of specified size
 Obtain cross-validation estimate on each dataset for
each scheme
 Checkif mean accuracy for scheme A is better than
mean accuracy for scheme B
25
25
Predicting probabilities
 Performance measure so far: success rate

 Also called 0-1 loss function:
0 if prediction is correct
 ∑
1 if prediction is incorrect
 Most classifiers produces class probabilities

 Depending on the application, we might want to
check the accuracy of the probability estimates
 0-1 loss is not the right thing to use in those cases
26
26
13
3/17/2021
Quadratic loss function

 p1 … pk are probability estimates for an instance
 c is the index of the instance’s actual class
 a1 … ak = 0, except for ac which is 1
 Quadratic loss is: (𝑝 − 𝑎 ) = 1 − 2𝑝 + 𝑝
 Want to minimize
 Can show that this is minimized when pj = pj*, the

true probabilities
27
27
Informational loss function
 The informational loss function is –log(pc),

where c is the index of the instance’s actual class
 Number of bits required to communicate the
actual class
 Let p1* … pk* be the true class probabilities
 Then the expected value for the loss function is:
−𝑝∗ log 𝑝 −. . . −𝑝 ∗ log 𝑝
 Justification: minimized when pj = pj*

 Difficulty: zero-frequency problem
28
28
14
3/17/2021
Discussion
 Which loss function to choose?

 Both encourage honesty
 Quadratic loss function takes into account all class
probability estimates for an instance
 Informational loss focuses only on the probability
estimate for the actual class
 Quadratic loss is bounded:
1 𝑝
it can never exceed 2
 Informational loss can be infinite
 Informational loss is related to MDL principle
[later]
29
29
Counting the cost

 In practice, different types of classification
errors often incur different costs
 Examples:
 Loan decisions
 Oil-slick detection
 Fault diagnosis
 Promotional mailing
30
30
15
3/17/2021
Counting the cost
 The confusion matrix:

Predicted class
Yes No
Actual class Yes True positive False negative
No False positive True negative
There are many other types of cost!

 E.g.: cost of collecting training data
31
31
Aside: the kappa statistic

 Two confusion matrices for a 3-class problem:
actual predictor (left) vs. random predictor (right)
 Number of successes: sum of entries in diagonal (D)

𝐷 −𝐷
 Kappa statistic: 𝐷 −𝐷
measures relative improvement over random predictor
32
32
16
3/17/2021
Classification with costs
 Two cost matrices:
 Success rate is replaced by average cost per

prediction
 Costis given by appropriate entry in the cost
matrix
33
33
Cost-sensitive classification
 Can take costs into account when making predictions

 Basic idea: only predict high-cost class when very
confident about prediction
 Given: predicted class probabilities
 Normally we just predict the most likely class
 Here, we should make the prediction that minimizes
the expected cost
 Expected cost: dot product of vector of class
probabilities and appropriate column in cost matrix
 Choose column (class) that minimizes expected cost
34
34
17
3/17/2021
Cost-sensitive learning
 So far we haven't taken costs into account at

training time
 Most learning schemes do not perform cost-
sensitive learning
 They generate the same classifier no matter what costs
are assigned to the different classes
 Example: standard decision tree learner
 Simple methods for cost-sensitive learning:
 Resampling of instances according to costs
 Weighting of instances according to costs
 Some schemes can take costs into account by
varying a parameter, e.g. naïve Bayes
35
35
Lift charts
 In practice, costs are rarely known
 Decisions are usually made by comparing possible
scenarios
 Example: promotional mailout to 1,000,000
households
 Mail to all; 0.1% respond (1000)
 Data mining tool identifies subset of 100,000 most
promising, 0.4% of these respond (400)
40% of responses for 10% of cost may pay off
 Identify subset of 400,000 most promising, 0.2% respond
(800)
 A lift chart allows a visual comparison
36
36
18
3/17/2021
Generating a lift chart
 Sort instances according to predicted probability

of being positive:
Predicted probability Actual class
1 0.95 Yes
2 0.93 Yes
3 0.93 No
4 0.88 Yes
… … …
 x axis is sample size

y axis is number of true positives
37
37
A hypothetical lift chart
40% of responses 80% of responses

for 10% of cost for 40% of cost
38
38
19
3/17/2021
ROC curves
 ROC curves are similar to lift charts

 Stands for “receiver operating characteristic”
 Used in signal detection to show tradeoff between
hit rate and false alarm rate over noisy channel
 Differences to lift chart:
 y axis shows percentage of true positives in sample
rather than absolute number
 x axis shows percentage of false positives in
sample rather than sample size
39
39
A sample ROC curve
 Jagged curve—one set of test data

 Smooth curve—use cross-validation
40
40
20
3/17/2021
Cross-validation and ROC curves
 Simple method of getting a ROC curve using

cross-validation:
 Collect probabilities for instances in test folds
 Sort instances according to probabilities
 This method is implemented in WEKA
 However, this is just one possibility
 Another possibility is to generate an ROC curve for
each fold and average them
41
41
ROC curves for two schemes
 For a small, focused sample, use method A

 For a larger one, use method B
 In between, choose between A and B with appropriate probabilities
42
42
21
3/17/2021
The convex hull
 Given two learning schemes we can achieve any

point on the convex hull!
 TP and FP rates for scheme 1: t1 and f1
 TP and FP rates for scheme 2: t2 and f2
 If scheme 1 is used to predict 100 × q % of the
cases and scheme 2 for the rest, then
 TP rate for combined scheme:
q × t1 + (1-q) × t2
 FP rate for combined scheme:
q × f1+(1-q) × f2
43
43
More measures...
 Percentage of retrieved documents that are relevant:
precision=TP/(TP+FP)
 Percentage of relevant documents that are returned:
recall =TP/(TP+FN)
 Precision/recall curves have hyperbolic shape
 Summary measures: average precision at 20%, 50% and 80% recall
(three-point average recall)
 F-measure=(2 × recall × precision)/(recall+precision)
 sensitivity × specificity = (TP / (TP + FN)) × (TN / (FP + TN))
 Area under the ROC curve (AUC):
probability that randomly chosen positive instance is ranked above
randomly chosen negative one
44
44
22
3/17/2021
Summary of some measures
Domain Plot Explanation
Lift chart Marketing TP TP

Subset (TP+FP)/(TP+FP+TN
size +FN)
ROC curve Communications TP rate TP/(TP+FN)
FP rate FP/(FP+TN)
Recall- Information Recall TP/(TP+FN)
precision retrieval Precision TP/(TP+FP)
curve
45
45
Evaluating numeric prediction
 Same strategies: independent test set, cross-

validation, significance tests, etc.
 Difference: error measures
 Actual target values: a1 a2 …an
 Predicted target values: p1 p2 … pn
 Most popular measure: mean-squared error
𝑝 −𝑎 +. . . + 𝑝 − 𝑎
𝑛
 Easy to manipulate mathematically
46
46
23
3/17/2021
Other measures
The root mean-squared error :
●
𝑝 −𝑎 +. . . + 𝑝 − 𝑎
𝑛
●The mean absolute error is less sensitive to outliers

than the mean-squared error:
∣ 𝑝 − 𝑎 ∣ +. . .+∣ 𝑝 − 𝑎 ∣
𝑛
●Sometimes relative error values are more appropriate

(e.g. 10% for an error of 50 when predicting 500)
47
47
Improvement on the mean

●How much does the scheme improve on
simply predicting the average?
● The relative squared error is:

𝑝 −𝑎 + ⋯ + (𝑝 − 𝑎 )
_ __
(𝑎 − 𝑎 ) + ⋯ + (𝑎 − 𝑎 )
● The relative absolute error is:

∣ 𝑝 − 𝑎 ∣ + ⋯ +∣ 𝑝 − 𝑎 ∣
__ _
∣ 𝑎 − 𝑎 ∣ + ⋯ +∣ 𝑎 − 𝑎 ∣
48
48
24
3/17/2021
Correlation coefficient
 Measures the statistical correlation between the
predicted values and the actual values
 Scale independent, between –1 and +1

 Good performance leads to large values!
49
49
Which measure?
 Best to look at all of them
 Often it doesn’t matter
 Example:
A B C D
Root mean-squared error 67.8 91.7 63.3 57.4
Mean absolute error 41.3 38.5 33.4 29.2
Root rel squared error 42.2% 57.2% 39.4% 35.8%
Relative absolute error 43.1% 40.1% 34.8% 30.4%
Correlation coefficient 0.88 0.88 0.89 0.91
●D best
●C second-best
●A, B arguable
50
50
25
3/17/2021
The MDL principle
 MDL stands for minimum description length

 The description length is defined as:
space required to describe a theory
+
space required to describe the theory’s mistakes
 In our case the theory is the classifier and the mistakes
are the errors on the training data
 Aim: we seek a classifier with minimal DL
 MDL principle is a model selection criterion
51
51
Model selection criteria

 Model selection criteria attempt to find a good
compromise between:
 The complexity of a model
 Its prediction accuracy on the training data
 Reasoning: a good model is a simple model that
achieves high accuracy on the given data
 Also known as Occam’s Razor :
the best theory is the smallest one
that describes all the facts
William of Ockham, born in the village of Ockham in Surrey

(England) about 1285, was the most influential philosopher of
the 14th century and a controversial theologian.
52
52
26
3/17/2021
Elegance vs. errors
 Theory 1: very simple, elegant theory that explains

the data almost perfectly
 Theory 2: significantly more complex theory that
reproduces the data without mistakes
 Theory 1 is probably preferable
53
53
MDL and compression
 MDL principle relates to data compression:

 The best theory is the one that compresses the data the
most
 I.e. to compress a dataset we generate a model and then
store the model and its mistakes
 We need to compute
(a) size of the model, and
(b) space needed to encode the errors
 (b) easy: use the informational loss function
 (a) need a method to encode the model
54
54
27
3/17/2021
MDL and Bayes’s theorem
 L[T]=“length” of the theory

 L[E|T]=training set encoded in a certain number of
bits given the theory
 Description length= L[T] + L[E|T]
 Bayes’s theorem gives a posteriori probability of a
theory given the data:
𝑃𝑟[E|T]𝑃𝑟 𝑇
𝑃𝑟[T|E] =
𝑃𝑟 𝐸
 Equivalent to:
−log𝑃𝑟[T|E] = −log𝑃𝑟[E|T] − log𝑃𝑟 𝑇 + log𝑃𝑟 𝐸
constant
55
55
MDL and MAP
 MAP stands for maximum a posteriori probability

 Finding the MAP theory corresponds to finding the
MDL theory
 Difficult bit in applying the MAP principle:
determining the prior probability Pr[T] of the theory
 Corresponds to difficult part in applying the MDL
principle: coding scheme for the theory
 I.e. if we know a priori that a particular theory is
more likely we need fewer bits to encode it
56
56
28
3/17/2021
Discussion of MDL principle
 Advantage: makes full use of the training data when

selecting a model
 Disadvantage 1: appropriate coding scheme/prior
probabilities for theories are crucial
 Disadvantage 2: no guarantee that the MDL theory is the
one which minimizes the expected error
 Note: Occam’s Razor is an axiom!
 Epicurus’s principle of multiple explanations: keep all
theories that are consistent with the data
57
57
29
3/28/2021
Lecture 6:
Data mining algorithms:
Classification
References:
1
3/28/2021
Data mining algorithms: Classification
 Basic concepts
 Decision tree Induction
 Bayes Classification Methods
 Rule-based Classification
 Model Evaluation and Selection
Supervised vs. Unsupervised Learning

 Supervised learning (classification)
 Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
2
3/28/2021
Prediction Problems: Classification vs. Numeric

Prediction
 Classification
 predicts categorical class labels (discrete or nominal)
 classifies data (constructs a model) based on the training
set and the values (class labels) in a classifying attribute
and uses it in classifying new data
 Numeric Prediction
 models continuous-valued functions, i.e., predicts
unknown or missing values
 Typical applications
 Credit/loan approval:
 Medical diagnosis: if a tumor is cancerous or benign
 Fraud detection: if a transaction is fraudulent
 Web page categorization: which category it is
Classification—A Two-Step Process

 Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
 The set of tuples used for model construction is training set
 The model is represented as classification rules, decision trees, or
mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model
 The known label of test sample is compared with the classified
result from the model
 Accuracy rate is the percentage of test set samples that are
correctly classified by the model
 Test set is independent of training set (otherwise overfitting)
 If the accuracy is acceptable, use the model to classify new data
 Note: If the test set is used to select models, it is called validation
(test) set
3
3/28/2021
Process (1): Model Construction
Classification
Algorithms
Training
Data
NA ME RANK YE ARS T ENUR ED Classifier

Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill P rofessor 2 yes
Jim Associate Prof 7 yes
IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
Process (2): Using the Model in Prediction
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NA M E RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
M erlisa Associate Prof 7 no
G eorge Professor 5 yes
Joseph Assistant Prof 7 yes
4
3/28/2021
 Basic concepts
Decision Tree Induction: An Example

age income student credit_rating buys_computer
<=30 high no fair no
 Training data set: Buys_computer <=30 high no excellent no
 The data set follows an example of 31…40 high no fair yes
>40 medium no fair yes
Quinlan’s ID3 (Playing Tennis) >40 low yes fair yes
 Resulting tree: >40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
student? yes credit rating?
no yes excellent fair
no yes yes
10
10
5
3/28/2021
Algorithm for Decision Tree Induction

 Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-
conquer manner
 At start, all the training examples are at the root
 Attributes are categorical (if continuous-valued, they
are discretized in advance)
 Examples are partitioned recursively based on selected
attributes
 Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
 Conditions for stopping partitioning
 All samples for a given node belong to the same class
 There are no remaining attributes for further
partitioning – majority voting is employed for classifying
the leaf
 There are no samples left
11
11
12
6
3/28/2021
Brief Review of Entropy
13
13
Attribute Selection Measure: Information

Gain (ID3/C4.5)
 Select the attribute with the highest information gain
 Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify a tuple in D:
m
Info ( D )    pi log 2 ( pi )
i 1
 Information needed (after using A to split D into v partitions) to
classify D: v | D |
Info A ( D )  
j
 Info ( D j )
j 1 | D |
 Information gained by branching on attribute A
Gain(A)  Info(D)  Info A(D)

14
14
7
3/28/2021
Attribute Selection: Information Gain

 Class P: buys_computer = “yes” 5 4
Info age ( D )  I ( 2 ,3)  I ( 4 ,0 )
 Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info ( D )  I ( 9,5)   log 2 ( )  log 2 ( )  0 .940  I (3, 2 )  0 .694
14 14 14 14 14
age pi n i I(p i, n i) 5
<=30 2 3 0.971 I ( 2,3) means “age <=30” has 5 out
14 of 14 samples, with 2 yes’es and
31…40 4 0 0
3 no’s. Hence
>40 3 2 0.971
age
<=30
income student credit_rating
high no fair
buys_computer
no
Gain ( age )  Info ( D )  Info age ( D )  0.246
<=30 high no excellent no
31…40
>40
high
medium
no
no
fair
fair
yes
yes
Similarly,
>40 low yes fair yes
Gain(income)  0.029
>40 low yes excellent no
<=30 medium no fair no
<=30
>40
low
medium
yes
yes
fair
fair
yes
yes
Gain( student )  0.151
<=30
31…40
medium
medium
yes
no
excellent
excellent
yes
yes Gain(credit _ rating )  0.048
15
15
Computing Information-Gain for

Continuous-Valued Attributes
 Let attribute A be a continuous-valued attribute
 Must determine the best split point for A
 Sort the value A in increasing order
 Typically, the midpoint between each pair of adjacent
values is considered as a possible split point
 (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
 The point with the minimum expected information
requirement for A is selected as the split-point for A
 Split:
 D1 is the set of tuples in D satisfying A ≤ split-point,
and D2 is the set of tuples in D satisfying A > split-point
16
16
8
3/28/2021
Gain Ratio for Attribute Selection (C4.5)
 Information gain measure is biased towards attributes

with a large number of values
 C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfo A ( D )    log 2 ( )
j 1 |D| |D|
 GainRatio(A) = Gain(A)/SplitInfo(A)
 Ex.
 gain_ratio(income) = 0.029/1.557 = 0.019

 The attribute with the maximum gain ratio is selected as
the splitting attribute
17
17
Gini Index (CART, IBM IntelligentMiner)
 If a data set D contains examples from n classes, gini

index, gini(D) is defined as n
gini ( D )  1   p 2j
j 1
where pj is the relative frequency of class j in D
 If a data set D is split on A into two subsets D1 and D2,
the gini index gini(D) is defined as
| D1 | |D |
gini A ( D )  gini ( D 1)  2 gini ( D 2 )
|D | |D |
 Reduction in Impurity:
gini( A)  gini( D)  giniA (D)
 The attribute provides the smallest ginisplit(D) (or the
largest reduction in impurity) is chosen to split the node
(need to enumerate all the possible splitting points for
each attribute)
18
18
9
3/28/2021
Computation of Gini Index

 Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
2 2
9 5
gini( D)  1        0.459
 14   14 
 Suppose the attribute income partitions D into 10 in D1: {low,
medium} and 4 in D2  10  4
giniincome{low,medium} ( D)   Gini ( D1 )   Gini ( D2 )
 14   14 
Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the

{low,medium} (and {high}) since it has the lowest Gini index
 All attributes are assumed continuous-valued
 May need other tools, e.g., clustering, to get the possible split values
 Can be modified for categorical attributes
19
19
Comparing Attribute Selection Measures
 The three measures, in general, return good

results but
 Information gain:
 biased towards multivalued attributes
 Gain ratio:
 tends to prefer unbalanced splits in which one partition is much
smaller than the others
 Gini index:
 biased to multivalued attributes
 has difficulty when # of classes is large
 tends to favor tests that result in equal-sized partitions and purity in
both partitions
20
20
10
3/28/2021
Other Attribute Selection Measures
 CHAID: a popular decision tree algorithm, measure based on χ2 test for

independence
 C-SEP: performs better than info. gain and gini index in certain cases
 G-statistic: has a close approximation to χ2 distribution
 MDL (Minimal Description Length) principle (i.e., the simplest solution is
preferred):
 The best tree as the one that requires the fewest # of bits to both
(1) encode the tree, and (2) encode the exceptions to the tree
 Multivariate splits (partition based on multiple variable combinations)
 CART: finds multivariate splits based on a linear comb. of attrs.
 Which attribute selection measure is the best?
 Most give good results, none is significantly superior than others
21
21
Overfitting and Tree Pruning

 Overfitting: An induced tree may overfit the training data
 Too many branches, some may reflect anomalies due to
noise or outliers
 Poor accuracy for unseen samples
 Two approaches to avoid overfitting
 Prepruning: Halt tree construction early ̵ do not split a
node if this would result in the goodness measure falling
below a threshold
 Difficult to choose an appropriate threshold
 Postpruning: Remove branches from a “fully grown” tree—

get a sequence of progressively pruned trees
 Use a set of data different from the training data to decide which is the “best
pruned tree”
22
22
11
3/28/2021
Enhancements to Basic Decision Tree Induction
 Allow for continuous-valued attributes

 Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete
set of intervals
 Handle missing attribute values
 Assign the most common value of the attribute
 Assign probability to each of the possible values
 Attribute construction
 Create new attributes based on existing ones that are
sparsely represented
 This reduces fragmentation, repetition, and replication
23
23
Classification in Large Databases

 Classification—a classical problem extensively studied by
statisticians and machine learning researchers
 Scalability: Classifying data sets with millions of examples
and hundreds of attributes with reasonable speed
 Why is decision tree induction popular?
 relatively faster learning speed (than other
classification methods)
 convertible to simple and easy to understand
classification rules
 can use SQL queries for accessing databases
 comparable classification accuracy with other methods
 RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
 Builds an AVC-list (attribute, value, class label)
24
24
12
3/28/2021
Scalability Framework for RainForest

 Separates the scalability aspects from the criteria that
determine the quality of the tree
 Builds an AVC-list: AVC (Attribute, Value, Class_label)
 AVC-set (of an attribute X )
 Projection of training dataset onto the attribute X and
class label where counts of individual class label are
aggregated
 AVC-group (of a node n )
 Set of AVC-sets of all predictor attributes at the node n
25
25
Rainforest: Training Set and Its AVC Sets

Training Examples AVC-set on Age AVC-set on income
age income studentcredit_rating
buys_computerAge Buy_Computer income Buy_Computer
<=30 high no fair no yes no

yes no
<=30 high no excellent no
high 2 2
31…40 high no fair yes <=30 2 3
31..40 4 0 medium 4 2
>40 low yes fair yes >40 3 2 low 3 1

AVC-set on
<=30 medium no fair no AVC-set on Student
credit_rating
<=30 low yes fair yes
student Buy_Computer
>40 medium yes fair yes Credit
Buy_Computer
<=30 medium yes excellent yes yes no rating yes no

31…40 medium no excellent yes yes 6 1 fair 6 2
31…40 high yes fair yes no 3 4 excellent 3 3
26
26
13
3/28/2021
BOAT (Bootstrapped Optimistic Algorithm

for Tree Construction)
 Use a statistical technique called bootstrapping to create
several smaller samples (subsets), each fits in memory
 Each subset is used to create a tree, resulting in several
trees
 These trees are examined and used to construct a new tree
T’
 It turns out that T’ is very close to the tree that would be
generated using the whole data set together
 Adv: requires only two scans of DB, an incremental alg.
27
27
Presentation of Classification Results
28
March 28, 2021 Data Mining: Concepts and Techniques
28
14
3/28/2021
Visualization of a Decision Tree in SGI/MineSet 3.0
29
March 28, 2021 Data Mining: Concepts and Techniques
29
Interactive Visual Mining by Perception-Based

Classification (PBC)
30
Data Mining: Concepts and Techniques
30
15
3/28/2021
 Basic concepts
31
Bayesian Classification: Why?
 A statistical classifier: performs probabilistic prediction, i.e.,

predicts class membership probabilities
 Foundation: Based on Bayes’ Theorem.
 Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers
 Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct
— prior knowledge can be combined with observed data
 Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
32
32
16
3/28/2021
Bayes’ Theorem: Basics

M
 Total probability Theorem: P (B )   P (B | A )P( A )
i i
i 1
 Bayes’ Theorem: P ( H | X )  P (X | H ) P ( H )  P (X | H ) P (H ) / P (X )
P (X )
 Let X be a data sample (“evidence”): class label is unknown
 Let H be a hypothesis that X belongs to class C
 Classification is to determine P(H|X), (i.e., posteriori probability):
the probability that the hypothesis holds given the observed data
sample X
 P(H) (prior probability): the initial probability
 E.g., X will buy computer, regardless of age, income, …
 P(X): probability that sample data is observed
 P(X|H) (likelihood): the probability of observing the sample X, given
that the hypothesis holds
 E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income
33
33
Prediction Based on Bayes’ Theorem
 Given training data X, posteriori probability of a

hypothesis H, P(H|X), follows the Bayes’ theorem
P (H | X )  P (X | H )P (H )  P (X | H ) P (H ) / P (X )
P (X )
 Informally, this can be viewed as
posteriori = likelihood x prior/evidence
 Predicts X belongs to Ci iff the probability P(Ci|X) is the
highest among all the P(Ck|X) for all the k classes
 Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost
34
34
17
3/28/2021
Classification Is to Derive the Maximum Posteriori
 Let D be a training set of tuples and their associated

class labels, and each tuple is represented by an n-D
attribute vector X = (x1, x2, …, xn)
 Suppose there are m classes C1, C2, …, Cm.
 Classification is to derive the maximum posteriori, i.e.,
the maximal P(Ci|X)
 This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X)  i i
i P(X)
 Since P(X) is constant for all classes, only
P(C | X)  P(X | C )P(C )
i i i
needs to be maximized
35
35
Naïve Bayes Classifier
 A simplified assumption: attributes are conditionally

independent (i.e., no dependence relation between
attributes): n
P(X | Ci)   P(x | Ci)  P(x | Ci)  P(x | Ci) ... P(x | Ci)
k 1 2 n
k 1
 This greatly reduces the computation cost: Only counts
the class distribution
 If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having
value xk for Ak divided by |Ci, D| (# of tuples of Ci in D)
 If Ak is continous-valued, P(xk|Ci) is usually computed
based on Gaussian distribution with a mean μ and
standard deviation σ. (x ) 2
1 
g ( x,  , ) 
2
2
e
2 
and P(xk|Ci) is
P(X| Ci)  g(xk , Ci ,Ci )
36
36
18
3/28/2021
Naïve Bayes Classifier: Training Dataset
age income studentcredit_rating

buys_computer
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
Data to be classified:
X = (age <=30, <=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
>40 medium 37 no excellent no
37
Naïve Bayes Classifier: An Example

 P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 age income studentcredit_rating
buys_computer
P(buys_computer = “no”) = 5/14= 0.357 <=30 high no excellent no
31…40 high no fair yes
 Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 >40 low yes fair yes
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 31…40 low yes excellent yes
<=30 medium no fair no
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 <=30 low yes fair yes
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 >40 medium yes fair yes
<=30 medium yes excellent yes
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 31…40 medium no excellent yes
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 >40 medium no excellent no
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
38
38
19
3/28/2021
Avoiding the Zero-Probability Problem

 Naïve Bayesian prediction requires each conditional prob. be
non-zero. Otherwise, the predicted prob. will be zero
n
P ( X | C i)   P ( x k | C i)
k 1
 Ex. Suppose a dataset with 1000 tuples, income=low (0),
income= medium (990), and income = high (10)
 Use Laplacian correction (or Laplacian estimator)
 Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
 The “corrected” prob. estimates are close to their

“uncorrected” counterparts
39
39
Naïve Bayes Classifier: Comments

 Advantages
 Easy to implement
 Good results obtained in most of the cases
 Disadvantages
 Assumption: class conditional independence, therefore
loss of accuracy
 Practically, dependencies exist among variables
 E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.
 Dependencies among these cannot be modeled by Naïve Bayes Classifier
 How to deal with these dependencies? Bayesian Belief
Networks (Text [1]. Chapter 9)
40
40
20
3/28/2021
 Basic concepts
41
Using IF-THEN Rules for Classification

 Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
 Rule antecedent/precondition vs. rule consequent
 Assessment of a rule: coverage and accuracy
 ncovers = # of tuples covered by R
 ncorrect = # of tuples correctly classified by R
coverage(R) = ncovers /|D| /* D: training data set */
accuracy(R) = ncorrect / ncovers
 If more than one rule are triggered, need conflict resolution
 Size ordering: assign the highest priority to the triggering rules that
has the “toughest” requirement (i.e., with the most attribute tests)
 Class-based ordering: decreasing order of prevalence or
misclassification cost per class
 Rule-based ordering (decision list): rules are organized into one long
priority list, according to some measure of rule quality or by experts
42
42
21
3/28/2021
Rule Extraction from a Decision Tree

 Rules are easier to understand than large
trees age?
 One rule is created for each path from the <=30 31..40 >40
root to a leaf
student? credit rating?
yes
 Each attribute-value pair along a path forms a
excellent fair
conjunction: the leaf holds the class no yes
yes
prediction no yes
 Rules are mutually exclusive and exhaustive

 Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = no
IF age = old AND credit_rating = fair THEN buys_computer = yes
43
43
Rule Induction: Sequential Covering Method
 Sequential covering algorithm: Extracts rules directly from training

data
 Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
 Rules are learned sequentially, each for a given class Ci will cover
many tuples of Ci but none (or few) of the tuples of other classes
 Steps:
 Rules are learned one at a time
 Each time a rule is learned, the tuples covered by the rules are
removed
 Repeat the process on the remaining tuples until termination
condition, e.g., when no more training examples or when the
quality of a rule returned is below a user-specified threshold
 Comp. w. decision-tree induction: learning a set of rules
simultaneously
44
44
22
3/28/2021
Sequential Covering Algorithm
while (enough target tuples left)

generate a rule
remove positive target tuples satisfying this rule
Examples covered
Examples covered by Rule 2
by Rule 1 Examples covered
by Rule 3
Positive
examples
45
45
Rule Generation
 To generate a rule
while(true)
find the best predicate p
if foil-gain(p) > threshold then add p to current rule
else break
A3=1&&A1=2
A3=1&&A1=2
&&A8=5A3=1
Positive Negative
examples examples
46
46
23
3/28/2021
47
How to Learn-One-Rule?
 Start with the most general rule possible: condition = empty
 Adding new attributes by adopting a greedy depth-first strategy
 Picks the one that most improves the rule quality
 Rule-Quality measures: consider both coverage and accuracy
 Foil-gain (in FOIL & RIPPER): assesses info_gain by extending
condition
pos' pos
FOIL_ Gain  pos'(log2  log2 )
pos'neg' pos  neg
 favors rules that have high accuracy and cover many positive tuples
 Rule pruning based on an independent set of test tuples

pos  neg
FOIL _ Prune( R ) 
pos  neg
Pos/neg are # of positive/negative tuples covered by R.
If FOIL_Prune is higher for the pruned version of R, prune R
48
48
24
3/28/2021
 Basic concepts
49
Model Evaluation and Selection

 Evaluation metrics: How can we measure accuracy? Other metrics to
consider?
 Use validation test set of class-labeled tuples instead of training set
when assessing accuracy
 Methods for estimating a classifier’s accuracy:
 Holdout method, random subsampling
 Cross-validation
 Bootstrap
 Comparing classifiers:
 Confidence intervals
 Cost-benefit analysis and ROC Curves
50
50
25
3/28/2021
Classifier Evaluation Metrics: Confusion Matrix

Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
Example of Confusion Matrix:

Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000
 Given m classes, an entry, CMi,j in a confusion matrix indicates #

of tuples in class i that were labeled by the classifier as class j
 May have extra rows/columns to provide totals
51
51
Classifier Evaluation Metrics: Accuracy, Error

Rate, Sensitivity and Specificity
A\P C ¬C Class Imbalance Problem:

C TP FN P
 One class may be rare, e.g.
¬C FP TN N
fraud, or HIV-positive
P’ N’ All
 Significant majority of the
 Classifier Accuracy, or negative class and minority of

recognition rate: percentage the positive class
of test set tuples that are  Sensitivity: True Positive
correctly classified recognition rate

Accuracy = (TP + TN)/All  Sensitivity = TP/P
 Specificity: True Negative

 Error rate: 1 – accuracy, or
recognition rate
Error rate = (FP + FN)/All
 Specificity = TN/N
52
52
26
3/28/2021
Classifier Evaluation Metrics:

Precision and Recall, and F-measures
 Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive
 Recall: completeness – what % of positive tuples did the

classifier label as positive?
 Perfect score is 1.0
 Inverse relationship between precision & recall
 F measure (F1 or F-score): harmonic mean of precision and
recall,
 Fß: weighted measure of precision and recall

 assigns ß times as much weight to recall as to precision
53
53
Classifier Evaluation Metrics: Example
Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)

cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)
 Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%
54
54
27
3/28/2021
Evaluating Classifier Accuracy:

Holdout & Cross-Validation Methods
 Holdout method
 Given data is randomly partitioned into two independent sets
 Training set (e.g., 2/3) for model construction
 Test set (e.g., 1/3) for accuracy estimation
 Random sampling: a variation of holdout

 Repeat holdout k times, accuracy = avg. of the accuracies obtained
 Cross-validation (k-fold, where k = 10 is most popular)

 Randomly partition the data into k mutually exclusive
subsets, each approximately equal size
 At i-th iteration, use Di as test set and others as training set
 Leave-one-out: k folds where k = # of tuples, for small sized
data
 *Stratified cross-validation*: folds are stratified so that class
dist. in each fold is approx. the same as that in the initial
data
55
55
Evaluating Classifier Accuracy: Bootstrap
 Bootstrap
 Works well with small data sets
 Samples the given training tuples uniformly with replacement
 i.e.,each time a tuple is selected, it is equally likely to be
selected again and re-added to the training set
 Several bootstrap methods, and a common one is .632 bootstrap
 A data set with d tuples is sampled d times, with replacement,
resulting in a training set of d samples. The data tuples that did
not make it into the training set end up forming the test set.
About 63.2% of the original data end up in the bootstrap, and the
remaining 36.8% form the test set (since (1 – 1/d)d ≈ e-1 = 0.368)
 Repeat the sampling procedure k times, overall accuracy of the
model:
56
56
28
3/28/2021
Issues Affecting Model Selection
 Accuracy
 classifier accuracy: predicting class label
 Speed
 time to construct the model (training time)
 time to use the model (classification/prediction time)
 Robustness: handling noise and missing values
 Scalability: efficiency in disk-resident databases
 Interpretability
 understanding and insight provided by the model
 Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
57
57
Summary (I)
 Classification is a form of data analysis that extracts models
describing important data classes.
 Effective and scalable methods have been developed for decision
tree induction, Naive Bayesian classification, rule-based
classification, and many other classification methods.
 Evaluation metrics include: accuracy, sensitivity, specificity,
precision, recall, F measure, and Fß measure.
 Stratified k-fold cross-validation is recommended for accuracy
estimation. Bagging and boosting can be used to increase overall
accuracy by learning and combining a series of individual models.
58
58
29
3/28/2021
Summary (II)
 There have been numerous comparisons of the different

classification methods; the matter remains a research
topic
 No single method has been found to be superior over all
others for all data sets
 Issues such as accuracy, training time, robustness,
scalability, and interpretability must be considered and
can involve trade-offs, further complicating the quest for
an overall superior method
59
59
30
1. Exercises in Text [1] - 8.7
The following table consists of training data from an employee database. The data have been
generalized. For example, “31 . . . 35” for age represents the age range of 31 to 35. For a given
row entry, count represents the number of data tuples having the values for department, status, age,
and salary given in that row.
Let status be the class label attribute.
(a) How would you modify the basic decision tree algorithm to take into consideration the count
of each generalized data tuple (i.e., of each row entry)?
(b) Use your algorithm to construct a decision tree from the given data.
(c) Given a data tuple having the values “systems”, “26. . . 30”, and “46–50K” for the attributes
department, age, and salary, respectively, what would a Na¨ıve Bayesian classification of the status
for the tuple be?
1
Answers
(a) How would you modify the basic decision tree algorithm to take into consideration the count
of each generalized data tuple (i.e., of each row entry)?
The basic decision tree algorithm should be modified as follows to take into consideration the
count of each generalized data tuple.
• The count of each tuple must be integrated into the calculation of the attribute selection measure
(such as information gain).
• Take the count into consideration to determine the most common class among the tuples.
(b) Use your algorithm to construct a decision tree from the given data.
The resulting tree is:
Info(D) = 0.899
Gain(Dept) = 0.0488
Gain(Age) = 0.4247
Gain(Salary) = 0.5375
(c) Given a data tuple having the values “systems”, “26. . . 30”, and “46–50K” for the attributes
department, age, and salary, respectively, what would a Na¨ıve Bayesian classification of the status
for the tuple be?
P (X|senior) = 0; << this case, the Laplacian correction was not used. 1 more tuple for each
age-value pair should be added.
P (X|junior) =23/113 × 49/113 × 23/113 = 0.018. Thus, a Na¨ıve Bayesian classification predicts
“junior”.

Data Mining Lecture Notes

Uploaded by

Data Mining Lecture Notes

Uploaded by

24/01/2021

 What is data mining?

What is data mining?

 Example 1: Web usage mining

 Example 2: cow culling

What is data mining?

information from data

What Is Data Mining?

What is data mining?

What is data mining?

Data mining is defined as the process of

 What is data mining?

Data Mining in Business Intelligence

Data Presentation Business

Data Preprocessing/Integration, Data Warehouses

Data Mining Goals

 Extract information from a data set

 What is data mining?

Knowledge Discovery (KDD) Process

Data Warehouse Selection

Example: A Web Mining Framework

 Web mining usually involves

KDD Process: A Typical View from ML and Statistics

Input Data Data Pre- Data Post-

Data integration Pattern discovery Pattern evaluation

 This is a view from typical machine learning and statistics communities

Which View Do You Prefer?

 What is data mining?

Data Mining: Confluence of Multiple Disciplines

Machine Pattern Statistics

Applications Data Mining Visualization

Algorithm Database High-Performance

Why Confluence of Multiple Disciplines?

 Tremendous amount of data

Machine learning techniques

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1) 19

If tear production rate = reduced

Age Spectacle Astigmatism Tear production Recommended

Young Myope No Reduced None

Young Hypermetrope No Normal Soft

Pre-presbyopic Hypermetrope No Reduced None

Presbyopic Myope Yes Normal Hard

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1) 20

Can machines really learn?

 Does learning imply intention?

 What is data mining?

Knowledge Representation Methods

Knowledge Representation Methods

 Decision table for the weather problem:

Outlook Humidity Play

Knowledge Representation Methods

Knowledge Representation Methods

PRP = 37.06 + 2.47CACH

Knowledge Representation Methods

Knowledge Representation Methods

If tear production rate = reduced

Knowledge Representation Methods

Knowledge Representation Methods

 What is data mining?

Processing loan applications (American Express)

 Given: questionnaire with

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1) 33

Enter machine learning

 1000 training examples of borderline cases

 Learned rules: correct on 70% of cases

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1) 35

Enter machine learning

Enter machine learning

 Average difference among three “most similar” days

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1) 38

Diagnosis of machine faults

 Diagnosis: classical domain

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1) 39

Enter machine learning

 Available: 600 faults with expert’s diagnosis

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1) 40