0% found this document useful (0 votes)
79 views186 pages

Data Mining Lecture Notes

The document provides an introduction to data mining, including: - Defining data mining as the process of discovering patterns in large data sets. - The goals of data mining are to extract information from data, transform it into an understandable structure, and enable prediction. - The stages of the data mining process include data preprocessing, model building, and pattern evaluation. - Data mining draws from multiple disciplines including machine learning, pattern recognition, and statistics.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
79 views186 pages

Data Mining Lecture Notes

The document provides an introduction to data mining, including: - Defining data mining as the process of discovering patterns in large data sets. - The goals of data mining are to extract information from data, transform it into an understandable structure, and enable prediction. - The stages of the data mining process include data preprocessing, model building, and pattern evaluation. - Data mining draws from multiple disciplines including machine learning, pattern recognition, and statistics.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 186

24/01/2021

Lecture 1:
Introduction to Data Mining
Lecturer: Dr. Nguyen, Thi Thanh Sang
(nttsang@hcmiu.edu.vn)

References:
[1] Chapter 1 in Data Mining: Concepts and Techniques (Third Edition), by
Jiawei Han, Micheline Kamber
[2] Chapter 1 in Data Mining: Practical Machine Learning Tools and
Techniques (Third Edition), by Ian H.Witten, Eibe Frank and Eibe Frank
1

Introduction

 What is data mining?


 Data Mining Goals
 Stages of the Data Mining Process
 Data Mining Techniques
 Knowledge Representation Methods
 Applications
 Example: weather data

1
2
24/01/2021

What is data mining?

 Example 1: Web usage mining


◆ Given: click streams
◆ Problem: prediction of user behaviour
◆ Data: historical records of embryos and outcome

 Example 2: cow culling


◆ Given: cows described by 700 features
◆ Problem: selection of cows that should be culled
◆ Data: historical records and farmers’ decisions

What is data mining?

 Extracting
◆ implicit,
◆ previously unknown,
◆ potentially useful

information from data


 Needed: programs that detect patterns and
regularities in the data
 Strong patterns  good predictions
◆ Problem 1: most patterns are not interesting
◆ Problem 2: patterns may be inexact (or spurious)
◆ Problem 3: data may be garbled or missing

2
4
24/01/2021

What Is Data Mining?


 Data mining (knowledge discovery from data)
 Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
 Data mining: a misnomer?
 Alternative names
 Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
 Watch out: Is everything “data mining”?
 Simple search and query processing
 (Deductive) expert systems

What is data mining?

Definitions:
 DM: The practice of examining large databases in
order to generate new information.
 DM: The process of analyzing data from different
perspectives and summarizing it into useful
information - information that can be used to
increase revenue, cut costs, or both.

3
6
24/01/2021

What is data mining?

Data mining is defined as the process of


discovering patterns in data.
 The process must be automatic or (more usually)
semiautomatic.
 The patterns discovered must be meaningful in
that they lead to some advantage, usually an
economic one.
 The data is invariably presented in substantial
quantities.

Introduction

 What is data mining?


 Data Mining Goals
 Stages of the Data Mining Process
 Data Mining Techniques
 Knowledge Representation Methods
 Applications
 Example: weather data

4
8
24/01/2021

Data Mining in Business Intelligence

Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
9

Data Mining Goals

 Extract information from a data set


 Transform it into an understandable structure for further
use
 The ultimate goal of data mining is prediction

5
10
24/01/2021

Introduction

 What is data mining?


 Data Mining Goals
 Stages of the Data Mining Process
 Data Mining Techniques
 Knowledge Representation Methods
 Applications
 Example: weather data

11

Knowledge Discovery (KDD) Process


 This is a view from typical
database systems and data
warehousing communities Pattern Evaluation
 Data mining plays an essential
role in the knowledge discovery
process Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases 12
6
12
24/01/2021

Example: A Web Mining Framework

 Web mining usually involves


 Data cleaning
 Data integration from multiple sources
 Warehousing the data
 Data cube construction
 Data selection for data mining
 Data mining
 Presentation of the mining results
 Patterns and knowledge to be used or stored into knowledge-base

13

13

KDD Process: A Typical View from ML and Statistics

Input Data Data Pre- Data Post-


Processing Mining Processing

Data integration Pattern discovery Pattern evaluation


Normalization Association & correlation Pattern selection
Feature selection Classification Pattern interpretation
Clustering
Dimension reduction Pattern visualization
Outlier analysis
…………

 This is a view from typical machine learning and statistics communities

14
7
14
24/01/2021

Which View Do You Prefer?


 Which view do you prefer?
 KDD vs. ML/Stat. vs. Business Intelligence
 Depending on the data, applications, and your focus
 Data Mining vs. Data Exploration
 Business intelligence view
 Warehouse, data cube, reporting but not much
mining
 Business objects vs. data mining tools
 Supply chain example: mining vs. OLAP vs.
presentation tools
 Data presentation vs. data exploration
15

15

Introduction

 What is data mining?


 Data Mining Goals
 Stages of the Data Mining Process
 Data Mining Techniques
 Knowledge Representation Methods
 Applications
 Example: weather data

8
16
24/01/2021

Data Mining: Confluence of Multiple Disciplines

Machine Pattern Statistics


Learning Recognition

Applications Data Mining Visualization

Algorithm Database High-Performance


Technology Computing

17

17

Why Confluence of Multiple Disciplines?

 Tremendous amount of data


 Algorithms must be scalable to handle big data
 High-dimensionality of data
 Micro-array may have tens of thousands of dimensions
 High complexity of data
 Data streams and sensor data
 Time-series data, temporal data, sequence data
 Structure data, graphs, social and information networks
 Spatial, spatiotemporal, multimedia, text and Web data
 Software programs, scientific simulations
 New and sophisticated applications
18
9
18
24/01/2021

Machine learning techniques


 Algorithms for acquiring structural descriptions
from examples
 Structural descriptions represent patterns
explicitly
◆ Can be used to predict outcome in new situation
◆ Can be used to understand and explain how prediction
is derived
(may be even more important)
 Methods originate from artificial intelligence,
statistics, and research on databases

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1) 19

19

Structural descriptions
 Example: if-then rules

If tear production rate = reduced


then recommendation = none
Otherwise, if age = young and astigmatic = no
then recommendation = soft

Age Spectacle Astigmatism Tear production Recommended


prescription rate lenses

Young Myope No Reduced None

Young Hypermetrope No Normal Soft

Pre-presbyopic Hypermetrope No Reduced None

Presbyopic Myope Yes Normal Hard

… … … … …

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1) 20


10
20
24/01/2021

Can machines really learn?


 Definitions of “learning” from dictionary:
To get knowledge of by study,
experience, or being taught Difficult to measure
To become aware by information or
from observation
To commit to memory Trivial for computers
To be informed of, ascertain; to receive instruction

 Operational definition:
Things learn when they change their behavior
in a way that makes them perform better in Does a slipper learn?
the future.

 Does learning imply intention?


21
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1)

21

Introduction

 What is data mining?


 Data Mining Goals
 Stages of the Data Mining Process
 Data Mining Techniques
 Knowledge Representation Methods
 Applications
 Example: weather data

11
22
24/01/2021

Knowledge Representation Methods

 Tables
 Data cube
 Linear models
 Trees
 Rules
 Instance-based Representation
 Clusters

23

Knowledge Representation Methods

 Decision table for the weather problem:

Outlook Humidity Play


Sunny High No
Sunny Normal Yes
Overcast High Yes
Overcast Normal Yes
Rainy High No
Rainy Normal No

12
24
24/01/2021

Knowledge Representation Methods


 A sample data cube:
Total annual sales
Date of TVs in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
TV
PC U.S.A
VCR

Country
sum
Canada

Mexico

sum

25

25

Knowledge Representation Methods


 A linear regression function for the CPU performance data

PRP = 37.06 + 2.47CACH


13
26
24/01/2021

Knowledge Representation Methods


 Regression tree for the CPU data

27

Knowledge Representation Methods

 If-then Rules

If tear production rate = reduced


then recommendation = none
Otherwise, if age = young and astigmatic = no
then recommendation = soft

14
28
24/01/2021

Knowledge Representation Methods

 Instance-based representation

29

Knowledge Representation Methods

 Clusters

15
30
24/01/2021

Introduction

 What is data mining?


 Data Mining Goals
 Stages of the Data Mining Process
 Data Mining Techniques
 Knowledge Representation Methods
 Applications
 Example: weather data

31

Applications
 The result of learning—or the learning method itself—
is deployed in practical applications
◆ Processing loan applications
◆ Screening images for oil slicks
◆ Electricity supply forecasting
◆ Diagnosis of machine faults
◆ Marketing and sales
◆ Separating crude oil and natural gas
◆ Reducing banding in rotogravure printing
◆ Finding appropriate technicians for telephone faults
◆ Scientific applications: biology, astronomy, chemistry
◆ Automatic selection of TV programs
◆ Monitoring intensive care patients

16
32
24/01/2021

Processing loan applications (American Express)

 Given: questionnaire with


financial and personal information
 Question: should money be lent?
 Simple statistical method covers 90% of cases
 Borderline cases referred to loan officers
 But: 50% of accepted borderline cases defaulted!
 Solution: reject all borderline cases?
◆ No! Borderline cases are most active customers

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1) 33

33

Enter machine learning

 1000 training examples of borderline cases


 20 attributes:
◆ age
◆ years with current employer
◆ years at current address
◆ years with the bank
◆ other credit cards possessed,…

 Learned rules: correct on 70% of cases


◆ human experts only 50%
 Rules could be used to explain decisions to
customers
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1) 34
17
34
24/01/2021

Screening images
 Given: radar satellite images of coastal waters
 Problem: detect oil slicks in those images
 Oil slicks appear as dark regions with changing
size and shape
 Not easy: lookalike dark regions can be caused
by weather conditions (e.g. high wind)
 Expensive process requiring highly trained
personnel

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1) 35

35

Enter machine learning


 Extract dark regions from normalized image
 Attributes:
◆ size of region
◆ shape, area
◆ intensity
◆ sharpness and jaggedness of boundaries
◆ proximity of other regions
◆ info about background
 Constraints:
◆ Few training examples—oil slicks are rare!
◆ Unbalanced data: most dark regions aren’t slicks
◆ Regions from same image form a batch
◆ Requirement: adjustable false-alarm rate

36
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1)
18
36
24/01/2021

Load forecasting
 Electricity supply companies
need forecast of future demand
for power
 Forecasts of min/max load for each hour
significant savings
 Given: manually constructed load model that
assumes “normal” climatic conditions
 Problem: adjust for weather conditions
 Static model consists of:
◆ base load for the year
◆ load periodicity over the year
◆ effect of holidays

37
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1)

37

Enter machine learning


 Prediction corrected using “most similar” days
 Attributes:
◆ temperature
◆ humidity
◆ wind speed
◆ cloud cover readings
◆ plus difference between actual load and predicted load

 Average difference among three “most similar” days


added to static model
 Linear regression coefficients form attribute weights
in similarity function

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1) 38


19
38
24/01/2021

Diagnosis of machine faults

 Diagnosis: classical domain


of expert systems
 Given: Fourier analysis of vibrations measured
at various points of a device’s mounting
 Question: which fault is present?
 Preventative maintenance of electromechanical
motors and generators
 Information very noisy
 So far: diagnosis by expert/hand-crafted rules

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1) 39

39

Enter machine learning

 Available: 600 faults with expert’s diagnosis


 ~300 unsatisfactory, rest used for training
 Attributes augmented by intermediate concepts
that embodied causal domain knowledge
 Expert not satisfied with initial rules because they
did not relate to his domain knowledge
 Further background knowledge resulted in more
complex rules that were satisfactory
 Learned rules outperformed hand-crafted ones

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1) 40

20
40
24/01/2021

Marketing and sales I

 Companies precisely record massive amounts of


marketing and sales data
 Applications:
◆ Customer loyalty:
identifying customers that are likely to defect by
detecting changes in their behavior
(e.g. banks/phone companies)
◆ Special offers:
identifying profitable customers
(e.g. reliable owners of credit cards that need extra
money during the holiday season)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1) 41

41

Marketing and sales II

 Market basket analysis


◆ Association techniques find
groups of items that tend to
occur together in a
transaction
(used to analyze checkout data)
 Historical analysis of purchasing patterns
 Identifying prospective customers
◆ Focusing promotional mailouts
(targeted campaigns are cheaper than mass-marketed
ones)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1) 42

21
42
24/01/2021

Introduction

 What is data mining?


 Data Mining Goals
 Stages of the Data Mining Process
 Data Mining Techniques
 Knowledge Representation Methods
 Applications
 Example: weather data

43

The weather problem


 Conditions for playing a certain game

Outlook Temperature Humidity Windy Play


Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild Normal False Yes
… … … … …

If outlook = sunny and humidity = high then play = no


If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity = normal then play = yes
If none of the above then play = yes

44
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1)
22
44
24/01/2021

Classification vs. association rules

 Classification rule:
predicts value of a given attribute (the classification of an example)
If outlook = sunny and humidity = high
then play = no

 Association rule:
predicts value of arbitrary attribute (or combination)
If temperature = cool then humidity = normal
If humidity = normal and windy = false
then play = yes
If outlook = sunny and play = no
then humidity = high
If windy = false and play = no
then outlook = sunny and humidity = high

45
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1)

45

Weather data with mixed attributes

 Some attributes have numeric values

Outlook Temperature Humidity Windy Play


Sunny 85 85 False No
Sunny 80 90 True No
Overcast 83 86 False Yes
Rainy 75 80 False Yes
… … … … …

If outlook = sunny and humidity > 83 then play = no


If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity < 85 then play = yes
If none of the above then play = yes

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1) 46


23
46
24/01/2021

Summary

 What is Data Mining?


 What kinds of Data can be mined?
 Which Technologies are used?
 Which kinds of applications are targeted?

47

24
31/01/2021

Lecture 2:
Getting to Know Your Data
Lecturer: Dr. Nguyen, Thi Thanh Sang
(nttsang@hcmiu.edu.vn)

References:
Chapter 2 in Data Mining: Concepts and Techniques (Third Edition), by Jiawei
Han, Micheline Kamber

1
31/01/2021

Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 Summary

Types of Data Sets


 Record

 Relational records

 Data matrix, e.g., numerical matrix, crosstabs

 Document data: text documents: term-frequency vector

 Transaction data

 Graph and network

 World Wide Web

 Social or information networks

 Molecular Structures

 Ordered

 Video data: sequence of images TID Items

 Temporal data: time-series 1 Bread, Coke, Milk


 Sequential Data: transaction sequences 2 Beer, Bread
 Genetic sequence data 3 Beer, Coke, Diaper, Milk
 Spatial, image and multimedia:
4 Beer, Bread, Diaper, Milk
 Spatial data: maps
5 Coke, Diaper, Milk

 Image data:

 Video data: 4

2
31/01/2021

Important Characteristics of Structured Data

 Dimensionality
 Curse of dimensionality
 Sparsity

 Only presence counts


 Resolution

 Patterns depend on the scale

 Distribution
 Centrality and dispersion
5

Data Objects

 Data sets are made up of data objects.


 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Also called samples , examples, instances, data points, objects,
tuples.
 Data objects are described by attributes.
 Database rows -> data objects; columns ->attributes.

3
31/01/2021

Attributes

 Attribute (or dimensions, features, variables): a data


field, representing a characteristic or feature of a data
object.
 E.g., customer _ID, name, address
 Types:
 Nominal
 Binary
 Numeric: quantitative
 Interval-scaled
 Ratio-scaled

Attribute Types

 Nominal: categories, states, or “names of things”


 Hair_color = {auburn, black, blond, brown, grey, red, white}
 marital status, occupation, ID numbers, zip codes
 Binary
 Nominal attribute with only 2 states (0 and 1)
 Symmetric binary: both outcomes equally important
 e.g., gender
 Asymmetric binary: outcomes not equally important.
 e.g., medical test (positive vs. negative)
 Convention: assign 1 to most important outcome (e.g., HIV
positive)
 Ordinal
 Values have a meaningful order (ranking) but magnitude between
successive values is not known.
 Size = {small, medium, large}, grades, army rankings

4
31/01/2021

Numeric Attribute Types

 Quantity (integer or real-valued)


 Interval
 Measured on a scale of equal-sized units
 Values have order
• E.g., temperature in C˚or F˚, calendar dates
 No true zero-point
 Ratio
 Inherent zero-point
 We can speak of values as being an order of magnitude
larger than the unit of measurement (10 K˚ is twice as high
as 5 K˚).
• e.g., temperature in Kelvin, length, counts, monetary
quantities
9

Discrete vs. Continuous Attributes

 Discrete Attribute
 Has only a finite or countably infinite set of values
 E.g., zip codes, profession, or the set of words in a collection of documents

 Sometimes, represented as integer variables


 Note: Binary attributes are a special case of discrete
attributes
 Continuous Attribute
 Has real numbers as attribute values
 E.g., temperature, height, or weight

 Practically,real values can only be measured and


represented using a finite number of digits
 Continuous attributes are typically represented as
floating-point variables
10

10

5
31/01/2021

Activities

 Identify attribute types in datasets

11

Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 Summary

12

12

6
31/01/2021

Basic Statistical Descriptions of Data


 Motivation
 To
better understand the data: central tendency,
variation and spread
 Data dispersion characteristics
 median, max, min, quantiles, outliers, variance, etc.
 Numerical dimensions correspond to sorted intervals
 Data dispersion: analyzed with multiple granularities
of precision
 Boxplot or quantile analysis on sorted intervals
 Dispersion analysis on computed measures
 Folding measures into numerical dimensions
 Boxplot or quantile analysis on the transformed cube
13

13

Measuring the Central Tendency


n
1
 Mean (algebraic measure) (sample vs. population): x 
n

i 1
xi
Note: n is sample size and N is population size. n

wx i i

 Weighted arithmetic mean: x  i 1


n

w

i
x
 Trimmed mean: chopping extreme values i 1

N
 Example:
 Suppose we have the following values for salary (in
thousands of dollars), shown in increasing order: 30,
36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.

 Mean = ?

14

14

7
31/01/2021

Measuring the Central Tendency (cont.)


 Median:
 Middle value if odd number of values, or average of the middle two values
otherwise
 Estimated by interpolation (for grouped data):
n / 2  ( freq )l
median  L1  ( ) width
freqmedian
 Mode Median
interval

 Value that occurs most frequently in the data

 Unimodal, bimodal, trimodal

 Empirical formula:

mean  mode  3  (mean  median)

15

Measuring the Central Tendency (cont.)

 Example:
 Suppose we have the following values for salary (in thousands of dollars),
shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.

 Median = ?

 Mode = ?

16

8
31/01/2021

Symmetric vs. Skewed Data

 Median, mean and mode of symmetric,


positively and negatively skewed data symmetric

positively skewed negatively skewed

17

January 31, 2021 Data Mining: Concepts and Techniques

17

Measuring the Dispersion of Data

 Quartiles, outliers and boxplots


 Quartiles: Q1 (25th percentile), Q3 (75th percentile)
 Inter-quartile range: IQR = Q3 – Q1
 Five number summary: min, Q1, median, Q3, max
 Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and
plot outliers individually
 Outlier: usually, a value higher/lower than 1.5 x IQR beyond the quartiles.
 Variance and standard deviation (σ)
 Variance: (algebraic, scalable computation)

 Standard deviation σ is the square root of variance σ2

18

18

9
31/01/2021

Boxplot Analysis

 Five-number summary of a distribution


 Minimum, Q1, Median, Q3, Maximum
 Boxplot
 Data is represented with a box
 The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
 The median is marked by a line within the box
 Whiskers: two lines outside the box extended
to Minimum and Maximum
 Outliers: points beyond a specified outlier
threshold, plotted individually
19

19

Fig. 2.3. Boxplot for the unit price data for items sold at four
branches of AllElectronics during a given time period.

For branch 1, Median = $80, Q1 is $60, and Q3 =


Max $100. Notice that two outlying observations for this
branch were plotted individually, as their values of
175 and 202 are more than 1.5 times the IQR here of
40.
Q3

Median

Q1

Min

20

10
31/01/2021

Measuring the Dispersion of Data

 Example:
 Suppose we have the following values for salary (in thousands of
dollars), shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60,
63, 70, 70, 110.

 Q1 = ?; Q3 = ?

 IQR = ?

 Variance 2 = ?; Standard deviation  = ?

21

Visualization of Data Dispersion: 3-D Boxplots

22

22

11
31/01/2021

Properties of Normal Distribution Curve

 The normal (distribution) curve


 From μ–σ to μ+σ: contains about 68% of the measurements
(μ: mean, σ: standard deviation)
 From μ–2σ to μ+2σ: contains about 95% of it
 From μ–3σ to μ+3σ: contains about 99.7% of it

23

23

Graphic Displays of Basic Statistical Descriptions

 Boxplot: graphic display of five-number summary

 Histogram: x-axis are values, y-axis repres. frequencies

 Quantile plot: each value xi is paired with fi indicating that


approximately 100 fi % of data are  xi

 Quantile-quantile (q-q) plot: graphs the quantiles of one univariant


distribution against the corresponding quantiles of another

 Scatter plot: each pair of values is a pair of coordinates and plotted


as points in the plane

24

24

12
31/01/2021

Histogram Analysis

 Histogram: Graph display of tabulated


frequencies, shown as bars 40
 It shows what proportion of cases fall 35
into each of several categories 30
 Differs from a bar chart in that it is the 25
area of the bar that denotes the 20
value, not the height as in bar charts,
15
a crucial distinction when the
10
categories are not of uniform width
5
 The categories are usually specified as
0
non-overlapping intervals of some 10000 30000 50000 70000 90000
variable. The categories (bars) must
be adjacent
25

25

Quantile Plot
 Displays all of the data (allowing the user to assess both
the overall behavior and unusual occurrences)
 Plots quantile information
 For a data xi data sorted in increasing order, fi
indicates that approximately 100 fi% of the data are
below or equal to the value xi

26

26

13
31/01/2021

Quantile-Quantile (Q-Q) Plot

 Graphs the quantiles of one univariate distribution against the


corresponding quantiles of another
 View: Is there is a shift in going from one distribution to another?
 Example shows unit price of items sold at Branch 1 vs. Branch 2 for
each quantile. Unit prices of items sold at Branch 1 tend to be
lower than those at Branch 2.

27

27

Scatter plot

 Provides a first look at bivariate data to see clusters of


points, outliers, etc.
 Each pair of values is treated as a pair of coordinates
and plotted as points in the plane

28

28

14
31/01/2021

Example: Histogram and Scatter

29

Positively and Negatively Correlated Data

 The left half fragment is positively


correlated

 The right half is negative correlated

30

30

15
31/01/2021

Uncorrelated Data

31

31

Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 Summary

32

32

16
31/01/2021

Similarity and Dissimilarity


 Similarity
 Numerical measure of how alike two data objects are
 Value is higher when objects are more alike
 Often falls in the range [0,1]
 Dissimilarity (e.g., distance)
 Numerical measure of how different two data objects
are
 Lower when objects are more alike
 Minimum dissimilarity is often 0
 Upper limit varies
 Proximity refers to a similarity or dissimilarity
33

33

Data Matrix and Dissimilarity Matrix

 Data matrix
 n data points with p  x 11 ... x 1f ... x 1p 
dimensions  
 ... ... ... ... ... 
 Two modes x ... x if ... x ip 
 i1 
 ... ... ... ... ... 
x ... x nf ... x np 
 n1 
 Dissimilarity matrix
 0 
 n data points, but  d(2,1) 
0
registers only the  
distance  d(3,1 ) d ( 3 ,2 ) 0 
 
A triangular matrix  : : : 
 d ( n ,1 ) d ( n,2 ) ... ... 0 
 Single mode
34

34

17
31/01/2021

Proximity Measure for Nominal Attributes

 Can take 2 or more states, e.g., red, yellow, blue, green


(generalization of a binary attribute)

 Method: Simple matching


 m: # of matches, p: total # of variables

d ( i , j )  p p m

35

35

Proximity Measure for Binary Attributes


Object j
 A contingency table for binary data
Object i

 Distance measure for symmetric binary


variables:

 Distance measure for asymmetric


binary variables:

 Jaccard coefficient (similarity measure


for asymmetric binary variables):

 Note: Jaccard coefficient is the same as “coherence”:

36

36

18
31/01/2021

Dissimilarity between Binary Variables


q t
 Example Asymmetric att.

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4


Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
 Gender is a symmetric attribute
 The remaining attributes are asymmetric binary
 Let the values Y and P be 1, and the value N 0
0  1 based only on
d ( jack , mary )   0 . 33 the asymmetric
2  0  1
attributes.
1  1
d ( jack , jim )   0 . 67
1  1  1
1  2
d ( jim , mary )   0 . 75
1  1  2
37

37

Standardizing Numeric Data



z  x
 Z-score:

 X: raw score to be standardized, μ: mean of the population, σ: standard


deviation
 the distance between the raw score and the population mean in units of
the standard deviation
 negative when the raw score is below the mean, “+” when above
 An alternative way: Calculate the mean absolute deviation
s f  1n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)
where
m f  1n (x1 f  x2 f  ...  xnf )
.

xif  m f
 standardized measure (z-score): zif  sf
 Using mean absolute deviation is more robust than using standard deviation

38

38

19
31/01/2021

Example:
Data Matrix and Dissimilarity Matrix

x2 x4
Data Matrix
point attribute1 attribute2
4 x1 1 2
x2 3 5
x3 2 0
x4 4 5
2 x1
Dissimilarity Matrix
(with Euclidean Distance)
x3
0 2 4 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
39

39

Distance on Numeric Data: Minkowski Distance

 Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are


two p-dimensional data objects, and h is the order
(the distance so defined is also called L-h norm)
 Properties
 d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
 d(i, j) = d(j, i) (Symmetry)
 d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
 A distance that satisfies these properties is a metric

40

40

20
31/01/2021

Special Cases of Minkowski Distance


 h = 1: Manhattan (city block, L1 norm) distance
 E.g., the Hamming distance: the number of bits that are different
between two binary vectors
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp

 h = 2: (L2 norm) Euclidean distance


d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j 2 ip jp

 h  . “supremum” (Lmax norm, L norm) distance.


 This is the maximum difference between any component
(attribute) of the vectors

41

41

Example: Minkowski Distance


Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
42

42

21
31/01/2021

Ordinal Variables

 An ordinal variable can be discrete or continuous


 Order is important, e.g., rank
 Can be treated like interval-scaled
 replace x by their rank rif  {1,..., M f }
if

 map the range of each variable onto [0, 1] by


replacing i-th object in the f-th variable by
r if  1
z if 
M f  1

 compute the dissimilarity using methods for interval-


scaled variables

43

43

Example: Ordinal variables

 Dissimilarity matrix:

44

22
31/01/2021

Attributes of Mixed Type


 A database may contain all attribute types
 Nominal, symmetric binary, asymmetric binary, numeric,
ordinal
 One may use a weighted formula to combine their effects
 pf  1 ij( f ) dij( f )
d (i, j) 
 pf  1 ij( f )
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
f is numeric: use the normalized distance
f is ordinal
zif 
r 1
if
 Compute ranks rif and
M 1 f
 Treat zif as numeric

45

45

Example: Mixed Type

 Dissimilarity matrix:
 For test-3: ; for the data:

46

23
31/01/2021

Cosine Similarity

 A document can be represented by thousands of attributes, each recording


the frequency of a particular word (such as keywords) or phrase in the
document.

 Other vector objects: gene features in micro-arrays, …


 Applications: information retrieval, biologic taxonomy, gene feature
mapping, ...
 Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency
vectors), then
cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d||: the length of vector d

47

47

Example: Cosine Similarity

 cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,


where  indicates vector dot product, ||d||: the length of vector d

 Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 0.94

48

48

24
31/01/2021

Activities

 Do exercises in Text [1]: 2.2.a-f, 2.4.a-b, 2.6.a-c, 2.8.a

49

Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 Summary

50

50

25
31/01/2021

Summary

 Data attribute types: nominal, binary, ordinal, interval-


scaled, ratio-scaled
 Many types of data sets, e.g., numerical, text, graph,
Web, image.
 Gain insight into the data by:
 Basic statistical data description: central tendency,
dispersion, graphical displays
 Datavisualization: map data onto graphical primitives
 Measure data similarity
 Above steps are the beginning of data preprocessing
 Many methods have been developed but still an active
area of research

51

52

January 31, 2021 Data Mining: Concepts and Techniques

52

26
Introduction to Data Mining

Lecture 2 – Activities

1. Identify attribute types: Numeric, Nominal, or Binary, in the following data tables.

Outlook Temperature Humidity Windy Play

Sunny Hot High FALSE Yes


Overcast Mild Normal TRUE Yes
Rainy Cool High FALSE No

Outlook Temperature Humidity Windy Play

Sunny 85.0 70.0 FALSE Yes


Overcast 80.0 80.0 TRUE Yes
Rainy 75.0 85.0 FALSE No

2. Exercises in Text [1]: 2.2.a-f, 2.4.a-b, 2.6.a-c, 2.8.a


2.2. Suppose that the data for analysis includes the attribute age. The age values for the data tuples
are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35,
35, 36, 40, 45, 46, 52, 70.
(a) What is the mean of the data? What is the median?
(b) What is the mode of the data? Comment on the data’s modality (i.e., bimodal, trimodal, etc.).
(c) What is the midrange of the data?
(d) Can you find (roughly) the first quartile (Q1) and the third quartile (Q3) of the data?
(e) Give the five-number summary of the data.
(f ) Show a boxplot of the data.

2.4. Suppose a hospital tested the age and body fat data for 18 randomly selected adults with the
following result.

Age 23 23 27 27 39 41 47 49 50
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2

Age 52 54 54 56 57 58 58 60 61
%fat 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7

(a) Calculate the mean, median and standard deviation of age and %fat.
(b) Draw the boxplots for age and %fat.

2.6. Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):
(a) Compute the Euclidean distance between the two objects.

1
Introduction to Data Mining

(b) Compute the Manhattan distance between the two objects.


(c) Compute the Minkowski distance between the two objects, using h = 3.

2.8. It is important to define or select similarity measures in data analysis. However, there is no
commonly accepted subjective similarity measure. Results can vary depending on the similarity
measures used. Nonetheless, seemingly different similarity measures may be equivalent after
some transformation.
Suppose we have the following two-dimensional data set:

(a) Consider the data as two-dimensional data points. Given a new data point, x = (1.4, 1.6) as a
query, rank the database points based on similarity with the query using Euclidean distance,
Manhattan distance, supremum distance, and cosine similarity.

2
Introduction to Data Mining

Answers:
2.2.
a. Mean = 809/27 = 30; Median = 25
b. Mode = 25 and 35
c. The midrange (average of the largest and smallest values in the data set) of the data is: (70 +
13)/2 = 41.5
d. Q1 = 20; Q3 = 35
e. 13, 20, 25, 35, 52, and outlier 70
f.

2.4.
a. For age: Mean = 46.44; Median = 51; σ = 12.85
For fat: Mean = 28.78; Median = 30.7; σ = 8.99
b.

3
Introduction to Data Mining

2.6.
a. Euclidean distance = 6.7;
b. Manhattan distance = 11;
c. Minkowski distance = 6.1534 (h=3).

2.8.
a.

The following rankings of the data points based on similarity:


Euclidean distance: x1, x4, x3, x5, x2
Manhattan distance: x1, x4, x3, x5, x2
Supremum distance: x1, x4, x3, x5, x2
Cosine similarity: x1, x3, x4, x2, x5

4
Homework - Session 2
Nguyen Tien Duc ITITIU18029

2.2
a.

b.

c.

d.

e.
f. Box plot:

2.4
a.
b. Box plots:
2.6
a. Euclidean distance:

b. Manhattan distance:

c. Minkowski distance with h = 3:

2.8

Formula 1:
Use for Euclidean (h = 2), Manhattan (h = 1), Supremum (h = infinity)

Formula 2 for cosine similarity:

a. The data points are ranked from higher to lower similarity to the query.
Euclidean distance data points rank:

Data points Distances

x1 0.14

x4 0.22

x3 0.28

x5 0.61

x2 0.67

Manhattan distance data points rank:

Data points Distances

x1 0.19

x4 0.3

x3 0.4

x5 0.7

x2 0.89
Supremum distance data points rank:

Data points Distances

x1 0.1

x4 0.19

x3 0.2

x2-x5 0.6

Cosine similarity data points rank:

Data points Similarities

x1 0.99999

x3 0.99996

x4 0.99903

x2 0.99575

x5 0.96536
2 / 2 8
/ 2 0 2
1

Lecture 3: 1

Data preprocessing
Lecturer: Dr. Nguyen, Thi Thanh Sang
(nttsang@hcmiu.edu.vn)

References:
Chapter 3 in Data Mining: Concepts and Techniques (Third Edition), by Jiawei
Han, Micheline Kamber

2/28/2021

2
2 / 2 8
/ 2 0 2
1

Data Preprocessing
 Data Preprocessing: An Overview
2
 Data Quality

 Major Tasks in Data Preprocessing

 Data Cleaning

 Data Integration

 Data Reduction

 Data Transformation and Data Discretization

 Summary
3

Data Quality: Why Preprocess the Data?

 Measures for data quality: A multidimensional view


 Accuracy: correct or wrong, accurate or not
 Completeness: not recorded, unavailable, …
 Consistency: some modified but some not, dangling,

 Timeliness: timely update?
 Believability: how trustable the data are correct?
 Interpretability: how easily the data can be
understood?

4
2 / 2 8
/ 2 0 2
1

Major Tasks in Data Preprocessing

 Data cleaning
 Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies 3

 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization
 Normalization
 Concept hierarchy generation
5

Data Preprocessing
 Data Preprocessing: An Overview
 Data Quality

 Major Tasks in Data Preprocessing

 Data Cleaning

 Data Integration

 Data Reduction

 Data Transformation and Data Discretization

 Summary 6

6
2 / 2 8
/ 2 0 2
1

Data Cleaning

 Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
4
 incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
 e.g., Occupation = “ ” (missing data)
 noisy: containing noise, errors, or outliers
 e.g., Salary = “−10” (an error)
 inconsistent: containing discrepancies in codes or names, e.g.,
 Age = “42”, Birthday = “03/07/2010”
 Was rating “1, 2, 3”, now rating “A, B, C”
 discrepancy between duplicate records
 Intentional (e.g., disguised missing data)
 Jan. 1 as everyone’s birthday?
7

Incomplete (Missing) Data

 Data is not always available


 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time
of entry
 not register history or changes of the data
 Missing data may need to be inferred

8
2 / 2 8
/ 2 0 2
1

How to Handle Missing Data?

 Ignore the tuple: usually done when class label is missing


(when doing classification)—not effective when the % of
5
missing values per attribute varies considerably
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the
same class: smarter
 the most probable value: inference-based such as
Bayesian formula or decision tree
9

Noisy Data

 Noise: random error or variance in a measured variable


 Incorrect attribute values may be due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which require data cleaning
 duplicate records
 incomplete data
 inconsistent data

10

10
2 / 2 8
/ 2 0 2
1

How to Handle Noisy Data?

 Binning
 first sort data and partition into (equal-frequency) bins
6

 then one can smooth by bin means, smooth by bin median,


smooth by bin boundaries, etc.
 Regression
 smooth by fitting the data into regression functions
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 detect suspicious values and check by human (e.g., deal
with possible outliers)

11

11

Data Cleaning as a Process


 Data discrepancy detection
 Use metadata (e.g., domain, range, dependency, distribution)
 Check field overloading
 Check uniqueness rule, consecutive rule and null rule
 Use commercial tools
 Data scrubbing: use simple domain knowledge (e.g., postal code,
spell-check) to detect errors and make corrections
 Data auditing: by analyzing data to discover rules and relationship to
detect violators (e.g., correlation and clustering to find outliers)
 Data migration and integration
 Data migration tools: allow transformations to be specified
 ETL (Extraction/Transformation/Loading) tools: allow users to specify
transformations through a graphical user interface
 Integration of the two processes
 Iterative and interactive (e.g., Potter’s Wheels)
12

12
2 / 2 8
/ 2 0 2
1

Data Preprocessing
 Data Preprocessing: An Overview
7
 Data Quality

 Major Tasks in Data Preprocessing

 Data Cleaning

 Data Integration

 Data Reduction

 Data Transformation and Data Discretization

 Summary
13

13

Data Integration

 Data integration:
 Combines data from multiple sources into a coherent store
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
 Entity identification problem:
 Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from different
sources are different
 Possible reasons: different representations, different scales, e.g.,
metric vs. British units

14

14
2 / 2 8
/ 2 0 2
1

Handling Redundancy in Data Integration

 Redundant data occur often when integration of


multiple databases
8

 Objectidentification: The same attribute or object


may have different names in different databases
 Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
 Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
 Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality
15

15

Correlation Analysis (Nominal Data)

 Χ2 (chi-square) test
(Observed  Expected) 2
 
2

Expected
 The larger the Χ2 value, the more likely the variables are related
 The cells that contribute the most to the Χ2 value are those
whose actual count is very different from the expected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated

 Both are causally linked to the third variable: population

16

16
2 / 2 8
/ 2 0 2
1

Chi-Square Calculation: An Example

Play chess Not play chess Sum (row)


Like science fiction 250(90) 200(360) 450
9
Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

 Χ2 (chi-square) calculation (numbers in parenthesis are expected


counts calculated based on the data distribution in the two
categories)
(250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2
2      507.93
90 210 360 840
 It shows that like_science_fiction and play_chess are correlated
in the group
17

17

Correlation Analysis (Numeric Data)


 Correlation coefficient (also called Pearson’s product moment
coefficient)

 
n n
(ai  A)(bi  B) (ai bi )  n A B
rA, B  i 1
 i 1
n A B n A B

where n is the number of tuples, A and B are the respective


means of A and B, σA and σB are the respective standard
deviation of A and B, and Σ(aibi) is the sum of the AB cross-
product.
 If rA,B > 0, A and B are positively correlated (A’s values increase as
B’s). The higher, the stronger correlation.
 rA,B = 0: independent; rAB < 0: negatively correlated

18

18
2 / 2 8
/ 2 0 2
1

Correlation (viewed as linear relationship)


 Correlation measures the linear relationship between
objects
1 0

 To compute correlation, we standardize data objects, A


and B, and then take their dot product

a'k  (ak  mean( A)) / std ( A)

b'k  (bk  mean( B)) / std ( B)

correlation( A, B )  A' B '

19

19

Covariance (Numeric Data)

 Covariance is similar to correlation

Correlation coefficient:

where n is the number of tuples, A and B are the respective mean or expected
values of A and B, σA and σB are the respective standard deviation of A and B
 Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their
expected values
 Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely
to be smaller than its expected value
 Independence: CovA,B = 0 but the converse is not true:
 Some pairs of random variables may have a covariance of 0 but are not independent.
Only under some additional assumptions (e.g., the data follow multivariate normal
distributions) does a covariance of 0 imply independence
20

20
2 / 2 8
/ 2 0 2
1

Co-Variance: An Example

 It can be simplified in computation as


1 1

 Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5, 10),
(4, 11), (6, 14).

 Question: If the stocks are affected by the same industry trends, will their prices rise or
fall together?

 E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4

 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

 Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

 Thus, A and B rise together since Cov(A, B) > 0.

21

Data Preprocessing
 Data Preprocessing: An Overview

 Data Quality

 Major Tasks in Data Preprocessing

 Data Cleaning

 Data Integration

 Data Reduction

 Data Transformation and Data Discretization

 Summary
22

22
2 / 2 8
/ 2 0 2
1

Data Reduction Strategies


 Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
 Why data reduction? — A database/data warehouse may store 1 2
terabytes of data. Complex data analysis may take a very long time
to run on the complete data set.
 Data reduction strategies
 Dimensionality reduction, e.g., remove unimportant attributes
 Wavelet transforms
 Principal Components Analysis (PCA)
 Feature subset selection, feature creation
 Numerosity reduction (some simply call it: Data Reduction)
 Regression and Log-Linear Models
 Histograms, clustering, sampling
 Data cube aggregation
 Data compression
23

23

Data Reduction 1: Dimensionality Reduction

 Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse
 Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
 The possible combinations of subspaces will grow exponentially
 Dimensionality reduction
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
 Dimensionality reduction techniques
 Wavelet transforms
 Principal Component Analysis
 Supervised and nonlinear techniques (e.g., feature selection)

24

24
2 / 2 8
/ 2 0 2
1

Mapping Data to a New Space


 Fourier transform
 Wavelet transform
1 3

Two Sine Waves Two Sine Waves + Noise Frequency

25

25

What Is Wavelet Transform?

 Decomposes a signal into


different frequency subbands
 Applicable to n-dimensional
signals
 Data are transformed to
preserve relative distance
between objects at different
levels of resolution
 Allow natural clusters to
become more distinguishable
 Used for image compression
26

26
2 / 2 8
/ 2 0 2
1

Wavelet Transformation
Haar2 Daubechie4
 Discrete wavelet transform (DWT) for linear signal processing, multi-
resolution analysis 1 4

 Compressed approximation: store only a small fraction of the


strongest of the wavelet coefficients
 Similar to discrete Fourier transform (DFT), but better lossy
compression, localized in space
 Method:
 Length, L, must be an integer power of 2 (padding with 0’s, when necessary)
 Each transform has 2 functions: smoothing, difference
 Applies to pairs of data, resulting in two set of data of length L/2
 Applies two functions recursively, until reaches the desired length

27

27

Wavelet Decomposition

 Wavelets: A math tool for space-efficient hierarchical


decomposition of functions
 S = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to S^ = [23/4, -11/4,
1/ , 0, 0, -1, -1, 0]
2

 Compression: many small detail coefficients can be replaced by


0’s, and only the significant coefficients are retained

28

28
2 / 2 8
/ 2 0 2
1

Why Wavelet Transform?

 Use hat-shape filters


 Emphasize region where points cluster
1 5

 Suppress weaker information in their boundaries


 Effective removal of outliers
 Insensitive to noise, insensitive to input order
 Multi-resolution
 Detect arbitrary shaped clusters at different scales
 Efficient
 Complexity O(N)
 Only applicable to low dimensional data

29

29

Principal Component Analysis (PCA)


 Find a projection that captures the largest amount of variation in
data
 The original data are projected onto a much smaller space,
resulting in dimensionality reduction. We find the eigenvectors
of the covariance matrix, and these eigenvectors define the new
space
x2

x1 30

30
2 / 2 8
/ 2 0 2
1

Principal Component Analysis (Steps)


 Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
 Normalize input data: Each attribute falls within the same range 1 6

 Compute k orthonormal (unit) vectors, i.e., principal components


 Each input data (vector) is a linear combination of the k principal
component vectors
 The principal components are sorted in order of decreasing
“significance” or strength
 Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data)
 Works for numeric data only
31

31

Attribute Subset Selection

 Another way to reduce dimensionality of data


 Redundant attributes
 Duplicate much or all of the information contained in one or more
other attributes
 E.g., purchase price of a product and the amount of sales tax paid
 Irrelevant attributes
 Contain no information that is useful for the data mining task at
hand
 E.g., students' ID is often irrelevant to the task of predicting
students' GPA

32

32
2 / 2 8
/ 2 0 2
1

Heuristic Search in Attribute Selection

 There are 2d possible attribute combinations of d


attributes
 Typical heuristic attribute selection methods: 1 7

 Best single attribute under the attribute independence


assumption: choose by significance tests
 Best step-wise feature selection:
 The best single-attribute is picked first
 Then next best attribute condition to the first, ...

 Step-wise attribute elimination:


 Repeatedly eliminate the worst attribute

 Best combined attribute selection and elimination


 Optimal branch and bound:
 Use attribute elimination and backtracking

33

33

Attribute Creation (Feature Generation)

 Create new attributes (features) that can capture the important


information in a data set more effectively than the original ones
 Three general methodologies
 Attribute extraction
 Domain-specific

 Mapping data to new space (see: data reduction)


 E.g., Fourier transformation, wavelet transformation, manifold approaches (not covered)

 Attribute construction
 Combining features (see: discriminative frequent patterns in Chapter on “Advanced Classification”)
 Data discretization

34

34
2 / 2 8
/ 2 0 2
1

Data Reduction 2: Numerosity Reduction


 Reduce data volume by choosing alternative, smaller
forms of data representation
 Parametric methods (e.g., regression) 1 8

 Assume the data fits some model, estimate model


parameters, store only the parameters, and discard
the data (except possible outliers)
 Ex.:Log-linear models—obtain value at a point in m-
D space as the product on appropriate marginal
subspaces
 Non-parametric methods
 Do not assume models
 Major families: histograms, clustering, sampling, …

35

35

Parametric Data Reduction: Regression and


Log-Linear Models

 Linear regression
 Data modeled to fit a straight line
 Often uses the least-square method to fit the line
 Multiple regression
 Allows a response variable Y to be modeled as a
linear function of multidimensional feature vector
 Log-linear model
 Approximates discrete multidimensional probability
distributions

36

36
2 / 2 8
/ 2 0 2
1

Regression Analysis y
Y1
 Regression analysis: A collective name for
techniques for the modeling and analysis of
Y1’
numerical data consisting of values of a y=x+1 1 9

dependent variable (also called response


variable or measurement) and of one or more
X1 x
independent variables (aka. explanatory
variables or predictors)
 Used for prediction (including
 The parameters are estimated so as to give a forecasting of time-series data),
"best fit" of the data inference, hypothesis testing,
and modeling of causal
 Most commonly the best fit is evaluated by
relationships
using the least squares method, but other
criteria have also been used

37

37

Regress Analysis and Log-Linear Models


 Linear regression: Y = w X + b
 Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
 Using the least squares criterion to the known values of Y1, Y2, …, X1,
X2, ….
 Multiple regression: Y = b0 + b1 X1 + b2 X2
 Many nonlinear functions can be transformed into the above
 Log-linear models:
 Approximate discrete multidimensional probability distributions
 Estimate the probability of each point (tuple) in a multi-dimensional
space for a set of discretized attributes, based on a smaller subset of
dimensional combinations
 Useful for dimensionality reduction and data smoothing
38

38
2 / 2 8
/ 2 0 2
1

Histogram Analysis
40
 Divide data into buckets and 35
store average (sum) for each 2 0

30
bucket
25
 Partitioning rules:
20
 Equal-width: equal bucket
15
range
10
 Equal-frequency (or equal-
depth) 5
0
10000 30000 50000 70000 90000

39

39

Clustering

 Partition data set into clusters based on similarity, and


store cluster representation (e.g., centroid and
diameter) only
 Can be very effective if data is clustered but not if data
is “smeared”
 Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
 There are many choices of clustering definitions and
clustering algorithms
 Cluster analysis will be studied in depth in Chapter 10
40

40
2 / 2 8
/ 2 0 2
1

Sampling
 Sampling: obtaining a small sample s to represent the
whole data set N
 Allow a mining algorithm to run in complexity that is 2 1

potentially sub-linear to the size of the data


 Key principle: Choose a representative subset of the data
 Simple random sampling may have very poor
performance in the presence of skew
 Develop adaptive sampling methods, e.g., stratified
sampling:
 Note: Sampling may not reduce database I/Os (page at a
time)

41

41

Types of Sampling

 Simple random sampling


 There is an equal probability of selecting any particular
item
 Sampling without replacement
 Once an object is selected, it is removed from the
population
 Sampling with replacement
 A selected object is not removed from the population
 Stratified sampling:
 Partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the same
percentage of the data)
 Used in conjunction with skewed data
42

42
2 / 2 8
/ 2 0 2
1

Sampling: With or without Replacement

2 2

Raw Data

43

43

Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

44

44
2 / 2 8
/ 2 0 2
1

Data Cube Aggregation


 The lowest level of a data cube (base cuboid)
 The aggregated data for an individual entity of interest

 E.g., a customer in a phone calling data warehouse


2 3

 Multiple levels of aggregation in data cubes


 Further reduce the size of data to deal with

 Reference appropriate levels


 Use the smallest representation which is enough to solve the
task
 Queries regarding aggregated information should be answered
using data cube, when possible

45

45

Data Reduction 3: Data Compression


 String compression
 There are extensive theories and well-tuned algorithms
 Typically lossless, but only limited manipulation is possible
without expansion
 Audio/video compression
 Typically lossy compression, with progressive refinement
 Sometimes small fragments of signal can be reconstructed
without reconstructing the whole
 Time sequence is not audio
 Typically short and vary slowly with time
 Dimensionality and numerosity reduction may also be
considered as forms of data compression

46

46
2 / 2 8
/ 2 0 2
1

Data Compression

2 4

Original Data Compressed


Data
lossless

Original Data
Approximated

47

47

Data Preprocessing
 Data Preprocessing: An Overview

 Data Quality

 Major Tasks in Data Preprocessing

 Data Cleaning

 Data Integration

 Data Reduction

 Data Transformation and Data Discretization

 Summary
48

48
2 / 2 8
/ 2 0 2
1

Data Transformation
 A function that maps the entire set of values of a given attribute to a new
set of replacement values s.t. each old value can be identified with one of
the new values
2 5

 Methods
 Smoothing: Remove noise from data
 Attribute/feature construction
 New attributes constructed from the given ones
 Aggregation: Summarization, data cube construction
 Normalization: Scaled to fall within a smaller, specified range
 min-max normalization
 z-score normalization
 normalization by decimal scaling
 Discretization: Concept hierarchy climbing

49

49

Normalization
 Min-max normalization: to [new_minA, new_maxA]

v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
 Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
Then $73,600 is mapped to 73 , 600  12 , 000 ( 1 . 0  0 )  0  0 . 716
98 , 000  12 , 000
 Z-score normalization (μ: mean, σ: standard deviation):
v  A
v'
 A

73,600  54,000
 Ex. Let μ = 54,000, σ = 16,000. Then  1.225
16,000
 Normalization by decimal scaling
v
v' Where j is the smallest integer such that Max(|ν’|) < 1
10 j

50

50
2 / 2 8
/ 2 0 2
1

Discretization

 Three types of attributes


 Nominal—values from an unordered set, e.g., color, profession
2 6
 Ordinal—values from an ordered set, e.g., military or academic
rank
 Numeric—real numbers, e.g., integer or real numbers
 Discretization: Divide the range of a continuous attribute into intervals
 Interval labels can then be used to replace actual data values
 Reduce data size by discretization
 Supervised vs. unsupervised
 Split (top-down) vs. merge (bottom-up)
 Discretization can be performed recursively on an attribute
 Prepare for further analysis, e.g., classification

51

51

Data Discretization Methods


 Typical methods: All the methods can be applied recursively
 Binning
 Top-down split, unsupervised

 Histogram analysis
 Top-down split, unsupervised

 Clustering analysis (unsupervised, top-down split or bottom-up


merge)
 Decision-tree analysis (supervised, top-down split)

 Correlation (e.g., 2) analysis (unsupervised, bottom-up merge)

52

52
2 / 2 8
/ 2 0 2
1

Simple Discretization: Binning

 Equal-width (distance) partitioning


 Divides the range into N intervals of equal size: uniform grid
 if A and B are the lowest and highest values of the attribute, the 2 7

width of intervals will be: W = (B –A)/N.


 The most straightforward, but outliers may dominate presentation
 Skewed data is not handled well

 Equal-depth (frequency) partitioning


 Divides the range into N intervals, each containing approximately
same number of samples
 Good data scaling
 Managing categorical attributes can be tricky

53

53

Binning Methods for Data Smoothing


 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

54

54
2 / 2 8
/ 2 0 2
1

Discretization by Classification &


Correlation Analysis
 Classification (e.g., decision tree analysis)

 Supervised: Given class labels, e.g., cancerous vs. benign 2 8

 Using entropy to determine split point (discretization point)

 Top-down, recursive split

 Details to be covered in Chapter “Classification”

 Correlation analysis (e.g., Chi-merge: χ2-based discretization)

 Supervised: use class information

 Bottom-up merge: find the best neighboring intervals (those having


similar distributions of classes, i.e., low χ2 values) to merge

 Merge performed recursively, until a predefined stopping condition

55

55

Concept Hierarchy Generation

 Concept hierarchy organizes concepts (i.e., attribute values)


hierarchically and is usually associated with each dimension in a data
warehouse
 Concept hierarchies facilitate drilling and rolling in data warehouses
to view data in multiple granularity
 Concept hierarchy formation: Recursively reduce the data by
collecting and replacing low level concepts (such as numeric values
for age) by higher level concepts (such as youth, adult, or senior)
 Concept hierarchies can be explicitly specified by domain experts
and/or data warehouse designers
 Concept hierarchy can be automatically formed for both numeric and
nominal data—For numeric data, use discretization methods shown

56

56
2 / 2 8
/ 2 0 2
1

Concept Hierarchy Generation


for Nominal Data

 Specification of a partial/total ordering of attributes explicitly


at the schema level by users or experts
2 9

 street < city < state < country


 Specification of a hierarchy for a set of values by explicit data
grouping
 {Urbana, Champaign, Chicago} < Illinois
 Specification of only a partial set of attributes
 E.g., only street < city, not others
 Automatic generation of hierarchies (or attribute levels) by
the analysis of the number of distinct values
 E.g., for a set of attributes: {street, city, state, country}
57

57

Automatic Concept Hierarchy Generation

 Some hierarchies can be automatically generated


based on the analysis of the number of distinct values
per attribute in the data set
 The attribute with the most distinct values is
placed at the lowest level of the hierarchy
 Exceptions, e.g., weekday, month, quarter, year

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

58

58
2 / 2 8
/ 2 0 2
1

Data Preprocessing
 Data Preprocessing: An Overview
3 0
 Data Quality

 Major Tasks in Data Preprocessing

 Data Cleaning

 Data Integration

 Data Reduction

 Data Transformation and Data Discretization

 Summary
59

59

Summary

 Data quality: accuracy, completeness, consistency,


timeliness, believability, interpretability
 Data cleaning: e.g. missing/noisy values, outliers
 Data integration from multiple sources:
 Entityidentification problem; Remove redundancies;
Detect inconsistencies
 Data reduction
 Dimensionality reduction; Numerosity reduction; Data
compression
 Data transformation and data discretization
 Normalization; Concept hierarchy generation

60

60
2 / 2 8
/ 2 0 2
1

3 1

2/28/2021 61

61
3/10/2021

Lecture 4:
Data mining knowledge
representation
Lecturer: Dr. Nguyen, Thi Thanh Sang
(nttsang@hcmiu.edu.vn)

References:
Chapter 3 in Data Mining: Practical Machine Learning Tools and Techniques
(Third Edition), by Ian H.Witten, Eibe Frank and Eibe Frank

1
3/10/2021

Knowledge representation
 Tables
 Linear models
 Trees
 Rules
 Classification rules
 Association rules
 Rules with exceptions
 More expressive rules
 Instance-based representation
 Clusters
3

Output: representing structural patterns

 Many different ways of representing patterns

 Decision trees, rules, instance-based, …

 Also called “knowledge” representation

 Representation determines inference method

 Understanding the output is the key to understanding the underlying

learning methods

 Different types of output for different learning problems (e.g.

classification, regression, …)
4

2
3/10/2021

Tables
► Simplest way of representing output:
► Use the same format as input!
► Decision table for the weather problem:
Outlook Humidity Play
Sunny High No
Sunny Normal Yes
Overcast High Yes
Overcast Normal Yes
Rainy High No
Rainy Normal No

► Main problem: selecting the right attributes

Knowledge representation
 Tables
 Linear models
 Trees
 Rules
 Classification rules
 Association rules
 Rules with exceptions
 More expressive rules
 Instance-based representation
 Clusters

3
3/10/2021

Linear models
 Another simple representation
 Regression model
 Inputs (attribute values) and output are all numeric
 Output is the sum of weighted attribute values
 The trick is to find good values for the weights

A linear regression function for the CPU


performance data

PRP = 37.06 + 2.47CACH


8

4
3/10/2021

Linear models for classification

 Binary classification
 Line separates the two classes
 Decision boundary - defines where the decision changes
from one class value to the other
 Prediction is made by plugging in observed values of
the attributes into the expression
 Predict one class if output  0, and the other class if
output < 0
 Boundary becomes a high-dimensional plane
(hyperplane) when there are multiple attributes

Separating setosas from versicolors

2.0 – 0.5PETAL-LENGTH – 0.8PETAL-WIDTH = 0

10

5
3/10/2021

Knowledge representation
 Tables
 Linear models
 Trees
 Rules
 Classification rules
 Association rules
 Rules with exceptions
 More expressive rules
 Instance-based representation
 Clusters

11

Trees

 “Divide-and-conquer” approach produces tree


 Nodes involve testing a particular attribute
 Usually, attribute value is compared to constant
 Other possibilities:
 Comparing values of two attributes
 Using a function of one or more attributes
 Leaves assign classification, set of classifications, or
probability distribution to instances
 Unknown instance is routed down the tree

12

6
3/10/2021

Nominal and numeric attributes

Nominal:

number of children usually equal to number values


 attribute won’t get tested more than once
●Other possibility: division into two subsets

Numeric:

test whether value is greater or less than constant


 attribute may get tested several times
●Other possibility: three-way split (or multi-way split)

● Integer: less than, equal to, greater than


● Real: below, within, above

13

Missing values

Does absence of value have some significance?


●Yes  “missing” is a separate value

●No  “missing” must be treated in a special way

Solution A: assign instance to most popular branch


Solution B: split instance into pieces

●Pieces receive weight according to fraction of training instances that


go down each branch
●Classifications from leave nodes are combined using the weights that

have percolated to them

14

7
3/10/2021

Trees for numeric prediction

 Regression: the process of computing an expression that


predicts a numeric quantity
 Regression tree: “decision tree” where each leaf predicts a
numeric quantity
 Predicted value is average value of training instances that reach the
leaf
 Model tree: “regression tree” with linear regression models
at the leaf nodes
 Linear patches approximate continuous function

15

Linear regression for the CPU data

PRP =
- 56.1
+ 0.049 MYCT
+ 0.015 MMIN
+ 0.006 MMAX
+ 0.630 CACH
- 0.270 CHMIN
+ 1.46 CHMAX

16

8
3/10/2021

Regression tree for the CPU data

17

Model tree for the CPU data

18

9
3/10/2021

Knowledge representation
 Tables
 Linear models
 Trees
 Rules
 Classification rules
 Association rules
 Rules with exceptions
 More expressive rules
 Instance-based representation
 Clusters

19

Classification rules
●Popular alternative to decision trees
●Antecedent (pre-condition): a series of tests (just like

the tests at the nodes of a decision tree)


●Tests are usually logically ANDed together (but may

also be general logical expressions)


●Consequent (conclusion): classes, set of classes, or

probability distribution assigned by rule


●Individual rules are often logically ORed together

 Conflicts arise if different conclusions apply

20

10
3/10/2021

From trees to rules

●Easy: converting a tree into a set of


rules
 One rule for each leaf:
●Antecedent contains a condition for every node
on the path from the root to the leaf
●Consequent is class assigned by the leaf

●Produces rules that are unambiguous


Doesn’t matter in which order they are
executed
21

From rules to trees

● More difficult: transforming a rule set into a tree


Tree cannot easily express disjunction between rules

● Example: rules which test different attributes


If a and b then x
If c and d then x

●Symmetry needs to be broken


●Corresponding tree contains identical subtrees

( “replicated subtree problem”)

22

11
3/10/2021

A tree for a simple disjunction

23

The exclusive-or problem

If x = 1 and y = 0
then class = a
If x = 0 and y = 1
then class = a
If x = 0 and y = 0
then class = b
If x = 1 and y = 1
then class = b

24

12
3/10/2021

A tree with a replicated subtree

If x = 1 and y = 1
then class = a
If z = 1 and w = 1
then class = a
Otherwise class = b

25

“Nuggets” of knowledge

●Are rules independent pieces of knowledge? (It


seems easy to add a rule to an existing rule base.)
●Problem: ignores how rules are executed

●Two ways of executing a rule set:

Ordered set of rules (“decision list”)


● Order is important for interpretation
Unordered set of rules
●Rules may overlap and lead to different conclusions for the same

instance

26

13
3/10/2021

Interpreting rules

● What if two or more rules conflict?


Give no conclusion at all?
Go with rule that is most popular on training data?

…

● What if no rule applies to a test instance?


Give no conclusion at all?
Go with class that is most frequent in training data?

…

27

Special case: boolean class


●Assumption: if instance does not belong to class
“yes”, it belongs to class “no”
●Trick: only learn rules for class “yes” and use

default rule for “no”


If x = 1 and y = 1 then class = a
If z = 1 and w = 1 then class = a
Otherwise class = b

●Order of rules is not important. No conflicts!


●Rule can be written in disjunctive normal form

28

14
3/10/2021

Knowledge representation
 Tables
 Linear models
 Trees
 Rules
 Classification rules
 Association rules
 Rules with exceptions
 More expressive rules
 Instance-based representation
 Clusters

29

Association rules

 Association rules…
 … can predict any attribute and combinations of attributes
 … are not intended to be used together as a set
 Problem: immense number of possible associations
 Output needs to be restricted to show only the most predictive
associations  only those with high support and high confidence

30

15
3/10/2021

Support and confidence of a rule


 Support: number of instances predicted correctly
 Confidence: number of correct predictions, as
proportion of all instances that rule applies to
 Example: 4 cool days with normal humidity

If temperature = cool then humidity = normal

 Support= 4, confidence = 100%


 Normally: minimum support and confidence pre-
specified (e.g. 58 rules with support  2 and
confidence  95% for weather data)

31

Interpreting association rules

 Interpretation is not obvious:


If windy = false and play = no then outlook = sunny
and humidity = high

is not the same as


If windy = false and play = no then outlook = sunny
If windy = false and play = no then humidity = high

 It means that the following also holds:


If humidity = high and windy = false and play = no
then outlook = sunny

32

16
3/10/2021

Knowledge representation
 Tables
 Linear models
 Trees
 Rules
 Classification rules
 Association rules
 Rules with exceptions
 More expressive rules
 Instance-based representation
 Clusters 33

33

Rules with exceptions

 Idea:
allow rules to have exceptions
 Example: rule for iris data
If petal-length  2.45 and petal-length < 4.45 then Iris-versicolor

 New instance:
Sepal Sepal Petal Petal Type
length width length width
5.1 3.5 2.6 0.2 Iris-setosa

 Modified rule:
If petal-length  2.45 and petal-length < 4.45 then Iris-versicolor
EXCEPT if petal-width < 1.0 then Iris-setosa

34

17
3/10/2021

A more complex example

 Exceptions to exceptions to exceptions …

default: Iris-setosa
except if petal-length  2.45 and petal-length < 5.355
and petal-width < 1.75
then Iris-versicolor
except if petal-length  4.95 and petal-width < 1.55
then Iris-virginica
else if sepal-length < 4.95 and sepal-width  2.45
then Iris-virginica
else if petal-length  3.35
then Iris-virginica
except if petal-length < 4.85 and sepal-length < 5.95
then Iris-versicolor

35

Advantages of using exceptions

 Rules can be updated incrementally


 Easy to incorporate new data
 Easy to incorporate domain knowledge

 People often think in terms of exceptions


 Each conclusion can be considered just in the
context of rules and exceptions that lead to it
 Locality property is important for understanding large
rule sets
 “Normal” rule sets don’t offer this advantage

36

18
3/10/2021

Knowledge representation
 Tables
 Linear models
 Trees
 Rules
 Classification rules
 Association rules
 Rules with exceptions
 More expressive rules
 Instance-based representation
 Clusters

37

More on exceptions

 Default...except if...then...
is logically equivalent to
if...then...else
(where the else specifies what the default did)
 But:exceptions offer a psychological advantage
 Assumption: defaults and tests early on apply
more widely than exceptions further down
 Exceptions reflect special cases

38

19
3/10/2021

Rules involving relations

●So far: all rules involved comparing an attribute-value


to a constant (e.g. temperature < 45)
●These rules are called “propositional” because they

have the same expressive power as propositional logic


●What if problem involves relationships between

attributes
Can’t be expressed with propositional rules
More expressive representation required

39

The shapes problem

● Target concept: standing up


● Shaded: standing
● Unshaded: lying

40

20
3/10/2021

A propositional solution
Width Height Sides Class
2 4 4 Standing
3 6 4 Standing
4 3 4 Lying
7 8 3 Standing
7 6 3 Lying
2 9 4 Standing
9 1 4 Lying
10 2 3 Lying

If width  3.5 and height < 7.0


then lying
If height  3.5 then standing

41

A relational solution

●Comparing attributes with each other


If width > height then lying
If height > width then standing

●Generalizes better to new data


●Standard relations: =, <, >

●But: learning relational rules is costly

●Simple solution: add extra attributes

(e.g. a binary attribute is width < height?)

42

21
3/10/2021

Rules with variables


Using variables and multiple relations:

If height_and_width_of(x,h,w) and h > w


then standing(x)

The top of a tower of blocks is standing:


If height_and_width_of(x,h,w) and h > w


and is_top_of(y,x)
then standing(x)

The whole tower is standing:


If is_top_of(x,z) and
height_and_width_of(z,h,w) and h > w
and is_rest_of(x,y)and standing(y)
then standing(x)
If empty(x) then standing(x)

Recursive definition!

43

Inductive logic programming

●Recursive definition can be seen as logic program


●Techniques for learning logic programs stem from

the area of “inductive logic programming” (ILP)


●But: recursive definitions are hard to learn

Also: few practical problems require recursion


Thus: many ILP techniques are restricted to non-recursive
definitions to make learning easier

44

22
3/10/2021

Knowledge representation
 Tables
 Linear models
 Trees
 Rules
 Classification rules
 Association rules
 Rules with exceptions
 More expressive rules
 Instance-based representation
 Clusters

45

Instance-based representation

 Simplest form of learning: rote learning


 Training instances are searched for instance that most closely
resembles new instance
 The instances themselves represent the knowledge
 Also called instance-based learning
 Similarity function defines what’s “learned”
 Instance-based learning is lazy learning
 Methods: nearest-neighbor, k-nearest-neighbor, …

46

23
3/10/2021

The distance function

 Simplest case: one numeric attribute


 Distance is the difference between the two attribute values involved
(or a function thereof)
 Several numeric attributes: normally, Euclidean distance is
used and attributes are normalized
 Nominal attributes: distance is set to 1 if values are
different, 0 if they are equal
 Are all attributes equally important?
 Weighting the attributes might be necessary

47

Learning prototypes

 Only those instances involved in a decision need to be stored


 Noisy instances should be filtered out
 Idea: only use prototypical examples

48

24
3/10/2021

Rectangular generalizations

 Nearest-neighbor rule is used outside rectangles


 Rectangles are rules! (But they can be more
conservative than “normal” rules.)
 Nested rectangles are rules with exceptions

49

Knowledge representation
 Tables
 Linear models
 Trees
 Rules
 Classification rules
 Association rules
 Rules with exceptions
 More expressive rules
 Instance-based representation
 Clusters

50

25
3/10/2021

Representing clusters I

Simple 2-D Venn


representation diagram

Overlapping clusters

51

Representing clusters II

Probabilistic Dendrogram
assignment
1 2 3

a 0.4 0.1 0.5


b 0.1 0.8 0.1
c 0.3 0.3 0.4
d 0.1 0.1 0.8
e 0.4 0.2 0.4
f 0.1 0.4 0.5
g 0.7 0.2 0.1
h 0.5 0.4 0.1

NB: dendron is the Greek


word for tree

52

26
3/10/2021

Summary

27
Introduction to Data mining

Lecture 4 – Activities
Preprocessing and Knowledge Representation

1. Given the dataset:

https://drive.google.com/drive/folders/1kkXcdni6SN-2Thp6j2YiV1rMXflSqVkO

userId movieId rating s1 s2 s3 s4 s5 s6 s7 s8


205229 108979 5 1 1 3 4 2 2 5 5
205229 6947 4 1 1 3 4 4 2 5 4
205229 117444 4 1 4 4 2 2 2 4 2
205229 150548 4 2 2 4 2 4 2 4 1
205229 136542 5 1 1 5 1 1 2 5 2
117112 77455 3.5 1 2 2 2 4 4 4 4
144726 1303 4 1 1 ? 4 3 2 5 3
144726 103306 3.5 1 ? 3 2 1 2 4 2
144726 2060 4.5 1 1 4 2 1 1 5 2
144726 135534 3.5 2 1 5 1 1 1 5 2
144726 128542 3.5 1 1 4 3 1 2 5 2
200400 26939 4 1 2 4 2 2 2 4 1
200400 40491 3.5 1 2 4 2 2 2 4 1
125112 104337 3.5 ? ? ? ? ? ? ? ?
125112 162082 4 2 ? 3 5 5 ? 4 4
125112 96966 4.5 ? ? ? ? ? ? ? ?
125112 165551 4.5 1 3 4 3 3 2 5 4
125112 89759 5 ? ? ? ? ? ? ? ?
113031 5046 3.5 1 1 3 1 1 2 4 1
113031 116855 4 1 1 4 1 1 5 5 1

Propose an application using this dataset, e.g., supply chain management, and do the following tasks to
preprocess the data:

1. Handle missing values

2. Find correlated attributes (applying Correlation Analysis to create a correlation matrix), and important
attributes.

1
Introduction to Data mining

2. Describe how to preprocess the following web log data in order to mine web access sequences, and
represent the preprocessed data.

in24.inetnebr.com - - [01/Aug/1995:00:00:01 -0400] "GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0" 200 1839

uplherc.upl.com - - [01/Aug/1995:00:00:07 -0400] "GET / HTTP/1.0" 304 0

uplherc.upl.com - - [01/Aug/1995:00:00:08 -0400] "GET /images/ksclogo-medium.gif HTTP/1.0" 304 0

uplherc.upl.com - - [01/Aug/1995:00:00:08 -0400] "GET /images/MOSAIC-logosmall.gif HTTP/1.0" 304 0

uplherc.upl.com - - [01/Aug/1995:00:00:08 -0400] "GET /images/USA-logosmall.gif HTTP/1.0" 304 0

ix-esc-ca2-07.ix.netcom.com - - [01/Aug/1995:00:00:09 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1713

uplherc.upl.com - - [01/Aug/1995:00:00:10 -0400] "GET /images/WORLD-logosmall.gif HTTP/1.0" 304 0

slppp6.intermind.net - - [01/Aug/1995:00:00:10 -0400] "GET /history/skylab/skylab.html HTTP/1.0" 200 1687

piweba4y.prodigy.com - - [01/Aug/1995:00:00:10 -0400] "GET /images/launchmedium.gif HTTP/1.0" 200 11853

slppp6.intermind.net - - [01/Aug/1995:00:00:11 -0400] "GET /history/skylab/skylab-small.gif HTTP/1.0" 200 9202

slppp6.intermind.net - - [01/Aug/1995:00:00:12 -0400] "GET /images/ksclogosmall.gif HTTP/1.0" 200 3635

ix-esc-ca2-07.ix.netcom.com - - [01/Aug/1995:00:00:12 -0400] "GET /history/apollo/images/apollo-logo1.gif HTTP/1.0" 200 1173

slppp6.intermind.net - - [01/Aug/1995:00:00:13 -0400] "GET /history/apollo/images/apollo-logo.gif HTTP/1.0" 200 3047

uplherc.upl.com - - [01/Aug/1995:00:00:14 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 304 0

References:

- Web log format:

127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET


/apache_pb.gif HTTP/1.0" 200 2326

A "-" in a field indicates missing data.

- 127.0.0.1 is the IP address of the client (remote host) which made the request to the server.
- user-identifier is the RFC 1413 identity of the client.
- frank is the userid of the person requesting the document.
- [10/Oct/2000:13:55:36 -0700] is the date, time, and time zone that the request was received,
by default in strftime format %d/%b/%Y:%H:%M:%S %z.
- "GET /apache_pb.gif HTTP/1.0" is the request line from the client. The
method GET, /apache_pb.gif the resource requested, and HTTP/1.0 the HTTP protocol.
- 200 is the HTTP status code returned to the client. 2xx is a successful response, 3xx a
redirection, 4xx a client error, and 5xx a server error.
- 2326 is the size of the object returned to the client, measured in bytes.

- A preprocessed web log data


(http://archive.ics.uci.edu/ml/datasets/Anonymous+Microsoft+Web+Data)

2
Introduction to Data mining

3. Describe how to represent the following data in a vector space model (document-term matrix).

Analysis models of technical and economic data of mining enterprises based on big data
1 analysis
2 Spatial and Spatio-temporal Data Mining
3 Hair data model: A new data model for Spatio-Temporal data mining
4 A Data Stream Mining System
DD-Rtree: A dynamic distributed data structure for efficient data distribution among cluster
5 nodes for spatial data mining algorithms
6 Big data gathering and mining pipelines for CRM using open-source
7 The Research on Safety Monitoring System of Coal Mine Based on Spatial Data Mining
CAKE – Classifying, Associating and Knowledge DiscovEry - An Approach for Distributed Data
8 Mining (DDM) Using PArallel Data Mining Agents (PADMAs)
9 Digital construction of coal mine big data for different platforms based on life cycle
10 Privacy-Preserving Big Data Stream Mining: Opportunities, Challenges, Directions
11 Analysis Methods of Workflow Execution Data Based on Data Mining
12 Efficient Tree Structures for High Utility Pattern Mining in Incremental Databases
13 Comparison of Tools for Data Mining and Retrieval in High Volume Data Stream
14 Adaptive Differentially Private Data Release for Data Sharing and Data Mining
15 Domain Driven Data Mining (D3M)
Notice of Retraction<br>The research of building production-oriented data mart for mine
16 enterprises based on data mining
17 Study on land use of changping district with spatial data mining method
Linked Open Data mining for democratization of big dataData Mining Library for Big Data
18 Processing Platforms: A Case Study-Sparkling Water Platform
19 Towards Data-Oriented Hospital Services: Data Mining-Based Hospital Management
20 Developing an Integrated Time-Series Data Mining Environment for Medical Data Mining

3
Introduction to Data mining

Answers:

1.

userId movieId rating s1 s2 s3 s4 s5 s6 s7 s8


205229 108979 5 1 1 3 4 2 2 5 5
205229 6947 4 1 1 3 4 4 2 5 4
205229 117444 4 1 4 4 2 2 2 4 2
205229 150548 4 2 2 4 2 4 2 4 1
205229 136542 5 1 1 5 1 1 2 5 2
117112 77455 3.5 1 2 2 2 4 4 4 4
144726 1303 4 1 1 3? 4 3 2 5 3
144726 103306 3.5 1 1? 3 2 1 2 4 2
144726 2060 4.5 1 1 4 2 1 1 5 2
144726 135534 3.5 2 1 5 1 1 1 5 2
144726 128542 3.5 1 1 4 3 1 2 5 2
200400 26939 4 1 2 4 2 2 2 4 1
200400 40491 3.5 1 2 4 2 2 2 4 1
125112 104337 3.5 1? 2? 4? 2? 1? 2? 4? 1?
125112 162082 4 2 1? 3 5 5 2? 4 4
125112 96966 4.5 1? 2? 4? 2? 1? 1? 5? 4?
125112 165551 4.5 1 3 4 3 3 2 5 4
125112 89759 5 1? 1? 5? 1? 1? 2? 5? 2?
113031 5046 3.5 1 1 3 1 1 2 4 1
113031 116855 4 1 1 4 1 1 5 5 1

Using Population standard deviation

rating s1 s2 s3 s4 s5 s6 s7 s8
rating 1
s1 -0.17436 1
s2 -0.06547 -0.1131 1
0.137016
s3 0.344 0.0608 1
-0.07058 -0.54144
s4 0.059 0.1346 1
s5 -0.07835 0.420013 0.167015 -0.54634 0.670483 1
-0.00699 0.170926
s6 -0.172 -0.2048 -0.33 -0.13 1
-
s7 0.567964 -0.183 -0.38095 0.360588 0.061467 -0.278 0.12438 1
-
-0.02187 -0.01941 -0.45844 0.668603 0.474526 0.07907 0.361009
s8 0.3814 1

4
Introduction to Data mining

= 4.05 = 0.522

= 1.15 = 0.357

= 1.55 = 0.8

= 3.75 = 0.77

= 2.3 = 1.144

= 2.05 = 1.28

= 2.1 = 0.88

= 4.55 = 0.5

= 2.4 = 1.2806

Using Sample standard deviation

rating s1 s2 s3 s4 s5 s6 s7 s8
rating 1
s1 -0.15 1
-0.11311
s2 -0.072 1
s3 0.326467 0.130165 0.057761 1
s4 0.055651 0.127848 -0.06705 -0.51437 1
s5 -0.07444 0.399013 0.158665 -0.51902 0.636959 1
s6 -0.1638 -0.19457 -0.00664 -0.31375 -0.1214 0.16238 1
- -
s7 0.539566 -0.17381 -0.3619 0.342559 0.058394 0.26407 0.11816 1
-
s8 0.362375 -0.02078 -0.01844 -0.43552 0.635173 0.4508 0.07512 0.342959 1

= 4.05 = 0.535576

= 1.15 = 0.366348

= 1.55 = 0.825578

= 3.75 = 0.786398

= 2.3 = 1.174286

= 2.05 = 1.316894

= 2.1 = 0.91191

= 4.55 = 0.5

= 2.4 = 1.313893

5
Introduction to Data mining

2. Generate WAS

Hint: create a database warehouse consisting of three-dimension tables and one fact table as shown
below:

1 session = 1 hour

Ref: See Lab4.

3.

VSM:

Set a list M of documents, M = null;

Set a list T of vocabularies, T = null;

For each line L

6
Introduction to Data mining

Split L into terms using spaces

Set a list Tm of vocabularies for each doc, Tm = null

For each ti in L

If ti not in T {add ti into T, and count ti}

Else count ti in T;

If ti not in Tm {add ti into Tm, and count ti for this Tm}

Else count ti in Tm;

Add Tm into M

7
3/17/2021

Lecture 5:
Evaluating what’s been
learned
Lecturer: Dr. Nguyen, Thi Thanh Sang
(nttsang@hcmiu.edu.vn)

References: Chapter 5 in Data Mining: Practical Machine Learning Tools and


Techniques (Third Edition), by Ian H.Witten, Eibe Frank and Eibe Frank
1

Evaluating what’s been learned

 Issues: training and testing


 Predicting performance: confidence limits
 Holdout, cross-validation, bootstrap
 Comparing schemes: the t-test
 Predicting probabilities: loss functions
 Cost-sensitive measures
 Evaluating numeric prediction
 The Minimum Description Length principle

1
3/17/2021

Evaluation: the key to success

 How predictive is the model we learned?


 Error on the training data is not a good indicator of
performance on future data
 Otherwise 1-NN would be the optimum classifier!
 Simple solution that can be used if lots of (labeled)
data is available:
 Split data into training and test set
 However: (labeled) data is usually limited
 More sophisticated techniques need to be used

Issues in evaluation

 Statistical reliability of estimated differences in


performance ( significance tests)
 Choice of performance measure:
 Number of correct classifications
 Accuracy of probability estimates
 Error in numeric predictions
 Costs assigned to different types of errors
 Many practical applications involve costs

2
3/17/2021

Training and testing I

 Natural performance measure for classification


problems: error rate
 Success: instance’s class is predicted correctly
 Error: instance’s class is predicted incorrectly
 Error rate: proportion of errors made over the whole
set of instances
 Resubstitution error: error rate obtained from
training data

Training and testing II

 Test set: independent instances that have played no


part in formation of classifier
 Assumption: both training data and test data are
representative samples of the underlying problem
 Test and training data may differ in nature
 Example: classifiers built using customer data from two
different towns A and B
 To estimate performance of classifier from town A in completely new
town, test it on data from B

3
3/17/2021

Making the most of the data

 Once evaluation is complete, all the data can be used


to build the final classifier
 Generally, the larger the training data the better the
classifier
 The larger the test data the more accurate the error
estimate
 Holdout procedure: method of splitting original data
into training and test set
 Dilemma: ideally both training set and test set should be
large!

Predicting performance

 Assume the estimated error rate is 25%. How close is


this to the true error rate?
 Depends on the amount of test data
 Prediction is just like tossing a (biased!) coin
 “Head” is a “success”, “tail” is an “error”
 In statistics, a succession of independent events like
this is called a Bernoulli process
 Statistical theory provides us with confidence intervals for
the true underlying proportion

4
3/17/2021

Confidence intervals

 We can say: p lies within a certain specified interval


with a certain specified confidence
 Example: S=750 successes in N=1000 trials
 Estimated success rate: 75%
 How close is this to true success rate p?
 Answer: with 80% confidence p in [73.2,76.7]
 Another example: S=75 and N=100
 Estimated success rate: 75%
 With 80% confidence p in [69.1,80.1]

Mean and variance

 Mean and variance for a Bernoulli trial:


p, p (1–p)
 Expected success rate f=S/N
 Mean and variance for f : p, p (1–p)/N
 For large enough N, f follows a Normal
distribution
 c% confidence interval [–z  X  z] for random
variable with 0 mean is given by:
𝑃𝑟 −𝑧 ≤ 𝑋 ≤ 𝑧 = 𝑐

 With a symmetric distribution:


𝑃𝑟 −𝑧 ≤ 𝑋 ≤ 𝑧 = 1 − 2 × 𝑃𝑟 𝑥 ≥ 𝑧

10

10

5
3/17/2021

Confidence limits
 Confidence limits for the normal distribution with 0
mean and a variance of 1: Pr[X  z] z
0.1% 3.09
0.5% 2.58
1% 2.33
5% 1.65
10% 1.28
20% 0.84
–1 0 1 1.65 40% 0.25

 Thus:
𝑃𝑟 −1.65 ≤ 𝑋 ≤ 1.65 = 90%

 To use this we have to reduce our random variable f to


have 0 mean and unit variance

11

11

Transforming f

 Transformed value for f :


𝑓−𝑝
𝑝 1 − 𝑝 ⁄𝑁
(i.e. subtract the mean and divide by the standard deviation)

 Resulting equation:
𝑓−𝑝
 Solving for p : 𝑃𝑟 −𝑧 ≤
𝑝 1 − 𝑝 ⁄𝑁
≤𝑧

=𝑐

12

12

6
3/17/2021

Examples

 f = 75%, N = 1000, c = 80% (so that z = 1.28):


𝑝 ∈ 0.732,0.767

 f = 75%, N = 100, c = 80% (so that z = 1.28):


𝑝 ∈ 0.691,0.801

 Note that normal distribution assumption is only valid for


large N (i.e. N > 100)
 f = 75%, N = 10, c = 80% (so that z = 1.28):
𝑝 ∈ 0.549,0.881

(should be taken with a grain of salt)

13

13

Holdout estimation

 What to do if the amount of data is limited?


 The holdout method reserves a certain amount for
testing and uses the remainder for training
 Usually: one third for testing, the rest for training
 Problem: the samples might not be representative
 Example: class might be missing in the test data
 Advanced version uses stratification
 Ensures that each class is represented with approximately
equal proportions in both subsets

14

14

7
3/17/2021

Repeated holdout method

 Holdout estimate can be made more reliable by


repeating the process with different subsamples
 In each iteration, a certain proportion is randomly selected for
training (possibly with stratification)
 The error rates on the different iterations are averaged to yield
an overall error rate
 This is called the repeated holdout method
 Still not optimum: the different test sets overlap
 Can we prevent overlapping?

15

15

Cross-validation

 Cross-validation avoids overlapping test sets


 First step: split data into k subsets of equal size
 Second step: use each subset in turn for testing, the
remainder for training
 Called k-fold cross-validation
 Often the subsets are stratified before the cross-
validation is performed
 The error estimates are averaged to yield an
overall error estimate

16

16

8
3/17/2021

More on cross-validation

 Standard method for evaluation: stratified ten-fold


cross-validation
 Why ten?
 Extensive experiments have shown that this is the best choice
to get an accurate estimate
 There is also some theoretical evidence for this
 Stratification reduces the estimate’s variance
 Even better: repeated stratified cross-validation
 E.g. ten-fold cross-validation is repeated ten times and results
are averaged (reduces the variance)

17

17

Leave-One-Out cross-validation

 Leave-One-Out:
a particular form of cross-validation:
 Set number of folds to number of training instances
 I.e., for n training instances, build classifier n times
 Makes best use of the data
 Involves no random subsampling
 Very computationally expensive
 (exception: NN)

18

18

9
3/17/2021

Leave-One-Out-CV and stratification

 Disadvantage of Leave-One-Out-CV:
stratification is not possible
 It guarantees a non-stratified sample because
there is only one instance in the test set!

19

19

The bootstrap

 CV uses sampling without replacement


 The same instance, once selected, cannot be selected
again for a particular training/test set
 The bootstrap uses sampling with replacement to
form the training set
 Sample a dataset of n instances n times with replacement
to form a new dataset of n instances
 Use this data as the training set
 Use the instances from the original
dataset that don’t occur in the new
training set for testing
20

20

10
3/17/2021

The 0.632 bootstrap

 Also called the 0.632 bootstrap


 A particular instance has a probability of 1–
1/n of not being picked
 Thus its probability of ending up in the test
data is: 1
(1 − ) ≈ 𝑒 ≈ 0.368
𝑛

 This means the training data will contain


approximately 63.2% of the instances

21

21

Estimating error with the bootstrap

 The error estimate on the test data will be very


pessimistic
 Trained on just ~63% of the instances
 Therefore, combine it with the resubstitution
error:
𝑒𝑟𝑟 = 0.632 × 𝑒test instances + 0.368 × 𝑒training_instances

 The resubstitution error gets less weight than


the error on the test data
 Repeat process several times with different
replacement samples; average the results

22

22

11
3/17/2021

More on the bootstrap

 Probably the best way of estimating


performance for very small datasets
 However, it has some problems
 Consider the random dataset from above
 A perfect memorizer will achieve
0% resubstitution error and
~50% error on test data
 Bootstrap estimate for this classifier:
𝑒𝑟𝑟 = 0.632 × 50% + 0.368 × 0% = 31.6%

 True expected error: 50%


23

23

Comparing data mining schemes

 Frequent question: which of two learning schemes


performs better?
 Note: this is domain dependent!
 Obvious way: compare 10-fold CV estimates
 Generally sufficient in applications (we don't loose if
the chosen method is not truly better)
 However, what about machine learning research?
 Need to show convincingly that a particular

method works better

24

24

12
3/17/2021

Comparing schemes II
 Want to show that scheme A is better than scheme B in
a particular domain
 For a given amount of training data
 On average, across all possible training sets
 Let's assume we have an infinite amount of data from
the domain:
 Sample infinitely many dataset of specified size
 Obtain cross-validation estimate on each dataset for
each scheme
 Checkif mean accuracy for scheme A is better than
mean accuracy for scheme B

25

25

Predicting probabilities

 Performance measure so far: success rate


 Also called 0-1 loss function:
0 if prediction is correct
 ∑
1 if prediction is incorrect

 Most classifiers produces class probabilities


 Depending on the application, we might want to
check the accuracy of the probability estimates
 0-1 loss is not the right thing to use in those cases

26

26

13
3/17/2021

Quadratic loss function


 p1 … pk are probability estimates for an instance

 c is the index of the instance’s actual class

 a1 … ak = 0, except for ac which is 1

 Quadratic loss is: (𝑝 − 𝑎 ) = 1 − 2𝑝 + 𝑝

 Want to minimize

 Can show that this is minimized when pj = pj*, the


true probabilities

27

27

Informational loss function

 The informational loss function is –log(pc),


where c is the index of the instance’s actual class
 Number of bits required to communicate the
actual class
 Let p1* … pk* be the true class probabilities
 Then the expected value for the loss function is:
−𝑝∗ log 𝑝 −. . . −𝑝 ∗ log 𝑝

 Justification: minimized when pj = pj*


 Difficulty: zero-frequency problem

28

28

14
3/17/2021

Discussion

 Which loss function to choose?


 Both encourage honesty
 Quadratic loss function takes into account all class
probability estimates for an instance
 Informational loss focuses only on the probability
estimate for the actual class
 Quadratic loss is bounded:
1 𝑝
it can never exceed 2
 Informational loss can be infinite
 Informational loss is related to MDL principle
[later]

29

29

Counting the cost


 In practice, different types of classification
errors often incur different costs
 Examples:
 Loan decisions
 Oil-slick detection
 Fault diagnosis
 Promotional mailing

30

30

15
3/17/2021

Counting the cost

 The confusion matrix:


Predicted class
Yes No
Actual class Yes True positive False negative
No False positive True negative

There are many other types of cost!


 E.g.: cost of collecting training data

31

31

Aside: the kappa statistic


 Two confusion matrices for a 3-class problem:
actual predictor (left) vs. random predictor (right)

 Number of successes: sum of entries in diagonal (D)


𝐷 −𝐷
 Kappa statistic: 𝐷 −𝐷

measures relative improvement over random predictor

32

32

16
3/17/2021

Classification with costs

 Two cost matrices:

 Success rate is replaced by average cost per


prediction
 Costis given by appropriate entry in the cost
matrix

33

33

Cost-sensitive classification

 Can take costs into account when making predictions


 Basic idea: only predict high-cost class when very
confident about prediction
 Given: predicted class probabilities
 Normally we just predict the most likely class
 Here, we should make the prediction that minimizes
the expected cost
 Expected cost: dot product of vector of class
probabilities and appropriate column in cost matrix
 Choose column (class) that minimizes expected cost

34

34

17
3/17/2021

Cost-sensitive learning

 So far we haven't taken costs into account at


training time
 Most learning schemes do not perform cost-
sensitive learning
 They generate the same classifier no matter what costs
are assigned to the different classes
 Example: standard decision tree learner
 Simple methods for cost-sensitive learning:
 Resampling of instances according to costs
 Weighting of instances according to costs
 Some schemes can take costs into account by
varying a parameter, e.g. naïve Bayes
35

35

Lift charts
 In practice, costs are rarely known
 Decisions are usually made by comparing possible
scenarios
 Example: promotional mailout to 1,000,000
households
 Mail to all; 0.1% respond (1000)
 Data mining tool identifies subset of 100,000 most
promising, 0.4% of these respond (400)
40% of responses for 10% of cost may pay off
 Identify subset of 400,000 most promising, 0.2% respond
(800)
 A lift chart allows a visual comparison

36

36

18
3/17/2021

Generating a lift chart

 Sort instances according to predicted probability


of being positive:
Predicted probability Actual class
1 0.95 Yes
2 0.93 Yes
3 0.93 No
4 0.88 Yes
… … …

 x axis is sample size


y axis is number of true positives
37

37

A hypothetical lift chart

40% of responses 80% of responses


for 10% of cost for 40% of cost

38

38

19
3/17/2021

ROC curves

 ROC curves are similar to lift charts


 Stands for “receiver operating characteristic”
 Used in signal detection to show tradeoff between
hit rate and false alarm rate over noisy channel
 Differences to lift chart:
 y axis shows percentage of true positives in sample
rather than absolute number
 x axis shows percentage of false positives in
sample rather than sample size

39

39

A sample ROC curve

 Jagged curve—one set of test data


 Smooth curve—use cross-validation

40

40

20
3/17/2021

Cross-validation and ROC curves

 Simple method of getting a ROC curve using


cross-validation:
 Collect probabilities for instances in test folds
 Sort instances according to probabilities
 This method is implemented in WEKA
 However, this is just one possibility
 Another possibility is to generate an ROC curve for
each fold and average them

41

41

ROC curves for two schemes

 For a small, focused sample, use method A


 For a larger one, use method B
 In between, choose between A and B with appropriate probabilities
42

42

21
3/17/2021

The convex hull

 Given two learning schemes we can achieve any


point on the convex hull!
 TP and FP rates for scheme 1: t1 and f1
 TP and FP rates for scheme 2: t2 and f2
 If scheme 1 is used to predict 100 × q % of the
cases and scheme 2 for the rest, then
 TP rate for combined scheme:
q × t1 + (1-q) × t2
 FP rate for combined scheme:
q × f1+(1-q) × f2

43

43

More measures...
 Percentage of retrieved documents that are relevant:
precision=TP/(TP+FP)
 Percentage of relevant documents that are returned:
recall =TP/(TP+FN)
 Precision/recall curves have hyperbolic shape
 Summary measures: average precision at 20%, 50% and 80% recall
(three-point average recall)
 F-measure=(2 × recall × precision)/(recall+precision)
 sensitivity × specificity = (TP / (TP + FN)) × (TN / (FP + TN))
 Area under the ROC curve (AUC):
probability that randomly chosen positive instance is ranked above
randomly chosen negative one

44

44

22
3/17/2021

Summary of some measures

Domain Plot Explanation

Lift chart Marketing TP TP


Subset (TP+FP)/(TP+FP+TN
size +FN)
ROC curve Communications TP rate TP/(TP+FN)
FP rate FP/(FP+TN)
Recall- Information Recall TP/(TP+FN)
precision retrieval Precision TP/(TP+FP)
curve

45

45

Evaluating numeric prediction

 Same strategies: independent test set, cross-


validation, significance tests, etc.
 Difference: error measures
 Actual target values: a1 a2 …an
 Predicted target values: p1 p2 … pn
 Most popular measure: mean-squared error
𝑝 −𝑎 +. . . + 𝑝 − 𝑎
𝑛

 Easy to manipulate mathematically

46

46

23
3/17/2021

Other measures
The root mean-squared error :

𝑝 −𝑎 +. . . + 𝑝 − 𝑎
𝑛

●The mean absolute error is less sensitive to outliers


than the mean-squared error:
∣ 𝑝 − 𝑎 ∣ +. . .+∣ 𝑝 − 𝑎 ∣
𝑛

●Sometimes relative error values are more appropriate


(e.g. 10% for an error of 50 when predicting 500)

47

47

Improvement on the mean


●How much does the scheme improve on
simply predicting the average?

● The relative squared error is:


𝑝 −𝑎 + ⋯ + (𝑝 − 𝑎 )
_ __
(𝑎 − 𝑎 ) + ⋯ + (𝑎 − 𝑎 )

● The relative absolute error is:


∣ 𝑝 − 𝑎 ∣ + ⋯ +∣ 𝑝 − 𝑎 ∣
__ _
∣ 𝑎 − 𝑎 ∣ + ⋯ +∣ 𝑎 − 𝑎 ∣

48

48

24
3/17/2021

Correlation coefficient
 Measures the statistical correlation between the
predicted values and the actual values

 Scale independent, between –1 and +1


 Good performance leads to large values!

49

49

Which measure?
 Best to look at all of them
 Often it doesn’t matter
 Example:

A B C D
Root mean-squared error 67.8 91.7 63.3 57.4
Mean absolute error 41.3 38.5 33.4 29.2
Root rel squared error 42.2% 57.2% 39.4% 35.8%
Relative absolute error 43.1% 40.1% 34.8% 30.4%
Correlation coefficient 0.88 0.88 0.89 0.91

●D best
●C second-best
●A, B arguable

50

50

25
3/17/2021

The MDL principle

 MDL stands for minimum description length


 The description length is defined as:
space required to describe a theory
+
space required to describe the theory’s mistakes
 In our case the theory is the classifier and the mistakes
are the errors on the training data
 Aim: we seek a classifier with minimal DL
 MDL principle is a model selection criterion

51

51

Model selection criteria


 Model selection criteria attempt to find a good
compromise between:
 The complexity of a model
 Its prediction accuracy on the training data
 Reasoning: a good model is a simple model that
achieves high accuracy on the given data
 Also known as Occam’s Razor :
the best theory is the smallest one
that describes all the facts

William of Ockham, born in the village of Ockham in Surrey


(England) about 1285, was the most influential philosopher of
the 14th century and a controversial theologian.

52

52

26
3/17/2021

Elegance vs. errors

 Theory 1: very simple, elegant theory that explains


the data almost perfectly
 Theory 2: significantly more complex theory that
reproduces the data without mistakes
 Theory 1 is probably preferable

53

53

MDL and compression

 MDL principle relates to data compression:


 The best theory is the one that compresses the data the
most
 I.e. to compress a dataset we generate a model and then
store the model and its mistakes
 We need to compute
(a) size of the model, and
(b) space needed to encode the errors
 (b) easy: use the informational loss function
 (a) need a method to encode the model

54

54

27
3/17/2021

MDL and Bayes’s theorem

 L[T]=“length” of the theory


 L[E|T]=training set encoded in a certain number of
bits given the theory
 Description length= L[T] + L[E|T]
 Bayes’s theorem gives a posteriori probability of a
theory given the data:
𝑃𝑟[E|T]𝑃𝑟 𝑇
𝑃𝑟[T|E] =
𝑃𝑟 𝐸

 Equivalent to:
−log𝑃𝑟[T|E] = −log𝑃𝑟[E|T] − log𝑃𝑟 𝑇 + log𝑃𝑟 𝐸

constant
55

55

MDL and MAP

 MAP stands for maximum a posteriori probability


 Finding the MAP theory corresponds to finding the
MDL theory
 Difficult bit in applying the MAP principle:
determining the prior probability Pr[T] of the theory
 Corresponds to difficult part in applying the MDL
principle: coding scheme for the theory
 I.e. if we know a priori that a particular theory is
more likely we need fewer bits to encode it

56

56

28
3/17/2021

Discussion of MDL principle

 Advantage: makes full use of the training data when


selecting a model
 Disadvantage 1: appropriate coding scheme/prior
probabilities for theories are crucial
 Disadvantage 2: no guarantee that the MDL theory is the
one which minimizes the expected error
 Note: Occam’s Razor is an axiom!
 Epicurus’s principle of multiple explanations: keep all
theories that are consistent with the data

57

57

29
3/28/2021

Lecture 6:
Data mining algorithms:
Classification
Lecturer: Dr. Nguyen, Thi Thanh Sang
(nttsang@hcmiu.edu.vn)

References:
Chapter 8 in Data Mining: Concepts and Techniques (Third Edition), by Jiawei
Han, Micheline Kamber

1
3/28/2021

Data mining algorithms: Classification

 Basic concepts
 Decision tree Induction
 Bayes Classification Methods
 Rule-based Classification
 Model Evaluation and Selection

Supervised vs. Unsupervised Learning


 Supervised learning (classification)
 Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data

2
3/28/2021

Prediction Problems: Classification vs. Numeric


Prediction
 Classification
 predicts categorical class labels (discrete or nominal)
 classifies data (constructs a model) based on the training
set and the values (class labels) in a classifying attribute
and uses it in classifying new data
 Numeric Prediction
 models continuous-valued functions, i.e., predicts
unknown or missing values
 Typical applications
 Credit/loan approval:
 Medical diagnosis: if a tumor is cancerous or benign
 Fraud detection: if a transaction is fraudulent
 Web page categorization: which category it is

Classification—A Two-Step Process


 Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
 The set of tuples used for model construction is training set
 The model is represented as classification rules, decision trees, or
mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model
 The known label of test sample is compared with the classified
result from the model
 Accuracy rate is the percentage of test set samples that are
correctly classified by the model
 Test set is independent of training set (otherwise overfitting)
 If the accuracy is acceptable, use the model to classify new data
 Note: If the test set is used to select models, it is called validation
(test) set

3
3/28/2021

Process (1): Model Construction

Classification
Algorithms
Training
Data

NA ME RANK YE ARS T ENUR ED Classifier


Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill P rofessor 2 yes
Jim Associate Prof 7 yes
IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’

Process (2): Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NA M E RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
M erlisa Associate Prof 7 no
G eorge Professor 5 yes
Joseph Assistant Prof 7 yes

4
3/28/2021

Data mining algorithms: Classification

 Basic concepts
 Decision tree Induction
 Bayes Classification Methods
 Rule-based Classification
 Model Evaluation and Selection

Decision Tree Induction: An Example


age income student credit_rating buys_computer
<=30 high no fair no
 Training data set: Buys_computer <=30 high no excellent no
 The data set follows an example of 31…40 high no fair yes
>40 medium no fair yes
Quinlan’s ID3 (Playing Tennis) >40 low yes fair yes
 Resulting tree: >40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

student? yes credit rating?

no yes excellent fair

no yes yes
10

10

5
3/28/2021

Algorithm for Decision Tree Induction


 Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-
conquer manner
 At start, all the training examples are at the root
 Attributes are categorical (if continuous-valued, they
are discretized in advance)
 Examples are partitioned recursively based on selected
attributes
 Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
 Conditions for stopping partitioning
 All samples for a given node belong to the same class
 There are no remaining attributes for further
partitioning – majority voting is employed for classifying
the leaf
 There are no samples left

11

11

12

6
3/28/2021

Brief Review of Entropy

13

13

Attribute Selection Measure: Information


Gain (ID3/C4.5)
 Select the attribute with the highest information gain
 Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify a tuple in D:
m
Info ( D )    pi log 2 ( pi )
i 1
 Information needed (after using A to split D into v partitions) to
classify D: v | D |
Info A ( D )  
j
 Info ( D j )
j 1 | D |
 Information gained by branching on attribute A

Gain(A)  Info(D)  Info A(D)


14

14

7
3/28/2021

Attribute Selection: Information Gain


 Class P: buys_computer = “yes” 5 4
Info age ( D )  I ( 2 ,3)  I ( 4 ,0 )
 Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info ( D )  I ( 9,5)   log 2 ( )  log 2 ( )  0 .940  I (3, 2 )  0 .694
14 14 14 14 14
age pi n i I(p i, n i) 5
<=30 2 3 0.971 I ( 2,3) means “age <=30” has 5 out
14 of 14 samples, with 2 yes’es and
31…40 4 0 0
3 no’s. Hence
>40 3 2 0.971
age
<=30
income student credit_rating
high no fair
buys_computer
no
Gain ( age )  Info ( D )  Info age ( D )  0.246
<=30 high no excellent no
31…40
>40
high
medium
no
no
fair
fair
yes
yes
Similarly,
>40 low yes fair yes

Gain(income)  0.029
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30
>40
low
medium
yes
yes
fair
fair
yes
yes
Gain( student )  0.151
<=30
31…40
medium
medium
yes
no
excellent
excellent
yes
yes Gain(credit _ rating )  0.048
31…40 high yes fair yes
>40 medium no excellent no

15

15

Computing Information-Gain for


Continuous-Valued Attributes
 Let attribute A be a continuous-valued attribute
 Must determine the best split point for A
 Sort the value A in increasing order
 Typically, the midpoint between each pair of adjacent
values is considered as a possible split point
 (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
 The point with the minimum expected information
requirement for A is selected as the split-point for A
 Split:
 D1 is the set of tuples in D satisfying A ≤ split-point,
and D2 is the set of tuples in D satisfying A > split-point

16

16

8
3/28/2021

Gain Ratio for Attribute Selection (C4.5)

 Information gain measure is biased towards attributes


with a large number of values
 C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfo A ( D )    log 2 ( )
j 1 |D| |D|
 GainRatio(A) = Gain(A)/SplitInfo(A)
 Ex.

 gain_ratio(income) = 0.029/1.557 = 0.019


 The attribute with the maximum gain ratio is selected as
the splitting attribute

17

17

Gini Index (CART, IBM IntelligentMiner)

 If a data set D contains examples from n classes, gini


index, gini(D) is defined as n
gini ( D )  1   p 2j
j 1
where pj is the relative frequency of class j in D
 If a data set D is split on A into two subsets D1 and D2,
the gini index gini(D) is defined as
| D1 | |D |
gini A ( D )  gini ( D 1)  2 gini ( D 2 )
|D | |D |
 Reduction in Impurity:
gini( A)  gini( D)  giniA (D)
 The attribute provides the smallest ginisplit(D) (or the
largest reduction in impurity) is chosen to split the node
(need to enumerate all the possible splitting points for
each attribute)
18

18

9
3/28/2021

Computation of Gini Index


 Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
2 2
9 5
gini( D)  1        0.459
 14   14 
 Suppose the attribute income partitions D into 10 in D1: {low,
medium} and 4 in D2  10  4
giniincome{low,medium} ( D)   Gini ( D1 )   Gini ( D2 )
 14   14 

Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the


{low,medium} (and {high}) since it has the lowest Gini index
 All attributes are assumed continuous-valued
 May need other tools, e.g., clustering, to get the possible split values
 Can be modified for categorical attributes

19

19

Comparing Attribute Selection Measures

 The three measures, in general, return good


results but
 Information gain:
 biased towards multivalued attributes

 Gain ratio:
 tends to prefer unbalanced splits in which one partition is much
smaller than the others

 Gini index:
 biased to multivalued attributes
 has difficulty when # of classes is large
 tends to favor tests that result in equal-sized partitions and purity in
both partitions
20

20

10
3/28/2021

Other Attribute Selection Measures

 CHAID: a popular decision tree algorithm, measure based on χ2 test for


independence
 C-SEP: performs better than info. gain and gini index in certain cases
 G-statistic: has a close approximation to χ2 distribution
 MDL (Minimal Description Length) principle (i.e., the simplest solution is
preferred):
 The best tree as the one that requires the fewest # of bits to both
(1) encode the tree, and (2) encode the exceptions to the tree
 Multivariate splits (partition based on multiple variable combinations)
 CART: finds multivariate splits based on a linear comb. of attrs.
 Which attribute selection measure is the best?
 Most give good results, none is significantly superior than others
21

21

Overfitting and Tree Pruning


 Overfitting: An induced tree may overfit the training data
 Too many branches, some may reflect anomalies due to
noise or outliers
 Poor accuracy for unseen samples
 Two approaches to avoid overfitting
 Prepruning: Halt tree construction early ̵ do not split a
node if this would result in the goodness measure falling
below a threshold
 Difficult to choose an appropriate threshold

 Postpruning: Remove branches from a “fully grown” tree—


get a sequence of progressively pruned trees
 Use a set of data different from the training data to decide which is the “best
pruned tree”

22

22

11
3/28/2021

Enhancements to Basic Decision Tree Induction

 Allow for continuous-valued attributes


 Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete
set of intervals
 Handle missing attribute values
 Assign the most common value of the attribute
 Assign probability to each of the possible values
 Attribute construction
 Create new attributes based on existing ones that are
sparsely represented
 This reduces fragmentation, repetition, and replication
23

23

Classification in Large Databases


 Classification—a classical problem extensively studied by
statisticians and machine learning researchers
 Scalability: Classifying data sets with millions of examples
and hundreds of attributes with reasonable speed
 Why is decision tree induction popular?
 relatively faster learning speed (than other
classification methods)
 convertible to simple and easy to understand
classification rules
 can use SQL queries for accessing databases
 comparable classification accuracy with other methods
 RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
 Builds an AVC-list (attribute, value, class label)

24

24

12
3/28/2021

Scalability Framework for RainForest


 Separates the scalability aspects from the criteria that
determine the quality of the tree
 Builds an AVC-list: AVC (Attribute, Value, Class_label)
 AVC-set (of an attribute X )
 Projection of training dataset onto the attribute X and
class label where counts of individual class label are
aggregated
 AVC-group (of a node n )
 Set of AVC-sets of all predictor attributes at the node n

25

25

Rainforest: Training Set and Its AVC Sets


Training Examples AVC-set on Age AVC-set on income
age income studentcredit_rating
buys_computerAge Buy_Computer income Buy_Computer

<=30 high no fair no yes no


yes no
<=30 high no excellent no
high 2 2
31…40 high no fair yes <=30 2 3
31..40 4 0 medium 4 2
>40 medium no fair yes
>40 low yes fair yes >40 3 2 low 3 1

>40 low yes excellent no


31…40 low yes excellent yes
AVC-set on
<=30 medium no fair no AVC-set on Student
credit_rating
<=30 low yes fair yes
student Buy_Computer
>40 medium yes fair yes Credit
Buy_Computer

<=30 medium yes excellent yes yes no rating yes no


31…40 medium no excellent yes yes 6 1 fair 6 2
31…40 high yes fair yes no 3 4 excellent 3 3
>40 medium no excellent no

26

26

13
3/28/2021

BOAT (Bootstrapped Optimistic Algorithm


for Tree Construction)
 Use a statistical technique called bootstrapping to create
several smaller samples (subsets), each fits in memory
 Each subset is used to create a tree, resulting in several
trees
 These trees are examined and used to construct a new tree
T’
 It turns out that T’ is very close to the tree that would be
generated using the whole data set together
 Adv: requires only two scans of DB, an incremental alg.

27

27

Presentation of Classification Results

28

March 28, 2021 Data Mining: Concepts and Techniques

28

14
3/28/2021

Visualization of a Decision Tree in SGI/MineSet 3.0

29

March 28, 2021 Data Mining: Concepts and Techniques

29

Interactive Visual Mining by Perception-Based


Classification (PBC)

30

Data Mining: Concepts and Techniques

30

15
3/28/2021

Data mining algorithms: Classification

 Basic concepts
 Decision tree Induction
 Bayes Classification Methods
 Rule-based Classification
 Model Evaluation and Selection

31

Bayesian Classification: Why?

 A statistical classifier: performs probabilistic prediction, i.e.,


predicts class membership probabilities
 Foundation: Based on Bayes’ Theorem.
 Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers
 Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct
— prior knowledge can be combined with observed data
 Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured

32

32

16
3/28/2021

Bayes’ Theorem: Basics


M
 Total probability Theorem: P (B )   P (B | A )P( A )
i i
i 1

 Bayes’ Theorem: P ( H | X )  P (X | H ) P ( H )  P (X | H ) P (H ) / P (X )
P (X )
 Let X be a data sample (“evidence”): class label is unknown
 Let H be a hypothesis that X belongs to class C
 Classification is to determine P(H|X), (i.e., posteriori probability):
the probability that the hypothesis holds given the observed data
sample X
 P(H) (prior probability): the initial probability
 E.g., X will buy computer, regardless of age, income, …
 P(X): probability that sample data is observed
 P(X|H) (likelihood): the probability of observing the sample X, given
that the hypothesis holds
 E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income

33

33

Prediction Based on Bayes’ Theorem

 Given training data X, posteriori probability of a


hypothesis H, P(H|X), follows the Bayes’ theorem

P (H | X )  P (X | H )P (H )  P (X | H ) P (H ) / P (X )
P (X )
 Informally, this can be viewed as
posteriori = likelihood x prior/evidence
 Predicts X belongs to Ci iff the probability P(Ci|X) is the
highest among all the P(Ck|X) for all the k classes
 Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost

34

34

17
3/28/2021

Classification Is to Derive the Maximum Posteriori

 Let D be a training set of tuples and their associated


class labels, and each tuple is represented by an n-D
attribute vector X = (x1, x2, …, xn)
 Suppose there are m classes C1, C2, …, Cm.
 Classification is to derive the maximum posteriori, i.e.,
the maximal P(Ci|X)
 This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X)  i i
i P(X)
 Since P(X) is constant for all classes, only
P(C | X)  P(X | C )P(C )
i i i
needs to be maximized

35

35

Naïve Bayes Classifier

 A simplified assumption: attributes are conditionally


independent (i.e., no dependence relation between
attributes): n
P(X | Ci)   P(x | Ci)  P(x | Ci)  P(x | Ci) ... P(x | Ci)
k 1 2 n
k 1
 This greatly reduces the computation cost: Only counts
the class distribution
 If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having
value xk for Ak divided by |Ci, D| (# of tuples of Ci in D)
 If Ak is continous-valued, P(xk|Ci) is usually computed
based on Gaussian distribution with a mean μ and
standard deviation σ. (x ) 2
1 
g ( x,  , ) 
2
2
e
2 
and P(xk|Ci) is
P(X| Ci)  g(xk , Ci ,Ci )
36

36

18
3/28/2021

Naïve Bayes Classifier: Training Dataset

age income studentcredit_rating


buys_computer
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
Data to be classified:
31…40 low yes excellent yes
X = (age <=30, <=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium 37 no excellent no

37

Naïve Bayes Classifier: An Example


 P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 age income studentcredit_rating
buys_computer
<=30 high no fair no
P(buys_computer = “no”) = 5/14= 0.357 <=30 high no excellent no
31…40 high no fair yes
 Compute P(X|Ci) for each class
>40 medium no fair yes
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 >40 low yes fair yes
>40 low yes excellent no
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 31…40 low yes excellent yes
<=30 medium no fair no
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 <=30 low yes fair yes
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 >40 medium yes fair yes
<=30 medium yes excellent yes
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 31…40 medium no excellent yes
31…40 high yes fair yes
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 >40 medium no excellent no
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
38

38

19
3/28/2021

Avoiding the Zero-Probability Problem


 Naïve Bayesian prediction requires each conditional prob. be
non-zero. Otherwise, the predicted prob. will be zero
n
P ( X | C i)   P ( x k | C i)
k 1
 Ex. Suppose a dataset with 1000 tuples, income=low (0),
income= medium (990), and income = high (10)
 Use Laplacian correction (or Laplacian estimator)
 Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003

 The “corrected” prob. estimates are close to their


“uncorrected” counterparts

39

39

Naïve Bayes Classifier: Comments


 Advantages
 Easy to implement
 Good results obtained in most of the cases
 Disadvantages
 Assumption: class conditional independence, therefore
loss of accuracy
 Practically, dependencies exist among variables
 E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.
 Dependencies among these cannot be modeled by Naïve Bayes Classifier
 How to deal with these dependencies? Bayesian Belief
Networks (Text [1]. Chapter 9)

40

40

20
3/28/2021

Data mining algorithms: Classification

 Basic concepts
 Decision tree Induction
 Bayes Classification Methods
 Rule-based Classification
 Model Evaluation and Selection

41

Using IF-THEN Rules for Classification


 Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
 Rule antecedent/precondition vs. rule consequent
 Assessment of a rule: coverage and accuracy
 ncovers = # of tuples covered by R
 ncorrect = # of tuples correctly classified by R
coverage(R) = ncovers /|D| /* D: training data set */
accuracy(R) = ncorrect / ncovers
 If more than one rule are triggered, need conflict resolution
 Size ordering: assign the highest priority to the triggering rules that
has the “toughest” requirement (i.e., with the most attribute tests)
 Class-based ordering: decreasing order of prevalence or
misclassification cost per class
 Rule-based ordering (decision list): rules are organized into one long
priority list, according to some measure of rule quality or by experts

42

42

21
3/28/2021

Rule Extraction from a Decision Tree


 Rules are easier to understand than large
trees age?

 One rule is created for each path from the <=30 31..40 >40
root to a leaf
student? credit rating?
yes
 Each attribute-value pair along a path forms a
excellent fair
conjunction: the leaf holds the class no yes

yes
prediction no yes

 Rules are mutually exclusive and exhaustive


 Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = no
IF age = old AND credit_rating = fair THEN buys_computer = yes
43

43

Rule Induction: Sequential Covering Method

 Sequential covering algorithm: Extracts rules directly from training


data
 Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
 Rules are learned sequentially, each for a given class Ci will cover
many tuples of Ci but none (or few) of the tuples of other classes
 Steps:
 Rules are learned one at a time
 Each time a rule is learned, the tuples covered by the rules are
removed
 Repeat the process on the remaining tuples until termination
condition, e.g., when no more training examples or when the
quality of a rule returned is below a user-specified threshold
 Comp. w. decision-tree induction: learning a set of rules
simultaneously

44

44

22
3/28/2021

Sequential Covering Algorithm

while (enough target tuples left)


generate a rule
remove positive target tuples satisfying this rule

Examples covered
Examples covered by Rule 2
by Rule 1 Examples covered
by Rule 3

Positive
examples

45

45

Rule Generation

 To generate a rule
while(true)
find the best predicate p
if foil-gain(p) > threshold then add p to current rule
else break

A3=1&&A1=2
A3=1&&A1=2
&&A8=5A3=1

Positive Negative
examples examples
46

46

23
3/28/2021

47

How to Learn-One-Rule?
 Start with the most general rule possible: condition = empty
 Adding new attributes by adopting a greedy depth-first strategy
 Picks the one that most improves the rule quality
 Rule-Quality measures: consider both coverage and accuracy
 Foil-gain (in FOIL & RIPPER): assesses info_gain by extending
condition
pos' pos
FOIL_ Gain  pos'(log2  log2 )
pos'neg' pos  neg
 favors rules that have high accuracy and cover many positive tuples

 Rule pruning based on an independent set of test tuples


pos  neg
FOIL _ Prune( R ) 
pos  neg
Pos/neg are # of positive/negative tuples covered by R.
If FOIL_Prune is higher for the pruned version of R, prune R

48

48

24
3/28/2021

Data mining algorithms: Classification

 Basic concepts
 Decision tree Induction
 Bayes Classification Methods
 Rule-based Classification
 Model Evaluation and Selection

49

Model Evaluation and Selection


 Evaluation metrics: How can we measure accuracy? Other metrics to
consider?
 Use validation test set of class-labeled tuples instead of training set
when assessing accuracy
 Methods for estimating a classifier’s accuracy:
 Holdout method, random subsampling
 Cross-validation
 Bootstrap
 Comparing classifiers:
 Confidence intervals
 Cost-benefit analysis and ROC Curves

50

50

25
3/28/2021

Classifier Evaluation Metrics: Confusion Matrix


Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

Example of Confusion Matrix:


Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

 Given m classes, an entry, CMi,j in a confusion matrix indicates #


of tuples in class i that were labeled by the classifier as class j
 May have extra rows/columns to provide totals
51

51

Classifier Evaluation Metrics: Accuracy, Error


Rate, Sensitivity and Specificity
A\P C ¬C Class Imbalance Problem:

C TP FN P
 One class may be rare, e.g.
¬C FP TN N
fraud, or HIV-positive
P’ N’ All
 Significant majority of the

 Classifier Accuracy, or negative class and minority of


recognition rate: percentage the positive class
of test set tuples that are  Sensitivity: True Positive

correctly classified recognition rate


Accuracy = (TP + TN)/All  Sensitivity = TP/P

 Specificity: True Negative


 Error rate: 1 – accuracy, or
recognition rate
Error rate = (FP + FN)/All
 Specificity = TN/N

52

52

26
3/28/2021

Classifier Evaluation Metrics:


Precision and Recall, and F-measures
 Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive

 Recall: completeness – what % of positive tuples did the


classifier label as positive?
 Perfect score is 1.0
 Inverse relationship between precision & recall
 F measure (F1 or F-score): harmonic mean of precision and
recall,

 Fß: weighted measure of precision and recall


 assigns ß times as much weight to recall as to precision

53

53

Classifier Evaluation Metrics: Example

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)


cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)

 Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

54

54

27
3/28/2021

Evaluating Classifier Accuracy:


Holdout & Cross-Validation Methods
 Holdout method
 Given data is randomly partitioned into two independent sets
 Training set (e.g., 2/3) for model construction
 Test set (e.g., 1/3) for accuracy estimation

 Random sampling: a variation of holdout


 Repeat holdout k times, accuracy = avg. of the accuracies obtained

 Cross-validation (k-fold, where k = 10 is most popular)


 Randomly partition the data into k mutually exclusive
subsets, each approximately equal size
 At i-th iteration, use Di as test set and others as training set
 Leave-one-out: k folds where k = # of tuples, for small sized
data
 *Stratified cross-validation*: folds are stratified so that class
dist. in each fold is approx. the same as that in the initial
data
55

55

Evaluating Classifier Accuracy: Bootstrap

 Bootstrap
 Works well with small data sets
 Samples the given training tuples uniformly with replacement
 i.e.,each time a tuple is selected, it is equally likely to be
selected again and re-added to the training set
 Several bootstrap methods, and a common one is .632 bootstrap
 A data set with d tuples is sampled d times, with replacement,
resulting in a training set of d samples. The data tuples that did
not make it into the training set end up forming the test set.
About 63.2% of the original data end up in the bootstrap, and the
remaining 36.8% form the test set (since (1 – 1/d)d ≈ e-1 = 0.368)
 Repeat the sampling procedure k times, overall accuracy of the
model:

56

56

28
3/28/2021

Issues Affecting Model Selection

 Accuracy
 classifier accuracy: predicting class label
 Speed
 time to construct the model (training time)
 time to use the model (classification/prediction time)
 Robustness: handling noise and missing values
 Scalability: efficiency in disk-resident databases
 Interpretability
 understanding and insight provided by the model
 Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
57

57

Summary (I)
 Classification is a form of data analysis that extracts models
describing important data classes.
 Effective and scalable methods have been developed for decision
tree induction, Naive Bayesian classification, rule-based
classification, and many other classification methods.
 Evaluation metrics include: accuracy, sensitivity, specificity,
precision, recall, F measure, and Fß measure.
 Stratified k-fold cross-validation is recommended for accuracy
estimation. Bagging and boosting can be used to increase overall
accuracy by learning and combining a series of individual models.

58

58

29
3/28/2021

Summary (II)

 There have been numerous comparisons of the different


classification methods; the matter remains a research
topic
 No single method has been found to be superior over all
others for all data sets
 Issues such as accuracy, training time, robustness,
scalability, and interpretability must be considered and
can involve trade-offs, further complicating the quest for
an overall superior method

59

59

30
Introduction to Data Mining

Lecture 6 – Activities

1. Exercises in Text [1] - 8.7

The following table consists of training data from an employee database. The data have been
generalized. For example, “31 . . . 35” for age represents the age range of 31 to 35. For a given
row entry, count represents the number of data tuples having the values for department, status, age,
and salary given in that row.

Let status be the class label attribute.

(a) How would you modify the basic decision tree algorithm to take into consideration the count
of each generalized data tuple (i.e., of each row entry)?
(b) Use your algorithm to construct a decision tree from the given data.
(c) Given a data tuple having the values “systems”, “26. . . 30”, and “46–50K” for the attributes
department, age, and salary, respectively, what would a Na¨ıve Bayesian classification of the status
for the tuple be?

1
Introduction to Data Mining

Answers

(a) How would you modify the basic decision tree algorithm to take into consideration the count
of each generalized data tuple (i.e., of each row entry)?
The basic decision tree algorithm should be modified as follows to take into consideration the
count of each generalized data tuple.
• The count of each tuple must be integrated into the calculation of the attribute selection measure
(such as information gain).
• Take the count into consideration to determine the most common class among the tuples.
(b) Use your algorithm to construct a decision tree from the given data.
The resulting tree is:

Info(D) = 0.899
Gain(Dept) = 0.0488
Gain(Age) = 0.4247
Gain(Salary) = 0.5375
(c) Given a data tuple having the values “systems”, “26. . . 30”, and “46–50K” for the attributes
department, age, and salary, respectively, what would a Na¨ıve Bayesian classification of the status
for the tuple be?
P (X|senior) = 0; << this case, the Laplacian correction was not used. 1 more tuple for each
age-value pair should be added.
P (X|junior) =23/113 × 49/113 × 23/113 = 0.018. Thus, a Na¨ıve Bayesian classification predicts
“junior”.

You might also like