Data Mining Lecture Notes
Data Mining Lecture Notes
Lecture 1:
Introduction to Data Mining
Lecturer: Dr. Nguyen, Thi Thanh Sang
(nttsang@hcmiu.edu.vn)
References:
[1] Chapter 1 in Data Mining: Concepts and Techniques (Third Edition), by
Jiawei Han, Micheline Kamber
[2] Chapter 1 in Data Mining: Practical Machine Learning Tools and
Techniques (Third Edition), by Ian H.Witten, Eibe Frank and Eibe Frank
1
Introduction
1
2
24/01/2021
Extracting
◆ implicit,
◆ previously unknown,
◆ potentially useful
2
4
24/01/2021
Definitions:
DM: The practice of examining large databases in
order to generate new information.
DM: The process of analyzing data from different
perspectives and summarizing it into useful
information - information that can be used to
increase revenue, cut costs, or both.
3
6
24/01/2021
Introduction
4
8
24/01/2021
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
5
10
24/01/2021
Introduction
11
Task-relevant Data
Data Cleaning
Data Integration
Databases 12
6
12
24/01/2021
13
13
14
7
14
24/01/2021
15
Introduction
8
16
24/01/2021
17
17
19
Structural descriptions
Example: if-then rules
… … … … …
Operational definition:
Things learn when they change their behavior
in a way that makes them perform better in Does a slipper learn?
the future.
21
Introduction
11
22
24/01/2021
Tables
Data cube
Linear models
Trees
Rules
Instance-based Representation
Clusters
23
12
24
24/01/2021
Country
sum
Canada
Mexico
sum
25
25
27
If-then Rules
14
28
24/01/2021
Instance-based representation
29
Clusters
15
30
24/01/2021
Introduction
31
Applications
The result of learning—or the learning method itself—
is deployed in practical applications
◆ Processing loan applications
◆ Screening images for oil slicks
◆ Electricity supply forecasting
◆ Diagnosis of machine faults
◆ Marketing and sales
◆ Separating crude oil and natural gas
◆ Reducing banding in rotogravure printing
◆ Finding appropriate technicians for telephone faults
◆ Scientific applications: biology, astronomy, chemistry
◆ Automatic selection of TV programs
◆ Monitoring intensive care patients
16
32
24/01/2021
33
Screening images
Given: radar satellite images of coastal waters
Problem: detect oil slicks in those images
Oil slicks appear as dark regions with changing
size and shape
Not easy: lookalike dark regions can be caused
by weather conditions (e.g. high wind)
Expensive process requiring highly trained
personnel
35
36
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1)
18
36
24/01/2021
Load forecasting
Electricity supply companies
need forecast of future demand
for power
Forecasts of min/max load for each hour
significant savings
Given: manually constructed load model that
assumes “normal” climatic conditions
Problem: adjust for weather conditions
Static model consists of:
◆ base load for the year
◆ load periodicity over the year
◆ effect of holidays
37
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1)
37
39
20
40
24/01/2021
41
21
42
24/01/2021
Introduction
43
44
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1)
22
44
24/01/2021
Classification rule:
predicts value of a given attribute (the classification of an example)
If outlook = sunny and humidity = high
then play = no
Association rule:
predicts value of arbitrary attribute (or combination)
If temperature = cool then humidity = normal
If humidity = normal and windy = false
then play = yes
If outlook = sunny and play = no
then humidity = high
If windy = false and play = no
then outlook = sunny and humidity = high
45
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1)
45
Summary
47
24
31/01/2021
Lecture 2:
Getting to Know Your Data
Lecturer: Dr. Nguyen, Thi Thanh Sang
(nttsang@hcmiu.edu.vn)
References:
Chapter 2 in Data Mining: Concepts and Techniques (Third Edition), by Jiawei
Han, Micheline Kamber
1
31/01/2021
Summary
Relational records
Transaction data
Molecular Structures
Ordered
Image data:
Video data: 4
2
31/01/2021
Dimensionality
Curse of dimensionality
Sparsity
Distribution
Centrality and dispersion
5
Data Objects
3
31/01/2021
Attributes
Attribute Types
4
31/01/2021
Discrete Attribute
Has only a finite or countably infinite set of values
E.g., zip codes, profession, or the set of words in a collection of documents
10
5
31/01/2021
Activities
11
Summary
12
12
6
31/01/2021
13
wx i i
w
i
x
Trimmed mean: chopping extreme values i 1
N
Example:
Suppose we have the following values for salary (in
thousands of dollars), shown in increasing order: 30,
36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
Mean = ?
14
14
7
31/01/2021
Empirical formula:
15
Example:
Suppose we have the following values for salary (in thousands of dollars),
shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
Median = ?
Mode = ?
16
8
31/01/2021
17
17
18
18
9
31/01/2021
Boxplot Analysis
19
Fig. 2.3. Boxplot for the unit price data for items sold at four
branches of AllElectronics during a given time period.
Median
Q1
Min
20
10
31/01/2021
Example:
Suppose we have the following values for salary (in thousands of
dollars), shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60,
63, 70, 70, 110.
Q1 = ?; Q3 = ?
IQR = ?
21
22
22
11
31/01/2021
23
23
24
24
12
31/01/2021
Histogram Analysis
25
Quantile Plot
Displays all of the data (allowing the user to assess both
the overall behavior and unusual occurrences)
Plots quantile information
For a data xi data sorted in increasing order, fi
indicates that approximately 100 fi% of the data are
below or equal to the value xi
26
26
13
31/01/2021
27
27
Scatter plot
28
28
14
31/01/2021
29
30
30
15
31/01/2021
Uncorrelated Data
31
31
Summary
32
32
16
31/01/2021
33
Data matrix
n data points with p x 11 ... x 1f ... x 1p
dimensions
... ... ... ... ...
Two modes x ... x if ... x ip
i1
... ... ... ... ...
x ... x nf ... x np
n1
Dissimilarity matrix
0
n data points, but d(2,1)
0
registers only the
distance d(3,1 ) d ( 3 ,2 ) 0
A triangular matrix : : :
d ( n ,1 ) d ( n,2 ) ... ... 0
Single mode
34
34
17
31/01/2021
d ( i , j ) p p m
35
35
36
36
18
31/01/2021
37
xif m f
standardized measure (z-score): zif sf
Using mean absolute deviation is more robust than using standard deviation
38
38
19
31/01/2021
Example:
Data Matrix and Dissimilarity Matrix
x2 x4
Data Matrix
point attribute1 attribute2
4 x1 1 2
x2 3 5
x3 2 0
x4 4 5
2 x1
Dissimilarity Matrix
(with Euclidean Distance)
x3
0 2 4 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
39
39
40
40
20
31/01/2021
41
41
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
42
42
21
31/01/2021
Ordinal Variables
43
43
Dissimilarity matrix:
44
22
31/01/2021
45
45
Dissimilarity matrix:
For test-3: ; for the data:
46
23
31/01/2021
Cosine Similarity
47
47
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 0.94
48
48
24
31/01/2021
Activities
49
Summary
50
50
25
31/01/2021
Summary
51
52
52
26
Introduction to Data Mining
Lecture 2 – Activities
1. Identify attribute types: Numeric, Nominal, or Binary, in the following data tables.
2.4. Suppose a hospital tested the age and body fat data for 18 randomly selected adults with the
following result.
Age 23 23 27 27 39 41 47 49 50
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2
Age 52 54 54 56 57 58 58 60 61
%fat 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7
(a) Calculate the mean, median and standard deviation of age and %fat.
(b) Draw the boxplots for age and %fat.
2.6. Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):
(a) Compute the Euclidean distance between the two objects.
1
Introduction to Data Mining
2.8. It is important to define or select similarity measures in data analysis. However, there is no
commonly accepted subjective similarity measure. Results can vary depending on the similarity
measures used. Nonetheless, seemingly different similarity measures may be equivalent after
some transformation.
Suppose we have the following two-dimensional data set:
(a) Consider the data as two-dimensional data points. Given a new data point, x = (1.4, 1.6) as a
query, rank the database points based on similarity with the query using Euclidean distance,
Manhattan distance, supremum distance, and cosine similarity.
2
Introduction to Data Mining
Answers:
2.2.
a. Mean = 809/27 = 30; Median = 25
b. Mode = 25 and 35
c. The midrange (average of the largest and smallest values in the data set) of the data is: (70 +
13)/2 = 41.5
d. Q1 = 20; Q3 = 35
e. 13, 20, 25, 35, 52, and outlier 70
f.
2.4.
a. For age: Mean = 46.44; Median = 51; σ = 12.85
For fat: Mean = 28.78; Median = 30.7; σ = 8.99
b.
3
Introduction to Data Mining
2.6.
a. Euclidean distance = 6.7;
b. Manhattan distance = 11;
c. Minkowski distance = 6.1534 (h=3).
2.8.
a.
4
Homework - Session 2
Nguyen Tien Duc ITITIU18029
2.2
a.
b.
c.
d.
e.
f. Box plot:
2.4
a.
b. Box plots:
2.6
a. Euclidean distance:
b. Manhattan distance:
2.8
Formula 1:
Use for Euclidean (h = 2), Manhattan (h = 1), Supremum (h = infinity)
a. The data points are ranked from higher to lower similarity to the query.
Euclidean distance data points rank:
x1 0.14
x4 0.22
x3 0.28
x5 0.61
x2 0.67
x1 0.19
x4 0.3
x3 0.4
x5 0.7
x2 0.89
Supremum distance data points rank:
x1 0.1
x4 0.19
x3 0.2
x2-x5 0.6
x1 0.99999
x3 0.99996
x4 0.99903
x2 0.99575
x5 0.96536
2 / 2 8
/ 2 0 2
1
Lecture 3: 1
Data preprocessing
Lecturer: Dr. Nguyen, Thi Thanh Sang
(nttsang@hcmiu.edu.vn)
References:
Chapter 3 in Data Mining: Concepts and Techniques (Third Edition), by Jiawei
Han, Micheline Kamber
2/28/2021
2
2 / 2 8
/ 2 0 2
1
Data Preprocessing
Data Preprocessing: An Overview
2
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
3
4
2 / 2 8
/ 2 0 2
1
Data cleaning
Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies 3
Data integration
Integration of multiple databases, data cubes, or files
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Data transformation and data discretization
Normalization
Concept hierarchy generation
5
Data Preprocessing
Data Preprocessing: An Overview
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary 6
6
2 / 2 8
/ 2 0 2
1
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
4
incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
e.g., Occupation = “ ” (missing data)
noisy: containing noise, errors, or outliers
e.g., Salary = “−10” (an error)
inconsistent: containing discrepancies in codes or names, e.g.,
Age = “42”, Birthday = “03/07/2010”
Was rating “1, 2, 3”, now rating “A, B, C”
discrepancy between duplicate records
Intentional (e.g., disguised missing data)
Jan. 1 as everyone’s birthday?
7
8
2 / 2 8
/ 2 0 2
1
Noisy Data
10
10
2 / 2 8
/ 2 0 2
1
Binning
first sort data and partition into (equal-frequency) bins
6
11
11
12
2 / 2 8
/ 2 0 2
1
Data Preprocessing
Data Preprocessing: An Overview
7
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
13
13
Data Integration
Data integration:
Combines data from multiple sources into a coherent store
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Entity identification problem:
Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different
sources are different
Possible reasons: different representations, different scales, e.g.,
metric vs. British units
14
14
2 / 2 8
/ 2 0 2
1
15
Χ2 (chi-square) test
(Observed Expected) 2
2
Expected
The larger the Χ2 value, the more likely the variables are related
The cells that contribute the most to the Χ2 value are those
whose actual count is very different from the expected count
Correlation does not imply causality
# of hospitals and # of car-theft in a city are correlated
16
16
2 / 2 8
/ 2 0 2
1
17
n n
(ai A)(bi B) (ai bi ) n A B
rA, B i 1
i 1
n A B n A B
18
18
2 / 2 8
/ 2 0 2
1
19
19
Correlation coefficient:
where n is the number of tuples, A and B are the respective mean or expected
values of A and B, σA and σB are the respective standard deviation of A and B
Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their
expected values
Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely
to be smaller than its expected value
Independence: CovA,B = 0 but the converse is not true:
Some pairs of random variables may have a covariance of 0 but are not independent.
Only under some additional assumptions (e.g., the data follow multivariate normal
distributions) does a covariance of 0 imply independence
20
20
2 / 2 8
/ 2 0 2
1
Co-Variance: An Example
Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5, 10),
(4, 11), (6, 14).
Question: If the stocks are affected by the same industry trends, will their prices rise or
fall together?
21
Data Preprocessing
Data Preprocessing: An Overview
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
22
22
2 / 2 8
/ 2 0 2
1
23
Curse of dimensionality
When dimensionality increases, data becomes increasingly sparse
Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
The possible combinations of subspaces will grow exponentially
Dimensionality reduction
Avoid the curse of dimensionality
Help eliminate irrelevant features and reduce noise
Reduce time and space required in data mining
Allow easier visualization
Dimensionality reduction techniques
Wavelet transforms
Principal Component Analysis
Supervised and nonlinear techniques (e.g., feature selection)
24
24
2 / 2 8
/ 2 0 2
1
25
25
26
2 / 2 8
/ 2 0 2
1
Wavelet Transformation
Haar2 Daubechie4
Discrete wavelet transform (DWT) for linear signal processing, multi-
resolution analysis 1 4
27
27
Wavelet Decomposition
28
28
2 / 2 8
/ 2 0 2
1
29
29
x1 30
30
2 / 2 8
/ 2 0 2
1
31
32
32
2 / 2 8
/ 2 0 2
1
33
33
Attribute construction
Combining features (see: discriminative frequent patterns in Chapter on “Advanced Classification”)
Data discretization
34
34
2 / 2 8
/ 2 0 2
1
35
35
Linear regression
Data modeled to fit a straight line
Often uses the least-square method to fit the line
Multiple regression
Allows a response variable Y to be modeled as a
linear function of multidimensional feature vector
Log-linear model
Approximates discrete multidimensional probability
distributions
36
36
2 / 2 8
/ 2 0 2
1
Regression Analysis y
Y1
Regression analysis: A collective name for
techniques for the modeling and analysis of
Y1’
numerical data consisting of values of a y=x+1 1 9
37
37
38
2 / 2 8
/ 2 0 2
1
Histogram Analysis
40
Divide data into buckets and 35
store average (sum) for each 2 0
30
bucket
25
Partitioning rules:
20
Equal-width: equal bucket
15
range
10
Equal-frequency (or equal-
depth) 5
0
10000 30000 50000 70000 90000
39
39
Clustering
40
2 / 2 8
/ 2 0 2
1
Sampling
Sampling: obtaining a small sample s to represent the
whole data set N
Allow a mining algorithm to run in complexity that is 2 1
41
41
Types of Sampling
42
2 / 2 8
/ 2 0 2
1
2 2
Raw Data
43
43
44
44
2 / 2 8
/ 2 0 2
1
45
45
46
46
2 / 2 8
/ 2 0 2
1
Data Compression
2 4
Original Data
Approximated
47
47
Data Preprocessing
Data Preprocessing: An Overview
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
48
48
2 / 2 8
/ 2 0 2
1
Data Transformation
A function that maps the entire set of values of a given attribute to a new
set of replacement values s.t. each old value can be identified with one of
the new values
2 5
Methods
Smoothing: Remove noise from data
Attribute/feature construction
New attributes constructed from the given ones
Aggregation: Summarization, data cube construction
Normalization: Scaled to fall within a smaller, specified range
min-max normalization
z-score normalization
normalization by decimal scaling
Discretization: Concept hierarchy climbing
49
49
Normalization
Min-max normalization: to [new_minA, new_maxA]
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
Then $73,600 is mapped to 73 , 600 12 , 000 ( 1 . 0 0 ) 0 0 . 716
98 , 000 12 , 000
Z-score normalization (μ: mean, σ: standard deviation):
v A
v'
A
73,600 54,000
Ex. Let μ = 54,000, σ = 16,000. Then 1.225
16,000
Normalization by decimal scaling
v
v' Where j is the smallest integer such that Max(|ν’|) < 1
10 j
50
50
2 / 2 8
/ 2 0 2
1
Discretization
51
51
Histogram analysis
Top-down split, unsupervised
52
52
2 / 2 8
/ 2 0 2
1
53
53
54
54
2 / 2 8
/ 2 0 2
1
55
55
56
56
2 / 2 8
/ 2 0 2
1
57
58
58
2 / 2 8
/ 2 0 2
1
Data Preprocessing
Data Preprocessing: An Overview
3 0
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
59
59
Summary
60
60
2 / 2 8
/ 2 0 2
1
3 1
2/28/2021 61
61
3/10/2021
Lecture 4:
Data mining knowledge
representation
Lecturer: Dr. Nguyen, Thi Thanh Sang
(nttsang@hcmiu.edu.vn)
References:
Chapter 3 in Data Mining: Practical Machine Learning Tools and Techniques
(Third Edition), by Ian H.Witten, Eibe Frank and Eibe Frank
1
3/10/2021
Knowledge representation
Tables
Linear models
Trees
Rules
Classification rules
Association rules
Rules with exceptions
More expressive rules
Instance-based representation
Clusters
3
learning methods
classification, regression, …)
4
2
3/10/2021
Tables
► Simplest way of representing output:
► Use the same format as input!
► Decision table for the weather problem:
Outlook Humidity Play
Sunny High No
Sunny Normal Yes
Overcast High Yes
Overcast Normal Yes
Rainy High No
Rainy Normal No
Knowledge representation
Tables
Linear models
Trees
Rules
Classification rules
Association rules
Rules with exceptions
More expressive rules
Instance-based representation
Clusters
3
3/10/2021
Linear models
Another simple representation
Regression model
Inputs (attribute values) and output are all numeric
Output is the sum of weighted attribute values
The trick is to find good values for the weights
4
3/10/2021
Binary classification
Line separates the two classes
Decision boundary - defines where the decision changes
from one class value to the other
Prediction is made by plugging in observed values of
the attributes into the expression
Predict one class if output 0, and the other class if
output < 0
Boundary becomes a high-dimensional plane
(hyperplane) when there are multiple attributes
10
5
3/10/2021
Knowledge representation
Tables
Linear models
Trees
Rules
Classification rules
Association rules
Rules with exceptions
More expressive rules
Instance-based representation
Clusters
11
Trees
12
6
3/10/2021
Nominal:
●
Numeric:
●
13
Missing values
14
7
3/10/2021
15
PRP =
- 56.1
+ 0.049 MYCT
+ 0.015 MMIN
+ 0.006 MMAX
+ 0.630 CACH
- 0.270 CHMIN
+ 1.46 CHMAX
16
8
3/10/2021
17
18
9
3/10/2021
Knowledge representation
Tables
Linear models
Trees
Rules
Classification rules
Association rules
Rules with exceptions
More expressive rules
Instance-based representation
Clusters
19
Classification rules
●Popular alternative to decision trees
●Antecedent (pre-condition): a series of tests (just like
20
10
3/10/2021
22
11
3/10/2021
23
If x = 1 and y = 0
then class = a
If x = 0 and y = 1
then class = a
If x = 0 and y = 0
then class = b
If x = 1 and y = 1
then class = b
24
12
3/10/2021
If x = 1 and y = 1
then class = a
If z = 1 and w = 1
then class = a
Otherwise class = b
25
“Nuggets” of knowledge
instance
26
13
3/10/2021
Interpreting rules
…
…
27
28
14
3/10/2021
Knowledge representation
Tables
Linear models
Trees
Rules
Classification rules
Association rules
Rules with exceptions
More expressive rules
Instance-based representation
Clusters
29
Association rules
Association rules…
… can predict any attribute and combinations of attributes
… are not intended to be used together as a set
Problem: immense number of possible associations
Output needs to be restricted to show only the most predictive
associations only those with high support and high confidence
30
15
3/10/2021
31
32
16
3/10/2021
Knowledge representation
Tables
Linear models
Trees
Rules
Classification rules
Association rules
Rules with exceptions
More expressive rules
Instance-based representation
Clusters 33
33
Idea:
allow rules to have exceptions
Example: rule for iris data
If petal-length 2.45 and petal-length < 4.45 then Iris-versicolor
New instance:
Sepal Sepal Petal Petal Type
length width length width
5.1 3.5 2.6 0.2 Iris-setosa
Modified rule:
If petal-length 2.45 and petal-length < 4.45 then Iris-versicolor
EXCEPT if petal-width < 1.0 then Iris-setosa
34
17
3/10/2021
default: Iris-setosa
except if petal-length 2.45 and petal-length < 5.355
and petal-width < 1.75
then Iris-versicolor
except if petal-length 4.95 and petal-width < 1.55
then Iris-virginica
else if sepal-length < 4.95 and sepal-width 2.45
then Iris-virginica
else if petal-length 3.35
then Iris-virginica
except if petal-length < 4.85 and sepal-length < 5.95
then Iris-versicolor
35
36
18
3/10/2021
Knowledge representation
Tables
Linear models
Trees
Rules
Classification rules
Association rules
Rules with exceptions
More expressive rules
Instance-based representation
Clusters
37
More on exceptions
Default...except if...then...
is logically equivalent to
if...then...else
(where the else specifies what the default did)
But:exceptions offer a psychological advantage
Assumption: defaults and tests early on apply
more widely than exceptions further down
Exceptions reflect special cases
38
19
3/10/2021
attributes
Can’t be expressed with propositional rules
More expressive representation required
39
40
20
3/10/2021
A propositional solution
Width Height Sides Class
2 4 4 Standing
3 6 4 Standing
4 3 4 Lying
7 8 3 Standing
7 6 3 Lying
2 9 4 Standing
9 1 4 Lying
10 2 3 Lying
41
A relational solution
42
21
3/10/2021
If is_top_of(x,z) and
height_and_width_of(z,h,w) and h > w
and is_rest_of(x,y)and standing(y)
then standing(x)
If empty(x) then standing(x)
Recursive definition!
●
43
44
22
3/10/2021
Knowledge representation
Tables
Linear models
Trees
Rules
Classification rules
Association rules
Rules with exceptions
More expressive rules
Instance-based representation
Clusters
45
Instance-based representation
46
23
3/10/2021
47
Learning prototypes
48
24
3/10/2021
Rectangular generalizations
49
Knowledge representation
Tables
Linear models
Trees
Rules
Classification rules
Association rules
Rules with exceptions
More expressive rules
Instance-based representation
Clusters
50
25
3/10/2021
Representing clusters I
Overlapping clusters
51
Representing clusters II
Probabilistic Dendrogram
assignment
1 2 3
52
26
3/10/2021
Summary
27
Introduction to Data mining
Lecture 4 – Activities
Preprocessing and Knowledge Representation
https://drive.google.com/drive/folders/1kkXcdni6SN-2Thp6j2YiV1rMXflSqVkO
Propose an application using this dataset, e.g., supply chain management, and do the following tasks to
preprocess the data:
2. Find correlated attributes (applying Correlation Analysis to create a correlation matrix), and important
attributes.
1
Introduction to Data mining
2. Describe how to preprocess the following web log data in order to mine web access sequences, and
represent the preprocessed data.
References:
- 127.0.0.1 is the IP address of the client (remote host) which made the request to the server.
- user-identifier is the RFC 1413 identity of the client.
- frank is the userid of the person requesting the document.
- [10/Oct/2000:13:55:36 -0700] is the date, time, and time zone that the request was received,
by default in strftime format %d/%b/%Y:%H:%M:%S %z.
- "GET /apache_pb.gif HTTP/1.0" is the request line from the client. The
method GET, /apache_pb.gif the resource requested, and HTTP/1.0 the HTTP protocol.
- 200 is the HTTP status code returned to the client. 2xx is a successful response, 3xx a
redirection, 4xx a client error, and 5xx a server error.
- 2326 is the size of the object returned to the client, measured in bytes.
2
Introduction to Data mining
3. Describe how to represent the following data in a vector space model (document-term matrix).
Analysis models of technical and economic data of mining enterprises based on big data
1 analysis
2 Spatial and Spatio-temporal Data Mining
3 Hair data model: A new data model for Spatio-Temporal data mining
4 A Data Stream Mining System
DD-Rtree: A dynamic distributed data structure for efficient data distribution among cluster
5 nodes for spatial data mining algorithms
6 Big data gathering and mining pipelines for CRM using open-source
7 The Research on Safety Monitoring System of Coal Mine Based on Spatial Data Mining
CAKE – Classifying, Associating and Knowledge DiscovEry - An Approach for Distributed Data
8 Mining (DDM) Using PArallel Data Mining Agents (PADMAs)
9 Digital construction of coal mine big data for different platforms based on life cycle
10 Privacy-Preserving Big Data Stream Mining: Opportunities, Challenges, Directions
11 Analysis Methods of Workflow Execution Data Based on Data Mining
12 Efficient Tree Structures for High Utility Pattern Mining in Incremental Databases
13 Comparison of Tools for Data Mining and Retrieval in High Volume Data Stream
14 Adaptive Differentially Private Data Release for Data Sharing and Data Mining
15 Domain Driven Data Mining (D3M)
Notice of Retraction<br>The research of building production-oriented data mart for mine
16 enterprises based on data mining
17 Study on land use of changping district with spatial data mining method
Linked Open Data mining for democratization of big dataData Mining Library for Big Data
18 Processing Platforms: A Case Study-Sparkling Water Platform
19 Towards Data-Oriented Hospital Services: Data Mining-Based Hospital Management
20 Developing an Integrated Time-Series Data Mining Environment for Medical Data Mining
3
Introduction to Data mining
Answers:
1.
rating s1 s2 s3 s4 s5 s6 s7 s8
rating 1
s1 -0.17436 1
s2 -0.06547 -0.1131 1
0.137016
s3 0.344 0.0608 1
-0.07058 -0.54144
s4 0.059 0.1346 1
s5 -0.07835 0.420013 0.167015 -0.54634 0.670483 1
-0.00699 0.170926
s6 -0.172 -0.2048 -0.33 -0.13 1
-
s7 0.567964 -0.183 -0.38095 0.360588 0.061467 -0.278 0.12438 1
-
-0.02187 -0.01941 -0.45844 0.668603 0.474526 0.07907 0.361009
s8 0.3814 1
4
Introduction to Data mining
= 4.05 = 0.522
= 1.15 = 0.357
= 1.55 = 0.8
= 3.75 = 0.77
= 2.3 = 1.144
= 2.05 = 1.28
= 2.1 = 0.88
= 4.55 = 0.5
= 2.4 = 1.2806
rating s1 s2 s3 s4 s5 s6 s7 s8
rating 1
s1 -0.15 1
-0.11311
s2 -0.072 1
s3 0.326467 0.130165 0.057761 1
s4 0.055651 0.127848 -0.06705 -0.51437 1
s5 -0.07444 0.399013 0.158665 -0.51902 0.636959 1
s6 -0.1638 -0.19457 -0.00664 -0.31375 -0.1214 0.16238 1
- -
s7 0.539566 -0.17381 -0.3619 0.342559 0.058394 0.26407 0.11816 1
-
s8 0.362375 -0.02078 -0.01844 -0.43552 0.635173 0.4508 0.07512 0.342959 1
= 4.05 = 0.535576
= 1.15 = 0.366348
= 1.55 = 0.825578
= 3.75 = 0.786398
= 2.3 = 1.174286
= 2.05 = 1.316894
= 2.1 = 0.91191
= 4.55 = 0.5
= 2.4 = 1.313893
5
Introduction to Data mining
2. Generate WAS
Hint: create a database warehouse consisting of three-dimension tables and one fact table as shown
below:
1 session = 1 hour
3.
VSM:
6
Introduction to Data mining
For each ti in L
Else count ti in T;
Add Tm into M
7
3/17/2021
Lecture 5:
Evaluating what’s been
learned
Lecturer: Dr. Nguyen, Thi Thanh Sang
(nttsang@hcmiu.edu.vn)
1
3/17/2021
Issues in evaluation
2
3/17/2021
3
3/17/2021
Predicting performance
4
3/17/2021
Confidence intervals
10
10
5
3/17/2021
Confidence limits
Confidence limits for the normal distribution with 0
mean and a variance of 1: Pr[X z] z
0.1% 3.09
0.5% 2.58
1% 2.33
5% 1.65
10% 1.28
20% 0.84
–1 0 1 1.65 40% 0.25
Thus:
𝑃𝑟 −1.65 ≤ 𝑋 ≤ 1.65 = 90%
11
11
Transforming f
Resulting equation:
𝑓−𝑝
Solving for p : 𝑃𝑟 −𝑧 ≤
𝑝 1 − 𝑝 ⁄𝑁
≤𝑧
=𝑐
12
12
6
3/17/2021
Examples
13
13
Holdout estimation
14
14
7
3/17/2021
15
15
Cross-validation
16
16
8
3/17/2021
More on cross-validation
17
17
Leave-One-Out cross-validation
Leave-One-Out:
a particular form of cross-validation:
Set number of folds to number of training instances
I.e., for n training instances, build classifier n times
Makes best use of the data
Involves no random subsampling
Very computationally expensive
(exception: NN)
18
18
9
3/17/2021
Disadvantage of Leave-One-Out-CV:
stratification is not possible
It guarantees a non-stratified sample because
there is only one instance in the test set!
19
19
The bootstrap
20
10
3/17/2021
21
21
22
22
11
3/17/2021
23
24
24
12
3/17/2021
Comparing schemes II
Want to show that scheme A is better than scheme B in
a particular domain
For a given amount of training data
On average, across all possible training sets
Let's assume we have an infinite amount of data from
the domain:
Sample infinitely many dataset of specified size
Obtain cross-validation estimate on each dataset for
each scheme
Checkif mean accuracy for scheme A is better than
mean accuracy for scheme B
25
25
Predicting probabilities
26
26
13
3/17/2021
Want to minimize
27
27
28
28
14
3/17/2021
Discussion
29
29
30
30
15
3/17/2021
31
31
32
32
16
3/17/2021
33
33
Cost-sensitive classification
34
34
17
3/17/2021
Cost-sensitive learning
35
Lift charts
In practice, costs are rarely known
Decisions are usually made by comparing possible
scenarios
Example: promotional mailout to 1,000,000
households
Mail to all; 0.1% respond (1000)
Data mining tool identifies subset of 100,000 most
promising, 0.4% of these respond (400)
40% of responses for 10% of cost may pay off
Identify subset of 400,000 most promising, 0.2% respond
(800)
A lift chart allows a visual comparison
36
36
18
3/17/2021
37
38
38
19
3/17/2021
ROC curves
39
39
40
40
20
3/17/2021
41
41
42
21
3/17/2021
43
43
More measures...
Percentage of retrieved documents that are relevant:
precision=TP/(TP+FP)
Percentage of relevant documents that are returned:
recall =TP/(TP+FN)
Precision/recall curves have hyperbolic shape
Summary measures: average precision at 20%, 50% and 80% recall
(three-point average recall)
F-measure=(2 × recall × precision)/(recall+precision)
sensitivity × specificity = (TP / (TP + FN)) × (TN / (FP + TN))
Area under the ROC curve (AUC):
probability that randomly chosen positive instance is ranked above
randomly chosen negative one
44
44
22
3/17/2021
45
45
46
46
23
3/17/2021
Other measures
The root mean-squared error :
●
𝑝 −𝑎 +. . . + 𝑝 − 𝑎
𝑛
47
47
48
48
24
3/17/2021
Correlation coefficient
Measures the statistical correlation between the
predicted values and the actual values
49
49
Which measure?
Best to look at all of them
Often it doesn’t matter
Example:
A B C D
Root mean-squared error 67.8 91.7 63.3 57.4
Mean absolute error 41.3 38.5 33.4 29.2
Root rel squared error 42.2% 57.2% 39.4% 35.8%
Relative absolute error 43.1% 40.1% 34.8% 30.4%
Correlation coefficient 0.88 0.88 0.89 0.91
●D best
●C second-best
●A, B arguable
50
50
25
3/17/2021
51
51
52
52
26
3/17/2021
53
53
54
54
27
3/17/2021
Equivalent to:
−log𝑃𝑟[T|E] = −log𝑃𝑟[E|T] − log𝑃𝑟 𝑇 + log𝑃𝑟 𝐸
constant
55
55
56
56
28
3/17/2021
57
57
29
3/28/2021
Lecture 6:
Data mining algorithms:
Classification
Lecturer: Dr. Nguyen, Thi Thanh Sang
(nttsang@hcmiu.edu.vn)
References:
Chapter 8 in Data Mining: Concepts and Techniques (Third Edition), by Jiawei
Han, Micheline Kamber
1
3/28/2021
Basic concepts
Decision tree Induction
Bayes Classification Methods
Rule-based Classification
Model Evaluation and Selection
2
3/28/2021
3
3/28/2021
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NA M E RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
M erlisa Associate Prof 7 no
G eorge Professor 5 yes
Joseph Assistant Prof 7 yes
4
3/28/2021
Basic concepts
Decision tree Induction
Bayes Classification Methods
Rule-based Classification
Model Evaluation and Selection
no yes yes
10
10
5
3/28/2021
11
11
12
6
3/28/2021
13
13
14
7
3/28/2021
Gain(income) 0.029
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30
>40
low
medium
yes
yes
fair
fair
yes
yes
Gain( student ) 0.151
<=30
31…40
medium
medium
yes
no
excellent
excellent
yes
yes Gain(credit _ rating ) 0.048
31…40 high yes fair yes
>40 medium no excellent no
15
15
16
16
8
3/28/2021
17
17
18
9
3/28/2021
19
19
Gain ratio:
tends to prefer unbalanced splits in which one partition is much
smaller than the others
Gini index:
biased to multivalued attributes
has difficulty when # of classes is large
tends to favor tests that result in equal-sized partitions and purity in
both partitions
20
20
10
3/28/2021
21
22
22
11
3/28/2021
23
24
24
12
3/28/2021
25
25
26
26
13
3/28/2021
27
27
28
28
14
3/28/2021
29
29
30
30
15
3/28/2021
Basic concepts
Decision tree Induction
Bayes Classification Methods
Rule-based Classification
Model Evaluation and Selection
31
32
32
16
3/28/2021
Bayes’ Theorem: P ( H | X ) P (X | H ) P ( H ) P (X | H ) P (H ) / P (X )
P (X )
Let X be a data sample (“evidence”): class label is unknown
Let H be a hypothesis that X belongs to class C
Classification is to determine P(H|X), (i.e., posteriori probability):
the probability that the hypothesis holds given the observed data
sample X
P(H) (prior probability): the initial probability
E.g., X will buy computer, regardless of age, income, …
P(X): probability that sample data is observed
P(X|H) (likelihood): the probability of observing the sample X, given
that the hypothesis holds
E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income
33
33
P (H | X ) P (X | H )P (H ) P (X | H ) P (H ) / P (X )
P (X )
Informally, this can be viewed as
posteriori = likelihood x prior/evidence
Predicts X belongs to Ci iff the probability P(Ci|X) is the
highest among all the P(Ck|X) for all the k classes
Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost
34
34
17
3/28/2021
35
35
36
18
3/28/2021
37
38
19
3/28/2021
39
39
40
40
20
3/28/2021
Basic concepts
Decision tree Induction
Bayes Classification Methods
Rule-based Classification
Model Evaluation and Selection
41
42
42
21
3/28/2021
One rule is created for each path from the <=30 31..40 >40
root to a leaf
student? credit rating?
yes
Each attribute-value pair along a path forms a
excellent fair
conjunction: the leaf holds the class no yes
yes
prediction no yes
43
44
44
22
3/28/2021
Examples covered
Examples covered by Rule 2
by Rule 1 Examples covered
by Rule 3
Positive
examples
45
45
Rule Generation
To generate a rule
while(true)
find the best predicate p
if foil-gain(p) > threshold then add p to current rule
else break
A3=1&&A1=2
A3=1&&A1=2
&&A8=5A3=1
Positive Negative
examples examples
46
46
23
3/28/2021
47
How to Learn-One-Rule?
Start with the most general rule possible: condition = empty
Adding new attributes by adopting a greedy depth-first strategy
Picks the one that most improves the rule quality
Rule-Quality measures: consider both coverage and accuracy
Foil-gain (in FOIL & RIPPER): assesses info_gain by extending
condition
pos' pos
FOIL_ Gain pos'(log2 log2 )
pos'neg' pos neg
favors rules that have high accuracy and cover many positive tuples
48
48
24
3/28/2021
Basic concepts
Decision tree Induction
Bayes Classification Methods
Rule-based Classification
Model Evaluation and Selection
49
50
50
25
3/28/2021
51
52
52
26
3/28/2021
53
53
54
54
27
3/28/2021
55
Bootstrap
Works well with small data sets
Samples the given training tuples uniformly with replacement
i.e.,each time a tuple is selected, it is equally likely to be
selected again and re-added to the training set
Several bootstrap methods, and a common one is .632 bootstrap
A data set with d tuples is sampled d times, with replacement,
resulting in a training set of d samples. The data tuples that did
not make it into the training set end up forming the test set.
About 63.2% of the original data end up in the bootstrap, and the
remaining 36.8% form the test set (since (1 – 1/d)d ≈ e-1 = 0.368)
Repeat the sampling procedure k times, overall accuracy of the
model:
56
56
28
3/28/2021
Accuracy
classifier accuracy: predicting class label
Speed
time to construct the model (training time)
time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability
understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
57
57
Summary (I)
Classification is a form of data analysis that extracts models
describing important data classes.
Effective and scalable methods have been developed for decision
tree induction, Naive Bayesian classification, rule-based
classification, and many other classification methods.
Evaluation metrics include: accuracy, sensitivity, specificity,
precision, recall, F measure, and Fß measure.
Stratified k-fold cross-validation is recommended for accuracy
estimation. Bagging and boosting can be used to increase overall
accuracy by learning and combining a series of individual models.
58
58
29
3/28/2021
Summary (II)
59
59
30
Introduction to Data Mining
Lecture 6 – Activities
The following table consists of training data from an employee database. The data have been
generalized. For example, “31 . . . 35” for age represents the age range of 31 to 35. For a given
row entry, count represents the number of data tuples having the values for department, status, age,
and salary given in that row.
(a) How would you modify the basic decision tree algorithm to take into consideration the count
of each generalized data tuple (i.e., of each row entry)?
(b) Use your algorithm to construct a decision tree from the given data.
(c) Given a data tuple having the values “systems”, “26. . . 30”, and “46–50K” for the attributes
department, age, and salary, respectively, what would a Na¨ıve Bayesian classification of the status
for the tuple be?
1
Introduction to Data Mining
Answers
(a) How would you modify the basic decision tree algorithm to take into consideration the count
of each generalized data tuple (i.e., of each row entry)?
The basic decision tree algorithm should be modified as follows to take into consideration the
count of each generalized data tuple.
• The count of each tuple must be integrated into the calculation of the attribute selection measure
(such as information gain).
• Take the count into consideration to determine the most common class among the tuples.
(b) Use your algorithm to construct a decision tree from the given data.
The resulting tree is:
Info(D) = 0.899
Gain(Dept) = 0.0488
Gain(Age) = 0.4247
Gain(Salary) = 0.5375
(c) Given a data tuple having the values “systems”, “26. . . 30”, and “46–50K” for the attributes
department, age, and salary, respectively, what would a Na¨ıve Bayesian classification of the status
for the tuple be?
P (X|senior) = 0; << this case, the Laplacian correction was not used. 1 more tuple for each
age-value pair should be added.
P (X|junior) =23/113 × 49/113 × 23/113 = 0.018. Thus, a Na¨ıve Bayesian classification predicts
“junior”.