Preprocessing - M2
Preprocessing - M2
Preprocessing - M2
¨ Data cleaning
¤ Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
¨ Data transformation
¤ Normalization and aggregation
¨ Data reduction
¤ Obtains reduced representation in volume but produces the same or similar
analytical results
¨ Data discretization
¤ Part of data reduction but with particular importance, especially for
numerical data
¨ Data integration
¤ Integration of multiple databases, data cubes, or files
Forms of data preprocessing
Data Cleaning
Example :
X7 : 13, 22, 90
X8 :87
X9-X10 : no cases
X11 : 7
X12 : 90
¨ X13 : No cases
¨ X14 : 77
¨ X15 : 6, 53
¨ X16 : 24
¨ X17 : No cases
¨ X18 : 7, 84
¨ X19 : 22
Multivariate Outlier
¨ Mahalanobis D2 is a multidimensional version of a
z-score.
¨ Df = number of variables
To understand why a
particular case is an
outlier, we want to
examine the descriptive
statistics for each variable.
¨ Multivariate outlier :
¨ Case : 98, 36
Noisy Data
¨ Noise: random error or variance in a measured variable
¨ Incorrect attribute values may due to
¤ faulty data collection instruments
¤ data entry problems
¤ data transmission problems
¤ technology limitation
¤ inconsistency in naming convention
¨ Other data problems which requires data cleaning
¤ duplicate records
¤ incomplete data
¤ inconsistent data
How to Handle Noisy Data?
¨ Binning method:
¤ first sort data and partition into (equi-depth) bins
¤ then smooth by bin means, smooth by bin median, smooth by
bin boundaries, etc.
¨ Clustering
¤ detect and remove outliers
¨ Combined computer and human inspection
¤ detect suspicious values and check by human
¨ Regression
¤ smooth by fitting the data into regression functions
Simple Discretization Methods: Binning
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Data Preprocessing
Normalization:
- scaled to fall within a small, specified range
¤ min-max normalization
¤ z-score normalization
Syntax R:
YourNormalizedDataSet<-
as.data.frame(lapply(YourDataSet, normalize))
Data Transformation:
Normalization
¨ min-max normalization
v - min
v' = (new _ max - new _ min ) + new _ min
A
A A A
max - min
A A
¨ z-score normalization
v - meanA
v' =
stand _ devA
¨ normalization by decimal scaling
v
v' = j Where j is the smallest integer such that Max(| v ' |)<1
10
Data transformation
¨ Feature extreaction
¨ Clustering
¨ Sampling
Example of Decision Tree Induction
A4 ?
A1? A6?
¨ Random samples
¤ Selected using chance method or random methods
¨ Systematic samples
¤ Numbering each subject of the populations & select every
kth number
¨ Stratified samples
¤ Dividing the population into groups according some
characteristic that is important to the study, then sampling
from each group
¨ Cluster samples
¤ Dividing the population into sections/clusters, then
randomly select some of those cluster & then chose all
members from those selected cluster
Random sampling
43
systematic
44
Stratified Sampling
45
Cluster Sampling
46
Sampling
SW OR om
SR le rand
p t
(sim le withou
samp ement)
c
repla
SRSW
R
Raw Data
Data Preprocessing
¨ Discretization:
☛ divide the range of a continuous attribute into intervals
¤ Some classification algorithms only accept categorical
attributes.
¤ Reduce data size by discretization
¤ Prepare for further analysis
Discretization and Concept hierachy
¨ Discretization
¤ reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace actual
data values.
¨ Concept hierarchies
¤ reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute age) by
higher level concepts (such as young, middle-aged, or
senior).
Discretization for numeric data