Preprocessing - M2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 53

2.

PRE PROCESSING DATA MINING

Santi Wulan Purnami


Why data preprocessing?
1. Data in the real world is dirty
¤ incomplete (Missing values)
¤ noisy (containing errors or outliers)

¤ inconsistent (different coding, different naming,


impossible values or out-of-range values)
2. For quality mining results, quality data is needed
3. Pre-processing is an important step for successful
data mining
Data Preprocessing Tasks
¨ Data cleaning
(fill in missing values, smooth noisy data,
identify/remove outliers, resolve inconsistencies)
¨ Data transformation
(normalisation, scaling)
¨ Data reduction (feature selection & feature
extraction, sampling)
(reduce volume of data)
Major Tasks in Data Preprocessing

¨ Data cleaning
¤ Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
¨ Data transformation
¤ Normalization and aggregation
¨ Data reduction
¤ Obtains reduced representation in volume but produces the same or similar
analytical results
¨ Data discretization
¤ Part of data reduction but with particular importance, especially for
numerical data
¨ Data integration
¤ Integration of multiple databases, data cubes, or files
Forms of data preprocessing
Data Cleaning

¨ Data cleaning tasks


¤ Fill in missing values
¤ Identify outliers and smooth out noisy data
¤ Correct inconsistent data
Missing Data

¨ Data is not always available


¤ E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data

¨ Missing data may be due to


¤ equipment malfunction
¤ inconsistent with other recorded data and thus deleted
¤ data not entered due to misunderstanding
¤ certain data may not be considered important at the time of entry
¤ not register history or changes of the data

¨ Missing data may need to be inferred.


How to Handle Missing Data?

¨ Ignore the tuple: usually done when class label is


missing (assuming the tasks in classification—not
effective when the percentage of missing values per
attribute varies considerably)
¨ Fill in the missing value manually: tedious + infeasible?
¨ Use the attribute mean to fill in the missing value
¨ Use the most probable value to fill in the missing value:
inference-based
Imputation
The simplest is imputation of the mean:
¨ Syntax R

¨ x[is.na(x)]<- mean(x, na.rm = TRUE)

Imputation based on modeling


Identify outlier
¨ Read book:
Hair, J.F., Anderson, R.E., Tatham, R.L. and Black,
W.C. 2010. Multivariate Data Analysis, Seventh
edition, Prentice Hall International: UK
Outlier detection
¨ Univariate outlier detection
- scatter plot
- box plot
- standardized data

¨ Multivariate outlier detection

See page 75 on Mutivariate Data Analysis (Hair, et all,


2006)
Univariate outlier: scatter plot
¨ Example : HBAT.SAV
¨ Dependent var : X19
¨ Independent var : X6 – X18
Univariate outlier: box plot
Univariate outlier: standardized data

¨ One way to identify univariate outliers is to convert


all of the scores for a variable to standard scores.

¨ If the sample size is small (80 or fewer cases), a


case is an outlier if its standard score is ±2.5 or
beyond.

¨ If the sample size is larger than 80 cases, a case is


an outlier if its standard score is ±3.0 or beyond
Steps:
1. Convert to Zscore
2. Sort descending

Example :
X7 : 13, 22, 90
X8 :87
X9-X10 : no cases
X11 : 7
X12 : 90
¨ X13 : No cases
¨ X14 : 77
¨ X15 : 6, 53
¨ X16 : 24
¨ X17 : No cases
¨ X18 : 7, 84
¨ X19 : 22
Multivariate Outlier
¨ Mahalanobis D2 is a multidimensional version of a
z-score.

¨ A case is a multivariate outlier if :


¨ > 2.5

¨ Df = number of variables

¨ Mahalanobis D2 requires that the variables be


metric.
Mahalanobis D2 is computed by Regression
SW388R7
Data Analysis &
Computers II
Slide 19
Adding the independent variables
SW388R7
Data Analysis &
Computers II
Slide 20
Adding an arbitrary dependent
SW388R7
Data Analysis &
Computers II
variable
Slide 21
Adding Mahalanobis D2 to the
SW388R7
Data Analysis &
Computers II
dataset
Slide 22
Specify saving Mahalanobis D2
SW388R7
Data Analysis &
Computers II
distance
Slide 23
Specify the statistics output needed
SW388R7
Data Analysis &
Computers II
Slide 24

To understand why a
particular case is an
outlier, we want to
examine the descriptive
statistics for each variable.

Click on the Statistics…


button to request the
statistics.
Request descriptive statistics
SW388R7
Data Analysis &
Computers II
Slide 25
Complete the request for
SW388R7
Data Analysis &
Computers II
Mahalanobis D2
Slide 26
Mahalanobis D2 scores in the data editor
SW388R7
Data Analysis &
Computers II
Slide 27
Example :

¨ Dependent var : X19


¨ Independent var : X6 – X18
¨ Df = 13

¨ Multivariate outlier :
¨ Case : 98, 36
Noisy Data
¨ Noise: random error or variance in a measured variable
¨ Incorrect attribute values may due to
¤ faulty data collection instruments
¤ data entry problems
¤ data transmission problems
¤ technology limitation
¤ inconsistency in naming convention
¨ Other data problems which requires data cleaning
¤ duplicate records
¤ incomplete data
¤ inconsistent data
How to Handle Noisy Data?
¨ Binning method:
¤ first sort data and partition into (equi-depth) bins
¤ then smooth by bin means, smooth by bin median, smooth by
bin boundaries, etc.
¨ Clustering
¤ detect and remove outliers
¨ Combined computer and human inspection
¤ detect suspicious values and check by human
¨ Regression
¤ smooth by fitting the data into regression functions
Simple Discretization Methods: Binning

¨ Equal-width (distance) partitioning:


¤ It divides the range into N intervals of equal size: uniform grid
¤ if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B-A)/N.
¤ The most straightforward
¤ But outliers may dominate presentation
¤ Skewed data is not handled well.
¨ Equal-depth (frequency) partitioning:
¤ It divides the range into N intervals, each containing approximately same
number of samples
¤ Good data scaling
¤ Managing categorical attributes can be tricky.
Binning Methods for Data Smoothing

* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Data Preprocessing

¨ Why preprocess the data?


¨ Data cleaning
¨ Data integration and transformation
¨ Data reduction
¨ Discretization and concept hierarchy generation
¨ Summary
Data Integration
¨ Data integration:
¤ combines data from multiple sources into a coherent store
¨ Schema integration
¤ integrate metadata from different sources
¤ Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id º B.cust-#
¨ Detecting and resolving data value conflicts
¤ for the same real world entity, attribute values from different
sources are different
¤ possible reasons: different representations, different scales,
e.g., metric vs. British units
Handling Redundant Data
¨ Redundant data occur often when integration of multiple
databases
¤ The same attribute may have different names in different
databases. Careful integration of the data from multiple
sources may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality
Data Transformation

Normalization:
- scaled to fall within a small, specified range
¤ min-max normalization
¤ z-score normalization

¤ normalization by decimal scaling

Syntax R:
YourNormalizedDataSet<-
as.data.frame(lapply(YourDataSet, normalize))
Data Transformation:
Normalization

¨ min-max normalization
v - min
v' = (new _ max - new _ min ) + new _ min
A
A A A
max - min
A A

¨ z-score normalization
v - meanA
v' =
stand _ devA
¨ normalization by decimal scaling
v
v' = j Where j is the smallest integer such that Max(| v ' |)<1
10
Data transformation

¨ Normalizing or scaling can increase an accuracy


performance of the classification
¤ C.W.Hsu, C.C.Chang, and C.J. Lin, “A Practical Guide to Support Vector
Classification”, Department of Computer Science and Information
Engineering, National Taiwan University, last updated, 2008
¤ S. Ali and K.A. Smith Miles, “Improved Support Vector Machine
Generalization Using Normalized Input Space”, Lecture Notes in Artificial
Intelligence- Volume 4304, J.G. Carbonell and J. Siekmann (eds.), Springer
Verlag, Germany, pp. 361-271, 2006
¤ Purnami, S.W. and Abdullah Embong. 2008. Smooth Support Vector
Machine for Breast Cancer Classification, The 4th IMT-GT 2008 Conference
on Mathematics, Statistics, and Their Applications (ICMSA08), Banda Aceh,
Indonesia.
Data Reduction Strategies

¨ Warehouse may store terabytes of data: Complex data


analysis/mining may take a very long time to run on the
complete data set
¨ Data reduction
¤ Obtains a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the
same) analytical results
¨ Data reduction strategies
¤ Dimensionality reduction
¤ Discretization
Dimensionality Reduction
¨ Feature selection (i.e., attribute subset selection):
¤ Select a minimum set of features such that the probability
distribution of different classes given the values for those
features is as close as possible to the original distribution given
the values of all features
¤ reduce # of patterns in the patterns, easier to understand

¨ Feature extreaction
¨ Clustering
¨ Sampling
Example of Decision Tree Induction

Initial attribute set:


{A1, A2, A3, A4, A5, A6}

A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}


Sampling Techniques
42

¨ Random samples
¤ Selected using chance method or random methods
¨ Systematic samples
¤ Numbering each subject of the populations & select every
kth number
¨ Stratified samples
¤ Dividing the population into groups according some
characteristic that is important to the study, then sampling
from each group
¨ Cluster samples
¤ Dividing the population into sections/clusters, then
randomly select some of those cluster & then chose all
members from those selected cluster
Random sampling
43
systematic
44
Stratified Sampling
45
Cluster Sampling
46
Sampling

SW OR om
SR le rand
p t
(sim le withou
samp ement)
c
repla

SRSW
R

Raw Data
Data Preprocessing

¨ Why preprocess the data?


¨ Data cleaning
¨ Data integration and transformation
¨ Data reduction
¨ Discretization and concept hierarchy generation
¨ Summary
Discretization
¨ Three types of attributes:
¤ Nominal — values from an unordered set
¤ Ordinal — values from an ordered set
¤ Continuous — real numbers

¨ Discretization:
☛ divide the range of a continuous attribute into intervals
¤ Some classification algorithms only accept categorical
attributes.
¤ Reduce data size by discretization
¤ Prepare for further analysis
Discretization and Concept hierachy

¨ Discretization
¤ reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace actual
data values.
¨ Concept hierarchies
¤ reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute age) by
higher level concepts (such as young, middle-aged, or
senior).
Discretization for numeric data

¨ Binning (see sections before)

¨ Histogram analysis (see sections before)

¨ Clustering analysis (see sections before)


Summary

¨ Data preparation is a big issue for both warehousing and


mining
¨ Data preparation includes
¤ Data cleaning and data integration
¤ Data reduction and feature selection
¤ Discretization
¨ A lot a methods have been developed but still an active area
of research
EXERCISE: IRIS DATA

You might also like