UNIT - 2 .DataScience 04.09.18
UNIT - 2 .DataScience 04.09.18
UNIT - 2 .DataScience 04.09.18
PRE
PROCESSING
1
Introduction
3
Data Preprocessing: An Overview
2.Major Tasks in Data Preprocessing
we look at the major steps involved in data preprocessing, namely, data cleaning, data
integration, data reduction, and data transformation.
Data cleaning routines work to “clean” the data by filling in missing values, smoothing noisy
data, identifying or removing outliers, and resolving inconsistencies.
To include data from multiple sources in your analysis. This would involve integrating
multiple databases, data cubes, or files i.e., data integration.
Data reduction obtains a reduced representation of the data set that is much smaller
in volume, yet produces the same (or almost the same) analytical results.
Discretization and concept hierarchy generation are powerful tools for data mining in that
they allow data mining at multiple abstraction levels. Normalization, data discretization,
and concept hierarchy generation are forms of data transformation.
4
5
Data Cleaning
Data cleaning (or data cleansing) routines attempt to fill in missing values, smooth
out noise while identifying outliers, and correct inconsistencies in the data.
1. Missing Values.
– Ignore the tuple
– Fill in the missing value manually
– Use a global constant to fill in the missing value
– Use a measure of central tendency for the attribute to fill in the missing value
– Use the attribute mean or median for all samples belonging to the same class as the
given tuple
– Use the most probable value to fill in the missing value
6
Data Cleaning
2. Noisy Data
7
Data Cleaning
2. Noisy Data
8
Data Cleaning
3. Data Cleaning as a Process
– Discrepancy detection.
– Discrepancies can be caused by several factors, including poorly designed data entry forms
that have many optional fields, human error in data entry, deliberate errors , and data
decay
– Discrepancies may also arise from inconsistent data representations and inconsistent
use of codes.
– Other sources of discrepancies include errors in instrumentation devices that record data
and system errors
– Errors can also occur when the data are used for purposes other than originally intended
– There may also be inconsistencies due to data integration
9
Data Cleaning
3. Data Cleaning as a Process
10
Data Cleaning
3. Data Cleaning as a Process
11
Data Cleaning
3. Data Cleaning as a Process
There are a number of different commercial tools that can aid in the
discrepancy detection step.
– Data scrubbing tools use simple domain knowledge to detect errors and
make corrections in the data. These tools rely on parsing and fuzzy matching
techniques when cleaning data from multiple sources.
– Data auditing tools find discrepancies by analyzing the data to discover rules
and relationships, and detecting data that violate such conditions.
– They are variants of data mining tools for statistical analysis to find
correlations, or clustering to identify outliers.
12
Data Cleaning
3. Data Cleaning as a Process
– Commercial tools can assist in the data transformation step. Data migration
tools allow simple transformations to be specified such as to replace the string
“gender” by “sex.”
– ETL (extraction/transformation/loading) tools allow users to specify
transforms through a graphical user interface (GUI).
– The two-step process of discrepancy detection and data transformation
iterates. Its an error-prone and time consuming. Some transformations may
introduce more discrepancies. Some nested discrepancies may only be detected
after others have been fixed
13
Data Cleaning
3. Data Cleaning as a Process
15
Data Reduction
16
Overview of Data Reduction
Strategies
17
Data Reduction Strategies
18
Principal Components
Analysis(PCA)
– Principal Components Analysis as a method of dimensionality reduction
– When the Data to be reduced consist of tuples or data vectors described by n attributes or
dimensions. Principal components analysis (also called the Karhunen - Loeve, or K - L, method)
searches for k n-dimensional orthogonal vectors that can best be used to represent the data,
where k <= n
– The original data are thus projected onto a much smaller space, resulting in dimensionality
reduction
– PCA “combines” the essence of attributes by creating an alternative, smaller set of variables.
– The initial data can then be projected onto this smaller set.
– PCA often reveals relationships that were not previously suspected and thereby allows
interpretations that would not ordinarily result.
19
Data Reduction Strategies
20
Principal Components Analysis(PCA)
The basic procedure
1. The input data are normalized, so that each attribute falls within the same
range.
2. PCA computes k orthonormal vectors that provide a basis for the normalized
input data.
3. The principal components are sorted in order of decreasing “significance”
or strength. The principal components essentially serve as a new set of axes
for the data, providing important information about variance.
4. The components are sorted in decreasing order of “significance,” the data
size can be reduced by eliminating the weaker components, that is, those with
low variance. Using the strongest principal components, it should be possible
21 to reconstruct a good approximation of the original data.
22
Attribute Subset Selection(ASS)
– Attribute subset selection reduces the data set size by removing irrelevant
or redundant attributes
– The goal of ASS is to find a minimum set of attributes such that the resulting
probability distribution of the data classes is as close as possible to the original
distribution obtained using all attributes
– Mining on a reduced set of attributes has an additional benefit
– Heuristic methods that explore a reduced search space are commonly used for
attribute subset selection.
– The “best” (and “worst”) attributes are typically determined using tests of
statistical significance, which assume that the attributes are independent of one
23 another.
Basic heuristic
methods of attribute
subset selection
include the
techniques
that follow, some of
which are illustrated
in Figure
24
Attribute Subset Selection(ASS)
25
Attribute Subset Selection(ASS)
4. Decision tree induction: Decision tree algorithms were originally intended for
classification. Decision tree induction constructs a flow chart like structure
where each internal (non leaf) node denotes a test on an attribute,
each branch corresponds to an outcome of the test, and each external (leaf)
node denotes a class prediction. At each node, the algorithm chooses the
“best” attribute to partition the data into individual classes.
5. When decision tree induction is used for attribute subset selection, a tree is
constructed from the given data. All attributes that do not appear in the tree
are assumed to be irrelevant. The set of attributes appearing in the tree form
the reduced subset of attributes.
26
Histograms
– Histograms use binning to approximate data distributions and are a popular form of
data reduction. A histogram for an attribute, A, partitions the data distribution of A
into disjoint subsets, referred to as buckets or bins.
– If each bucket represents only a single attribute–value/frequency pair, the buckets
are called singleton buckets.
– Equal-width: In an equal-width histogram, the width of each bucket range is uniform
– Equal-frequency (or equal-depth): In an equal-frequency histogram, the buckets
are created so that, roughly, the frequency of each bucket is constant
– Histograms are highly effective at approximating both sparse and dense data, as well
as highly skewed and uniform data. The histograms described before for
single attributes can be extended for multiple attributes.
– Multidimensional histograms can capture dependencies between attributes.
Singleton buckets are useful for storing high-frequency outliers
27
Histograms - Example
28
Clustering
29
Clustering
30
Sampling(1)
– Cluster sample: If the tuples in D are grouped into M mutually disjoint “clusters,” then an
SRS of s clusters can be obtained, where s < M.
– Stratified sample: If D is divided into mutually disjoint parts called strata, a
stratified sample of D is generated by obtaining an SRS at each stratum.
– An advantage of sampling for data reduction is that the cost of obtaining a sample is
proportional to the size of the sample, s, as opposed to N, the data set size.
Hence, sampling complexity is potentially sublinear to the size of the data. Other data
reduction techniques can require at least one complete pass through D.
– When applied to data reduction, sampling is most commonly used to estimate
the answer to an aggregate query.
33
Data Transformation
and Data Discretization
34
Data Transformation Strategies
Overview
In data transformation, the data are transformed or consolidated into forms
appropriate for mining.
Strategies for data transformation include the following:
1. Smoothing, which works to remove noise from the data. Techniques include
binning, regression, and clustering.
2. Attribute construction, where new attributes are constructed and added from
the given set of attributes to help the mining process.
3. Aggregation, where summary or aggregation operations are applied to the
data. For example, the daily sales data may be aggregated so as to compute
monthly and annual total amounts. This step is typically used in constructing a
36
37
Data Transformation by
Normalization
– The measurement unit used can affect the data analysis
– Expressing an attribute in smaller units will lead to a larger range for that
attribute, and thus tend to give such an attribute greater effect or “weight”
– To help avoid dependence on the choice of measurement units, the data
should be normalized or standardized.
– This involves transforming the data to fall within a smaller or common range
such as [-1, 1] or [0.0, 1.0]
– Normalizing the data attempts to give all attributes an equal weight
– Normalization is particularly useful for classification algorithms involving neural
39
normalization falls outside of the original data range for A.
Data Transformation by Normalization
Min-max Normalization Example
40
Data Transformation by Normalization
z-score normalization
41
Data Transformation by Normalization
z-score normalization Example
– z-score normalization. Suppose that the mean and standard deviation of the
values for the attribute income are Rs.54,000 and Rs.16,000, respectively. With
z-score normalization, a value of Rs.73,600 for income is transformed to
42
Data Transformation by Normalization
Normalization by decimal scaling
43
Discretization by Binning
44
Discretization by Histogram
Analysis
Histogram analysis is an unsupervised discretization technique because it does not
use class information.
– A histogram partitions the values of an attribute, A, into disjoint ranges called
buckets or bins.
– In an equal-width histogram, the values are partitioned into equal-size
partitions or ranges.
– An equal-frequency histogram, the values are partitioned so that, ideally, each
partition contains the same number of data tuples.
– The histogram analysis algorithm can be applied recursively to each partition in
order to automatically generate a multilevel concept hierarchy, with the
45 procedure terminating once a pre specified number of concept levels has been
reached.
Discretization by Cluster, Decision
Tree, and Correlation Analyses(1)
46
– In the latter, clusters are formed by repeatedly grouping neighboring clusters in order to
form higher-level concepts.
Discretization by Cluster, Decision
Tree, and Correlation Analyses(2)
47
tuples of the same class as possible.
Discretization by Cluster, Decision
Tree, and Correlation Analyses(2.1)
48
Discretization by Cluster, Decision
Tree, and Correlation Analyses(3)
49
Discretization by Cluster, Decision
Tree, and Correlation Analyses(3.1)
50
Concept Hierarchy Generation for
Nominal Data
We study four methods for the generation of concept hierarchies for nominal data,
as follows
– Specification of a partial ordering of attributes explicitly at the schema level by
users or experts
– Specification of a portion of a hierarchy by explicit data grouping
– Specification of a set of attributes, but not of their partial ordering
– Specification of only a partial set of attributes
51
52
UNIT – 3…
53