UNIT - 2 .DataScience 04.09.18

DATA
PRE
PROCESSING
1
Introduction
– Real-world databases are highly susceptible to noisy, missing, and inconsistent

data due to their typically huge size and their likely origin from multiple,
heterogenous sources. Low-quality data will lead to low-quality mining results.
– Data cleaning can be applied to remove noise and correct inconsistencies in
data.
– Data integration merges data from multiple sources into a coherent data store
such as a data warehouse.
– Data reduction can reduce data size by, for instance, aggregating, eliminating
redundant features, or clustering.
– Data transformations may be applied, where data are scaled to fall within a
2 smaller range like 0.0 to 1.0.
Data Preprocessing: An Overview
1. Data Quality: Why Preprocess the Data?

There are many factors comprising data quality, including accuracy, completeness,
consistency, timeliness, believability, and interpretability.
Disguised missing data. There are many possible reasons for inaccurate data. The
data collection instruments used may be faulty. There may have been human or
computer errors occurring at data entry. Users may purposely submit incorrect
data values for mandatory fields when they do not wish to submit personal
information
3
Data Preprocessing: An Overview
2.Major Tasks in Data Preprocessing
we look at the major steps involved in data preprocessing, namely, data cleaning, data
integration, data reduction, and data transformation.
Data cleaning routines work to “clean” the data by filling in missing values, smoothing noisy
data, identifying or removing outliers, and resolving inconsistencies.
To include data from multiple sources in your analysis. This would involve integrating
multiple databases, data cubes, or files i.e., data integration.
Data reduction obtains a reduced representation of the data set that is much smaller
in volume, yet produces the same (or almost the same) analytical results.
Discretization and concept hierarchy generation are powerful tools for data mining in that
they allow data mining at multiple abstraction levels. Normalization, data discretization,
and concept hierarchy generation are forms of data transformation.
4
5
Data Cleaning
Data cleaning (or data cleansing) routines attempt to fill in missing values, smooth
out noise while identifying outliers, and correct inconsistencies in the data.
1. Missing Values.
– Ignore the tuple
– Fill in the missing value manually
– Use a global constant to fill in the missing value
– Use a measure of central tendency for the attribute to fill in the missing value
– Use the attribute mean or median for all samples belonging to the same class as the
given tuple
– Use the most probable value to fill in the missing value
6
Data Cleaning
2. Noisy Data
Noise is a random error or variance in a measured variable

– Binning: Binning methods smooth a sorted data value by
consulting its “neighborhood,” that is, the values around
it.
– The sorted values are distributed into a number of
“buckets,” or bins. Because binning methods consult the
neighborhood of values, they perform local smoothing.
– Figure illustrates some binning techniques.
– In this example, the data for price are first sorted and
then partitioned into equal-frequency bins of size 3.
– Smoothing by bin medians
– Smoothing by bin boundaries
7
Data Cleaning
2. Noisy Data
– Regression: Data smoothing can also be done by regression,

a technique that conforms data values to a function.
– Linear regression involves finding the “best” line to fit two
attributes (or variables) so that one attribute can be used to
predict the other.
– Multiple linear regression is an extension of linear regression,
where more than two attributes are involved and the data are
fit to a multidimensional surface.
– Outlier analysis: Outliers may be detected by clustering, for
example, where similar values are organized into groups, or
“clusters.” Intuitively, values that fall outside of the set of
clusters may be considered outliers - Figure.
8
Data Cleaning
3. Data Cleaning as a Process
– Discrepancy detection.
– Discrepancies can be caused by several factors, including poorly designed data entry forms
that have many optional fields, human error in data entry, deliberate errors , and data
decay
– Discrepancies may also arise from inconsistent data representations and inconsistent
use of codes.
– Other sources of discrepancies include errors in instrumentation devices that record data
and system errors
– Errors can also occur when the data are used for purposes other than originally intended
– There may also be inconsistencies due to data integration
9
Data Cleaning
“So, how can we proceed with discrepancy detection?”

– As a starting point, use any knowledge you may already have regarding
properties of the data. Such knowledge or “data about data” is referred to as
metadata
– Lookout for the inconsistent use of codes and any inconsistent data
representations
– Field overloading is another error source that typically results when developers
squeeze new attribute definitions into unused (bit) portions of already defined
attributes
10
Data Cleaning
The data should also be examined regarding

– A unique rules says that each value of the given attribute must be different
from all other values for that attribute.
– A consecutive rule says that there can be no missing values between the
lowest and highest values for the attribute, and that all values must also be
unique
– A null rule specifies the use of blanks, question marks, special characters, or
other strings that may indicate the null condition , and how such values
should be handled.
11
Data Cleaning
There are a number of different commercial tools that can aid in the
discrepancy detection step.
– Data scrubbing tools use simple domain knowledge to detect errors and
make corrections in the data. These tools rely on parsing and fuzzy matching
techniques when cleaning data from multiple sources.
– Data auditing tools find discrepancies by analyzing the data to discover rules
and relationships, and detecting data that violate such conditions.
– They are variants of data mining tools for statistical analysis to find
correlations, or clustering to identify outliers.
12
Data Cleaning
– Commercial tools can assist in the data transformation step. Data migration
tools allow simple transformations to be specified such as to replace the string
“gender” by “sex.”
– ETL (extraction/transformation/loading) tools allow users to specify
transforms through a graphical user interface (GUI).
– The two-step process of discrepancy detection and data transformation
iterates. Its an error-prone and time consuming. Some transformations may
introduce more discrepancies. Some nested discrepancies may only be detected
after others have been fixed
13
Data Cleaning
– New approaches to data cleaning emphasize increased interactivity. Potter’s

Wheel, for example, is a publicly available data cleaning tool that integrates
discrepancy detection and transformation.
– Users gradually build a series of transformations by composing and debugging
individual transformations, one step at a time, on a spreadsheet-like interface.
The transformations can be specified graphically or by providing examples.
– Results are shown immediately on the records that are visible on the screen.
The user can choose to undo the transformations, so that transformations that
introduced additional errors can be “erased.”
– The tool automatically performs discrepancy checking in the background on the
latest transformed view of the data.
14
Data Cleaning
– Users can gradually develop and refine transformations as discrepancies are

found, leading to more effective and efficient data cleaning.
– Another approach to increased interactivity in data cleaning is the development
of declarative languages for the specification of data transformation operators.
– Such work focuses on defining powerful extensions to SQL and algorithms that
enable users to express data cleaning specifications efficiently.
15
Data Reduction
– Data reduction techniques can be applied to obtain a reduced representation of

the data set that is much smaller in volume, yet closely maintains the integrity
of the original data.
– Means, mining on the reduced data set should be more efficient yet produce
the same analytical results.
16
Overview of Data Reduction
Strategies
– Data reduction strategies include dimensionality reduction, numerosity

reduction, and data compression
– Dimensionality reduction is the process of reducing the number of random
variables or attributes under consideration.
– Dimensionality reduction methods include wavelet transforms and principal
components analysis, which transform or project the original data onto a smaller
space.
– Attribute subset selection is a method of dimensionality reduction in which
irrelevant, weakly relevant, or redundant attributes or dimensions are detected and
removed.
17
Data Reduction Strategies
– Numerosity reduction techniques replace the original data volume by

alternative, smaller forms of data representation.
– These techniques may be parametric or nonparametric.
– For parametric methods, a model is used to estimate the data, so
that typically only the data parameters need to be stored, instead of the
actual data.
– Nonparametric methods for storing reduced representations of the data
include histograms, clustering, sampling, and data cube aggregation.
18
Principal Components
Analysis(PCA)
– Principal Components Analysis as a method of dimensionality reduction
– When the Data to be reduced consist of tuples or data vectors described by n attributes or
dimensions. Principal components analysis (also called the Karhunen - Loeve, or K - L, method)
searches for k n-dimensional orthogonal vectors that can best be used to represent the data,
where k <= n
– The original data are thus projected onto a much smaller space, resulting in dimensionality
reduction
– PCA “combines” the essence of attributes by creating an alternative, smaller set of variables.
– The initial data can then be projected onto this smaller set.
– PCA often reveals relationships that were not previously suspected and thereby allows
interpretations that would not ordinarily result.
19
Data Reduction Strategies
– In data compression, transformations are applied to obtain a reduced or

“compressed” representation of the original data.
– If the original data can be reconstructed from the compressed data without any information loss,
the data reduction is called lossless.
– Instead, we can reconstruct only an approximation of the original data, then the data reduction is
called lossy.
– There are several lossless algorithms for string compression; however, they typically allow only
limited data manipulation.
– Dimensionality reduction and numerosity reduction techniques can also be considered forms of
data compression.
– There are many other ways of organizing methods of data reduction. The computational time spent
on data reduction should not outweigh or “erase” the time saved by mining on a reduced data set
size.
20
Principal Components Analysis(PCA)
The basic procedure
1. The input data are normalized, so that each attribute falls within the same
range.
2. PCA computes k orthonormal vectors that provide a basis for the normalized
input data.
3. The principal components are sorted in order of decreasing “significance”
or strength. The principal components essentially serve as a new set of axes
for the data, providing important information about variance.
4. The components are sorted in decreasing order of “significance,” the data
size can be reduced by eliminating the weaker components, that is, those with
low variance. Using the strongest principal components, it should be possible
21 to reconstruct a good approximation of the original data.
22
Attribute Subset Selection(ASS)
– Attribute subset selection reduces the data set size by removing irrelevant
or redundant attributes
– The goal of ASS is to find a minimum set of attributes such that the resulting
probability distribution of the data classes is as close as possible to the original
distribution obtained using all attributes
– Mining on a reduced set of attributes has an additional benefit
– Heuristic methods that explore a reduced search space are commonly used for
attribute subset selection.
– The “best” (and “worst”) attributes are typically determined using tests of
statistical significance, which assume that the attributes are independent of one
23 another.
Basic heuristic
methods of attribute
subset selection
include the
techniques
that follow, some of
which are illustrated
in Figure
24
1. Stepwise forward selection: The procedure starts with an empty set of

attributes as the reduced set. The best of the original attributes is determined
and added to the reduced set. At each subsequent iteration or step, the best
of the remaining original attributes is added to the set.
2. Stepwise backward elimination: The procedure starts with the full set of
attributes. At each step, it removes the worst attribute remaining in the set.
3. Combination of forward selection and backward elimination: The stepwise
forward selection and backward elimination methods can be combined so
that, at each step, the procedure selects the best attribute and removes the
worst from among the remaining attributes.
25
4. Decision tree induction: Decision tree algorithms were originally intended for
classification. Decision tree induction constructs a flow chart like structure
where each internal (non leaf) node denotes a test on an attribute,
each branch corresponds to an outcome of the test, and each external (leaf)
node denotes a class prediction. At each node, the algorithm chooses the
“best” attribute to partition the data into individual classes.
5. When decision tree induction is used for attribute subset selection, a tree is
constructed from the given data. All attributes that do not appear in the tree
are assumed to be irrelevant. The set of attributes appearing in the tree form
the reduced subset of attributes.
26
Histograms
– Histograms use binning to approximate data distributions and are a popular form of
data reduction. A histogram for an attribute, A, partitions the data distribution of A
into disjoint subsets, referred to as buckets or bins.
– If each bucket represents only a single attribute–value/frequency pair, the buckets
are called singleton buckets.
– Equal-width: In an equal-width histogram, the width of each bucket range is uniform
– Equal-frequency (or equal-depth): In an equal-frequency histogram, the buckets
are created so that, roughly, the frequency of each bucket is constant
– Histograms are highly effective at approximating both sparse and dense data, as well
as highly skewed and uniform data. The histograms described before for
single attributes can be extended for multiple attributes.
– Multidimensional histograms can capture dependencies between attributes.
Singleton buckets are useful for storing high-frequency outliers
27
Histograms - Example
– The following data are a list of AllElectronics prices for

commonly sold items. The numbers have been sorted: 1, 1,
5, 5, 5,5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15,
15, 15, 18, 18, 18, 18, 18,18, 18, 18, 20, 20, 20, 20, 20, 20,
20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30,30, 30.
28
Clustering
– Clustering techniques consider data tuples as

objects.
– They partition the objects into groups, or
clusters, so that objects within a cluster are
“similar” to one another and “dissimilar” to
objects in other clusters. Similarity is commonly
defined in terms of how “close” the objects are
in space, based on a distance function.
– The “quality” of a cluster may be represented by
its diameter, the maximum distance between
any two objects in the cluster.
29
Clustering
– Centroid distance is an alternative measure of cluster quality and is defined as

the average distance of each cluster object from the cluster centroid.
– Figure showed a 2-D plot of customer data with respect to customer locations in
a city. Three data clusters are visible.
– In data reduction, the cluster representations of the data are used to replace
the actual data. The effectiveness of this technique depends on the data’s
nature.
– It is much more effective for data that can be organized into distinct clusters
than for smeared data.
30
Sampling(1)
– Sampling can be used as a data reduction technique because it allows a large

data set to be represented by a much smaller random data sample.
Suppose that a large data set, D, contains N tuples. Let’s look at the most common
ways that we could sample D for data reduction.
– Simple random sample without replacement (SRSWOR) of size s: This is
created by drawing s of the N tuples from D (s < N), where the probability of
drawing any tuple in D is 1/N, that is, all tuples are equally likely to be sampled.
– Simple random sample with replacement (SRSWR) of size s: This is similar
to SRSWOR, except that each time a tuple is drawn from D, it is recorded and
then replaced. That is, after a tuple is drawn, it is placed back in D so that it may
31 be drawn again.
32
Sampling(2)
– Cluster sample: If the tuples in D are grouped into M mutually disjoint “clusters,” then an
SRS of s clusters can be obtained, where s < M.
– Stratified sample: If D is divided into mutually disjoint parts called strata, a
stratified sample of D is generated by obtaining an SRS at each stratum.
– An advantage of sampling for data reduction is that the cost of obtaining a sample is
proportional to the size of the sample, s, as opposed to N, the data set size.
Hence, sampling complexity is potentially sublinear to the size of the data. Other data
reduction techniques can require at least one complete pass through D.
– When applied to data reduction, sampling is most commonly used to estimate
the answer to an aggregate query.
33
Data Transformation
and Data Discretization
34
Data Transformation Strategies
Overview
In data transformation, the data are transformed or consolidated into forms
appropriate for mining.
Strategies for data transformation include the following:
1. Smoothing, which works to remove noise from the data. Techniques include
binning, regression, and clustering.
2. Attribute construction, where new attributes are constructed and added from
the given set of attributes to help the mining process.
3. Aggregation, where summary or aggregation operations are applied to the
data. For example, the daily sales data may be aggregated so as to compute
monthly and annual total amounts. This step is typically used in constructing a
35 data cube for data analysis at multiple abstraction levels.

Data Transformation Strategies
Overview
4. Normalization, where the attribute data are scaled so as to fall within a
smaller range, such as - 1.0 to 1.0, or 0.0 to 1.0
5. Discretization, where the raw values of a numeric attribute are replaced
by interval labels or conceptual labels. The labels, in turn, can be recursively
organized into higher-level concepts, resulting in a concept hierarchy for the
numeric attribute
6. Concept hierarchy generation for nominal data, where attributes such as street
can be generalized to higher-level concepts, like city or country. Many
hierarchies for nominal attributes are implicit within the database schema and
can be automatically defined at the schema definition level
36
37
Data Transformation by
Normalization
– The measurement unit used can affect the data analysis
– Expressing an attribute in smaller units will lead to a larger range for that
attribute, and thus tend to give such an attribute greater effect or “weight”
– To help avoid dependence on the choice of measurement units, the data
should be normalized or standardized.
– This involves transforming the data to fall within a smaller or common range
such as [-1, 1] or [0.0, 1.0]
– Normalizing the data attempts to give all attributes an equal weight
– Normalization is particularly useful for classification algorithms involving neural
38 networks or distance measurements such as nearest-neighbor classification and

clustering
Data Transformation by Normalization
Min-max normalization
There are many methods for data normalization.

– Min - Max Normalization performs a linear transformation on the original data.
Suppose that minA and maxA are the minimum and maximum values of an
attribute, A. Min-max normalization maps a value, vi , of A to in the range
[new minA, new maxA] by computing
– Min-max normalization preserves the relationships among the original data

values. It will encounter an “out-of-bounds” error if a future input case for
39
normalization falls outside of the original data range for A.
Min-max Normalization Example
– Example 3.4 Min-max normalization. Suppose that the minimum and

maximum values for the attribute income are Rs.12,000 and Rs.98,000,
respectively.
– We would like to map income to the range [0.0, 1.0]. By min-max normalization,
a value of Rs.73,600 for income is transformed to
40
z-score normalization
– In z-score normalization (or zero-mean normalization), the values for an

attribute, A, are normalized based on the mean (i.e., average) and standard
deviation of A. A value, vi , of A is normalized to by computing
– Where are the mean and standard deviation, respectively, of attribute A.
41
z-score normalization Example
– z-score normalization. Suppose that the mean and standard deviation of the
values for the attribute income are Rs.54,000 and Rs.16,000, respectively. With
z-score normalization, a value of Rs.73,600 for income is transformed to
42
Normalization by decimal scaling
– Normalization by decimal scaling normalizes by moving the decimal point of

values of attribute A. The number of decimal points moved depends on the
maximum absolute value of A. A value, vi , of A is normalized to by computing
43
Discretization by Binning
Binning is a top-down splitting technique based on a specified number of bins.

In Data Cleaning - Noisy data Section discussed binning methods for data smoothing.
– Same methods are also used as discretization methods for data reduction and concept
hierarchy generation.
– For example, attribute values can be discretized by applying equal-width or equal-
frequency binning, and then replacing each bin value by the bin mean or median, as in
smoothing by bin means or smoothing by bin medians, respectively.
– These techniques can be applied recursively to the resulting partitions to generate
concept hierarchies.
– Binning does not use class information and is therefore an unsupervised discretization
technique. It is sensitive to the user-specified number of bins, as well as the presence of
outliers.
44
Discretization by Histogram
Analysis
Histogram analysis is an unsupervised discretization technique because it does not
use class information.
– A histogram partitions the values of an attribute, A, into disjoint ranges called
buckets or bins.
– In an equal-width histogram, the values are partitioned into equal-size
partitions or ranges.
– An equal-frequency histogram, the values are partitioned so that, ideally, each
partition contains the same number of data tuples.
– The histogram analysis algorithm can be applied recursively to each partition in
order to automatically generate a multilevel concept hierarchy, with the
45 procedure terminating once a pre specified number of concept levels has been
reached.
Discretization by Cluster, Decision
Tree, and Correlation Analyses(1)
Cluster analysis is a popular data discretization method.

– A clustering algorithm can be applied to discretize a numeric attribute, A, by partitioning
the values of A into clusters or groups.
– Clustering takes the distribution of A into consideration, as well as the closeness of data
points, and therefore is able to produce high-quality discretization results.
– Clustering can be used to generate a concept hierarchy for A by following either a top-
down splitting strategy or a bottom-up merging strategy, where each cluster forms a
node of the concept hierarchy.
– In the former, each initial cluster or partition may be further decomposed into several
sub clusters, forming a lower level of the hierarchy.
46
– In the latter, clusters are formed by repeatedly grouping neighboring clusters in order to
form higher-level concepts.
Techniques to generate decision trees for classification can be applied to

discretization.
– Such techniques employ a top-down splitting approach.
– Decision tree approaches to discretization are supervised, that is, they make use of class
label information.
– For example, we may have a data set of patient symptoms (the attributes) where each
patient has an associated diagnosis class label. Class distribution information is used in
the calculation and determination of split-points (data values for partitioning an
attribute range).
– Core idea is to select split-points so that a given resulting partition contains as many
47
tuples of the same class as possible.
Tree, and Correlation Analyses(2.1)
– Entropy is the most commonly used measure for this purpose.

– To discretize a numeric attribute, A, the method selects the value of A that has
the minimum entropy as a split-point, and recursively partitions the resulting
intervals to arrive at a hierarchical discretization.
– Such discretization forms a concept hierarchy for A. Because decision tree–
based discretization uses class information, it is more likely that the interval
boundaries (split-points) are defined to occur in places that may help improve
classification accuracy.
48
Measures of correlation can be used for discretization.

– ChiMerge is a 2-based discretization method. The discretization methods that we have
studied up to this point have all employed a top-down, splitting strategy.
– It contrasts with ChiMerge, which employs a bottom-up approach by finding the best
neighboring intervals and then merging them to form larger intervals, recursively.
– As with decision tree analysis, ChiMerge is supervised in that it uses class information.
– The basic notion is that for accurate discretization, the relative class frequencies should
be fairly consistent within an interval.
– If two adjacent intervals have a very similar distribution of classes, then the intervals can
be merged. Otherwise, they should remain separate.
49
Tree, and Correlation Analyses(3.1)
ChiMerge proceeds as follows.

– Initially, each distinct value of a numeric attribute A is considered to be one interval.
– 2 tests are performed for every pair of adjacent intervals.
– Adjacent intervals with the least 2 values are merged together, because low 2 values for
a pair indicate similar class distributions.
– This merging process proceeds recursively until a predefined stopping criterion is met.
50
Concept Hierarchy Generation for
Nominal Data
We study four methods for the generation of concept hierarchies for nominal data,
as follows
– Specification of a partial ordering of attributes explicitly at the schema level by
users or experts
– Specification of a portion of a hierarchy by explicit data grouping
– Specification of a set of attributes, but not of their partial ordering
– Specification of only a partial set of attributes
51
52
UNIT – 3…
53

UNIT - 2 .DataScience 04.09.18

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

UNIT - 2 .DataScience 04.09.18

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

UNIT - 2 .DataScience 04.09.18

Uploaded by

Copyright:

Available Formats

DATA

– Real-world databases are highly susceptible to noisy, missing, and inconsistent

1. Data Quality: Why Preprocess the Data?

Noise is a random error or variance in a measured variable

– Regression: Data smoothing can also be done by regression,

“So, how can we proceed with discrepancy detection?”

The data should also be examined regarding

– New approaches to data cleaning emphasize increased interactivity. Potter’s

– Users can gradually develop and refine transformations as discrepancies are

– Data reduction techniques can be applied to obtain a reduced representation of

– Data reduction strategies include dimensionality reduction, numerosity

– Numerosity reduction techniques replace the original data volume by

– In data compression, transformations are applied to obtain a reduced or

1. Stepwise forward selection: The procedure starts with an empty set of

– The following data are a list of AllElectronics prices for

– Clustering techniques consider data tuples as

– Centroid distance is an alternative measure of cluster quality and is defined as

– Sampling can be used as a data reduction technique because it allows a large

35 data cube for data analysis at multiple abstraction levels.

38 networks or distance measurements such as nearest-neighbor classification and

There are many methods for data normalization.

– Min-max normalization preserves the relationships among the original data

– Example 3.4 Min-max normalization. Suppose that the minimum and

– In z-score normalization (or zero-mean normalization), the values for an

– Where are the mean and standard deviation, respectively, of attribute A.

– Normalization by decimal scaling normalizes by moving the decimal point of

Binning is a top-down splitting technique based on a specified number of bins.

Cluster analysis is a popular data discretization method.

Techniques to generate decision trees for classification can be applied to

– Entropy is the most commonly used measure for this purpose.

Measures of correlation can be used for discretization.

ChiMerge proceeds as follows.

You might also like