Data Mining - Data Reduction
Data Mining - Data Reduction
Data Mining - Data Reduction
Estimate the probability of each point in a multidimensional space based on a smaller subset
Can construct higher dimensional spaces from
lower dimensional ones
Both
Sparse data
Histograms:
Divide data into buckets and store average (sum) for
each bucket
Partitioning rules:
Equal-width: equal bucket range
Equal-frequency (or equal-depth)
V-optimal: with the least histogram variance
(weighted sum of the original values that each
bucket represents)
MaxDiff: Consider difference between pair of
adjacent values. Set bucket boundary between
each pair for pairs having the (No. of buckets)
1 largest differences
Multi-dimensional histogram
Clustering
Partition data set into clusters based on similarity,
and store cluster representation (e.g., centroid and
diameter) only
Can be very effective if data is clustered but not if
data is smeared
Can have hierarchical clustering and be stored in
multi-dimensional index tree structures
Sampling
Sampling: obtaining a small sample s to represent the
whole data set N
Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
Choose a representative subset of the data
Simple random sampling may have very poor
performance in the presence of skew
Develop adaptive sampling methods
Stratified sampling
Sampling Techniques:
Simple Random Sample Without Replacement
(SRSWOR)
Simple Random Sample With Replacement
(SRSWR)
Cluster Sample
Stratified Sample
Cluster Sample:
Tuples are grouped into M mutually disjoint clusters
SRS of m clusters is taken where m < M
Tuples in a database retrieved in pages
Page - Cluster
SRSWOR to pages
Stratified Sample:
Data is divided into mutually disjoint parts called
strata
SRS at each stratum
Representative samples ensured even in the presence
of skewed data