Cluster Analysis Introduction (Unit-6)
Cluster Analysis Introduction (Unit-6)
Cluster Analysis Introduction (Unit-6)
CLUSTER ANALYSIS
7. CLUSTER ANALYSIS
7.1 What Is Cluster Analysis
Cluster is a collection of data objects. Similar to one another within the same
cluster. Dissimilar to the object in other clusters
Cluster Analysis
Finding similarities between data according to the characteristics found in the data
and grouping similar data objects into clusters
The process of grouping a set of abstract objects into classes of similar objects is
called clustering.
Clustering is an example of unsupervised learning.
Clustering is a from of learning by observation, rather than learning by
examples.
Inter-cluster
Intra-cluster distance are
distances are maximized
minimized
Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING
Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING
The rows and columns of the data matrix represent different entities, while
those of the dissimilarity matrix represent the same entity. Thus, the data matrix is
often called a two-mode matrix, where as the dissimilarity matrix is called a one-
mode matrix. Many clustering algorithms operate on a dissimilarity matrix. If the data
are presented in the from of a data matrix, it can first be transformed into a
dissimilarity matrix before applying such clustering algorithms.
Standardize Data:
Using mean absolute deviation is more robust than standard deviation. MAD is
even more robust, but outliers disappear completely.
18,22,25,42,28,43,33,35,56,28
Where xif ,….., xnf are measurements of fx, mf is the mean value of f.
-33|+|42-33|+|28-38|+|43-33|+|33-33|+|35-
33|+|56-33|+|28-33|)
Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING
X1=18
X=22
X=25
X=42
+……+
Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING
Properties of a distance
d(i,k)>=0
d(i,k)=0
d(i,k) = d(k,i)
Example 1: Given two objects represented by the tuples (22, 1, 42, 10) and
(20, 0, 36, 8);
C) Compute the Minkowski distance between the two objects using q=3.
d(i,j) =
=
= = 6.708
d=|22-20|+|1-0|+|42-36|+|10-8|
=2+1+6+2 =11
Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING
d(i,j)=
= = 6.15
Example2:
Euclidian distance d =
=
=
= 5.099
1 0 Sum
1 Q r q+r
Object j 0 S t s+t
sum q+s r+t P
Binary Variables
Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING
Example:
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y Y N N N N
. . . . . . . .
.
. . . . . . .
. . . . . . . .
Let values Y and P set to 1, and the value N is set to 0. Suppose that the
distance between two patients is based on the asymmetric attributes
Ordinal Variables
Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING
map the range of each variable onto [0,1] by replacing i-th object in
the f-th variable by
Ratio-Scaled Variables
Methods:
Treat them as interval-scaled variables – not a good choice (it can distort data)
Where
Computation of
If f is interval-based:
Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING
If f is binary or categorical:
0,xif = xif
= 1, otherwise
If f is ordinal compute ranks rif and
Tanimoto coefficient
Partitioning Methods:
It classifies the data into k groups, which together satisfy the following
requirements:
The general criterion of a good partitioning is that object in the same cluster are
“close” or related to each other, where object of different clusters are “far apart”
or very different.
1)Hierarchical Methods
Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING
At each step, merge the cluster until only one cluster (or k clusters) left
At each step, split a cluster contains a point (or there are k clusters)
2) Density-Based Methods
Most partitioning methods cluster objects based on the distance between objects,
such methods can find only spherical-shaped clusters and encounter difficulty at
discovering clusters of arbitrary shapes.
Other clustering methods have been developed based on the notion of density.
Their general idea is to continue growing the given cluster as long as the density
(number of object or data points) in the “neighborhood” exceeds some threshold;
that is, for each data point within a given cluster, the neighborhood of a given
radius has to contain at least a minimum number of points.
Such a method can be used to filter out noise (outliers) and discover clusters of
arbitrary shape.
3) Grid-Based Methods
Grid –based methods quantized the space into a finite number of cells that
from a grid structure.
All of the clustering operations are performed on the grid structure (i.e., on the
quantized space).
The main advantage of this approach is its fast processing time, which is
typically independent of the number of data object and dependent only on the
number of cells in each dimension in the quantized space.
Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING
4) Model-Based Methods:
Model-based methods hypothesize a model for each of the clusters and find the
best fit of the data to the given model.
Since only one set of clusters is output, the user normally has to input the desired
number of clusters K.
Each ck has nk samples and each sample has exactly one cluster, so that
Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING
The square-error for cluster Ck is the sum of the squared Euclidean distance
between each sample in Ck and its centroid. This error is also called the within-cluster
variation.
The square-error for the entire clustering space containing k clusters is the
sum of the within-cluster variations:
Where E is the sum of the square error for all objects in the data set; p is the point
space representing a given objects; and mi is the mean of cluster Ci (both p and mi are
multidimensional).
Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING
Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING
Advantages
1. With a large number of variables, k-means nay be computationally faster than
hierarchical clustering (if k is small).
2. K-means may produce tighter clusters than hierarchical clustering, especially
if the clusters are globular.
Disadvantages
Difficult in comparing the quality of the clusters produced.
Applicable only when mean is defined.
Need to specify k, the number of clusters, in advance.
Unable to handle noisy data and Outliers.
Not suitable to discover cluster with non-convex shapes.
Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING
PAM (Partitioning Around Medoids) was one of the first k-medoids algorithms
introduced. It attempts to determine k partitions for n objects. After an initial random
selection of k representative objects, the algorithm repeatedly tries to make a better
choice of cluster representative. The final set of representative objects are the
respective medoids of the clusters (or) centered objects.
Advantages
Pam is very robust to the existence of outliers. The clusters found by this
method do not depend on the order in which the objects are examined.
PAM works effectively for small data sets.
Disadvantages
It does not sale well for large data sets.
CLARA algorithm:
Input: Database of d objects.
Repeat for m times
Draw a sample s D randomly from D
Call PAM(s,k) To get k mediods.
Classify the entire data set D to c1,c2, -ck.
Caliculate the quality of clustering as the average dissimilarity
End
Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING
Basically hierarchical methods group data into a tree of clusters. There are two
basic varieties herarsial algorithms; agglomerative and divisive. A tree structure
called a dendrogram is commonly used to represent the process of hierarchical
clustering.
Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING
BIRCH partitions objects hierarchically using tree structures and then refines
the clusters using other clustering methods. I t defines clustering feature and an
associated tree structure that summarizes a cluster. The tree (CF tree) is a height-
balanced tree that stores cluster information. BIRCH doesn‟t produce spherical
clusters and may produce unintended cluster. These structures help the clustering
method achieve good speed and scalability in large databases and also make it
effective for incremental and dynamic clustering of incoming objects.
BIRCH applies a multiphase clustering technique, a single scan of the data set
yields a basic good clustering, and one or more additional scans can (optionally) be
used to further improve the quality.
The primary phases are:
Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING
Phase 1: BIRCH scans the database to build an initial in-memory CF tree, which can
be viewed as a multilevel compression of the data that tries to preserve the inherent
clustering structure of the data.
Phase 2: BIRCH applies a (selected) clustering algorithm to cluster the leaf nodes of
the CF tree, which removes sparse clusters as outliers and groups dense clusters into
larger ones.
Clustering features
Summary of the statistics for a given subclusters, the 0th, 1st, and 2nd
moments of the subcluster from the statistical point of view.
Registers crucial measurements for computing cluster and utilizes
storage efficiently.
A CF tree is a height balanced tree that stores the clustering features for a
hierarchical clustering.
A non leaf node in a tree has descendents or “Children”.
The non leaf nodes stores sums of the CFs of their children.
A CF tree has two parameters
Branching factor: specify the maximum number of children.
Threshold : Maximum diameter of sub-clusters stored at the left
nodes
Drawback
Handles only numeric data and sensitive to the order of the data record.
Creates cluster by sampling the database and shrinks them toward the center of
the cluster by a specified fraction. Obviously better in runtime but lacking in
precision. It is robust against outliers and creates clusters of differing sizes that are not
necessarily spherical. CURE fails when clustering categorical data. Ignores aggregate
interconnectivity of objects in separate clusters.
Steps:
Draw a random sample, of the original objects.
Partition sample S into a set of partitions and form a cluster for each
partion.
Representative points are found by selecting a constant number of
points from a cluster and then “shrinking” them toward the centre of
the cluster.
Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING
Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING
Dr.A.VEERASWAMY SACET-CSE