Session 18-Cluster Analysis

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 20

Cluster Analysis

Dr. Rajiv Kumar


IIM Kashipur

Note: Content used in this PPT has copied from various source.
Cluster Analysis-Introduction

 Cluster: A collection of data objects


 similar (or related) to one another within the same group
 dissimilar (or unrelated) to the objects in other groups
 Cluster analysis (or clustering, data segmentation, …)

Variable 1
 Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
 Unsupervised learning: no predefined classes (i.e., learning by
observations vs. learning by examples: supervised)
Variable 2
 classes (i.e., learning by observations vs. learning by examples:
supervised)
A Classification of Clustering Procedures

Clustering Procedures

Hierarchical Nonhierarchical Other

Agglomerative Divisive Two-Step

Linkage Variance Centroid Sequential Parallel Optimizing


Methods Methods Methods Threshold Threshold Partitioning
Ward’s
Method

Single Complete Average


Linkage Linkage Linkage
Hierarchical clustering
 Hierarchical clustering is characterized by the development of a hierarchy or
tree-like structure. Hierarchical methods can be agglomerative or divisive.
 Agglomerative clustering starts with each object in a separate cluster.
Clusters are formed by grouping objects into bigger and bigger clusters.
This process is continued until all objects are members of a single cluster.
 Divisive clustering starts with all the objects grouped in a single cluster.
Clusters are divided or split until each object is in a separate cluster.
Nonhierarchical Clustering
 The nonhierarchical clustering methods are frequently referred to as k-means clustering. These
methods include sequential threshold, parallel threshold, and optimizing partitioning.
 In the sequential threshold method, a cluster center is selected and all objects within a prespecified
threshold value from the center are grouped together. Then a new cluster center or seed is selected,
and the process is repeated for the unclustered points. Once an object is clustered with a seed, it is
no longer considered for clustering with subsequent seeds.
 The parallel threshold method operates similarly, except that several cluster centers are selected
simultaneously and objects within the threshold level are grouped with the nearest center.
 The optimizing partitioning method differs from the two threshold procedures in that objects can
later be reassigned to clusters to optimize an overall criterion, such as average within cluster distance
for a given number of clusters.
Quality: What Is Good Clustering?

 A good clustering method will produce high quality clusters


 high intra-class similarity: cohesive within clusters
 low inter-class similarity: distinctive between clusters
 The quality of a clustering method depends on
 the similarity measure used by the method
 its implementation, and
 Its ability to discover some or all of the hidden patterns
Centroid-based clustering
 Centroid-based clustering: In centroid-based clustering, clusters are represented by a
central vector, which may not necessarily be a member of the data set (e.g. K-means
clustering)
 K-means clustering: k-means clustering aims to partition n observations into k clusters
in which each observation belongs to the cluster with the nearest mean, serving as a
prototype of the cluster.
 K-medoids (PAM Algorithm): Instead of taking the mean value of the object in a
cluster as a reference point, medoids can be used, which is the most centrally located
object in a cluster
 K-medians: It is a variation of k-means clustering where instead of calculating the mean
for each cluster to determine its centroid, one instead calculates the median.
Centroid-based clustering
 Centroid-based clustering: In centroid-based clustering, clusters are represented by a
central vector, which may not necessarily be a member of the data set (e.g. K-means
clustering)
 K-means clustering: k-means clustering aims to partition n observations into k clusters
in which each observation belongs to the cluster with the nearest mean, serving as a
prototype of the cluster.
 K-medoids (PAM Algorithm): Instead of taking the mean value of the object in a
cluster as a reference point, medoids can be used, which is the most centrally located
object in a cluster
 K-medians: It is a variation of k-means clustering where instead of calculating the mean
for each cluster to determine its centroid, one instead calculates the median.
The K-Means Clustering Method
 Given k, the k-means algorithm is implemented in four steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the clusters of the current
partitioning (the centroid is the center, i.e., mean point, of the cluster)
 Assign each object to the cluster with the nearest seed point
 Go back to Step 2, stop when the assignment does not change
An Example of K-Means Clustering

K=2

Arbitrarily Update the


partition cluster
objects into centroids
k groups

The initial data set Loop if Reassign objects


needed
 Partition objects into k nonempty subsets
 Repeat
 Compute centroid (i.e., mean point) for
each partition
Update the
 Assign each object to the cluster of its cluster
nearest centroid centroids

 Until no change
K-Mean Cluster: Example

Subject A B
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
K-Mean Cluster: Example…
Initial Step: Group the data set into two clusters
• Find a sensible initial partition. Let the A & B values of the two individuals furthest
apart (using the Euclidean distance measure), define the initial cluster means,
giving:

Individual Mean Vector (Centroid)


Group1 1 (1.0, 1.0) (1.0. 1.0)
Group2 4 (5.0, 7.0) (5.0, 7.0)
K-Mean Cluster: Example…
The remaining individuals are now examined in sequence and allocated to the cluster
to which they are closest, in terms of Euclidean distance to the cluster mean. The
mean vector is recalculated each time a new member is added. This leads to the
following series of steps:
Cluster 1 Cluster 2
Step Individual Mean Vector Individual Mean Vector
(Centroid) (Centroid)
1 1 (1.0, 1.0) 4 (5.0, 7.0)
2 1, 2 (1.2, 1.5) 4 (5.0, 7.0)
3 1,2,3 (1.8, 2.3) 4 (5.0, 7.0)
4 1,2,3 (1.8, 2.3) 4, 5 (4.2, 6.0)
5 1,2,3 (1.8, 2.3) 4, 5, 6 (4.3, 5.7)
6 1,2,3 (1.8, 2.3) 4, 5, 6, 7 (4.1, 5.4)

1 (1.0, 1.0), 2 (1.5, 2.0), 3 (3.0, 4.0), 4 (5.0, 7.0), 5 (3.5, 5.0), 6 (4.5, 5.0), 7 (3.5, 4.5)
K-Mean Cluster: Example…
Now the initial partition has changed, and the two clusters at this stage having the
following characteristics:

Individual Mean Vector (Centroid)


Cluster 1 1, 2, 3 (1.8, 2.3)
Cluster 2 4, 5, 6, 7 (4.1, 5.4)

But we cannot yet be sure that each individual has been assigned to the right cluster. So, we compare
each individual’s distance to its own cluster mean and to that of the opposite cluster.

1 (1.0, 1.0), 2 (1.5, 2.0), 3 (3.0, 4.0), 4 (5.0, 7.0), 5 (3.5, 5.0), 6 (4.5, 5.0), 7 (3.5, 4.5)
K-Mean Cluster: Example…
Calculating the mean distance of individual from the Mean of Cluster 1 and Cluster 2
we find:
Individual Distance to Mean Distance to Mean
(Centroid) of Cluster 1 (Centroid) of Cluster 2
1 1.5 5.4
2 0.4 4.3
3 2.1 1.8
4 5.7 1.8
5 3.2 0.7
6 3.8 0.6
7 2.8 1.1
Only individual 3 is nearer to the mean of the opposite cluster (Cluster 2) than its own (Cluster 1).
In other words, each individual's distance to its own cluster mean should be smaller that the
distance to the other cluster's mean (which is not the case with individual 3) …
K-Means Cluster: Example…
…Thus, individual 3 is relocated to Cluster 2 resulting in the new partition:

Individual Mean Vector (Centroid)


Cluster 1 1, 2 (1.3, 1.5)
Cluster 2 3, 4, 5, 6, 7 (3.9, 5.1)

The iterative relocation would now continue from this new partition until no
more relocations occur. However, in this example each individual is now
nearer its own cluster mean than that of the other cluster and the iteration
stops, choosing the latest partitioning as the final cluster solution.

Note: it is possible that the k-means algorithm won't find a final solution. In this case it would
be a good idea to consider stopping the algorithm after a pre-chosen maximum of iterations.
K-Means Cluster: Example…
…Thus, individual 3 is relocated to Cluster 2 resulting in the new partition:

Individual Mean Vector (Centroid)


Cluster 1 1, 2 (1.3, 1.5)
Cluster 2 3, 4, 5, 6, 7 (3.9, 5.1)

The iterative relocation would now continue from this new partition until no
more relocations occur. However, in this example each individual is now
nearer its own cluster mean than that of the other cluster and the iteration
stops, choosing the latest partitioning as the final cluster solution.

Note: it is possible that the k-means algorithm won't find a final solution. In this case it would
be a good idea to consider stopping the algorithm after a pre-chosen maximum of iterations.
K-Mean Clustering
Example

• Iris Data-IrisData.xlsx
K-Mean Clustering: R Code

library(readxl)
df<-read_excel('IrisData.xlsx')
print(df)
kmeans(df[, c(2,3,4,5)], 3)

Output
Clustering vector:
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3 3
333333333
[68] 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 1 1 1 3 1 1 1 1 1 1 3 3 1 1 1 1 3 1 3 1 3 1
133111113
[135] 1 1 1 1 3 1 1 1 3 1 1 1 3 1 1 3
Classification Vs. Clustering

Criteria Classification Clustering


Prior knowledge of classes Yes No
Use case Classify new sample into Suggest group based on
known classes pattern in data
Algorithm Decision Trees, Bayesian K-means, Expectation,
classifiers, Logistic Maximization
regression, Linear
discriminant analysis
Data Needs Labelled samples from a set Unlabeled samples
of classes
Supervised Supervised Unsupervised

You might also like