Chapter 2
Chapter 2
Chapter 2
hierarchical
clustering
UN SUP ERVISED L EARN IN G IN R
Hank Roark
Senior Data Scientist at Boeing
Hierarchical clustering
Number of clusters is not known ahead of time
UNSUPERVISED LEARNING IN R
Simple example
UNSUPERVISED LEARNING IN R
Five clusters
UNSUPERVISED LEARNING IN R
Four clusters
UNSUPERVISED LEARNING IN R
Three clusters
UNSUPERVISED LEARNING IN R
Two clusters
UNSUPERVISED LEARNING IN R
One cluster
UNSUPERVISED LEARNING IN R
Hierarchical clustering in R
# Calculates similarity as Euclidean distance
# between observations
dist_matrix <- dist(x)
# Returns hierarchical clustering model
hclust(d = dist_matrix)
Call:
hclust(d = s)
UNSUPERVISED LEARNING IN R
Let's practice!
UN SUP ERVISED L EARN IN G IN R
Selecting number of
clusters
UN SUP ERVISED L EARN IN G IN R
Hank Roark
Senior Data Scientist at Boeing
Interpreting results
# Create hierarchical cluster model: hclust.out
hclust.out <- hclust(dist(x))
# Inspect the result
summary(hclust.out)
UNSUPERVISED LEARNING IN R
Dendrogram
Tree shaped structure used to interpret hierarchical clustering models
UNSUPERVISED LEARNING IN R
Dendrogram
Tree shaped structure used to interpret hierarchical clustering models
UNSUPERVISED LEARNING IN R
Dendrogram
Tree shaped structure used to interpret hierarchical clustering models
UNSUPERVISED LEARNING IN R
Dendrogram
Tree shaped structure used to interpret hierarchical clustering models
UNSUPERVISED LEARNING IN R
Dendrogram
Tree shaped structure used to interpret hierarchical clustering models
UNSUPERVISED LEARNING IN R
Dendrogram plotting in R
# Draws a dendrogram
plot(hclust.out)
abline(h = 6, col = "red")
UNSUPERVISED LEARNING IN R
Tree "cutting" in R
# Cut by height h
cutree(hclust.out, h = 6)
1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3
3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 2 4 2 4 4
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2
2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1
UNSUPERVISED LEARNING IN R
Let's practice!
UN SUP ERVISED L EARN IN G IN R
Clustering linkage
and practical
matters
UN SUP ERVISED L EARN IN G IN R
Hank Roark
Senior Data Scientist at Boeing
Linking clusters in hierarchical clustering
How is distance between clusters determined? Rules?
Centroid: nds centroid of cluster 1 and centroid of cluster 2, and uses similarity between
two centroids
UNSUPERVISED LEARNING IN R
Linking methods: complete and average
UNSUPERVISED LEARNING IN R
Linking method: single
UNSUPERVISED LEARNING IN R
Linking method: centroid
UNSUPERVISED LEARNING IN R
Linkage in R
# Fitting hierarchical clustering models using different methods
hclust.complete <- hclust(d, method = "complete")
hclust.average <- hclust(d, method = "average")
hclust.single <- hclust(d, method = "single")
UNSUPERVISED LEARNING IN R
Practical matters
Data on di erent scales can cause undesirable results in clustering methods
Solution is to scale data so that features have same mean and standard deviation
Subtract mean of a feature from all observations
UNSUPERVISED LEARNING IN R
Practical matters
# Check if scaling is necessary
colMeans(x)
-0.1337828 0.0594019
apply(x, 2, sd)
1.974376 2.112357
UNSUPERVISED LEARNING IN R
Practical matters
# Produce new matrix with columns of mean of 0 and sd of 1
scaled_x <- scale(x)
colMeans(scaled_x)
2.775558e-17 3.330669e-17
apply(scaled_x, 2, sd)
1 1
UNSUPERVISED LEARNING IN R
Let's practice!
UN SUP ERVISED L EARN IN G IN R
Review of
hierarchical
clustering
UN SUP ERVISED L EARN IN G IN R
Hank Roark
Senior Data Scientist at Boeing
Hierarchical clustering review
# Fitting various hierarchical clustering models
hclust.complete <- hclust(d, method = "complete")
hclust.average <- hclust(d, method = "average")
hclust.single <- hclust(d, method = "single")
UNSUPERVISED LEARNING IN R
Linking methods: complete and average
UNSUPERVISED LEARNING IN R
Hierarchical clustering
UNSUPERVISED LEARNING IN R
Iterating
UNSUPERVISED LEARNING IN R
Dendrogram
UNSUPERVISED LEARNING IN R
How k-means and hierarchical clustering differ
UNSUPERVISED LEARNING IN R
Practical matters
# Scale the data
pokemon.scaled <- scale(pokemon)
1 2 3
1 242 1 0
2 342 1 0
3 204 9 1
UNSUPERVISED LEARNING IN R
Let's practice!
UN SUP ERVISED L EARN IN G IN R