U L D R: Nsupervised Earning and Imensionality Eduction

UNSUPERVISED LEARNING AND
DIMENSIONALITY REDUCTION
DR. M M MANJURUL ISLAM

Contents
1. Unsupervised Learning (Clustering Problem)
2. K-means clustering algorithm
3. Centroid initialization and choosing the number of K
4. Dimensionality Reduction
5. Principal Component Analysis
1
Unsupervised Learning
Introduction
4
1 Unsupervised Learning (Clustering Problem)
❖ Introduction into unsupervised learning

• Supervised learning example: Classification Problem
x1
x2
(1) (1) (2) (2) (3) (3) ( m)

Given a training set: {( x , y ),( x , y ),( x , y ),...,( x , y( m) )}
5

• Unsupervised learning example: Clustering Problem
x1
Cluster 1
Cluster 2
x2
(1) (2) (3) ( m)

Given a training set: {x , x , x ,..., x }
6

• Applications of clustering
Market Segmentation Organize computing clusters
Astronomical data analysis 7

2
K-means Clustering
Algorithm
2 K-means clustering algorithm
❖ K-means algorithm
• Inputs:
─ K (number of clusters)
─ Training set {𝑥 (1) , 𝑥 (2) , 𝑥 (3) , . . . , 𝑥 (𝑚) }, 𝑥 (𝑖) ∈ ℝ𝑛
• Algorithm flow:
𝑛
Randomly initialize K cluster centroids 𝜇1 , 𝜇2 , . . . , 𝜇𝐾 ∈ ℝ
Repeat {
for i = 1 to m
c (i ) := index (from 1 to K) of cluster centroid closest to x (i )
for k = 1 to K
k := average (mean) of points assigned to cluster k
}
9
• Execution of algorithm
10
11
12
13
14
15
❖ K-means algorithm objective function
c (i ) =index of cluster (1,2,…,K) to which example x (i ) is currently assigned

k =cluster centroid k ( k  n
)
c (i ) =cluster centroid to which example x (i ) has been assigned
1 m (i )
, 1 ,...,  K ) =  x − c( i )
2
(1) (m)
J (c ,..., c
m i =1
min
(1) (m)
J ( c (1)
,..., c (m)
, 1 ,...,  K )
c ,..., c
1 ,...,  K
16
Strengths of k-means
• Strengths:
– Simple: easy to understand and to implement
– Efficient: Time complexity: O(tkn),
where n is the number of data points,
k is the number of clusters, and
t is the number of iterations.
– Since both k and t are small. k-means is considered a
linear algorithm.
• K-means is the most popular clustering algorithm.
• Note that: it terminates at a local optimum if SSE is
used. The global optimum is hard to find due to
complexity.
17
Weaknesses of k-means
• The algorithm is only applicable if the mean
is defined.
– For categorical data, k-mode - the centroid is
represented by most frequent values.
• The user needs to specify k.
• The algorithm is sensitive to outliers
– Outliers are data points that are very far away
from other data points.
– Outliers could be errors in the data recording or
some special data points with very different
values.
18
Weaknesses of k-means: Problems with outliers
19
Weaknesses of k-means: To deal with outliers
• One method is to remove some data points in the

clustering process that are much further away from
the centroids than other data points.
– To be safe, we may want to monitor these possible outliers
over a few iterations and then decide to remove them.
• Another method is to perform random sampling.
Since in sampling we only choose a small subset of
the data points, the chance of selecting an outlier is
very small.
– Assign the rest of the data points to the clusters by
distance or similarity comparison, or classification
20
Weaknesses of k-means (cont …)
• The algorithm is sensitive to initial seeds.
21
• If we use different seeds: good results

There are some met
hods to help choo
se good seeds
22
• The k-means algorithm is not suitable for discoveri

ng clusters that are not hyper-ellipsoids (or hyper-
spheres).
23
Common ways to represent clusters
• Use the centroid of each cluster to

represent the cluster.
– compute the radius and
– standard deviation of the cluster to
determine its spread in each dimension
– The centroid representation alone works well

if the clusters are of the hyper-spherical
shape.
– If clusters are elongated or are of other
shapes, centroids are not sufficient
24
3
Centroid initialization and
Choosing the number of K
3 Centroid initialization and choosing the number of K
❖ Random initialization of centroids for K-means algorithm

• Number of clusters K<m
• Randomly pick K training examples
• Set 1 ,...,  K equal to these K examples
26

• Problem
Ideal case of random initialization Bad examples of random initialization
27

• Solution
For i =1 to 100 {
Randomly initialize K-means.
Run K-means. Get c (1) ,..., c( m ) , 1 ,...,  K
Compute cost function:
J (c (1) ,..., c( m ) , 1 ,...,  K )
}
Pick clustering with the lowest cost J (c (1) ,..., c( m ) , 1 ,...,  K )
28
❖ Choosing the number of clusters

• What is the right number of clusters? How to choose it?
29
❖ Choosing the number of clusters K

• Elbow method
30
❖ Choosing the number of clusters K

• Evaluate K-means based on a metric for how well it performs for the particular
purpose
Rubbing Increasing
Feature 1 and Feature 2
construct Health Index
for rubbing prognosis Feature 1
- No rubbing
- Slight rubbing
- Intensive rubbing
Feature 2
31
4
Dimensionality Reduction
4 Dimensionality Reduction
❖ Motivation. Data Compression and Visualization.

• Reduce data from 2D to 1D
Mapping
x2
𝑥 (1) ∈ ℝ2 → 𝑧 (1) ∈ ℝ
𝑥 (2) ∈ ℝ2 → 𝑧 (2) ∈ ℝ
(1) 𝑥 (𝑚) ∈ ℝ2 → 𝑧 (𝑚) ∈ ℝ

x
x (2)
(1) (2) x1
z z
33
EMBEDDED
4 Dimensionality Reduction SYSTEM
LABORATORY
❖ Motivation. Data Compression and Visualization.

• Reduce data from 3D to 2D
z2
x3
x1 x2
z1
x (i )  3
z (i )  2
34
5
Principal Component
Analysis
Principal Component Analysis
36
5 Principal Component Analysis
❖ Principal Component Analysis (PCA) problem formulation

x2
x1
Reduce from n-dimension to k-dimension: Find k vectors u (1) , u (2) ,..., u ( k ) onto which
project the data, so as to minimize the projection error. 37
❖ Principal Component Analysis (PCA) algorithm

• Data Preprocessing
Given a training data set: x , x ,..., x

(1) (2) ( m)
Preprocessing (feature scaling/mean normalization):

1 m (i )
 j =  xj
m i =1
Replace each x j with x j −  j .
(i )
If different features on different scales, scale features to have comparable

range of values:
x j (i ) −  j
x j (i ) =
max(x j ) − min(x j )
38
Reduce data from n-dimensions to k-dimensions

Compute ‘covariance matrix:
1 n
 =  ( x ( i ) )( x ( i ) )T
m i =1
Compute “eigenvectors” of covariance matrix:
[U,S,V] = svd(Sigma);
Here, we obtain:
| | |
𝑈 = 𝑢(1) 𝑢(2) 𝑢(𝑛) ∈ ℝ𝑛×𝑛
| | |
References: J.Shlens, ‘A Tutorial on Principal Component Analysis: Derivation, Discussion and

Singular Value Decomposition’, 2003 39

• Linear transformation of the original data
On previous step, we obtained eigenvectors from singular value decomposition of

covariance matrix:
| | |
𝑈 = 𝑢(1) 𝑢(2) 𝑢 (𝑛) ∈ ℝ𝑛×𝑛
| | |
k
Now, we can transform our original data with it is own feature dimension into new
feature space, based on principal components (𝑥 ∈ ℝ𝑛 → 𝑧 ∈ ℝ𝑘 ) :

 | | |   − (u (1) ) −
 
z (i ) = u (1) u (2) u ( k )  x ( i ) =  − (u (2) ) −  x(i )
 | | |   − (u ( k ) ) −  nx1

nxk kxn
Ureduce
kx1 40
• MATLAB implementation
After mean normalization (ensure every feature has zero mean) and optionally
feature scaling:
1 n
 =  ( x ( i ) )( x ( i ) )T
m i =1
[U,S,V] = svd(Sigma);
Ureduce = U(:,1:k);
Z = Ureduce’*x;
*Notice:
In MATLAB there are new functions, which you can use directly to obtain principal
components for your data without the steps mentioned above.
Please, have a look on functions called: pca (if you apply PCA directly to data) and pcacov
(if you want to extract principal components from covariance matrix). The results should be
similar. 41
• Reconstruction of data from its compressed representation
x2 x2
(1) x1 (1)
xapprox x1
x
x (2) (2)
xapprox
z = U  reduce x 𝑧 ∈ ℝ → 𝑥 ∈ ℝ2
z (1) (𝑖)
𝑥𝑎𝑝𝑝𝑟𝑜𝑥 = 𝑈𝑟𝑒𝑑𝑢𝑐𝑒 𝑧 (𝑖)
z1 nxk kx1
nx1 42
❖ Choosing the number of principal components
1 m

2
Average squared projection error: x ( i ) − xapprox
(i )
m i =1
1 m
Total variation in the data:  (i ) 2
x
m i =1
Typically, choose k to be smallest value so that:
1

m 2
i =1
x (i )
− x (i )
approx
m  0.01
1
 i =1 (i ) 2
m
x
m
“99% of variance is retained”
43
❖ Choosing the number of principal components
Two ways to check the retained variance of data:

[U,S,V] = svd(Sigma)
Algorithm:
 S11 ... ... ... ... 
Try PCA with k= 1  ... S 22 ... ... ... 
Compute: 
S =  ... ... S33 ... ... 
U reduce , z (1) , z (2) ,..., z ( m ) , xapprox
(1) ( m)
,..., xapprox  
 ... ... ... ... ... 
 ... ... ... ... S nn 
Check if: For given k:
1

m 2
−

(i ) (i )
i =1
x x approx
k
Sii
m  0.01 1− i =1
 0.01

n
1
 i =1 x (i ) 2
m
i =1
Sii
m
If ‘yes’ – stop algorithm, if ‘no’ –change Pick smallest value of k for which “99%
the value k+=1 and run algorithm again of variance retained
44
❖ Advices for applying PCA

• Supervised learning speedup
( x (1) , y (1) ), ( x (2) , y (2) ),..., ( x ( m ) , y ( m ) )

Extract inputs:
Unlabelled dataset: 𝑥 (1) , 𝑥 (2) , . . . , 𝑥 (𝑚) ∈ ℝ10000
PCA
𝑧 (1) , 𝑧 (2) , . . . , 𝑧 (m) ∈ ℝ100
New training set:

( z (1) , y (1) ), ( z (2) , y (2) ),..., ( z ( m ) , y ( m ) )
Note: Mapping x ( i ) → z (i) should be defined by running PCA only on the training set.
This mapping can be applied as well to the examples xcv (i )
and xtest
(i )
in the cross valida
tion and test sets.
45
❖ Advices for applying PCA

• Compression
─ Reduce memory/disk needed to store data
─ Speed up learning algorithm
• Visualisation
• Before implementing PCA, first try running whatever you want to do with the
original/raw data. Only if that does not do what you want, then implement PCA
and consider using another feature space dimension.
46
Principal Component Analysis
47
How to choose a clustering algorithm
• Clustering research has a long history. A vast colle

ction of algorithms are available.
– We only introduced several main algorithms.
• Choosing the “best” algorithm is a challenge.
– Every algorithm has limitations and works well with certa
in data distributions.
– It is very hard, if not impossible, to know what distributi
on the application data follow. The data may not fully fo
llow any “ideal” structure or distribution required by the
algorithms.
– One also needs to decide how to standardize the data, t
o choose a suitable distance function and to select other
parameter values.
48
Choose a clustering algorithm (cont …)
• Due to these complexities, the common practice is

to
– run several algorithms using different distance functions
and parameter settings, and
– then carefully analyze and compare the results.
• The interpretation of the results must be based on
insight into the meaning of the original data toget
her with knowledge of the algorithms used.
• Clustering is highly application dependent and to
certain extent subjective (personal preferences).
49
Cluster Evaluation: hard problem
• The quality of a clustering is very hard to

evaluate because
– We do not know the correct clusters
• Some methods are used:
– User inspection
• Study centroids, and spreads
• Rules from a decision tree.
• For text documents, one can read some documents
in clusters.
50
Cluster evaluation: ground truth
• We use some labeled data (for classification)

• Assumption: Each class is a cluster.
• After clustering, a confusion matrix is constructed. From the
matrix, we compute various measurements, entropy, purity,
precision, recall and F-score.
– Let the classes in the data D be C = (c1, c2, …, ck). The
clustering method produces k clusters, which divides D
into k disjoint subsets, D1, D2, …, Dk.
51
Evaluation measures: Entropy
52
Evaluation measures: purity
53
A remark about ground truth evaluation
• Commonly used to compare different clustering

algorithms.
• A real-life data set for clustering has no class
labels.
– Thus although an algorithm may perform very well on
some labeled data sets, no guarantee that it will perform
well on the actual application data at hand.
• The fact that it performs well on some label data
sets does give us some confidence of the quality
of the algorithm.
• This evaluation method is said to be based on
external data or information.
54
Evaluation based on internal information
• Intra-cluster cohesion (compactness):

– Cohesion measures how near the data points in a
cluster are to the cluster centroid.
– Sum of squared error (SSE) is a commonly used
measure.
• Inter-cluster separation (isolation):
– Separation means that different cluster centroids
should be far away from one another.
• In most applications, expert judgments are still the key.
55
Indirect evaluation
• In some applications, clustering is not the primary

task, but used to help perform another task.
• We can use the performance on the primary task
to compare clustering methods.
• For instance, in an application, the primary task is
to provide recommendations on book purchasing
to online shoppers.
– If we can cluster books according to their features, we
might be able to provide better recommendations.
– We can evaluate different clustering algorithms based on
how well they help with the recommendation task.
– Here, we assume that the recommendation can be
reliably evaluated.
56
Summary
• Clustering is has along history and still active

– There are a huge number of clustering algorithms
– More are still coming every year.
• We only introduced several main algorithms. There
are many others, e.g.,
– density based algorithm, sub-space clustering, scale-up
methods, neural networks based methods, fuzzy clustering,
co-clustering, etc.
• Clustering is hard to evaluate, but very useful in
practice. This partially explains why there are still a
large number of clustering algorithms being devised
every year.
• Clustering is highly application dependent and to
some extent subjective.
57
58

U L D R: Nsupervised Earning and Imensionality Eduction

Uploaded by

Copyright:

Available Formats

U L D R: Nsupervised Earning and Imensionality Eduction

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

U L D R: Nsupervised Earning and Imensionality Eduction

Uploaded by

Copyright:

Available Formats

UNSUPERVISED LEARNING AND

DR. M M MANJURUL ISLAM

❖ Introduction into unsupervised learning

(1) (1) (2) (2) (3) (3) ( m)

❖ Introduction into unsupervised learning

(1) (2) (3) ( m)

❖ Introduction into unsupervised learning

Market Segmentation Organize computing clusters

Astronomical data analysis 7

─ Training set {𝑥 (1) , 𝑥 (2) , 𝑥 (3) , . . . , 𝑥 (𝑚) }, 𝑥 (𝑖) ∈ ℝ𝑛

❖ K-means algorithm objective function

c (i ) =index of cluster (1,2,…,K) to which example x (i ) is currently assigned

• One method is to remove some data points in the

• If we use different seeds: good results

• The k-means algorithm is not suitable for discoveri

• Use the centroid of each cluster to

– The centroid representation alone works well

❖ Random initialization of centroids for K-means algorithm

❖ Random initialization of centroids for K-means algorithm

Ideal case of random initialization Bad examples of random initialization

❖ Random initialization of centroids for K-means algorithm

Pick clustering with the lowest cost J (c (1) ,..., c( m ) , 1 ,...,  K )

❖ Choosing the number of clusters

❖ Choosing the number of clusters K

❖ Choosing the number of clusters K

❖ Motivation. Data Compression and Visualization.

(1) 𝑥 (𝑚) ∈ ℝ2 → 𝑧 (𝑚) ∈ ℝ

❖ Motivation. Data Compression and Visualization.

❖ Principal Component Analysis (PCA) problem formulation

❖ Principal Component Analysis (PCA) algorithm

Given a training data set: x , x ,..., x

Preprocessing (feature scaling/mean normalization):

If different features on different scales, scale features to have comparable

❖ Principal Component Analysis (PCA) algorithm

Reduce data from n-dimensions to k-dimensions

Compute “eigenvectors” of covariance matrix:

References: J.Shlens, ‘A Tutorial on Principal Component Analysis: Derivation, Discussion and

❖ Principal Component Analysis (PCA) algorithm

On previous step, we obtained eigenvectors from singular value decomposition of

❖ Choosing the number of principal components

Typically, choose k to be smallest value so that:

“99% of variance is retained”

❖ Choosing the number of principal components

Two ways to check the retained variance of data:

❖ Advices for applying PCA

( x (1) , y (1) ), ( x (2) , y (2) ),..., ( x ( m ) , y ( m ) )

New training set:

❖ Advices for applying PCA

• Clustering research has a long history. A vast colle

• Due to these complexities, the common practice is

• The quality of a clustering is very hard to

• We use some labeled data (for classification)

• Commonly used to compare different clustering

• Intra-cluster cohesion (compactness):

• In some applications, clustering is not the primary

• Clustering is has along history and still active

You might also like