U L D R: Nsupervised Earning and Imensionality Eduction

Download as pdf or txt
Download as pdf or txt
You are on page 1of 58

UNSUPERVISED LEARNING AND

DIMENSIONALITY REDUCTION

DR. M M MANJURUL ISLAM


Contents
1. Unsupervised Learning (Clustering Problem)
2. K-means clustering algorithm
3. Centroid initialization and choosing the number of K
4. Dimensionality Reduction
5. Principal Component Analysis
1
Unsupervised Learning
Introduction

4
1 Unsupervised Learning (Clustering Problem)

❖ Introduction into unsupervised learning


• Supervised learning example: Classification Problem

x1

x2

(1) (1) (2) (2) (3) (3) ( m)


Given a training set: {( x , y ),( x , y ),( x , y ),...,( x , y( m) )}
5
1 Unsupervised Learning (Clustering Problem)

❖ Introduction into unsupervised learning


• Unsupervised learning example: Clustering Problem

x1
Cluster 1

Cluster 2

x2

(1) (2) (3) ( m)


Given a training set: {x , x , x ,..., x }
6
1 Unsupervised Learning (Clustering Problem)

❖ Introduction into unsupervised learning


• Applications of clustering

Market Segmentation Organize computing clusters

Astronomical data analysis 7


2
K-means Clustering
Algorithm
2 K-means clustering algorithm

❖ K-means algorithm
• Inputs:
─ K (number of clusters)

─ Training set {𝑥 (1) , 𝑥 (2) , 𝑥 (3) , . . . , 𝑥 (𝑚) }, 𝑥 (𝑖) ∈ ℝ𝑛

• Algorithm flow:
𝑛
Randomly initialize K cluster centroids 𝜇1 , 𝜇2 , . . . , 𝜇𝐾 ∈ ℝ
Repeat {
for i = 1 to m
c (i ) := index (from 1 to K) of cluster centroid closest to x (i )
for k = 1 to K
k := average (mean) of points assigned to cluster k
}

9
2 K-means clustering algorithm

❖ K-means algorithm
• Execution of algorithm

10
2 K-means clustering algorithm

❖ K-means algorithm
• Execution of algorithm

11
2 K-means clustering algorithm

❖ K-means algorithm
• Execution of algorithm

12
2 K-means clustering algorithm

❖ K-means algorithm
• Execution of algorithm

13
2 K-means clustering algorithm

❖ K-means algorithm
• Execution of algorithm

14
2 K-means clustering algorithm

❖ K-means algorithm
• Execution of algorithm

15
2 K-means clustering algorithm

❖ K-means algorithm objective function

c (i ) =index of cluster (1,2,…,K) to which example x (i ) is currently assigned


k =cluster centroid k ( k  n
)
c (i ) =cluster centroid to which example x (i ) has been assigned

1 m (i )
, 1 ,...,  K ) =  x − c( i )
2
(1) (m)
J (c ,..., c
m i =1

min
(1) (m)
J ( c (1)
,..., c (m)
, 1 ,...,  K )
c ,..., c
1 ,...,  K

16
Strengths of k-means

• Strengths:
– Simple: easy to understand and to implement
– Efficient: Time complexity: O(tkn),
where n is the number of data points,
k is the number of clusters, and
t is the number of iterations.
– Since both k and t are small. k-means is considered a
linear algorithm.
• K-means is the most popular clustering algorithm.
• Note that: it terminates at a local optimum if SSE is
used. The global optimum is hard to find due to
complexity.

17
Weaknesses of k-means
• The algorithm is only applicable if the mean
is defined.
– For categorical data, k-mode - the centroid is
represented by most frequent values.
• The user needs to specify k.
• The algorithm is sensitive to outliers
– Outliers are data points that are very far away
from other data points.
– Outliers could be errors in the data recording or
some special data points with very different
values.
18
Weaknesses of k-means: Problems with outliers

19
Weaknesses of k-means: To deal with outliers

• One method is to remove some data points in the


clustering process that are much further away from
the centroids than other data points.
– To be safe, we may want to monitor these possible outliers
over a few iterations and then decide to remove them.
• Another method is to perform random sampling.
Since in sampling we only choose a small subset of
the data points, the chance of selecting an outlier is
very small.
– Assign the rest of the data points to the clusters by
distance or similarity comparison, or classification

20
Weaknesses of k-means (cont …)
• The algorithm is sensitive to initial seeds.

21
Weaknesses of k-means (cont …)

• If we use different seeds: good results


There are some met
hods to help choo
se good seeds

22
Weaknesses of k-means (cont …)

• The k-means algorithm is not suitable for discoveri


ng clusters that are not hyper-ellipsoids (or hyper-
spheres).

23
Common ways to represent clusters

• Use the centroid of each cluster to


represent the cluster.
– compute the radius and
– standard deviation of the cluster to
determine its spread in each dimension

– The centroid representation alone works well


if the clusters are of the hyper-spherical
shape.
– If clusters are elongated or are of other
shapes, centroids are not sufficient
24
3
Centroid initialization and
Choosing the number of K
3 Centroid initialization and choosing the number of K

❖ Random initialization of centroids for K-means algorithm


• Number of clusters K<m
• Randomly pick K training examples
• Set 1 ,...,  K equal to these K examples

26
3 Centroid initialization and choosing the number of K

❖ Random initialization of centroids for K-means algorithm


• Problem

Ideal case of random initialization Bad examples of random initialization

27
3 Centroid initialization and choosing the number of K

❖ Random initialization of centroids for K-means algorithm


• Solution

For i =1 to 100 {
Randomly initialize K-means.
Run K-means. Get c (1) ,..., c( m ) , 1 ,...,  K
Compute cost function:
J (c (1) ,..., c( m ) , 1 ,...,  K )
}

Pick clustering with the lowest cost J (c (1) ,..., c( m ) , 1 ,...,  K )

28
3 Centroid initialization and choosing the number of K

❖ Choosing the number of clusters


• What is the right number of clusters? How to choose it?

29
3 Centroid initialization and choosing the number of K

❖ Choosing the number of clusters K


• Elbow method

30
3 Centroid initialization and choosing the number of K

❖ Choosing the number of clusters K


• Evaluate K-means based on a metric for how well it performs for the particular
purpose

Rubbing Increasing
Feature 1 and Feature 2
construct Health Index
for rubbing prognosis Feature 1

- No rubbing
- Slight rubbing
- Intensive rubbing

Feature 2
31
4
Dimensionality Reduction
4 Dimensionality Reduction

❖ Motivation. Data Compression and Visualization.


• Reduce data from 2D to 1D
Mapping
x2
𝑥 (1) ∈ ℝ2 → 𝑧 (1) ∈ ℝ

𝑥 (2) ∈ ℝ2 → 𝑧 (2) ∈ ℝ

(1) 𝑥 (𝑚) ∈ ℝ2 → 𝑧 (𝑚) ∈ ℝ


x

x (2)

(1) (2) x1
z z

33
EMBEDDED
4 Dimensionality Reduction SYSTEM
LABORATORY

❖ Motivation. Data Compression and Visualization.


• Reduce data from 3D to 2D

z2

x3

x1 x2
z1
x (i )  3
z (i )  2

34
5
Principal Component
Analysis
Principal Component Analysis

36
5 Principal Component Analysis

❖ Principal Component Analysis (PCA) problem formulation


x2

x1

Reduce from n-dimension to k-dimension: Find k vectors u (1) , u (2) ,..., u ( k ) onto which
project the data, so as to minimize the projection error. 37
5 Principal Component Analysis

❖ Principal Component Analysis (PCA) algorithm


• Data Preprocessing

Given a training data set: x , x ,..., x


(1) (2) ( m)

Preprocessing (feature scaling/mean normalization):


1 m (i )
 j =  xj
m i =1
Replace each x j with x j −  j .
(i )

If different features on different scales, scale features to have comparable


range of values:
x j (i ) −  j
x j (i ) =
max(x j ) − min(x j )

38
5 Principal Component Analysis

❖ Principal Component Analysis (PCA) algorithm

Reduce data from n-dimensions to k-dimensions


Compute ‘covariance matrix:

1 n
 =  ( x ( i ) )( x ( i ) )T
m i =1

Compute “eigenvectors” of covariance matrix:

[U,S,V] = svd(Sigma);

Here, we obtain:
| | |
𝑈 = 𝑢(1) 𝑢(2) 𝑢(𝑛) ∈ ℝ𝑛×𝑛
| | |

References: J.Shlens, ‘A Tutorial on Principal Component Analysis: Derivation, Discussion and


Singular Value Decomposition’, 2003 39
5 Principal Component Analysis

❖ Principal Component Analysis (PCA) algorithm


• Linear transformation of the original data

On previous step, we obtained eigenvectors from singular value decomposition of


covariance matrix:
| | |
𝑈 = 𝑢(1) 𝑢(2) 𝑢 (𝑛) ∈ ℝ𝑛×𝑛
| | |

k
Now, we can transform our original data with it is own feature dimension into new
feature space, based on principal components (𝑥 ∈ ℝ𝑛 → 𝑧 ∈ ℝ𝑘 ) :

 | | |   − (u (1) ) −
 
z (i ) = u (1) u (2) u ( k )  x ( i ) =  − (u (2) ) −  x(i )
 | | |   − (u ( k ) ) −  nx1

nxk kxn
Ureduce
kx1 40
5 Principal Component Analysis
❖ Principal Component Analysis (PCA) algorithm
• MATLAB implementation

After mean normalization (ensure every feature has zero mean) and optionally
feature scaling:
1 n
 =  ( x ( i ) )( x ( i ) )T
m i =1
[U,S,V] = svd(Sigma);

Ureduce = U(:,1:k);

Z = Ureduce’*x;

*Notice:
In MATLAB there are new functions, which you can use directly to obtain principal
components for your data without the steps mentioned above.
Please, have a look on functions called: pca (if you apply PCA directly to data) and pcacov
(if you want to extract principal components from covariance matrix). The results should be
similar. 41
5 Principal Component Analysis
❖ Principal Component Analysis (PCA) algorithm
• Reconstruction of data from its compressed representation

x2 x2

(1) x1 (1)
xapprox x1
x
x (2) (2)
xapprox
z = U  reduce x 𝑧 ∈ ℝ → 𝑥 ∈ ℝ2
z (1) (𝑖)
𝑥𝑎𝑝𝑝𝑟𝑜𝑥 = 𝑈𝑟𝑒𝑑𝑢𝑐𝑒 𝑧 (𝑖)
z1 nxk kx1

nx1 42
5 Principal Component Analysis

❖ Choosing the number of principal components

1 m

2
Average squared projection error: x ( i ) − xapprox
(i )

m i =1

1 m
Total variation in the data:  (i ) 2
x
m i =1

Typically, choose k to be smallest value so that:

1

m 2
i =1
x (i )
− x (i )
approx
m  0.01
1
 i =1 (i ) 2
m
x
m

“99% of variance is retained”

43
5 Principal Component Analysis

❖ Choosing the number of principal components

Two ways to check the retained variance of data:


[U,S,V] = svd(Sigma)
Algorithm:
 S11 ... ... ... ... 
Try PCA with k= 1  ... S 22 ... ... ... 
Compute: 
S =  ... ... S33 ... ... 
U reduce , z (1) , z (2) ,..., z ( m ) , xapprox
(1) ( m)
,..., xapprox  
 ... ... ... ... ... 
 ... ... ... ... S nn 
Check if: For given k:
1

m 2


(i ) (i )
i =1
x x approx
k
Sii
m  0.01 1− i =1
 0.01

n
1
 i =1 x (i ) 2
m
i =1
Sii
m
If ‘yes’ – stop algorithm, if ‘no’ –change Pick smallest value of k for which “99%
the value k+=1 and run algorithm again of variance retained
44
5 Principal Component Analysis

❖ Advices for applying PCA


• Supervised learning speedup

( x (1) , y (1) ), ( x (2) , y (2) ),..., ( x ( m ) , y ( m ) )


Extract inputs:
Unlabelled dataset: 𝑥 (1) , 𝑥 (2) , . . . , 𝑥 (𝑚) ∈ ℝ10000
PCA
𝑧 (1) , 𝑧 (2) , . . . , 𝑧 (m) ∈ ℝ100

New training set:


( z (1) , y (1) ), ( z (2) , y (2) ),..., ( z ( m ) , y ( m ) )
Note: Mapping x ( i ) → z (i) should be defined by running PCA only on the training set.
This mapping can be applied as well to the examples xcv (i )
and xtest
(i )
in the cross valida
tion and test sets.

45
5 Principal Component Analysis

❖ Advices for applying PCA


• Compression
─ Reduce memory/disk needed to store data
─ Speed up learning algorithm

• Visualisation

• Before implementing PCA, first try running whatever you want to do with the
original/raw data. Only if that does not do what you want, then implement PCA
and consider using another feature space dimension.

46
Principal Component Analysis

47
How to choose a clustering algorithm

• Clustering research has a long history. A vast colle


ction of algorithms are available.
– We only introduced several main algorithms.
• Choosing the “best” algorithm is a challenge.
– Every algorithm has limitations and works well with certa
in data distributions.
– It is very hard, if not impossible, to know what distributi
on the application data follow. The data may not fully fo
llow any “ideal” structure or distribution required by the
algorithms.
– One also needs to decide how to standardize the data, t
o choose a suitable distance function and to select other
parameter values.

48
Choose a clustering algorithm (cont …)

• Due to these complexities, the common practice is


to
– run several algorithms using different distance functions
and parameter settings, and
– then carefully analyze and compare the results.
• The interpretation of the results must be based on
insight into the meaning of the original data toget
her with knowledge of the algorithms used.
• Clustering is highly application dependent and to
certain extent subjective (personal preferences).

49
Cluster Evaluation: hard problem

• The quality of a clustering is very hard to


evaluate because
– We do not know the correct clusters
• Some methods are used:
– User inspection
• Study centroids, and spreads
• Rules from a decision tree.
• For text documents, one can read some documents
in clusters.

50
Cluster evaluation: ground truth

• We use some labeled data (for classification)


• Assumption: Each class is a cluster.
• After clustering, a confusion matrix is constructed. From the
matrix, we compute various measurements, entropy, purity,
precision, recall and F-score.
– Let the classes in the data D be C = (c1, c2, …, ck). The
clustering method produces k clusters, which divides D
into k disjoint subsets, D1, D2, …, Dk.

51
Evaluation measures: Entropy

52
Evaluation measures: purity

53
A remark about ground truth evaluation

• Commonly used to compare different clustering


algorithms.
• A real-life data set for clustering has no class
labels.
– Thus although an algorithm may perform very well on
some labeled data sets, no guarantee that it will perform
well on the actual application data at hand.
• The fact that it performs well on some label data
sets does give us some confidence of the quality
of the algorithm.
• This evaluation method is said to be based on
external data or information.
54
Evaluation based on internal information

• Intra-cluster cohesion (compactness):


– Cohesion measures how near the data points in a
cluster are to the cluster centroid.
– Sum of squared error (SSE) is a commonly used
measure.
• Inter-cluster separation (isolation):
– Separation means that different cluster centroids
should be far away from one another.
• In most applications, expert judgments are still the key.

55
Indirect evaluation

• In some applications, clustering is not the primary


task, but used to help perform another task.
• We can use the performance on the primary task
to compare clustering methods.
• For instance, in an application, the primary task is
to provide recommendations on book purchasing
to online shoppers.
– If we can cluster books according to their features, we
might be able to provide better recommendations.
– We can evaluate different clustering algorithms based on
how well they help with the recommendation task.
– Here, we assume that the recommendation can be
reliably evaluated.

56
Summary

• Clustering is has along history and still active


– There are a huge number of clustering algorithms
– More are still coming every year.
• We only introduced several main algorithms. There
are many others, e.g.,
– density based algorithm, sub-space clustering, scale-up
methods, neural networks based methods, fuzzy clustering,
co-clustering, etc.
• Clustering is hard to evaluate, but very useful in
practice. This partially explains why there are still a
large number of clustering algorithms being devised
every year.
• Clustering is highly application dependent and to
some extent subjective.

57
58

You might also like