Live 2 - AI - K Means Clustering
Live 2 - AI - K Means Clustering
Live 2 - AI - K Means Clustering
k - means Clustering
Pham Viet Cuong - Dept. Control Eng. & Automation, FEEE, HCMUT 2
k - Means Clusterings
ü Example:
v Document clustering
§ Web search engine often return thousands of pages --> Difficult
for user
§ Clustering can be used to group retrieved documents into
categories
v Customer segmentation
v Recommendation engines
v Image compression
Pham Viet Cuong - Dept. Control Eng. & Automation, FEEE, HCMUT 3
k - Means Clusterings
ü Supervised or unsupervised?
Pham Viet Cuong - Dept. Control Eng. & Automation, FEEE, HCMUT 4
k - Means Clusterings
ü Requirements
v An integer k
v A set of training data (without labels)
v A metric to measure similarity
ü Algorithm
v Pick k random points as cluster centers
v Repeat until convergence
§ Assign data points to closest cluster center
§ Update each cluster center to be the mean of its assigned points
Convergence: No pointsʼ assignments change
Pham Viet Cuong - Dept. Control Eng. & Automation, FEEE, HCMUT 5
k - Means Clusterings
ü Example 1
Pham Viet Cuong - Dept. Control Eng. & Automation, FEEE, HCMUT 7
k - Means Clusterings
ü Example: Image segmentation
v Segmentation: partition an image into regions each of which has
reasonably homogenous visual appearance
Pham Viet Cuong - Dept. Control Eng. & Automation, FEEE, HCMUT 8
k - Means Clusterings
ü Example: Geyser eruptions
v Eruption time (mins)
v Waiting time to next eruption (mins)
Pham Viet Cuong - Dept. Control Eng. & Automation, FEEE, HCMUT 9
k - Means Clusterings
ü Example: Image compression
v Original image: 396*396*24 = 3,763,584 bits
v Compressed image: 30*24 + 396*396*4 = 627,984 bits
Pham Viet Cuong - Dept. Control Eng. & Automation, FEEE, HCMUT 10
k - Means Clusterings
ü Properties
v Guaranteed to converge in a finite number of iterations
v Running time per iteration
§ Assign data points to closest cluster center
O(kN)
§ Update the cluster center to be the mean of its assigned points
O(N)
Pham Viet Cuong - Dept. Control Eng. & Automation, FEEE, HCMUT 11
k - Means Clusterings
ü How to measure similarity?
v Similarity is subjective
v Depends on data, cases, users, etc.
v Not always straightforward which metrics work well
v “Trial and error” can be used
v Examples of similarity measures: Euclidean, Mahattan, cosine distance
Pham Viet Cuong - Dept. Control Eng. & Automation, FEEE, HCMUT 12
k - Means Clusterings
ü How to choose k?
v Elbow method
Pham Viet Cuong - Dept. Control Eng. & Automation, FEEE, HCMUT 14
k - Means Clusterings
ü Drawbacks
Pham Viet Cuong - Dept. Control Eng. & Automation, FEEE, HCMUT 15
k - Means Clusterings
ü Drawbacks
Pham Viet Cuong - Dept. Control Eng. & Automation, FEEE, HCMUT 16
k - Nearest Neighbors
ü Sources:
v http://people.csail.mit.edu/dsontag/courses/ml12/slides/lecture14.pdf
v https://www.slideshare.net/annafensel/kmeans-clustering-122651195
v https://en.wikipedia.org/wiki/Elbow_method_(clustering)
v https://www2.stat.duke.edu/courses/Fall02/sta290/datasets/geyser
Pham Viet Cuong - Dept. Control Eng. & Automation, FEEE, HCMUT 17