K Means

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

K-Means clustering method

1. install and call need libraries:


library("readxl")
library("ggplot2")
library("dplyr")
library("ggfortify")
library("factoextra")
library("cluster")
2. read data from data1.xlsx file, then select columns 2,3:
data <-read_excel('data 1.xlsx')
df<- select(data,c(2:3))
3. Determinant the value of K.
A. Using the Elbow Method:
# function to compute total within-cluster sum of squares
fviz_nbclust(df, kmeans, method = "wss", k.max = 24) + theme_minimal() + ggtitle("the Elbow Method")
Output:

According to the Elbow Method, the optimal value of K is 5.


B. Using The gap statistic Method:
# The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null
reference distribution of the data.
# >>>> The estimate of the optimal clusters will be value that maximize the gap statistic
gap_stat <- clusGap(df, FUN = kmeans, nstart = 30, K.max = 24, B = 50)
fviz_gap_stat(gap_stat) + theme_minimal() + ggtitle("fviz_gap_stat: Gap Statistic")
Output:

According to the gap statistic Method, the optimal value of K is 1.


So we will ignore it.
C. Using The Silhouette Method:
# Average silhouette method computes the average silhouette of observations for different values of k
#The optimal number of clusters k is the one that maximize the average silhouette over a range of possible values for k.
fviz_nbclust(df, kmeans, method = "silhouette", k.max = 24) + theme_minimal() + ggtitle("The Silhouette Plot")
Output:

According to the Silhouette Method, the optimal value of K is 8.

4. K-Means clustering method


A. K=5:
#perform k-means clustering with k = 5 clusters
km_c5 <- kmeans(df, centers = 5, nstart = 25)
#plot results of final k-means model
fviz_cluster(km_c5, data = df)
Output:

find the mean for each cluster and add new column in data for Cluster.
#function to find the mean of the variables in each cluster
aggregate(df, by=list(cluster=km_c5$cluster), mean)
final_data <- cbind(df, cluster = km_c5$cluster)
Output:
Cluster times accel
1 42.00000 5.12500
2 10.94545 -2.90000
3 35.58182 -22.14545
4 33.22000 53.18000
5 26.30000 -86.30000
B. K=8:
#perform k-means clustering with k = 8 clusters
km_c8 <- kmeans(df, centers = 8, nstart = 25)
#plot results of final k-means model
fviz_cluster(km_c8, data = df)
Output:

find the mean for each cluster and add new column in data for Cluster.
#function to find the mean of the variables in each cluster
aggregate(df, by=list(cluster=km_c8$cluster), mean)
final_data <- cbind(df, cluster = km_c8$cluster)
Output:
Cluster times accel
1 32.28571 -18.928571
2 32.42500 5.012500
3 53.22857 -1.714286
4 26.20000 -107.000000
5 33.73333 73.200000
6 9.44000 -2.650000
7 33.00000 44.600000
8 29.93333 -49.566667
Consolidated code:
# Import Set Of Library
library("readxl")
library("ggplot2")
library("dplyr")
library("ggfortify")
library("factoextra")
library("cluster")
#----------------------------------
# Load Data From Excel File
data <-read_excel('data 1.xlsx')
View(data)
#-----------------------------------
# Select Data That Use in Training
df<- select(data,c(2:3))
#------------------------------------
set.seed(1) #it ensures that you get the same result always.
# The Elbow Method :
# function to compute total within-cluster sum of squares
fviz_nbclust(df, kmeans, method = "wss", k.max = 24) + theme_minimal() + ggtitle("the Elbow Method")
#------------------------------------
# The gap statistic compares the total within
# intra-cluster variation for different values of k with their expected values
# under null reference distribution of the data.
# >>>> The estimate of the optimal clusters will be value that maximize the gap statistic
gap_stat <- clusGap(df, FUN = kmeans, nstart = 30, K.max = 24, B = 50)
fviz_gap_stat(gap_stat) + theme_minimal() + ggtitle("fviz_gap_stat: Gap Statistic")
#-------------------------------------------------------
# The Silhouette Method :
# Average silhouette method computes the average silhouette of observations for different values of k
# >>>> The optimal number of clusters k is the one that maximize the average silhouette over a range of possible values for k.
fviz_nbclust(df, kmeans, method = "silhouette", k.max = 24) + theme_minimal() + ggtitle("The Silhouette Plot")
#-------------------------------------------------------------
#perform k-means clustering with k = 5 clusters
km_c5 <- kmeans(df, centers = 5, nstart = 25)
#plot results of final k-means model
fviz_cluster(km_c5, data = df)
#function to find the mean of the variables in each cluster
aggregate(df, by=list(cluster=km_c5$cluster), mean)
# We can also append the cluster assignments of each state back to the original dataset:
#add cluster assigment to original data
final_data <- cbind(df, cluster = km_c5$cluster)
#view final data
final_data
#--------------------------------------------------------------
#perform k-means clustering with k = 8 clusters
km_c8 <- kmeans(df, centers = 8, nstart = 25)
#plot results of final k-means model
fviz_cluster(km_c8, data = df)
#function to find the mean of the variables in each cluster
aggregate(df, by=list(cluster=km_c8$cluster), mean)
# We can also append the cluster assignments of each state back to the original dataset:
#add cluster assigment to original data
final_data <- cbind(df, cluster = km_c8$cluster)
#view final data
final_data

You might also like