Ambo University: Inistitute of Technology
Ambo University: Inistitute of Technology
Ambo University: Inistitute of Technology
INISTITUTE OF TECHNOLOGY
Department of Computer Science
Assignment of Selected Topic in Computer Science
Module Title:-Selected Topic in Computer Science
Course Code:- CoSc6005
Title:-Explain the performance metrics to evaluate clustering algorithms and classification
algorithms with examples.
Prepared By:
ID No
Contents Page
2.Classification Algorithm.........................................................................................1
I. Conclusion .....................................................................................................12
2. Classification Algorithm
1|Page
Classification: is the method of establishing well known structure to implement to
innovative data.
To understand this we will take an example, an e-mail program may be attempted which is to
classify an e-mail as genuine or as spam.
We have also compared four classification algorithms (J48, One R, Naïve Bayes, Decision
table) on the basis of MAE, RAE, RRSE, and RMSE.
Clustering techniques are broadly divided into partitioning, hierarchical and density based.
a) Partitioning algorithms: Identify clusters as areas highly populated with data. They learn
clusters directly.
b) Hierarchical clustering: Build clusters gradually and are less sensitive to noise.
These algorithms are less sensitive to outliers and can discover clusters of irregular shapes.
A. Simple K-mean
The K mean algorithm was first projected by Stuart Lloyd, as a technique for pulse-code modulation in
1957 [6]. It is a classical and well known clustering algorithm. It is the most commonly used
partitioned clustering algorithm because it can be easily implemented. It is efficient in terms of the
execution time. Its time complexity is O (tKn) where n data point numbers, K is the cluster number
and t is the iteration number. It is used to partition data points into discoverable K (non-overlapping)
clusters by finding K centroids or center points and then assigning each point to the cluster associated
with its nearest centroid.
B. DBSCAN
DBSCAN was proposed by Martin Ester et al in 1996. It is one of the most common clustering
algorithms [8]. It is a density-based clustering algorithm because it finds a number of clusters
starting from the estimated density distribution of corresponding nodes. This algorithm is based on
connecting points within certain distance thresholds similar to linkage based clustering. However, it
2|Page
only connects points that satisfy a density criterion (minimum number of objects within radius). An
arbitrary shape cluster is formed which consists of all density-connected objects. DBSCAN separates
data points into three classes:
The Hierarchical clustering algorithm (HCA) is also called as connectivity based clustering, which is
mainly based on the core idea of objects that are being more relative to the nearby objects than to the
objects far away. It is a method of cluster analysis which seeks to build a hierarchy of clusters. Its result
is usually presented in a dendrogram. It is generally classified as Agglomerative and Divisive methods
that depended upon how the hierarchies are formed.
splits are performed recursively as one move down the hierarchy. Its complexity is O (2n) which is
worse.
These algorithms join the objects and form clusters by measuring their distance. These algorithms
cannot provide a particular partitioning in the dataset, but they provide a widespread hierarchy of
clusters that are merged with each other at accurate distance.
3|Page
D. Make Density Based Clustering Algorithm
The make density based clustering algorithm uses (wrapping) a clustering algorithm internally. It
returns both distribution and density. This clustering algorithm is very helpful when clusters are
uneven. In this algorithm we try to find the cluster according to the density of data point in a region.
The main idea of this clustering is for each of cluster the neighborhood of given radius (Eps) has
contain at least minimum number of instances (min Pts). It can also be used if the data has noise and
when there are outliers in the data. The points of same density and present within the respective same
areas will be connected while forming clusters. In this way, we get separate cluster of having low
density regions (a set of points separated by low density) and high density regions (a set of points
separated by high density). The high density region has are tight as compared to low dense regions.
The advantages, disadvantages and the application of the three basic clustering algorithms are
summarized below
Criterion.
Hierarchica • Wireless sensors
• It can be used for problems
l • It is very expensive for huge
1 which involves point linkages Networks
Algorithms datasets.
4|Page
• It helps to discover clusters of • High sensitivity to the setting of • Scientific literature
different size.
input parameters
• Geostatic
5. Sample Example
the data of both Algorithms
The above mentioned algorithms has also been compared in terms other evaluation metrics like
accuracy, precision and F1-measure which is calculated using the formula given below:
Accuracy - Accuracy is the most widely used performance measure and it calculated as a
ratio of number of correctly predicted observation to the total observations.
Precision - Precision is defined as the ratio of number of correctly predicted positive
observations to the total number of predicted positive observations.
Recall (Sensitivity) - Recall is defined as the ratio of number of correctly predicted positive
observations to the all observations in actual class.
The above mentioned algorithms were also compared using these 3 metrics using the values obtained
from the confusion matrix. The table below shows the obtained results.
TP +TN TP + FP + FN + TN
Ac =
F1 score - F1 Score is the weighted average of Precision and Recall. Therefore, this score takes
both false positives and false negatives into account. The F1 Score is calculated with precision and
recall. It is also called the F Score or the F Measure.
F measure = 2 * {(precision * recall) | (precision + recall)}
S.N
Clustering Algorithm Accuracy Precision Recall F-Measure
1 Expectation maximization(EM) 0.7623 0.7654 0.5876 0.664818
o2 CLOPE 0.7815 0.6872 0.5847 0.63182
3 DBSCAN 0.9675 0.9132 0.8287 0.8689
4 Filtered cluster 0.6754 0.6156 0.5843 0.599542
5 Farthest first 0.7532 0.7245 0.6789 0.700959
6|Page
6 COWEB 0.8765 0.8567 0.7756 0.814135
7 K-Means clustering 0.9134 0.9072 0.7862 0.842377
8 CLARA 0.9078 0.8967 0.7432 0.812766
The obtained results show that DB SCAN performs well in terms all the metrics compared with other
algorithms. The comparison among various clustering algorithms on the basis of time taken and
number of clusters formed over the student dataset have been performed. For given dataset, EM
algorithm took more time to perform clustering whereas farthest first algorithm took very less time. In
case of clustered instances, DBSCAN algorithm formed larger amount of clusters whereas farthest first
algorithm and filtered cluster algorithm formed less amount of clusters. So according to time taken,
farthest first algorithm is preferred more than other algorithms and according to clustered instances,
DBSCAN algorithm is preferred more than other algorithms.
The algorithm is also tested in terms of accuracy, precision, recall and f-measure. DBSCAN algorithm
gives more accuracy and f-measure value when compared to all other algorithm. The chart above
shows the performance of various clustering algorithm. The result shows that DBSCAN has higher
performance in terms of all the metrics. Hence it can be conclude that DBSCAN outperforms all other
algorithms for our student performance analysis dataset.
7|Page
8. Dataset Description for Example
The sample dataset used for comparing classification algorithms is “diabetes diagnosis data” available
in csv format. This dataset consists of 768 instances and 9 attributes. Table shown below gives the
description of diabetes diagnosis dataset.
PG Concentration Plasma glucose concentration a 2 hours in an oral glucose tolerance test (Numeric)
8|Page
Diagnosis Class variable (0 or 1) (Sick/ Healthy)
Table 3 below shows the experimental results obtained while comparing the clustering algorithms.
Table 4 shows the six parameters for evaluating accuracy of algorithms. These parameters are Kappa
Statics, TP Rate, Precision, Recall, F-measure and ROC area
Figures below shows the graphical representation of the results for the comparison of different
clustering algorithms on the basis of Accuracy rate, Time taken and cluster distribution.
9|Page
Figure 1. Comparison of Accuracy of K-means, Hierarchical and Density Based clustering
Table 5 shows the four basic error rate parameters for the evaluation of four classification algorithms.
10 | P a g e
Figure 5. Graphical Representation of Error rate Evaluation
11 | P a g e
I. Conclusion
12 | P a g e
II. References
1. Bhoopender Singh, Gaurav Dubey, “A comparative analysis of different data mining using
WEKA”, International Journal of Innovative Research and Studies, ISSN: 2319-9725, Volume
2, Issue 5, Page 380-391 and May 2013.
2. Dr.Naveeta Mehta, Shilpa Dang, "A Review of Clustering Techiques in various Applications
for Effective Data Mining",International Journal of Research in IT & Management, ISSN 2231-
4334,Volume 1, Issue 2, Page 50-66 and June 2011.
3. Rui Xu, Donald Wunsch II, "Survey of Clustering Algorithms", IEEE transactions on neural
networks, Volume16, NO. 3, Page 645-678 and MAY 2005.
4 . Ranjini K and Rajalinngum N, “Performance Analysis of Hierarchical Clustering
Algorithm” International Journal of Advanced Networking and Applications, ISSN: 1006-
1011 and Year 2011.
5. UCI machine learning repository, archive.ics.uci.edu/ml.
13 | P a g e