Ambo University: Inistitute of Technology

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

AMBO UNIVERSITY

INISTITUTE OF TECHNOLOGY
Department of Computer Science
Assignment of Selected Topic in Computer Science
Module Title:-Selected Topic in Computer Science
Course Code:- CoSc6005
Title:-Explain the performance metrics to evaluate clustering algorithms and classification
algorithms with examples.

Prepared By:

ID No

Abay Mulisa ------------------------------------------------------------------------MET/201/12

Submitted to: Dr.Vel

Date 10/5/2020 G.C


Table of Content

Contents Page

1.Clustering Algorithm ..............................................................................................1

2.Classification Algorithm.........................................................................................1

3.Clustering Techniques and their function...............................................................2

4.Advantages, Disadvantages and Application of Clustering Algorithms ................4

5.Sample Example of both Algorithms .....................................................................5

6.Evaluation Metrics of Algorithm............................................................................6

7.Comparison of Clustering Algorithm .....................................................................7

8.Dataset Description for Example............................................................................8

I. Conclusion .....................................................................................................12

II. References ...........................................................................................................13


1. Clustering Algorithm

 Clustering is an unsupervised classification mechanism where a set of patterns (data), usually


multidimensional is classified into groups (clusters) such that members of one group are similar
according to a predefined criterion. Clustering is a separation of data into groups of related objects.
Each group, called a cluster, consists of data that are similar (homogenous) between them and
dissimilar (heterogeneous) compared to data of other groups. Clustering of a set forms a partition of its
elements chosen to minimize some measure of dissimilarity between members of the same cluster. It is
mainly helpful for organizing documents to retrieval and support browsing.
 Cluster analysis is a very important technology in text mining. It is an iterative process of
information detection or interactive multi-objective optimization that involves test and failure. It
divides the datasets into several meaningful clusters to reflect the dataset’s natural structure. There are
several commonly used clustering algorithms namely as Simple K-means, DBSCAN and Hierarchical
and so on. A clustering algorithm partitions a dataset into several groups such that the similarity within
a group is larger than among groups. Clustering algorithms are often useful in various fields like spatial
data analysis, earthquake study, image processing, data mining, learning theory, pattern recognition,
etc.
 Clustering: is the method to explore the data in groups and structures that are in some manner or
another similar way which do not use data having recognized structures.
 In this work we have compared three clustering algorithms (k- means clustering, DBSCAN,
Hierarchical) on the basis of number of clusters, cluster instances, accuracy and time taken to build
the model.

2. Classification Algorithm

 Classification is a classic data mining technique based on machine learning. Basically


classification is used to classify each item in a set of data into one of predefined set of classes or
groups. In classification, we make the software that can learn how to classify the data items into
groups. Clustering is a data mining technique that makes meaningful or useful cluster of objects that
have similar characteristics using automatic technique. Each group, called cluster, consists of objects
that are similar between themselves and dissimilar to objects of other groups.

1|Page
 Classification: is the method of establishing well known structure to implement to
innovative data.
 To understand this we will take an example, an e-mail program may be attempted which is to
classify an e-mail as genuine or as spam.
 We have also compared four classification algorithms (J48, One R, Naïve Bayes, Decision
table) on the basis of MAE, RAE, RRSE, and RMSE.

3. Clustering Techniques and their function

Clustering techniques are broadly divided into partitioning, hierarchical and density based.

a) Partitioning algorithms: Identify clusters as areas highly populated with data. They learn

clusters directly.

b) Hierarchical clustering: Build clusters gradually and are less sensitive to noise.

c) Density-Based clustering algorithm: Discover dense connected components of data,

which are flexible (shape).

These algorithms are less sensitive to outliers and can discover clusters of irregular shapes.

A. Simple K-mean

The K mean algorithm was first projected by Stuart Lloyd, as a technique for pulse-code modulation in
1957 [6]. It is a classical and well known clustering algorithm. It is the most commonly used
partitioned clustering algorithm because it can be easily implemented. It is efficient in terms of the
execution time. Its time complexity is O (tKn) where n data point numbers, K is the cluster number
and t is the iteration number. It is used to partition data points into discoverable K (non-overlapping)
clusters by finding K centroids or center points and then assigning each point to the cluster associated
with its nearest centroid.

B. DBSCAN

DBSCAN was proposed by Martin Ester et al in 1996. It is one of the most common clustering
algorithms [8]. It is a density-based clustering algorithm because it finds a number of clusters
starting from the estimated density distribution of corresponding nodes. This algorithm is based on
connecting points within certain distance thresholds similar to linkage based clustering. However, it

2|Page
only connects points that satisfy a density criterion (minimum number of objects within radius). An
arbitrary shape cluster is formed which consists of all density-connected objects. DBSCAN separates
data points into three classes:

 Hub points: Points that are at the interior of a cluster (Centre).


 Edge points: Falls within the neighborhood of a hub point which is not a hub point.
 Noise points: Any point that is not a hub point or an edge point.
To find a cluster, DBSCAN starts with an arbitrary instance (p) in data set (D) and retrieves all
instances of D with respect to epsilon (Eps) and minimum points (minPts). minPoints, defined as the
minimum number of points required to exist in a neighborhood to be declared a cluster, and Eps
defined as the radius of the neighborhood of a point based on a distance (Euclidean, Manhattan or
Minkowski) metric. The algorithm makes use of a spatial data structure to locate points within Eps
distance from the core points of the clusters.

C. Hierarchical Clustering Algorithm

The Hierarchical clustering algorithm (HCA) is also called as connectivity based clustering, which is
mainly based on the core idea of objects that are being more relative to the nearby objects than to the
objects far away. It is a method of cluster analysis which seeks to build a hierarchy of clusters. Its result
is usually presented in a dendrogram. It is generally classified as Agglomerative and Divisive methods
that depended upon how the hierarchies are formed.

 Agglomerative: It is a "bottom up" approach. It starts by placing each object in its


own cluster. Then merges these minute clusters into larger and larger clusters, until all of the objects
are in a single cluster or until certain termination conditions are satisfied. Its complexity is O (n³) which
makes then too slow for large data sets.
 Divisive: It is a "top down" approach. It starting with all objects in one cluster. Then

splits are performed recursively as one move down the hierarchy. Its complexity is O (2n) which is
worse.
These algorithms join the objects and form clusters by measuring their distance. These algorithms
cannot provide a particular partitioning in the dataset, but they provide a widespread hierarchy of
clusters that are merged with each other at accurate distance.

3|Page
D. Make Density Based Clustering Algorithm

The make density based clustering algorithm uses (wrapping) a clustering algorithm internally. It
returns both distribution and density. This clustering algorithm is very helpful when clusters are
uneven. In this algorithm we try to find the cluster according to the density of data point in a region.
The main idea of this clustering is for each of cluster the neighborhood of given radius (Eps) has
contain at least minimum number of instances (min Pts). It can also be used if the data has noise and
when there are outliers in the data. The points of same density and present within the respective same
areas will be connected while forming clusters. In this way, we get separate cluster of having low
density regions (a set of points separated by low density) and high density regions (a set of points
separated by high density). The high density region has are tight as compared to low dense regions.

4. Advantages, Disadvantages and Application of Clustering Algorithms

The advantages, disadvantages and the application of the three basic clustering algorithms are
summarized below

S.N Algorithms Advantages Disadvantages Application


o • Inability to make corrections once the
splitting/merging decision is made.
• Pattern recognition
• Lack of interpretability regarding
• Embedded flexibility
regarding the level of the cluster Descriptors.
• Image segmentation
granularity.
• Vagueness of termination

Criterion.
Hierarchica • Wireless sensors
• It can be used for problems
l • It is very expensive for huge
1 which involves point linkages Networks
Algorithms datasets.

• Severe effectiveness degradation in


• City planning
high dimensional spaces due to

the curse of dimensionality


phenomenon. • Spatial data analysis

4|Page
• It helps to discover clusters of • High sensitivity to the setting of • Scientific literature

different size.
input parameters

Density- • Poor cluster descriptors • Images of satellite


Based • Resistance to Noise and
2 • Unsuitable for high-dimensional
Clustering
Algorithms outliers datasets because of the curse of
• Crystallography of x-
dimensionality Phenomenon.
ray

• High dimensional spaces is

ill-defined • Anomaly detection in

3 • Poor cluster descriptors •temperate


Geostaticdata

• Relatively scalable and • Reliance on the user to

simple. specify the number of clusters in • Computation vision


advance
Partitioning
Clustering • Useful when clusters • High sensitivity to initialization • Scientific literature
• Suitable for datasets with • Market segmentation
Algorithms
are not normal
compact spherical clusters that •phase,
Datasets with
noise andaltering
outliersdensities are
are
• Return both distribution tricky.
• Frequent entrapments into local • Images of satellite
• Earth quake study
well-separated.
and density optima

4 Density • Used when data has • Sensitive


Inability totoclustering
deal with non-convex • Crystallography of x-
• Land use
Based clusters of varying size and density. ray
noise
Algorithm
• Used when outliers in • Parameters Min Points and EPS.

• Geostatic
5. Sample Example
the data of both Algorithms

• Gives result close to • Sampling affects density measures.


Table below depicts the numbers of instances and attributes of the used datasets.
• Earthquake study
K-mean Algorithms.
S.N Clustered Time taken Sum of Number
to build the squared of
o Clustering Algorithm Attribute Instances Instances cluster (Sec) errors iterations
s
performe
Expectation maximization
d
algorithm
1 32 8124 14 8613.74 19 23
Log likelihood value = -
5|Page
9.7878
2 CLOPE 32 8124 23 6.27 25 20

3 DBSCAN 32 8124 8124 112.24 12 15


4 Filtered cluster 32 8124 2 1.14 20 11
Epsilon=0.9 Minpts=6
5 Farthest first 32 8124 2 0.06 17 11

6 COWEB (splits=87;merges= 32 8124 172 0.92 16 14


7 90)
K-Means clustering 32 8124 K is 4.567 19 23
defined
8. CLARA 32 8124 1200 0.75 18 20

6. Evaluation Metrics of Algorithm

The above mentioned algorithms has also been compared in terms other evaluation metrics like
accuracy, precision and F1-measure which is calculated using the formula given below:

 Accuracy - Accuracy is the most widely used performance measure and it calculated as a
ratio of number of correctly predicted observation to the total observations.
 Precision - Precision is defined as the ratio of number of correctly predicted positive
observations to the total number of predicted positive observations.
 Recall (Sensitivity) - Recall is defined as the ratio of number of correctly predicted positive
observations to the all observations in actual class.

The above mentioned algorithms were also compared using these 3 metrics using the values obtained
from the confusion matrix. The table below shows the obtained results.

TP +TN TP + FP + FN + TN
Ac =

 F1 score - F1 Score is the weighted average of Precision and Recall. Therefore, this score takes
both false positives and false negatives into account. The F1 Score is calculated with precision and
recall. It is also called the F Score or the F Measure.
F measure = 2 * {(precision * recall) | (precision + recall)}

S.N
Clustering Algorithm Accuracy Precision Recall F-Measure
1 Expectation maximization(EM) 0.7623 0.7654 0.5876 0.664818
o2 CLOPE 0.7815 0.6872 0.5847 0.63182
3 DBSCAN 0.9675 0.9132 0.8287 0.8689
4 Filtered cluster 0.6754 0.6156 0.5843 0.599542
5 Farthest first 0.7532 0.7245 0.6789 0.700959

6|Page
6 COWEB 0.8765 0.8567 0.7756 0.814135
7 K-Means clustering 0.9134 0.9072 0.7862 0.842377
8 CLARA 0.9078 0.8967 0.7432 0.812766

The obtained results show that DB SCAN performs well in terms all the metrics compared with other
algorithms. The comparison among various clustering algorithms on the basis of time taken and
number of clusters formed over the student dataset have been performed. For given dataset, EM
algorithm took more time to perform clustering whereas farthest first algorithm took very less time. In
case of clustered instances, DBSCAN algorithm formed larger amount of clusters whereas farthest first
algorithm and filtered cluster algorithm formed less amount of clusters. So according to time taken,
farthest first algorithm is preferred more than other algorithms and according to clustered instances,
DBSCAN algorithm is preferred more than other algorithms.

7. Comparison of Clustering Algorithm

The algorithm is also tested in terms of accuracy, precision, recall and f-measure. DBSCAN algorithm
gives more accuracy and f-measure value when compared to all other algorithm. The chart above
shows the performance of various clustering algorithm. The result shows that DBSCAN has higher
performance in terms of all the metrics. Hence it can be conclude that DBSCAN outperforms all other
algorithms for our student performance analysis dataset.

7|Page
8. Dataset Description for Example

The sample dataset used for comparing classification algorithms is “diabetes diagnosis data” available
in csv format. This dataset consists of 768 instances and 9 attributes. Table shown below gives the
description of diabetes diagnosis dataset.

Table 1. Bank Dataset Description

Id a unique identification number


Age age of customer in years (numeric)
Sex MALE / FEMALE
Region inner_city/rural/suburban/town
Income income of customer (numeric)
Married is the customer married (YES/NO)
Children number of children (numeric)
Car does the customer own a car (YES/NO)
save_acct does the customer have a saving account (YES/NO)

current_acct does the customer have a current account (YES/NO)


Pep did the customer buy a PEP (Personal Equity Plan) after
Mortgage the
doeslast
themailing (YES/NO)
customer have a mortgage (YES/NO)

Table 2. Diabetes Dataset Description

Pregnancies Number of times pregnant(Numeric)

PG Concentration Plasma glucose concentration a 2 hours in an oral glucose tolerance test (Numeric)

Diastolic BP Diastolic blood pressure (mm Hg) (Numeric)

Tri fold thick Triceps skin fold thickness (mm(Numeric)

Serum Ins 2-Hour serum insulin (mu U/ml(Numeric)

BMI Body mass index (weight in kg/(height in m)^2) (Numeric)

DP Function Diabetes pedigree function (Numeric)

Age Age (years) (Numeric)

8|Page
Diagnosis Class variable (0 or 1) (Sick/ Healthy)

Results of bank dataset on different clustering algorithms

Table 3 below shows the experimental results obtained while comparing the clustering algorithms.

Figure 2. Time take by the K-means, Hierarchical and Density

Based clustering for datasets

Figure 3. Cluster distribution

Results of diabetes dataset on different classification algorithms

Table 4 shows the six parameters for evaluating accuracy of algorithms. These parameters are Kappa
Statics, TP Rate, Precision, Recall, F-measure and ROC area

Figures below shows the graphical representation of the results for the comparison of different
clustering algorithms on the basis of Accuracy rate, Time taken and cluster distribution.

9|Page
Figure 1. Comparison of Accuracy of K-means, Hierarchical and Density Based clustering

Table 4. Accuracy Parameters for Diabetes Dataset

Figure 4. Graphical Representation of Accuracy Parameters

Table 5 shows the four basic error rate parameters for the evaluation of four classification algorithms.

Table 5. Error Rate Evaluation Parameters for Diabetes Dataset

Algorithm MAE RMSE RAE RRSE

J48 0.2383 0.3452 52.4339% 72.4207%

Naïve bayes 0.2811 0.4133 61.8486% 86.7082%

Decision Table 0.3063 0.38 67.3862% 79.7336%

OneR 0.2357 0.4855 51.8551% 101.8515%

10 | P a g e
Figure 5. Graphical Representation of Error rate Evaluation

11 | P a g e
I. Conclusion

This assignment is to make a detailed survey on different kind of performance metric of


clustering algorithm and classification algorithm and their techniques its advantages, disadvantages and
its applications. To know about the different clustering techniques available in data mining. Also
compares the performance of different clustering algorithm using different metrics among which
DBSCAN algorithm performs well in terms all the measure and so in future all our proposed algorithm
will be based on improving this algorithm to produce improved results. Performance based
comparative study of clustering and classification algorithms are performed here on two different
datasets. The experimental results of various clustering and classification algorithms are depicted in
form of tables and graphs. From Figure 1 and Figure 2 it is clear that DBSCAN is the best algorithm as
it takes lesser time (0.03seconds) to build the model and gives higher accuracy as compared to other
clustering algorithms. It is evident from figure 4 and figure 5 that J48 classification algorithm gives
best performance as compared to other studied algorithms. J48 gives higher accuracy rate and
minimum error rate. Decision table algorithm has second minimum error rate and it also have
over all good performance. As seen in the graph One R have high error rate and have poor
performance as compare to other algorithms.

12 | P a g e
II. References

1. Bhoopender Singh, Gaurav Dubey, “A comparative analysis of different data mining using
WEKA”, International Journal of Innovative Research and Studies, ISSN: 2319-9725, Volume
2, Issue 5, Page 380-391 and May 2013.
2. Dr.Naveeta Mehta, Shilpa Dang, "A Review of Clustering Techiques in various Applications
for Effective Data Mining",International Journal of Research in IT & Management, ISSN 2231-
4334,Volume 1, Issue 2, Page 50-66 and June 2011.
3. Rui Xu, Donald Wunsch II, "Survey of Clustering Algorithms", IEEE transactions on neural
networks, Volume16, NO. 3, Page 645-678 and MAY 2005.
4 . Ranjini K and Rajalinngum N, “Performance Analysis of Hierarchical Clustering
Algorithm” International Journal of Advanced Networking and Applications, ISSN: 1006-
1011 and Year 2011.
5. UCI machine learning repository, archive.ics.uci.edu/ml.

13 | P a g e

You might also like