KNN Model-Based Approach in Classification
KNN Model-Based Approach in Classification
KNN Model-Based Approach in Classification
1 1 2 2 1
Gongde Guo , Hui Wang , David Bell , Yaxin Bi , and Kieran Greer
1
School of Computing and Mathematics, University of Ulster
Newtownabbey, BT37 0QB, Northern Ireland, UK
{G.Guo,H.Wang,Krc.Greer}@ulst.ac.uk
2
School of Computer Science, Queen’s University Belfast
Belfast, BT7 1NN, UK
{DA.Bell,Y.Bi}@qub.ac.uk
1 Introduction
R. Meersman et al. (Eds.): CoopIS/DOA/ODBASE 2003, LNCS 2888, pp. 986–996, 2003.
© Springer-Verlag Berlin Heidelberg 2003
KNN Model-Based Approach in Classification 987
kNN has a high cost of classifying new instances. This is single-handedly due to
the fact that nearly all computation takes place at classification time rather than when
the training examples are first encountered. Though kNN has been applied to text
categorization since the early days of its research [3] and is shown to be one of the
most effective methods on Reuters corpus of newswire stories – a benchmark corpus
in text categorization. Its efficiency as being a lazy learning method without pre-
modelling prohibits it from being applied to areas where dynamic classification is
needed for large repository. There are techniques [10, 11] that are used to
significantly reduce the computation required at query time, such as indexing training
examples, but it is out of our concerns in this paper.
We attempt to solve these problems and present in this paper a kNN type
classification method called kNNModel. The method constructs a model from the data
and classifies new data using the model. The model is a set of representatives of the
training data, as regions in the data space.
The rest of the paper is organised as follows. Section 2 discusses related research
in this area. Section 3 introduces the basic idea of the proposed modelling and
classification algorithm, where the modelling and classification processes are
illustrated by an example with the help of some graphs. The experimental results are
described and discussed in Section 4. Section 5 ends the paper with a discussion on
existing problems and addresses further research directions.
2 Related Work
This work is a subsequent research based on our previous research on data reduction
(DR) [4]. The advantage of DR is that raw data and reduced data can be both
represented by hyper relations. The collection of hyper relations can be made into a
complete Boolean algebra in a natural way, and so for any collection of hyper tuples
its unique least upper bound (lub) can be found, as a reduction. The experimental
result shows that DR can obtain relatively higher reduction rate whilst preserving its
classification accuracy. However, it is relatively slow in its basic form of model
construction, since much time is spent in trying probable merge.
As the k-Nearest-Neighbours classifier requires storing the whole training set and
may be too costly when this set is large, many researchers have attempted to get rid of
the redundancy of the training set to alleviate this problem [5,6,7,8]. Hart [5]
proposed a computationally simple local search method as Condensed Nearest
Neighbour (CNN) by minimizing the number of stored patterns and storing only a
subset of the training set for classification. The basic idea is that patterns in the
training set may be very similar and some do not add extra information and thus may
be discarded. Gate [6] proposed the Reduced Nearest Neighbour (RNN) rule that aims
to further reduce the stored subset after having applied CNN. It simply removes those
elements from the subset which will not cause an error. Alpaydin [7] investigated
some voting schemes over multiple learners in order to improve classification
accuracy, and Kubat et al [8] addressed an approach that selects three very small
groups of examples such that, when used as 1-NN subclassifiers, each tends to error
in a different part of the instance space. Simple voting then corrects many failures of
individual subclassifiers. The experimental results of those methods conducted on
some public datasets are reported in [9].
988 G. Guo et al.
The proposed kNN model-based approach is different from the DR and other
condensed nearest neighbour methods. It constructs a model by finding a set of
representatives with some extra information from the training data based on
similarity principle. The created representatives can be seen as regions in the data
space and will be used for further classification.
kNN is a case-based learning method, which keeps all the training data for
classification. Being a lazy learning method prohibits it in many applications such as
dynamic web mining for a large repository. One way to improve its efficiency is to
find some representatives to represent the whole training data for classification, viz.
building an inductive learning model from the training dataset and using this model
(representatives) for classification. There are many existing algorithms such as
decision trees or neural networks initially designed to build such a model. One of the
evaluation standards for different algorithms is their performance. As kNN is a simple
but effective method for classification and it is convincing as one of the most
effective methods on Reuters corpus of newswire stories in text categorization, it
motivates us to build a model for kNN to improve its efficiency whilst preserving its
classification accuracy as well.
Looking at Fig. 1, a training dataset including 36 data points with two classes
{square, circle} is distributed in 2-dimensional data space.
Sim(di), Num(di)=9
di
Fig. 1. The distribution of data points. Fig. 2. The first obtained representative.
If we use Euclidean distance as our similarity measure, it is clear that many data
points with the same class label are close to each other according to distance measure
in many local areas. In each local region, the central data point di looking at Fig. 2 for
example, with some extra information such as Num(di) - the number of data points
inside the local region and Sim(di) - the similarity of the most distant data point inside
the local region to di, might be an ideal representative of this local region. If we take
these representatives as a model to represent the whole training dataset, it will
significantly reduce the number of data points for classification, thereby to improve
its efficiency. Obviously, if a new data point is covered by a representative it will be
KNN Model-Based Approach in Classification 989
classified by the class label of this representative. If not, we calculate the distance of
the new data point to each representative’s nearest boundary and take each
representative’s nearest boundary as a data point, then classify the new data point in
the spirit of KNN.
In model construction process, each data point has its largest local neighbourhood
which covers the maximal number of data points with the same class label. Based on
these local neighbourhoods, the largest local neighbourhood (called largest global
neighbourhood) can be obtained in each cycle. This largest global neighbourhood can
be seen as a representative to represent all the data points covered by it. For data
points not covered by any representatives, we repeat the above operation until all the
data points have been covered by chosen representatives. Obviously, we needn’t
choose a specific k for our method in the model construction process, the number of
data points covered by a representative can be seen as an optimal k but it is different
in different representatives. The k is generated automatically in the model
construction process. Further, using a list of chosen representatives as a model for
classification not only reduces the number of data for classification, also significantly
improves its efficiency. From this point of view, our proposed method overcomes the
two shortcomings inherited in the kNN method.
Let D be a collection of n class-known data tuples {d1, d2,…, dn}. di∈D could be a
document represented in a form of space vector di=<wi1, wi2, …, wim>, where wij could
be the normalised TF-IDF weighting representation in text categorization as an
example. For the reason of generalisation from now on, we use the term ‘data tuple’
to represent all kinds of data in different applications in this paper in order to avoid
limiting our algorithm to specific applications. Also the term ‘similarity measure’
could be any similarity measure such as Euclidean distance or Cosine similarity only
if it is suitable for a concrete application. For simplicity, from now on, we use
Euclidean distance as default similarity measure to describe following algorithms.
The detailed model construction algorithm is described as follows:
(1) Select a similarity measure and create a similarity matrix from the given
training dataset.
(2) Set to ‘ungrouped’ the tag of all data tuples.
(3) For each ‘ungrouped’ data tuple, find its largest local neighbourhood which
covers the largest number of neighbours with the same category.
(4) Find the data tuple di with a largest global neighbourhood Ni among all the
local neighbourhoods, create a representative <Cls(di), Sim(di), Num(di),
Rep(di)> into M to represent all the data tuples covered by Ni, and then set to
‘grouped’ the tag of all the data tuples covered by Ni.
(5) Repeat step 3 and step 4 until all the data tuples in the training dataset have
been set to ‘grouped’.
(6) Model M consists of all the representatives collected from the above learning
process.
In the above algorithm, M represents the created model. The representative
<Cls(di), Sim(di), Num(di), Rep(di)> respectively represents the class label of di, the
990 G. Guo et al.
lowest similarity to di among the data tuples covered by Ni; the number of data tuples
covered by Ni, and a representation of di itself. In step (4), if there are more than one
neighbourhoods having the same maximal number of neighbours, we choose the one
with minimal value of Sim(di), viz. the one with highest density, as representative.
The classification algorithm is described as follows:
(1) For a new data tuple dt to be classified, calculate its similarity to all
representatives in the model M.
(2) If dt is covered only by one representative <Cls(dj), Sim(dj), Num(dj), Rep(dj)>,
viz the Euclidean distance of dt to dj is smaller than Sim(dj), dt is classified as
the category of dj.
(3) If dt is covered by at least two representatives with different category, classify
dt as the category of the representative with largest Num(dj), viz. the
neighbourhood covers the largest number of data tuples in the training dataset.
(4) If no representative in the model M covers dt, classify dt as the category of a
representative which boundary is closest to dt.
The Euclidean distance of dt to a representative di’s nearest boundary equals to the
difference of the Euclidean distance of di to dt minus Sim(di).
To improve the classification accuracy for kNNModel, we implemented two
different pruning methods in our kNNModel. One method is by removing the
representatives from the model M that only covers a few data tuples and the relevant
data tuples covered by these representatives from the training dataset, and then
constructing the model again from the revised training dataset. The second method is
by modifying the step 3 in the model construction algorithm to allow each largest
local neighbourhood cover r (called error tolerant degree) data tuples with different
categories to the majority category in this neighbourhood. This modification
integrates the pruning work into the process of model construction. Experimental
results will be reported in the next section.
To grasp the idea here, the best way is by means of an example, so we graphically
illustrate the model construction and classification process.
A training dataset including 36 data tuples is divided into two classes denoted as
square and circle. The distribution of data tuples in 2-dimensional data space is shown
in Fig. 3.
In Fig. 4, the fine line circle covers 9 (Num(di)=9) data tuples with the same class
label of di – circle (in this example, we use the first pruning method, viz. we assign 0
to r). The representative covers the maximal number of neighbours with the same
class label at the first cycle. The Sim(di) represents the Euclidean distance of di to the
most distant data tuple from di in Ni.
After the first cycle, we obtain the first representative <Cls(di), Sim(di), Num(di),
Rep(di)>, add it into the model M, and then turn to the next cycle. At the end of the
second cycle we add another representative <Cls(dj), Sim(dj), Num(dj), Rep(dj)> into
the model M shown in Fig. 5. Repeat this process until all the data tuples in the
training dataset have been set to ‘grouped’ (represented by a empty circle or square).
KNN Model-Based Approach in Classification 991
Sim(di), Num(di)=9
di
Fig. 3. The distribution of data tuples. Fig. 4. The first obtained representative.
Fig. 5. The second obtained representative. Fig. 6. The model before pruning.
Fig. 7. The model after pruning. Fig. 8. The distribution of test data tuples.
At the end, ten representatives shown in Fig. 6 are obtained from the training dataset
and stored in the model M, where seven in ten representatives cover more than 2 data
tuples denoted by a fine line circle and the other three representatives, each of them
covers only one data tuple denoted by a bold line circle.
In this situation, pruning work can be done by removing the representatives from
the model M which only cover a few data tuples (for example, Num(di)<2). All the
data tuples covered by these representatives will be removed as well from the training
dataset. After that, we construct the model again from the revised training dataset.
After pruning and model construction, we obtain the final model M, looking at Fig. 7
for a graphical illustration.
992 G. Guo et al.
In Figure 8, there are four triangles which represent the test data tuples. According
to the classification algorithm described before, these four test data tuples are
classified as a label of circle, square, circle, square from left to right respectively.
If we use the second pruning method and assign 1 to r, the model construction
process is shown as follows:
Fig. 11. The third representative. Fig. 12. The final model.
Experiment using the 5-fold cross validation method has been carried out to evaluate
the prediction accuracy of kNNModel, and to compare the experimental results with
C5.0 and kNN as our benchmarks. The C5.0 is implemented in the Clementine’
software package.
Six public datasets were chosen from the UCI machine learning repository. Some
information about these datasets is listed in Table 1.
In Table 1, the meaning of the title in each column is follows: NA-Number of
attributes, NN-Number of Nominal attributes, NO-Number of Ordinal attributes, NB-
Number of Binary attributes, NE-Number of Examples, CD-Class Distribution.
The comparison of C5.0, kNN, and kNNModel in testing accuracy using the 5-fold
cross validation method is listed in Table 2. The data reduction rate in the final model
of the kNNModel is listed in Table 3. As we use Euclidean distance as a similarity
measure in the experiment for kNN and kNNModel, six datasets were pre-processed
including normalization and feature selection before conducting the classification. In
experiments, we assign 1 to r and use information gain as our feature selection
measure.
KNN Model-Based Approach in Classification 993
Dataset NA NN NO NB NE CD
Glass 9 0 9 0 214 70:17:76:0:13:9:29
Iris 4 0 4 0 150 50:50:50
Heart 13 3 7 3 270 120:150
Wine 13 0 13 0 178 59:71:48
Diabetes 8 0 8 0 768 268:500
Aust 14 4 6 4 690 383:307
Table 3. The number of representatives and the average reduction rate in the final model
Note that in Table 2 and Table 3, N>i means each representative in the final model
of the kNNModel at least covers i+1 data tuples of the training dataset. It is not an
integrant parameter. It can be removed from the kNNModel algorithm by pruning
process. The experimental results of different N listed here are to demonstrate the
relationship between classification accuracy and reduction rate of the kNNModel
algorithm.
We also carried out an experiment by assigning different value to r and N to find
the best classification accuracy for knnModel, and to see the influence of the r and N
994 G. Guo et al.
With N=1, the influence of different r (0~15) plays to the classification accuracy of
knnModel is shown in Figure 13 and Figure 14 when do test on Aust and Diabetes
datasets.
The experimental results of the kNNModel without N on six datasets are listed in
Table 5.
In table 5, CA means classification accuracy, RR means reduction rate. For kNN,
the CA is the average classification accuracy of k=1, 3, 5.
From the experimental results, it is clear that the average classification accuracy of
our proposed kNNModel method on six datasets is better than C5.0 in 5-fold cross
validation and is comparable to kNN. But the kNNModel significantly improves the
efficiency of kNN by keeping only a few representatives for classification. The
experimental results show that the average reduction rate is 90.41%.
5 Conclusions
In this paper we have presented a novel solution for dealing with the shortcomings of
kNN. To overcome the problems of low effeciency and dependency on k, we select a
few representatives from training dataset with some extra information to represent the
whole training dataset. In the selection of each representative we use the optimal but
different k decided by dataset itself to eliminate the dependency on k without user’s
intervention. Experimental results carried out on six public datasets show that the
kNNModel is a quite competitive method for classification. Its average classification
accuracy on six public datasets is comparable with C5.0 and kNN. Also the
kNNModel significantly reduces the number of the data tuples in the final model for
clasification with a 90.41% reduction rate on average. It could be a good replacement
for kNN in many applications such as dynamic web mining for a large repository.
Further research is required into how to improve the classification accuracy of
marginal data which fall outside the regions of representatives.
References
1. D. Hand, H. Mannila, P. Smyth.: Principles of Data Mining. The MIT Press. (2001)
2. H. Wang.: Nearest Neighbours without k: A Classification Formalism based on
Probability, technical report, Faculty of Informatics, University of Ulster, N.Ireland, UK
(2002)
3. F.Sebastiani.: Machine Learning in Automated Text Categorization. In ACM Computing
Surveys, Vol. 34, No. 1, March (2002) pp. 1–47.
4. H. Wang, I. Duntsch, D. Bell.: Data Reduction Based on Hyper Relations. In proceedings
of KDD98, New York, pages 349–353 (1998)
5. P. Hart.: The Condensed Nearest Neighbour Rule, IEEE Transactions on Information
Theory, 14, 515–516, (1968)
6. G. Gates.: The Reduced Nearest Neighbour Rule. IEEE Transactions on Information
Theory, 18, 431–433, (1972)
7. E. Alpaydin.: Voting Over Multiple Condensed Nearest Neoghbors. Artificial Intelligence
Review 11:115-132, (1997) ©1997 Kluwer Academic Publishers.
8. M. Kubat, M. Jr.: Voting Nearest-Neighbour Subclassifiers. Proceedings of the 17th
International Conference on Machine Learning, ICML-2000, pp. 503–510, Stanford, CA,
June 29–July 2, (2000)
9. D. R. Wilson, T. R. Martinez.: Reduction Techniques for Exemplar-Based Learning
Algorithms. Machine learning, 38–3, pp. 257–286, (2000)
10. T. Mitchell.: Machine Learning. MITPress and McGraw-Hill (1997)
11. C.M.Bishop.: Neural Networks for Pattern Recognition. Oxford University Press, UK
(1995)