1-2 The Problem 3-4 Proposed Solution 5-7 The Experiment 8-9 Experimental Results 10-11 Conclusion 12 References 13
1-2 The Problem 3-4 Proposed Solution 5-7 The Experiment 8-9 Experimental Results 10-11 Conclusion 12 References 13
Contents
Introduction
1-2
The Problem
3-4
Proposed Solution
5-7
The Experiment
8-9
Experimental Results
10-11
Conclusion
12
References
13
2016-2017
features that are either redundant or irrelevant, and can thus be removed without incurring much
loss of information. Redundant or irrelevant features are two distinct notions, since one relevant
feature may be redundant in the presence of another relevant feature with which it is strongly
correlated.
Feature selection techniques should be distinguished from feature extraction. Feature extraction
creates new features from functions of the original features, whereas feature selection returns a
subset of the features. Feature selection techniques are often used in domains where there are
many features and comparatively few samples (or data points). Archetypal cases for the
application of feature selection include the analysis of written texts and DNA microarray data,
where there are many thousands of features, and a few tens to hundreds of samples.
2016-2017
2016-2017
2. The Problem
Practical pattern classification and knowledge discovery problems require selection of a subset of
attributes or features (from a much larger set) to represent the patterns to be classified.
Many practical pattern classification tasks (e.g., medical diagnosis) require learning of an
appropriate classification function that assigns a given input pattern (typically represented using
a vector of attribute or feature values) to one of a nite set of classes. The choice of features,
attributes, or measurements used to represent patterns that are presented to a classifier affect:
The accuracy of the classification function that can be learned using an inductive learning
algorithm (e.g., a decision tree induction algorithm or a neural network learning
algorithm): The attributes used to describe the patterns implicitly define a pattern
language. If the language is not expressive enough, it would fail to capture the
information that is necessary for classification and hence regardless of the learning
algorithm used, the accuracy of the classification function learned would be limited by
this lack of information.
The time needed for learning a sufficiently accurate classification function: For a given
representation of the classification function, the attributes used to describe the patterns
implicitly determine the search space that needs to be explored by the learning algorithm.
An abundance of irrelevant attributes can unnecessarily increase the size of the search
space, and hence the time needed for learning a sufficiently accurate classification
function. The number of examples needed for learning a sufficiently accurate
classification function: All other things being equal, the larger the number of attributes
used to describe the patterns in a domain of interest, the larger is the number of examples
needed to learn a classification function to a desired accuracy.
The cost of performing classification using the learned classification function: In many
practical applications e.g., medical diagnosis, patterns are described using observable
symptoms as well as results of diagnostic tests. Different diagnostic tests might have
different costs as well as risks associated with them. For instance, an invasive exploratory
surgery can be much more expensive and risky than say, a blood test.
Department of Computer Science and Engineering
2016-2017
This presents us with a feature subset selection problem in automated design of pattern
classifiers. The feature subset selection problem refers the task of identifying and selecting a
useful subset of attributes to be used to represent patterns from a larger set of often mutually
redundant, possibly irrelevant, attributes with different associated measurement costs and/or
risks. An example of such a scenario which is of significant practical interest is the task of
selecting a subset of clinical tests (each with different nancial cost, diagnostic value, and
associated risk) to be performed as part of a medical diagnosis task. Other examples of feature
subset selection problem include large scale data mining applications, power system control, and
so on.
2016-2017
3. Proposed Solution
A genetic algorithm approach to the multi-criteria optimization problem of feature subset
selection. Experiments demonstrate the feasibility of this approach for feature subset selection in
the automated design of neural networks for pattern classification and knowledge discovery
Feature subset selection in the context of many practical problems (e.g., diagnosis) presents an
instance of a multi-criteria optimization problem. The multiple criteria to be optimized include
the accuracy of classification, cost and risk associated with classification which in turn depends
on the selection of attributes used to describe the patterns. Evolutionary algorithms offer a
particularly attractive approach to multi-criteria optimization problems. This paper explores a
wrapper-based multi-criteria approach to feature subset selection using a genetic algorithm in
conjunction with a relatively fast inter-pattern distance-based neural network learning algorithm.
However, the general approach can be used with any inductive learning algorithm.
Neural networks - a densely interconnected networks of relatively simple computing elements
offer an attractive framework for the design of pattern classifiers for real-world real-time pattern
classification tasks on account of their potential for parallelism and fault and noise tolerance. The
classification function realized by a neural network is determined by the functions computed by
the neurons, the connectivity of the network, and the parameters(weights) associated with the
connections. It is well-known that multi-layer networks of non-linear computing elements (e.g.,
threshold neurons) can realize any classification function. While evolutionary algorithms are
generally quite effective for rapid global search of large search spaces in multi-modal
optimization problems, neural networks offer a particularly attractive approach to fine tuning
solutions once promising regions in the search space have been identified. Against this
background, genetic algorithms offer an attractive approach to feature subset selection for neural
network pattern classifiers.
2016-2017
However, the use of genetic algorithms for feature subset selection for neural network pattern
classifiers trained using traditional neural network training algorithms presents some practical
problems:
Traditional neural network learning algorithms (e.g., backpropagation) perform an error
gradient guided search for a suitable setting of weights in the weight space determined by
a user-specified network architecture. This ad hoc choice of network architecture often
inappropriately constrains the search for an appropriate setting of weights. For example,
if the network has fewer neurons than necessary, the learning algorithm will fail to nd the
desired classification function. If the network has far more neurons than necessary, it can
result in overfitting of the training data leading to poor generalization. In either case, it
would make it difficult to evaluate the usefulness of a feature subset employed to
describe (or represent) the training patterns used to train the neural network.
Gradient based learning algorithms although mathematically well-founded for unimodal
search spaces, can get caught in local minima of the error function. This can complicate
the evaluation of the usefulness of a feature subset employed to describe the training
patterns used to train the neural networks.
A typical run of a genetic algorithm involves many generations. In each generation,
evaluation of an individual (a feature subset) involves training neural networks and
computing their accuracy and cost. This can make the tness evaluation rather expensive
since gradient based algorithms are typically quite slow. The problem is further
exacerbated by the fact that multiple neural networks have to be used to sample the space
of ad-hoc choices of network architecture in order to get a reliable tness estimate for each
feature subset represented in the population.
DistAl is a simple and fast constructive neural network learning algorithm for pattern
classification. The results presented in this paper are based experiments using neural networks
constructed by DistAl. The key idea behind DistAl is to add hidden neurons one at a time based
on a greedy strategy which ensures that the hidden neuron correctly classifies a maximal subset
of training patterns belonging to a single class. Correctly classified examples can then be
Department of Computer Science and Engineering
2016-2017
eliminated from further consideration. The process terminates when this process results in an
empty training set (when the network correctly classifies the entire training set). When this
happens, the training set becomes linearly separable in the transformed space defined by the
hidden neurons. In fact, it is possible to set the weights on the hidden to output neuron
connections without going through an iterative process. It is straightforward to show that DistAl
is guaranteed to converge to 100% classification accuracy on any finite training set in time that is
polynomial in the number of training patterns. Experiments reported in [Yang et al., 1997] show
that DistAl, despite its simplicity, yields classifiers that compare quite favorably with those
generated using more sophisticated (and substantially more computationally demanding)
constructive learning algorithms. This makes DistAl an attractive choice for experimenting with
evolutionary approaches to feature subset selection for neural network pattern classifiers.
2016-2017
4. The Experiment
Experiments were run using a standard genetic algorithm with rank-based selection strategy. The
reported results are based on 10-fold cross-validation for each classification task with the
following parameter settings:
Population Size
50
Number of Generations
20
Probability of Crossover
0.6
Probability of Mutation
0.001
Probability of Selection
0.6
Each individual in the population represents a candidate solution to the feature subset selection
problem. Let m be the total number of attributes available to choose from to represent the
patterns to be classified. In a medical diagnosis task, these would be observable symptoms and a
set of possible diagnostic tests that can be performed on the patient. (Note that given m such
attributes, there exist 2m possible feature subsets. Thus, for large values of m, exhaustive search
is not feasible). It is represented by a binary vector of dimension m (where m is the total number
of attributes). If a bit is a 1, it means that the corresponding attribute is selected. A value of 0
indicates that the corresponding attribute is not selected.
The fitness of an individual is determined by evaluating the neural network constructed by
DistAl using a training set whose patterns are represented using only the selected subset of
features. If an individual has n bits turned on, the corresponding neural network has n input
nodes.
2016-2017
Where:
fitness(x) is the fitness of the feature subset represented by x
accuracy(x) is the test accuracy of the neural network classier trained using DistAl using
the feature subset represented by x and
costmax is an upper bound on the costs of candidate solutions.
In this case, it is simply the sum of the costs associated with all of the attributes. This is clearly a
somewhat ad hoc choice. However, it does discourage trivial solutions (e.g., a zero cost solution
with a very low accuracy) from being selected over reasonable solutions which yield high
accuracy at a moderate cost. In practice, dening suitable trade-offs between the multiple
objectives has to be based on knowledge of the domain. In general, it is a non-trivial task to
combine multiple optimization criteria into a single tness function. A wide variety of approaches
have been examined in the utility theory literature
Department of Computer Science and Engineering
2016-2017
5. Experimental Results
The results with 10 different training/test sets based on 10-fold cross-validation were averaged
and shown in the following tables.
5.1 Experiment 1:
Selections in the genetic algorithm were based only on the accuracy to compare its performance
with that made with the entire set of attributes. The results (in terms of the number of attributes
chosen (Dimension), the generalization accuracy (Accuracy) and the network size (Hidden)) are
shown in the table below:
The results indicate that the networks constructed using GA-selected subset of attributes compare
quite favorably with networks that use all of the attributes. The generalization accuracy always
increased significantly with comparable network size (but substantially fewer connections)
2016-2017
10
5.2 Experiment 2:
Selections made based on both the generalization accuracy and the measurement cost of
attributes. 3P, Hepatitis, HeartCle and Pima datasets were used for the experiment (with the
random costs in 3P). The results are shown in the table below:
As we can see from the table, the combined fitness function of accuracy and cost outperformed
that of accuracy only in every aspect: the dimension, generalization accuracy and network size.
This is not surprising because the former tries to minimize cost (while maximizing the accuracy),
which cuts down the dimension, while the latter emphasizes only on the accuracy. Some of the
runs resulted in feature subsets which did not necessarily have minimum cost. This suggests the
possibility of improving the results by the use of a more principled choice of a tness function that
combines accuracy and cost.
2016-2017
11
6. Conclusion
The results presented in this paper indicate that genetic algorithms offer an attractive approach to
solving the feature subset selection problem (under a different cost and performance constraints)
in inductive learning of neural network pattern classifiers. This finds applications in
cost-sensitive design of classifiers for tasks such as medical diagnosis, computer vision, among
others. Other applications of interest include automated data mining and knowledge discovery
from datasets with an abundance of irrelevant or redundant attributes. In such cases, identifying a
relevant subset that adequately captures the regularities in the data can be particularly useful. The
GA-based approach to feature subset selection does not rely on monotonicity assumptions that
are used in traditional approaches to feature selection which often limits their applicability to
real-world classification and knowledge acquisition tasks.
Some directions for further research include: Application of GA-based approaches to feature
subset selection to large-scale pattern classification tasks that arise in power systems control,
gene sequence recognition, and data mining and knowledge discovery; Extensive experimental
(and whenever feasible, theoretical) comparison of the performance of the proposed approach
with that of conventional methods for feature subset selection; More principled design of
multi-objective fitness functions for feature subset selection using domain knowledge as well as
mathematically well-founded tools of multi-attribute utility theory. Some of these topics are the
focus of our ongoing research.
2016-2017
12
7. References
[1] Yang, J., Honavar, V., (1997). Feature Subset Selection Using A Genetic Algorithm TR
#97-02a. Iowa State University.
[2] Yang, J., Parekh, R., & Honavar, V. (1997). DistAL: An Inter-pattern Distance-based
Constructive Learning Algorithm. Tech. rept. ISU-CS-TR 97-05. Iowa State University.
2016-2017
13