Training Neural Networks With GA Hybrid Algorithms
Training Neural Networks With GA Hybrid Algorithms
1 Introduction
The interest of the research in Artificial Neural Networks (ANNs) resides in the
appealing properties they exhibit: adaptability, learning capability, and ability
to generalize. Nowadays, ANNs are receiving a lot of attention from the interna-
tional research community with a large number of studies concerning training,
structure design, and real world applications, ranging from classification to robot
control or vision [1].
The neural network training task is a capital process in supervised lear-
ning, in which a pattern set made up of pairs of inputs plus expected outputs
is known beforehand, and used to compute the set of weights that makes the
ANN to learn it. One of the most popular training algorithms in the domain of
neural networks is the Backpropagation (or generalized delta rule) technique [2],
a gradient-descent method. Other techniques such as evolutionary algorithms
(EAs) have been also applied to the training problem in the past [3, 4], trying
to avoid the local minima that so often appear in complex problems. Although
training is a main issue in ANN’s design, many other works are devoted to evolve
the layered structure of the ANN or even the elementary behavior of the neu-
rons composing the ANN. For example, in [5] a definition of neurons, layers, and
the associated training problem is analyzed by using parallel genetic algorithms;
also, in [6] the architecture of the network and the weights are evolved by using
the EPNet evolutionary system. It is really difficult to perform a revision of this
topic; however, the work of Yao [7] represents an excellent starting point to get
acquired of the research in training ANNs.
The motivation of the present work is manyfold. First, we want to perform
a standard presentation of results that promotes and facilitates future compa-
risons. This sounds common sense, but it is not frequent that authors follow
standard rules for comparisons such as the structured Prechelt’s set of recom-
mendations [8], a “de facto” standard for many ANN researchers. A second
contribution is to include in our study, not only the well known Genetic Algo-
rithm (GA) and Backpropagation algorithm, but also the Levenberg-Marquardt
(LM) approach [9], and two additional hybrids. The potential advantages coming
from an LM utilization merit a detailed study. We have selected a benchmark
from the field of Medicine, composed of three classification problems: diagnosis
of breast cancer, diagnosis of diabetes in Pima Indians, and diagnosis of heart
disease.
The remainder of the article is organized as follows. Section 2 introduces the
Artificial Neural Network computation model. Next, we give a brief description
of the algorithms under analysis (Section 3). The details of the experiments and
their results are shown in Section 4. Finally, we summarize our conclusions and
future work in Section 5.
Output
Inputs Weights
W1
Output Layer
A1
W2
A2
W3 Neuron
A3
Sum-of-Product
Output
x Hidden Layer
WN
G f(x) y
Connection
AN Sumation
Activation weights
Function
1 Function
q
Bias Input Layer
Input Pattern
P S
omax − omin X X p
SEP = 100 · (ti − opi )2 . (1)
P ·S p=1 i=1
where tpi and opi are, respectively, the i-th components of the expected vector
and the actual current output vector for the pattern p; omin and omax are the
minimum and maximum values of the output neurons, S is the number of output
neurons, and P is the number of patterns.
In classification problems, we could use still an additional measure: the Clas-
sification Error Percentage (CEP). CEP is the percentage of incorrectly classified
patterns, and it is a usual complement to any of the other two (SEP or the well-
known MSE) raw error values, since CEP reports in a high-level manner the
quality of the trained ANN.
3 The Algorithms
We use for our study several algorithms to train ANNs: the Backpropagation
algorithm, the Levenberg-Marquardt algorithm, a Genetic Algorithm, a hybrid
between Genetic Algorithm and Backpropagation, and a hybrid between Genetic
Algorithm and Levenberg-Marquardt. We briefly describe them in the following
paragraphs.
3.1 Backpropagation
The Backpropagation algorithm (BP) [2] is a classical domain-dependent techni-
que for supervised training. It works by measuring the output error, calculating
the gradient of this error, and adjusting the ANN weights (and biases) in the
descending gradient direction. Hence, BP is a gradient-descent local search pro-
cedure (expected to stagnate in local optima in complex landscapes).
First, we define the squared error of the ANN for a set of patterns:
P X
X S
E= (tpi − opi )2 . (2)
p=1 i=1
The actual value of the previous expression depends on the weights of the
network. The basic BP algorithm (without momentum in our case) calculates
the gradient of E (for all the patterns in our case) and updates the weights by
moving them along the gradient-descendent direction. This can be summarized
with the expression ∆w = −η∇E, where the parameter η > 0 is the learning
rate that controls the learning speed. The pseudo-code of the BP algorithm is
shown in Fig. 2.
InitializeWeights;
while not StopCriterion do
for all i,j do
∂E
wij := wij − η ∂w ij
;
endfor;
endwhile;
3.2 Levenberg-Marquardt
The Levenberg-Marquardt algorithm (LM) [9] is an approximation to the New-
ton method used also for training ANNs. The Newton method approximates
the error of the network with a second order expression, which contrasts to the
Backpropagation algorithm that does it with a first order expression. LM is po-
pular in the ANN domain (even it is considered the first approach for an unseen
MLP training task), although it is not that popular in the metaheuristics field.
LM updates the ANN weights as follows:
" P
#−1
X
p T p
∆w = − µI + J (w) J (w) ∇E(w) . (3)
p=1
where J p (w) is the Jacobian matrix of the error vector ep (w) evaluated in w,
and I is the identity matrix. The vector error ep (w) is the error of the network for
pattern p, that is, ep (w) = tp −op (w). The parameter µ is increased or decreased
at each step. If the error is reduced, then µ is divided by a factor β, and it is
multiplied by β in other case. Levenberg-Marquardt performs the steps detailed
in Fig. 3. It calculates the network output, the error vectors, and the Jacobian
matrix for each pattern. Then, it computes ∆w using (3) and recalculates the
error with w + ∆w as network weights. If the error has decreased, µ is divided
by β, the new weights are maintained, and the process starts again; otherwise,
µ is multiplied by β, ∆w is calculated with a new value, and it iterates again.
InitializeWeights;
while not StopCriterion do
Calculates ep (w) for each pattern;
PP
e1 := p=1
ep (w)T ep (w);
Calculates J p (w) for each pattern;
repeat
Calculates ∆w;
PP
e2 := p=1
ep (w + ∆w)T ep (w + ∆w);
if (e1 <= e2) then
µ := µ ∗ β;
endif;
until (e2 < e1);
µ := µ/β;
w := w + ∆w;
endwhile;
t := 0;
Initialize: P (0) := {a1 (0), . . . , aµ (0)} ∈ I µ ;
Evaluate: P (0) : {Φ (a1 (0)) , . . . , Φ (aµ (0))};
while ι (P (t)) 6= true do //Reproductive loop
Select: P 0 (t) := sΘs (P (t));
Recombine: P 00 (t) := ⊗Θc (P 0 (t));
Mutate: P 000 (t) := mΘm (P 00 (t));
Evaluate: P 000 (t) : {Φ (a000 000
1 (t)) , . . . , Φ (aλ (t))};
Replace: P (t + 1) := rΘr (P 000 (t) ∪ Q);
t := t + 1;
endwhile;
4 Empirical Study
After discussing the algorithms, we present in this section the experiments per-
formed and their results. The benchmark for training and the parameters of the
algorithms are presented in the next subsection. The analysis of the results is
shown in Subsection 4.2.
4.1 Computational Experiments
We tackle three classification problems. These problems consist in determining
the class that a certain input vector belongs to. Each pattern from the training
pattern set contains an input vector and its desired output vector. These vectors
are formed by real numbers. However, in classification problems, the output of
the network must be interpreted as a class. Such interpretation can be performed
in different ways [8]. One of them consists in assigning an output neuron to each
class. When an input vector is presented to the network, the network response is
the class associated with the output neuron with the larger value. This method
is known as winner-takes-all and it is employed in this work.
The instances solved here belong to the Proben11 benchmark [8]: Cancer,
Diabetes, and Heart. We now briefly detail them:
The structure of the MLP used for any problem accounts for three layers
(input-hidden-output) having six neurons in the hidden layer. The number of
neurons in the input and output layers depends on the concrete instance. The
activating function of the neurons is the sigmoid function. Table 1 summarizes
the network architecture for each instance.
To evaluate an ANN, we split the pattern set into two subsets: the training
one and the test one. The ANN is trained with all the algorithms by using the
training pattern set, and then it is evaluated on the unseen test pattern set.
The training set for each instance is approximately made of the first 75% of
the examples, while the last 25% constitutes the test set. The exact number of
patterns for each instance is presented in Table 1 to ease future comparisons.
After presenting the problems, we now turn to describe the parameters for the
algorithms (Table 2). To get the parameters of the pure algorithms we performed
1
Available from ftp://ftp.ira.uka.de/pub/neuron/proben1.tar.gz.
Table 1. MLP architecture and patterns distribution for all instances
Patterns
Instance Architecture
Training Test
Cancer 9 - 6 - 2 525 174
Diabetes 8 - 6 - 2 576 192
Heart 35 - 6 - 2 690 230
some preliminary experiments and defined those with the best results. The hy-
brid algorithms GABP and GALM use the same parameters as their elementary
components. However, the mutation operator of the GA is not applied; instead,
it is replaced by BP or LM, respectively. The BP and LM are applied with an
associated probability pt only to one individual generated after recombination
at each iteration. When applied, BP/LM only performs one single epoch.
BC DI HE
Epochs 1000 1000 500
BP
η 0.01 0.01 0.001
Epochs 1000 1000 500
LM µ 0.001 0.001 0.001
β 10 10 10
Population size 64
Selection Roulette (2 inds.)
Recombination SPX (pc = 1.0)
GA
Mutation Bit-Flip (pm = 1/lenght)
Replacement Elitist
Stop criterion 1064 evals.
pt 1.0 1.0 0.5
GAxx
Epochs of xx 1 1 1
A first conclusion is that the GA obtains always a higher CEP than BP, LM
and the hybrids (except for Heart and GABP). This is not a surprising fact,
since the GA performs a rather explorative search in this kind of problems. BP
is slightly more accurate than LM for all the instances, what we did not expect
after the accurate behavior of LM in other studies.
With respect to the hybrid algorithms, the results do confirm our hypothesis
of work: GALM is more accurate than GABP. In fact, this is noticeable since BP
performed better than LM. Of course, we are not saying that this holds for any
ANN training problem. However, we do state a clear claim after these results,
i.e., GABP has received “too much” attention from the community, while maybe
GALM could have worked out lower error percentages. To help the reader we
also display these results in a graph in Fig. 5.
We have traced the evolution of each algorithm for the Cancer instance to
better explain how the different algorithms work (Fig. 6). We measure the SEP
of the network in each epoch of the algorithm. For population-based algorithms
(GA, GABP and GALM) we trace the SEP of the best fitness network. Each
trace line represents the average SEP over 50 independent runs. We can observe
that LM is the faster algorithm, followed by BP, what confirms intuition on
the velocity of local search compared to GAs and hybrids. BP an LM clearly
stagnate before 200 epochs in a solution. The GA is the slowest algorithm, and
its hybridization with BP, and especially with LM, shows an acceleration of the
evolution. An interesting observation is that the algorithms with the lowest SEP
(BP and LM) do not always get the lowest CEP (best classification) for the
test patterns. For example, GALM, which exhibits the lowest CEP, has only a
modest value of SEP in the training process. This is due to the overtraining
of the network in the BP and the LM algorithms, and confirms the necessity of
reporting both, ANN errors and classification percentages in this field of research.
25
GA
20
GABP
15
SEP
10
GALM
5
BP
LM
0
0 100 200 300 400 500 600 700 800 900 1000
Epochs
Fig. 6. Average evolution of SEP for the algorithms on the Cancer instance
There are many interesting works related to neural network training that
also solve the instances tackled here. But unfortunately, some of the results are
not comparable with ours, because they use a different definition of the training
and test sets; this is why we consider a capital issue to adhere to any standard
way of evaluation like the one proposed by Prechelt [8]. However, we did find
some works for meaningful comparisons.
For the Cancer instance we find that the best mean CEP [22] is 1.1%, which
represents a lower accuracy compared to our 0.02% obtained with the GALM
hybrid. In [23], a CEP close to 2% for this instance is achieved, while our GALM
is one hundred times more accurate. The mentioned work uses 524 patterns
for the training set and the rest for the test set, that is, almost exactly our
configuration with only one pattern changed (a minor detail), and therefore
the results can be compared. The same occurs for the work of Yao and Liu [6],
where their EPNet algorithm works out neural networks of a lower quality (1.4%
of CEP).
For the Diabetes instance, a CEP of 30.11% is reached in [24] (outperformed
by our BP, LM, and GALM) with the same network architecture as in our work.
In [6] we found for this instance a 22.4% of CEP (outperformed by our BP with
a 21.76%).
Finally, in [24] we found a 45.71% of CEP for the Heart instance using the
same architecture. In this case, all our algorithms outperform their CEP measure
(except GABP).
In summary, while we have found some of the more accurate results for the
three instances, it is still needed to get ahead on other instances, always keeping
in mind the importance of reporting results in a standardized manner.
5 Conclusions
In this work we have tackled the neural network training problem with five al-
gorithms: two well-known problem-specific algorithms such as Backpropagation
and Levenberg-Marquardt, a general metaheuristic such as a Genetic Algorithm,
and two hybrid algorithms combining the Genetic Algorithm with the problem-
specific techniques. To compare the algorithms we solve three classification pro-
blems from the domain of Medicine: the diagnosis of breast cancer, the diagnosis
of diabetes in the Pima Indians, and the diagnosis of heart disease.
Our results show that the problem-specific algorithms (BP and LM) get lo-
wer classification error than the genetic algorithm, and thus confirm numerically
what intuition can only suggest. The hybrid algorithm GALM outperforms in
two of the three instances the classification error of the problem-specific algo-
rithms. This makes GALM look as a promising algorithm for neural network
training. On the other hand, many of the classification errors obtained in this
work are below those found in the literature, what represents a cutting-edge
result. As a future work we plan to add new algorithms to the analysis, and to
apply them to more instances, especially in the domain of Bioinformatics.
Acknowledgments
This work has been partially funded by the Ministry of Science and Techno-
logy and FEDER under contract TIC2002-04498-C05-02 (the TRACER project,
http://tracer.lcc.uma.es).
References
1. Alander, J.T.: Indexed Bibliography of Genetic Algorithms and Neural Networks.
Technical Report 94-1-NN, University of Vaasa, Department of Information Tech-
nology and Production Economics (1994)
2. Rumelhart, D., Hinton, G., Williams, R.: Learning Representations by Backpro-
pagation Errors. Nature 323 (1986) 533–536
3. Cotta, C., Alba, E., Sagarna, R., Larrañaga, P.: Adjusting Weights in Artificial
Neural Networks using Evolutionary Algorithms. In Larrañaga, P., Lozano, J., eds.:
Estimation of Distribution Algorithms. A New Tool for Evolutionary Computation,
Kluwer Academic Publishers (2001) 357–373
4. Cantú-Paz, E.: Pruning Neural Networks with Distribution Estimation Algorithms.
In Erick Cantú-Paz et al., ed.: GECCO 2003, LNCS 2723, Springer-Verlag (2003)
790–800
5. Alba, E., Aldana, J.F., Troya, J.M.: Full Automatic ANN Design: A Genetic
Approach. In Mira, J., Cabestany, J., Prieto, A., eds.: New Trends in Neural
Computation, Springer-Verlag (1993) 399–404
6. Yao, X., Liu, Y.: A New Evolutionary System for Evolving Artificial Neural Net-
works. IEEE Transactions on Neural Networks 8 (1997) 694–713
7. Yao, X.: Evolving Artificial Neural Networks. Proceedings of the IEEE 87 (1999)
1423–1447
8. Prechelt, L.: Proben1 — A Set of Neural Network Benchmark Problems and
Benchmarking Rules. Technical Report 21, Fakultät für Informatik Universität
Karlsruhe, 76128 Karlsruhe, Germany (1994)
9. Hagan, M.T., Menhaj, M.B.: Training Feedforward Networks with the Marquardt
Algorithm. IEEE Transactions on Neural Networks 5 (1994)
10. McClelland, J.L., Rumelhart, D.E.: Parallel Distributed Processing: Explorations
in the Microstructure of Cognition. The MIT Press (1986)
11. Rosenblatt, F.: Principles of Neurodynamics. Spartan Books, New York (1962)
12. Holland, J.H.: Adaptation in Natural and Artificial Systems. The University of
Michigan Press, Ann Arbor, Michigan (1975)
13. Davis, L., ed.: Handbook of Genetic Algorithms. Van Nostrand Reinhold, New
York (1991)
14. Cotta, C., Troya, J.M.: On Decision-Making in Strong Hybrid Evolutionary Al-
gorithms. Tasks and Methods in Applied Artificial Intelligence, Lecture Notes in
Artificial Intelligence 1415 (1998) 418–427
15. Bennett, K.P., Mangasarian, O.L.: Robust Linear Programming Discrimination
of Two Linearly Inseparable Sets. Optimization Methods and Software 1 (1992)
23–34
16. Mangasarian, O.L., Setiono, R., Wolberg, W.H.: Pattern Recognition via Linear
Programming: Theory and Application to Medical Diagnosis. In Coleman, T.F.,
Li, Y., eds.: Large-Scale Numerical Optimization. SIAM Publications, Philadelphia
(1990) 22–31
17. Wolberg, W.H.: Cancer Diagnosis via Linear Programming. SIAM News 23 (1990)
1–18
18. Wolberg, W.H., Mangasarian, O.L.: Multisurface Method of Pattern Separation
for Medical Diagnosis Applied to Breast Cytology. In: Proceedings of the National
Academy of Sciences. Volume 87., U.S.A (1990) 9193–9196
19. Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., Johannes, R.S.: Using
the ADAP Learning Algorithm to Forecast the Onset of Diabetes Mellitus. In:
Proceedings of the Twelfth Symposium on Computer Applications in Medical Care,
IEEE Computer Society Press (1988) 261–265
20. Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J., Sandhu, S.,
Guppy, K., Lee, S., Froelicher, V.: International Application of a New Probability
Algorithm for the Diagnosis of Coronary Artery Disease. American Journal of
Cardiology (1989) 304–310
21. Gennari, J.H., Langley, P., Fisher, D.: Models of Incremental Concept Formation.
Artificial Intelligence 40 (1989) 11–61
22. Ragg, T., Gutjahr, S., Sa, H.: Automatic Determination of Optimal Network
Topologies Based on Information Theory and Evolution. In: Proceedings of the
23rd EUROMICRO Conference, Budapest, Hungary (1997)
23. Land, W.H., Albertelli, L.E.: Breast Cancer Screening Using Evolved Neural Net-
works. In: IEEE International Conference on Systems, Man, and Cybernetics,
1998. Volume 2., IEEE Computer Society Press (1998) 1619–1624
24. Erhard, W., Fink, T., Gutzmann, M.M., Rahn, C., Doering, A., Galicki, M.: The
Improvement and Comparison of Different Algorithms for Optimizing Neural Net-
works on the MasPar MP-2. In Heiss, M., ed.: Neural Computation – NC’98, ICSC
Academic Press (1998) 617–623