Artificial Neural Network - Quick Guide - Tutorialspoint
Artificial Neural Network - Quick Guide - Tutorialspoint
Neural networks are parallel computing devices, which is basically an attempt to make a
computer model of the brain. The main objective is to develop a system to perform various
computational tasks faster than the traditional systems. These tasks include pattern recognition
and classification, approximation, optimization, and data clustering.
The historical review shows that significant progress has been made in this field. Neural network
based chips are emerging and applications to complex problems are being developed. Surely,
today is a period of transition for neural network technology.
Biological Neuron
A nerve cell (neuron) is a special biological cell that processes information. According to an
estimation, there are huge number of neurons, approximately 1011 with numerous
interconnections, approximately 1015.
Schematic Diagram
Working of a Biological Neuron
As shown in the above diagram, a typical neuron consists of the following four parts with the
help of which we can explain its working −
Dendrites − They are tree-like branches, responsible for receiving the information from
other neurons it is connected to. In other sense, we can say that they are like the ears
of neuron.
Soma − It is the cell body of the neuron and is responsible for processing of
information, they have received from dendrites.
Axon − It is just like a cable through which neurons send the information.
Synapses − It is the connection between the axon and other neuron dendrites.
Before taking a look at the differences between Artificial Neural Network (ANN) and Biological
Neural Network (BNN), let us take a look at the similarities based on the terminology between
these two.
Soma Node
Dendrites Input
Axon Output
The following table shows the comparison between ANN and BNN based on some criteria
mentioned.
Processing Massively parallel, slow but Massively parallel, fast but inferior than BNN
superior than ANN
Size 1011 neurons and 1015 102 to 104 nodes (mainly depends on the type of
interconnections application and network designer)
Learning They can tolerate ambiguity Very precise, structured and formatted data is
required to tolerate ambiguity
Storage Stores the information in the Stores the information in continuous memory
capacity synapse locations
The following diagram represents the general model of ANN followed by its processing.
For the above general model of artificial neural network, the net input can be calculated as
follows −
y
yiin
n
=
= x
x1 . w1 +
1 . w1 + x
x2 . w2 +
2 . w2 + x
x3 . w3 …
3 . w3 … x
xm . wm
m . wm
m
i.e., Net input
m
y
yiin
n
=
= ∑
∑i x
xii .. w
wii
i
The output can be calculated by applying the activation function over the net input.
Y
Y =
= F
F((y
yiin
n)
)
Network Topology
A network topology is the arrangement of a network along with its nodes and connecting lines.
According to the topology, ANN can be classified as the following kinds −
Feedforward Network
It is a non-recurrent network having processing units/nodes in layers and all the nodes in a layer
are connected with the nodes of the previous layers. The connection has different weights upon
them. There is no feedback loop means the signal can only flow in one direction, from input to
output. It may be divided into the following two types −
Single layer feedforward network − The concept is of feedforward ANN having only
one weighted layer. In other words, we can say the input layer is fully connected to the
output layer.
As the name suggests, a feedback network has feedback paths, which means the signal can
flow in both directions using loops. This makes it a non-linear dynamic system, which changes
continuously until it reaches a state of equilibrium. It may be divided into the following types −
Recurrent networks − They are feedback networks with closed loops. Following are
the two types of recurrent networks.
Fully recurrent network − It is the simplest neural network architecture because all
nodes are connected to all other nodes and each node works as both input and output.
Jordan network − It is a closed loop network in which the output will go to the input
again as feedback as shown in the following diagram.
Adjustments of Weights or Learning
Learning, in artificial neural network, is the method of modifying the weights of connections
between the neurons of a specified network. Learning in ANN can be classified into three
categories namely supervised learning, unsupervised learning, and reinforcement learning.
Supervised Learning
As the name suggests, this type of learning is done under the supervision of a teacher. This
learning process is dependent.
During the training of ANN under supervised learning, the input vector is presented to the
network, which will give an output vector. This output vector is compared with the desired output
vector. An error signal is generated, if there is a difference between the actual output and the
desired output vector. On the basis of this error signal, the weights are adjusted until the actual
output is matched with the desired output.
Unsupervised Learning
As the name suggests, this type of learning is done without the supervision of a teacher. This
learning process is independent.
During the training of ANN under unsupervised learning, the input vectors of similar type are
combined to form clusters. When a new input pattern is applied, then the neural network gives
an output response indicating the class to which the input pattern belongs.
There is no feedback from the environment as to what should be the desired output and if it is
correct or incorrect. Hence, in this type of learning, the network itself must discover the patterns
and features from the input data, and the relation for the input data over the output.
Reinforcement Learning
As the name suggests, this type of learning is used to reinforce or strengthen the network over
some critic information. This learning process is similar to supervised learning, however we
might have very less information.
During the training of network under reinforcement learning, the network receives some
feedback from the environment. This makes it somewhat similar to supervised learning.
However, the feedback obtained here is evaluative not instructive, which means there is no
teacher as in supervised learning. After receiving the feedback, the network performs
adjustments of the weights to get better critic information in future.
Activation Functions
It may be defined as the extra force or effort applied over the input to obtain an exact output. In
ANN, we can also apply activation functions over the input to get the exact output. Followings
are some activation functions of interest −
It is also called the identity function as it performs no input editing. It can be defined as −
F
F((x
x)) =
= x
x
1
1
F
F((x
x)) =
= s
siig
gmm(
(xx)
) =
=
1
1 +
+ e
exxp
p((−
−xx)
)
Bipolar sigmoidal function − This activation function performs input editing between
-1 and 1. It can be positive or negative in nature. It is always bounded, which means its
output cannot be less than -1 and more than 1. It is also strictly increasing in nature like
sigmoid function. It can be defined as
2
2 1
1 −
− e
exxp
p((x
x))
F
F((x
x)) =
= s
siig
gmm(
(xx)
) =
= −
− 1
1 =
=
1
1 +
+ e
exxp
p((−
−xx)
) 1
1 +
+ e
exxp
p((x
x))
As stated earlier, ANN is completely inspired by the way biological nervous system, i.e. the
human brain works. The most impressive characteristic of the human brain is to learn, hence
the same feature is acquired by ANN.
Basically, learning means to do and adapt the change in itself as and when there is a change in
environment. ANN is a complex system or more precisely we can say that it is a complex
adaptive system, which can change its internal structure based on the information passing
through it.
Why Is It important?
Being a complex adaptive system, learning in ANN implies that a processing unit is capable of
changing its input/output behavior due to the change in environment. The importance of learning
in ANN increases because of the fixed activation function as well as the input/output vector,
when a particular network is constructed. Now to change the input/output behavior, we need to
adjust the weights.
Classification
It may be defined as the process of learning to distinguish the data of samples into different
classes by finding common features between the samples of the same classes. For example, to
perform training of ANN, we have some training samples with unique features, and to perform
its testing we have some testing samples with other unique features. Classification is an
example of supervised learning.
We know that, during ANN learning, to change the input/output behavior, we need to adjust the
weights. Hence, a method is required with the help of which the weights can be modified. These
methods are called Learning rules, which are simply algorithms or equations. Following are
some learning rules for the neural network −
Basic Concept − This rule is based on a proposal given by Hebb, who wrote −
“When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes
part in firing it, some growth process or metabolic change takes place in one or both cells such
that A’s efficiency, as one of the cells firing B, is increased.”
From the above postulate, we can conclude that the connections between two neurons might be
strengthened if the neurons fire at the same time and might weaken if they fire at different times.
Mathematical Formulation − According to Hebbian learning rule, following is the formula to
increase the weight of connection at every time step.
Δ
Δwwjjii(
(tt)
) =
= α
αxxii (
(tt)
).. y
yjj (
(tt)
)
Here, Δ
Δwwjjii(
(tt)
) = increment by which the weight of connection increases at time step t
α
α = the positive and constant learning rate
x
xii (
(tt)
) = the input value from pre-synaptic neuron at time step t
y
yii (
(tt)
) = the output of pre-synaptic neuron at same time step t
This rule is an error correcting the supervised learning algorithm of single layer feedforward
networks with linear activation function, introduced by Rosenblatt.
Basic Concept − As being supervised in nature, to calculate the error, there would be a
comparison between the desired/target output and the actual output. If there is any difference
found, then a change must be made to the weights of connection.
Mathematical Formulation − To explain its mathematical formulation, suppose we have ‘n’
number of finite input vectors, x(n), along with its desired/target output vector t(n), where n = 1
to N.
Now the output ‘y’ can be calculated, as explained earlier on the basis of the net input, and
activation function being applied over that net input can be expressed as follows −
1
1,, y
yiin
n
>
> θ
θ
y
y =
= f
f((y
yiin ) = {
n) = {
0
0,, y
yiin
n
⩽
⩽ θ
θ
Where θ is threshold.
The updating of weight can be done in the following two cases −
Case I − when t ≠ y, then
w
w((n
neew
w)) =
= w
w((o
olld
d)) +
+ t
txx
It is introduced by Bernard Widrow and Marcian Hoff, also called Least Mean Square (LMS)
method, to minimize the error over all training patterns. It is kind of supervised learning
algorithm with having continuous activation function.
Basic Concept − The base of this rule is gradient-descent approach, which continues forever.
Delta rule updates the synaptic weights so as to minimize the net input to the output unit and the
target value.
Mathematical Formulation − To update the synaptic weights, delta rule is given by
Δ
Δwwii =
= α
α .. x
xii .. e
ejj
Here Δ
Δwwii = weight change for ith pattern;
α
α = the positive and constant learning rate;
x
xii = the input value from pre-synaptic neuron;
e
ejj = (
(tt −
− y
yiin
n)
) , the difference between the desired/target output and the actual
output y
yiin
n
w
w((n
neew
w)) =
= w
w((o
olld
d)) +
+ Δ
Δww
Basic Concept of Competitive Network − This network is just like a single layer feedforward
network with feedback connection between outputs. The connections between outputs are
inhibitory type, shown by dotted lines, which means the competitors never support themselves.
Basic Concept of Competitive Learning Rule − As said earlier, there will be a competition
among the output nodes. Hence, the main concept is that during training, the output unit with
the highest activation to a given input pattern, will be declared the winner. This rule is also called
Winner-takes-all because only the winning neuron is updated and the rest of the neurons are
left unchanged.
Mathematical formulation − Following are the three important factors for mathematical
formulation of this learning rule −
1
1 i
iff v
vkk
>
> v
vjj f
foor
raallll j
j,, j
j ≠
≠ k
k
y
ykk
=
= {
{
0
0 o
otth
heer
rwwi
isse
e
Condition of sum total of weight − Another constraint over the competitive learning
rule is, the sum total of weights to a particular output neuron is going to be 1. For
example, if we consider neuron k then −
∑
∑wwk
kjj
=
= 1
1 f
foor
raallll k
k
j
j
Change of weight for winner − If a neuron does not respond to the input pattern, then
no learning takes place in that neuron. However, if a particular neuron wins, then the
corresponding weights are adjusted as follows
−
−αα(
(xxjj −
− w
wk )
),, i
iff n
neeu
urro
onnk
kwwi
inns
s
kj
j
Δ
Δwwk
kjj
=
= {
{
0
0,, i
iff n
neeu
urro
onnk
k llo
osss
sees
s
Here α
α is the learning rate.
This clearly shows that we are favoring the winning neuron by adjusting its weight and if there is
a neuron loss, then we need not bother to re-adjust its weight.
This rule, introduced by Grossberg, is concerned with supervised learning because the desired
outputs are known. It is also called Grossberg learning.
Basic Concept − This rule is applied over the neurons arranged in a layer. It is specially
designed to produce a desired output d of the layer of p neurons.
Mathematical Formulation − The weight adjustments in this rule are computed as follows
Δ
Δwwjj =
= α
α((d
d −
− w
wjj )
)
Supervised Learning
As the name suggests, supervised learning takes place under the supervision of a teacher.
This learning process is dependent. During the training of ANN under supervised learning, the
input vector is presented to the network, which will produce an output vector. This output vector
is compared with the desired/target output vector. An error signal is generated if there is a
difference between the actual output and the desired/target output vector. On the basis of this
error signal, the weights would be adjusted until the actual output is matched with the desired
output.
Perceptron
Developed by Frank Rosenblatt by using McCulloch and Pitts model, perceptron is the basic
operational unit of artificial neural networks. It employs supervised learning rule and is able to
classify the data into two classes.
Links − It would have a set of connection links, which carries a weight including a bias
always having weight 1.
Adder − It adds the input after they are multiplied with their respective weights.
Activation function − It limits the output of neuron. The most basic activation function
is a Heaviside step function that has two possible outputs. This function returns 1, if the
input is positive, and 0 for any negative input.
Training Algorithm
Perceptron network can be trained for single output unit as well as multiple output units.
Weights
Bias
Learning rate α
α
For easy calculation and simplicity, weights and bias must be set equal to 0 and the learning
rate must be set equal to 1.
Step 2 − Continue step 3-8 when the stopping condition is not true.
Step 5 − Now obtain the net input with the following relation −
n
n
y
yiin
n
=
= b
b +
+ ∑
∑xxii .. w
wii
i
i
Here ‘b’ is bias and ‘n’ is the total number of input neurons.
Step 6 − Apply the following activation function to obtain the final output.
⎧1
⎧ 1 i
iff y
yiin
n
>
> θ
θ
f
f((y
yiin ) = ⎨ 0
n) = ⎨0 i
iff −
−θθ ⩽
⩽ y
yiin ⩽
⩽ θ
θ
n
⎩
⎩
−
−11 i
iff y
yiin
n
<
< −
−θθ
w
wii (
(nne
eww)
) =
= w
wii (
(oolld
d)) +
+ α
αttx
xii
b
b((n
neew
w)) =
= b
b((o
olld
d)) +
+ α
αtt
Case 2 − if y = t then,
w
wii (
(nne
eww)
) =
= w
wii (
(oolld
d))
b
b((n
neew
w)) =
= b
b((o
olld
d))
Here ‘y’ is the actual output and ‘t’ is the desired/target output.
Step 8 − Test for the stopping condition, which would happen when there is no change in
weight.
The following diagram is the architecture of perceptron for multiple output classes.
Step 1 − Initialize the following to start the training −
Weights
Bias
Learning rate α
α
For easy calculation and simplicity, weights and bias must be set equal to 0 and the learning
rate must be set equal to 1.
Step 2 − Continue step 3-8 when the stopping condition is not true.
Step 3 − Continue step 4-6 for every training vector x.
Step 4 − Activate each input unit as follows −
x
xii =
= s
sii (
(ii =
= 1
1tto
onn)
)
n
n
y
yiin
n
=
= b
b +
+ ∑
∑xxii w
wiijj
i
i
Here ‘b’ is bias and ‘n’ is the total number of input neurons.
Step 6 − Apply the following activation function to obtain the final output for each output unit j =
1 to m −
⎧
⎧ 1
1 i
iff y
yiin >
> θ
θ
⎪
⎪ njj
f
f((y
yiin ) = ⎨0
n) = ⎨ 0
i
iff −
−θθ ⩽
⩽ y
yiin ⩽
⩽ θ
θ
njj
⎩
⎩
⎪
⎪
−
−11 i
iff y
yiin <
< −
−θθ
njj
Step 7 − Adjust the weight and bias for x = 1 to n and j = 1 to m as follows −
Case 1 − if yj ≠ tj then,
w
wiijj(
(nne
eww)
) =
= w
wiijj(
(oolld
d)) +
+ α
αttjj x
xii
b
bjj (
(nne
eww)
) =
= b
bjj (
(oolld
d)) +
+ α
αttjj
Case 2 − if yj = tj then,
w
wiijj(
(nne
eww)
) =
= w
wiijj(
(oolld
d))
b
bjj (
(nne
eww)
) =
= b
bjj (
(oolld
d))
Here ‘y’ is the actual output and ‘t’ is the desired/target output.
Step 8 − Test for the stopping condition, which will happen when there is no change in weight.
It uses delta rule for training to minimize the Mean-Squared Error (MSE) between the
actual output and the desired/target output.
The weights and the bias are adjustable.
Architecture
The basic structure of Adaline is similar to perceptron having an extra feedback loop with the
help of which the actual output is compared with the desired/target output. After comparison on
the basis of training algorithm, the weights and bias will be updated.
Training Algorithm
Weights
Bias
Learning rate α
α
For easy calculation and simplicity, weights and bias must be set equal to 0 and the learning
rate must be set equal to 1.
Step 2 − Continue step 3-8 when the stopping condition is not true.
Step 3 − Continue step 4-6 for every bipolar training pair s:t.
Step 4 − Activate each input unit as follows −
x
xii =
= s
sii (
(ii =
= 1
1tto
onn)
)
n
n
y
yiin
n
=
= b
b +
+ ∑
∑xxii w
wii
i
i
Here ‘b’ is bias and ‘n’ is the total number of input neurons.
Step 6 − Apply the following activation function to obtain the final output −
1
1 i
iff y
yiin ⩾
⩾ 0
0
n
f
f((y
yiin ) = {
n) = {
−
−11 i
iff y
yiin <
< 0
0
n
Step 7 − Adjust the weight and bias as follows −
Case 1 − if y ≠ t then,
w
wii (
(nne
eww)
) =
= w
wii (
(oolld
d)) +
+ α
α((t
t −
− y
yiin )x i
n )xi
b
b((n
neew
w)) =
= b
b((o
olld
d)) +
+ α
α((t
t −
− y
yiin
n)
)
Case 2 − if y = t then,
w
wii (
(nne
eww)
) =
= w
wii (
(oolld
d))
b
b((n
neew
w)) =
= b
b((o
olld
d))
Here ‘y’ is the actual output and ‘t’ is the desired/target output.
(
(tt −
− y
yiin
n)
) is the computed error.
Step 8 − Test for the stopping condition, which will happen when there is no change in weight or
the highest weight change occurred during training is smaller than the specified tolerance.
It is just like a multilayer perceptron, where Adaline will act as a hidden unit between
the input and the Madaline layer.
The weights and the bias between the input and Adaline layers, as in we see in the
Adaline architecture, are adjustable.
The Adaline and Madaline layers have fixed weights and bias of 1.
Training can be done with the help of Delta rule.
Architecture
The architecture of Madaline consists of “n” neurons of the input layer, “m” neurons of the
Adaline layer, and 1 neuron of the Madaline layer. The Adaline layer can be considered as the
hidden layer as it is between the input layer and the output layer, i.e. the Madaline layer.
Training Algorithm
By now we know that only the weights and bias between the input and the Adaline layer are to
be adjusted, and the weights and bias between the Adaline and the Madaline layer are fixed.
Step 1 − Initialize the following to start the training −
Weights
Bias
Learning rate α
α
For easy calculation and simplicity, weights and bias must be set equal to 0 and the learning
rate must be set equal to 1.
Step 2 − Continue step 3-8 when the stopping condition is not true.
Step 3 − Continue step 4-7 for every bipolar training pair s:t.
x
xii =
= s
sii (
(ii =
= 1
1tto
onn)
)
Step 5 − Obtain the net input at each hidden layer, i.e. the Adaline layer with the following
relation −
n
n
Q
Qiin
njj
=
= b
bjj +
+ ∑
∑xxii w
wiijj j
j =
= 1
1tto
omm
i
i
Here ‘b’ is bias and ‘n’ is the total number of input neurons.
Step 6 − Apply the following activation function to obtain the final output at the Adaline and the
Madaline layer −
1
1 i
iff x
x ⩾
⩾ 0
0
f
f((x
x)) =
= {
{
−
−11 i
iff x
x <
< 0
0
Q
Qjj =
= f
f((Q
Qiin
nj
)
j)
y
y =
= f
f((y
yiin
n)
)
m
i.e.
m
y
yiin
njj
=
= b
b0 + ∑ Qj v
0 + ∑j=1 Qj vjj
j=1
w
wiijj(
(nne
eww)
) =
= w
wiijj(
(oolld
d)) +
+ α
α((1
1 −
− Q
Qiin
nj
)x i
j )xi
b
bjj (
(nne
eww)
) =
= b
bjj (
(oolld
d)) +
+ α
α((1
1 −
− Q
Qiin
nj
)
j)
In this case, the weights would be updated on Qj where the net input is close to 0 because t = 1.
Case 2 − if y ≠ t and t = -1 then,
w
wiik (new) = w ik
k (new) = wi
(old) + α(−1 − Q in
k (old) + α(−1 − Qi nk
)x i
k )xi
b
bk (new) = b k (
k (new) = bk (oolld
d)) +
+ α
α((−
−11 −
− Q
Qiin
nk
)
k)
In this case, the weights would be updated on Qk where the net input is positive because t = -1.
Here ‘y’ is the actual output and ‘t’ is the desired/target output.
Case 3 − if y = t then
Step 8 − Test for the stopping condition, which will happen when there is no change in weight or
the highest weight change occurred during training is smaller than the specified tolerance.
Architecture
As shown in the diagram, the architecture of BPN has three interconnected layers having
weights on them. The hidden layer as well as the output layer also has bias, whose weight is
always 1, on them. As is clear from the diagram, the working of BPN is in two phases. One
phase sends the signal from the input layer to the output layer, and the other phase back
propagates the error from the output layer to the input layer.
Training Algorithm
For training, BPN will use binary sigmoid activation function. The training of BPN will have the
following three phases.
Phase 1 − Feed Forward Phase
For easy calculation and simplicity, take some small random values.
Step 2 − Continue step 3-11 when the stopping condition is not true.
Step 3 − Continue step 4-10 for every training pair.
Phase 1
Step 4 − Each input unit receives input signal xi and sends it to the hidden unit for all i = 1 to n
Step 5 − Calculate the net input at the hidden unit using the following relation −
n
n
Q
Qiin
njj
=
= b
b0 + ∑ x iv
j + ∑ xi
0j viijj j
j =
= 1
1tto
opp
i
i==1
1
Here b0j is the bias on hidden unit, vij is the weight on j unit of the hidden layer coming from i
unit of the input layer.
Now calculate the net output by applying the following activation function
Q
Qjj =
= f
f((Q
Qiin
nj
)
j)
Send these output signals of the hidden layer units to the output layer units.
Step 6 − Calculate the net input at the output layer unit using the following relation −
p
p
y
yiin
nkk
=
= b
b00k
k
+
+ ∑
∑ Q
Qjj w
wjjk k = 1 to m
k k = 1 to m
j
j==1
1
Here b0k is the bias on output unit, wjk is the weight on k unit of the output layer coming from j
unit of the hidden layer.
Calculate the net output by applying the following activation function
y
ykk
=
= f
f((y
yiin
nk
)
k)
Phase 2
Step 7 − Compute the error correcting term, in correspondence with the target pattern received
at each output unit, as follows −
′′
δ
δk = (t k −
k = (tk − y
yk )f (y in
k )f (yi k)
nk
)
On this basis, update the weight and bias as follows −
Δ
Δvvjjk
k
=
= α
αδδk Q ijj
k Qi
Δ
Δbb0
0kk
=
= α
αδδk
k
Then, send δ
δkk
back to the hidden layer.
Step 8 − Now each hidden unit will be the sum of its delta inputs from the output units.
m
m
δ
δiin
nj
= ∑ δk w
j = ∑ δk wjjk
k
k
k==1
1
′′
δ
δjj =
= δ
δiin
nj
f (Q in
j f (Qi j)
nj
)
Δ
Δwwiijj =
= α
αδδjj x
xii
Δ
Δbb0
0jj
=
= α
αδδjj
Phase 3
Step 9 − Each output unit (ykk = 1 to m) updates the weight and bias as follows −
v
vjjk (new) = v jk
k (new) = vj
(old) + Δv jk
k (old) + Δvj k
b
b0 (new) = b 0k
k (new) = b0
0k
(old) + Δb 0k
k (old) + Δb0 k
Step 10 − Each output unit (zjj = 1 to p) updates the weight and bias as follows −
w
wiijj(
(nne
eww)
) =
= w
wiijj(
(oolld
d)) +
+ Δ
Δwwiijj
b
b0 (new) = b 0jj (
j (new) = b0
0j (oolld
d)) +
+ Δ
Δbb0
0jj
Step 11 − Check for the stopping condition, which may be either the number of epochs reached
or the target output matches the actual output.
Mathematical Formulation
y
yiin
nkk
=
= ∑
∑ z
zii w
wjjk
k
i
i
And y
yiin
njj
=
= ∑
∑i x
xii v
viijj
i
1
1 2
2
E
E =
= ∑
∑ [[t
tk − y k ]]
k − yk
2
2
k
k
∂
∂EE ∂
∂ 1
1 2
2
=
= (
( ∑
∑ [[t
tk − y k ]] )
k − yk )
∂
∂wwjjk
k
∂
∂wwjjk
k
2
2
k
k
∂
∂ 1
1 2
2
=
= ⟮
⟮ [[t
tk − t(y in
k − t(yi nk
)] ⟯
k )] ⟯
∂
∂wwjjk
k
2
2
∂
∂
=
= −
−[[t
tk − y k ]]
k − yk
f
f((y
yiin
nk
)
k)
∂
∂wwjjk
k
∂
∂
=
= −
−[[t
tk − y k ]]f
k − yk f((y
yiin
nk
)
k)
(
(yyiin
nk
)
k)
∂
∂wwjjk
k
′′
=
= −
−[[t
tk − y k ]]f
k − yk f (
(yyiin
nk
)z j
k )zj
′′
∂
∂EE ∂
∂
=
= −
−∑∑δ
δkk
(
(yyiin
nk
)
k)
∂
∂vviijj ∂
∂vviijj
k
k
′′
δ
δjj =
= −
−∑∑δ
δk w jk
k wj
f (z in
k f (zi j)
nj
)
k
k
∂
∂EE
Δ
Δwwjjk
k
=
= −
−αα
∂
∂wwjjk
k
=
= α
αδδk zj
k zj
∂
∂EE
Δ
Δvviijj =
= −
−αα
∂
∂vviijj
=
= α
αδδjj x
xii
Unsupervised Learning
As the name suggests, this type of learning is done without the supervision of a teacher. This
learning process is independent. During the training of ANN under unsupervised learning, the
input vectors of similar type are combined to form clusters. When a new input pattern is applied,
then the neural network gives an output response indicating the class to which input pattern
belongs. In this, there would be no feedback from the environment as to what should be the
desired output and whether it is correct or incorrect. Hence, in this type of learning the network
itself must discover the patterns, features from the input data and the relation for the input data
over the output.
Winner-Takes-All Networks
These kinds of networks are based on the competitive learning rule and will use the strategy
where it chooses the neuron with the greatest total inputs as a winner. The connections
between the output neurons show the competition between them and one of them would be
‘ON’ which means it would be the winner and others would be ‘OFF’.
Following are some of the networks based on this simple concept using unsupervised learning.
Hamming Network
In most of the neural networks using unsupervised learning, it is essential to compute the
distance and perform comparisons. This kind of network is Hamming network, where for every
given input vectors, it would be clustered into different groups. Following are some important
features of Hamming Networks −
Lippmann started working on Hamming networks in 1987.
Max Net
This is also a fixed weight network, which serves as a subnet for selecting the node having the
highest input. All the nodes are fully interconnected and there exists symmetrical weights in all
these weighted interconnections.
Architecture
It uses the mechanism which is an iterative process and each node receives inhibitory inputs
from all other nodes through connections. The single node whose value is maximum would be
active or winner and the activations of all other nodes would be inactive. Max Net uses identity
activation function with
x
x i
iff x
x >
> 0
0
f
f((x
x)) =
= {
{
0
0 i
iff x
x ≤
≤ 0
0
The task of this net is accomplished by the self-excitation weight of +1 and mutual inhibition
magnitude, which is set like [0 < ɛ < ] where “m” is the total number of the nodes.
1
1
m
m
It is concerned with unsupervised training in which the output nodes try to compete with each
other to represent the input pattern. To understand this learning rule we will have to understand
competitive net which is explained as follows −
This network is just like a single layer feed-forward network having feedback connection
between the outputs. The connections between the outputs are inhibitory type, which is shown
by dotted lines, which means the competitors never support themselves.
Basic Concept of Competitive Learning Rule
As said earlier, there would be competition among the output nodes so the main concept is -
during training, the output unit that has the highest activation to a given input pattern, will be
declared the winner. This rule is also called Winner-takes-all because only the winning neuron is
updated and the rest of the neurons are left unchanged.
Mathematical Formulation
Following are the three important factors for mathematical formulation of this learning rule −
Condition to be a winner
Suppose if a neuron yk wants to be the winner, then there would be the following
condition
1
1 i
iff v
vk > vj f
k > vj foor
raallll j
j,, j
j ≠
≠ k
k
y
ykk
=
= {
{
0
0 o
otth
heer
rwwi
isse
e
It means that if any neuron, say, yk wants to win, then its induced local field (the output
of the summation unit), say vk, must be the largest among all the other neurons in the
network.
Condition of the sum total of weight
Another constraint over the competitive learning rule is the sum total of weights to a
particular output neuron is going to be 1. For example, if we consider neuron k then
∑
∑wwk
kjj
=
= 1
1 f
foor
raallll k
k
k
k
−
−αα(
(xxjj −
− w
wk ),
j ),
kj
i
iff n
neeu
urro
onnk
kwwi
inns
s
Δ
Δwwk
kjj
=
= {
{
0
0 i
iff n
neeu
urro
onnk
k llo
osss
sees
s
Here α
α is the learning rate.
This clearly shows that we are favoring the winning neuron by adjusting its weight and if
a neuron is lost, then we need not bother to re-adjust its weight.
K-means is one of the most popular clustering algorithm in which we use the concept of partition
procedure. We start with an initial partition and repeatedly move patterns from one cluster to
another, until we get a satisfactory result.
Algorithm
Step 1 − Select k points as the initial centroids. Initialize k prototypes (w1,…,wk), for example
we can identifying them with randomly chosen input vectors −
W
Wjj =
= i
ip ,
p,
w
whhe
erre
ejj ∈
∈ {
{11,, .. .. .. .. ,, k
k}}a
annd
dpp ∈
∈ {
{11,, .. .. .. .. ,, n
n}}
Step 3 − For each input vector ip where p ∈ {1,…,n}, put ip in the cluster Cj* with the nearest
prototype wj* having the following relation
||i
ip − w j∗
p − wj
| ≤ |i p −
∗ | ≤ |ip − w
wjj ||,, j
j ∈
∈ {
{11,, .. .. .. .. ,, k
k}}
Step 4 − For each cluster Cj, where j ∈ { 1,…,k}, update the prototype wj to be the centroid of
all samples currently in Cj , so that
i
ipp
w
wjj =
= ∑
∑
||C
Cjj ||
i
ip ∈C j
p ∈Cj
j
j==1
1 i
ip ∈w j
p ∈wj
Neocognitron
It is a multilayer feedforward network, which was developed by Fukushima in 1980s. This model
is based on supervised learning and is used for visual pattern recognition, mainly hand-written
characters. It is basically an extension of Cognitron network, which was also developed by
Fukushima in 1975.
Architecture
It is a hierarchical network, which comprises many layers and there is a pattern of connectivity
locally in those layers.
As we have seen in the above diagram, neocognitron is divided into different connected layers
and each layer has two cells. Explanation of these cells is as follows −
S-Cell − It is called a simple cell, which is trained to respond to a particular pattern or a group of
patterns.
C-Cell − It is called a complex cell, which combines the output from S-cell and simultaneously
lessens the number of units in each array. In another sense, C-cell displaces the result of S-cell.
Training Algorithm
Training of neocognitron is found to be progressed layer by layer. The weights from the input
layer to the first layer are trained and frozen. Then, the weights from the first layer to the second
layer are trained, and so on. The internal calculations between S-cell and Ccell depend upon
the weights coming from the previous layers. Hence, we can say that the training algorithm
depends upon the calculations on S-cell and C-cell.
Calculations in S-cell
The S-cell possesses the excitatory signal received from the previous layer and possesses
inhibitory signals obtained within the same layer.
−
−−
−−
−−
−−
−−
−−
−−
−−
−
2
2
θ
θ =
= √
√∑∑∑
∑ttii c
c
i
i
1
1 +
+ e
e
x
x =
= −
− 1
1
1
1 +
+ v
vww0
0
Here, e
e =
= ∑
∑i c
cii w
i
wii
x
x,, i
iff x
x ≥
≥ 0
0
s
s =
= {
{
0
0,, i
iff x
x <
< 0
0
Calculations in C-cell
C
C =
= ∑
∑ssii x
xii
i
i
Here, si is the output from S-cell and xi is the fixed weight from S-cell to C-cell.
The final output is as follows −
C
C
,, i
iff C
C >
> 0
0
a
a++C
C
C
Coou
ut
= {
t = {
0
0,, o
otth
heer
rwwi
isse
e
Here ‘a’ is the parameter that depends on the performance of the network.
Learning Vector Quantization (LVQ), different from Vector quantization (VQ) and Kohonen Self-
Organizing Maps (KSOM), basically is a competitive network which uses supervised learning.
We may define it as a process of classifying the patterns where each output unit represents a
class. As it uses supervised learning, the network will be given a set of training patterns with
known classification along with an initial distribution of the output class. After completing the
training process, LVQ will classify an input vector by assigning it to the same class as that of the
output unit.
Architecture
Following figure shows the architecture of LVQ which is quite similar to the architecture of
KSOM. As we can see, there are “n” number of input units and “m” number of output units.
The layers are fully interconnected with having weights on them.
Parameters Used
Following are the parameters used in LVQ training process as well as in the flowchart
x = training vector (x1,...,xi,...,xn)
Training Algorithm
Step 3 − Continue with steps 4-9, if the condition for stopping this algorithm is not met.
n
n m
m
2
2
D
D((j
j)) =
= ∑
∑∑∑(
(xxii −
− w
wiijj)
)
i
i==1
1 j
j==1
1
Step 7 − Calculate the new weight of the winning unit by the following relation −
if T = Cj then wjj (
w (nne
eww)
) = wjj (
= w (oolld
d)) +
+ α
α[[x
x − wjj (
− w (oolld
d))]]
if T ≠ Cj then w
wjj (
(nne
eww)
) =
= w
wjj (
(oolld
d)) −
− α
α[[x
x −
− w
wjj (
(oolld
d))]]
Flowchart
Variants
Three other variants namely LVQ2, LVQ2.1 and LVQ3 have been developed by Kohonen.
Complexity in all these three variants, due to the concept that the winner as well as the runner-
up unit will learn, is more than in LVQ.
LVQ2
As discussed, the concept of other variants of LVQ above, the condition of LVQ2 is formed by
window. This window will be based on the following parameters −
d
dcc
d
drr
>
> 1
1 −
− θ
θ a
annd
d >
> 1
1 +
+ θ
θ
d
drr
d
dcc
Here, θ
θ is the number of training samples.
y
yc (t + 1) = y c (
c (t + 1) = yc (tt)
) +
+ α
α((t
t))[[x
x((t
t)) −
− y
yc (t)]
c (t)]
(belongs to different class)
y
yr (t + 1) = y r (
r (t + 1) = yr (tt)
) +
+ α
α((t
t))[[x
x((t
t)) −
− y
yr (t)]
r (t)]
(belongs to same class)
Here α
α is the learning rate.
LVQ2.1
In LVQ2.1, we will take the two closest vectors namely yc1 and yc2 and the condition for window
is as follows −
d
dcc1
1
d
dcc2
2
M
Miin
n[[ ,, ]
] >
> (
(11 −
− θ
θ))
d
dcc2
2
d
dcc1
1
d
dcc1
1
d
dcc2
2
M
Maax
x[[ ,, ]
] <
< (
(11 +
+ θ
θ))
d
dcc2
2
d
dcc1
1
y
yc (t + 1) = y c1
1 (t + 1) = yc
c1
(t) + α(t)[x(t) − y c1
1 (t) + α(t)[x(t) − yc
(t)]
1 (t)]
(belongs to different class)
y
yc (t + 1) = y c2
2 (t + 1) = yc
c2
(t) + α(t)[x(t) − y c2
2 (t) + α(t)[x(t) − yc
(t)]
2 (t)]
(belongs to same class)
Here, α
α is the learning rate.
LVQ3
In LVQ3, we will take the two closest vectors namely yc1 and yc2 and the condition for window is
as follows −
d
dcc1
1
d
dcc2
2
M
Miin
n[[ ,, ]
] >
> (
(11 −
− θ
θ))(
(11 +
+ θ
θ))
d
dcc2
2
d
dcc1
1
Here θ
θ ≈
≈ 0.2
0.2
y
yc (t + 1) = y c1
1 (t + 1) = yc
c1
(t) + β(t)[x(t) − y c1
1 (t) + β(t)[x(t) − yc
(t)]
1 (t)]
(belongs to different class)
y
yc (t + 1) = y c2
2 (t + 1) = yc
c2
(t) + β(t)[x(t) − y c2
2 (t) + β(t)[x(t) − yc
(t)]
2 (t)]
(belongs to same class)
Here β
β is the multiple of the learning rate α
α and β
β =
= m
mαα(
(tt)
) for every 0.1 < m < 0.5
This network was developed by Stephen Grossberg and Gail Carpenter in 1987. It is based on
competition and uses unsupervised learning model. Adaptive Resonance Theory (ART)
networks, as the name suggests, is always open to new learning (adaptive) without losing the
old patterns (resonance). Basically, ART network is a vector classifier which accepts an input
vector and classifies it into one of the categories depending upon which of the stored pattern it
resembles the most.
Operating Principal
The main operation of ART classification can be divided into the following phases −
Recognition phase − The input vector is compared with the classification presented at
every node in the output layer. The output of the neuron becomes “1” if it best matches
with the classification applied, otherwise it becomes “0”.
Comparison phase − In this phase, a comparison of the input vector to the comparison
layer vector is done. The condition for reset is that the degree of similarity would be less
than vigilance parameter.
Search phase − In this phase, the network will search for reset as well as the match
done in the above phases. Hence, if there would be no reset and the match is quite
good, then the classification is over. Otherwise, the process would be repeated and the
other stored pattern must be sent to find the correct match.
ART1
It is a type of ART, which is designed to cluster binary vectors. We can understand about this
with the architecture of it.
Architecture of ART1
Cluster Unit (F2 layer) − This is a competitive layer. The unit having the largest net
input is selected to learn the input pattern. The activation of all other cluster unit are set
to 0.
Reset Mechanism − The work of this mechanism is based upon the similarity between
the top-down weight and the input vector. Now, if the degree of this similarity is less
than the vigilance parameter, then the cluster is not allowed to learn the pattern and a
rest would happen.
Supplement Unit − Actually the issue with Reset mechanism is that the layer F2 must have to
be inhibited under certain conditions and must also be available when some learning happens.
That is why two supplemental units namely, G1 and G2 is added along with reset unit, R. They
are called gain control units. These units receive and send signals to the other units present in
the network. ‘+’ indicates an excitatory signal, while ‘−’ indicates an inhibitory signal.
Parameters Used
ρ − Vigilance parameter
||x|| − Norm of vector x
Algorithm
Step 1 − Initialize the learning rate, the vigilance parameter, and the weights as follows −
α
α >
> 1
1 a
annd
d 0
0 <
< ρ
ρ ≤
≤ 1
1
α
α
0
0 <
< b
biijj (
(00)
) <
< a
annd
d t
tiijj (
(00)
) =
= 1
1
α
α −
− 1
1 +
+ n
n
Step 2 − Continue step 3-9, when the stopping condition is not true.
Step 3 − Continue step 4-6 for every training input.
Step 4 − Set activations of all F1(a) and F1 units as follows
F2 = 0 and F1(a) = input vectors
Step 5 − Input signal from F1(a) to F1(b) layer must be sent like
s
sii =
= x
xii
y
yjj =
= ∑
∑i b
biijj x
i
xii the condition is yj ≠ -1
x
xii =
= s
siit
tJJi
i
Step 10 − Now, after calculating the norm of vector x and vector s, we need to check the reset
condition as follows −
If ||x||/ ||s|| < vigilance parameter ρ,theninhibit node J and go to step 7
α
αxxii
b
biijj (
(nne
eww)
) =
=
α
α −
− 1
1 +
+ ||||x
x||||
t
tiijj (
(nne
eww)
) =
= x
xii
Step 12 − The stopping condition for algorithm must be checked and it may be as follows −
Do not have any change in weight.
Reset is not performed for units.
Maximum number of epochs reached.
Suppose we have some pattern of arbitrary dimensions, however, we need them in one
dimension or two dimensions. Then the process of feature mapping would be very useful to
convert the wide pattern space into a typical feature space. Now, the question arises why do we
require self-organizing feature map? The reason is, along with the capability to convert the
arbitrary dimensions into 1-D or 2-D, it must also have the ability to preserve the neighbor
topology.
This topology has 24 nodes in the distance-2 grid, 16 nodes in the distance-1 grid, and 8 nodes
in the distance-0 grid, which means the difference between each rectangular grid is 8 nodes.
The winning unit is indicated by #.
This topology has 18 nodes in the distance-2 grid, 12 nodes in the distance-1 grid, and 6 nodes
in the distance-0 grid, which means the difference between each rectangular grid is 6 nodes.
The winning unit is indicated by #.
Architecture
The architecture of KSOM is similar to that of the competitive network. With the help of
neighborhood schemes, discussed earlier, the training can take place over the extended region
of the network.
Step 1 − Initialize the weights, the learning rate α and the neighborhood topological scheme.
Step 2 − Continue step 3-9, when the stopping condition is not true.
Step 3 − Continue step 4-6 for every input vector x.
Step 4 − Calculate Square of Euclidean Distance for j = 1 to m
n
n m
m
2
2
D
D((j
j)) =
= ∑
∑∑∑(
(xxii −
− w
wiijj)
)
i
i==1
1 j
j==1
1
Step 6 − Calculate the new weight of the winning unit by the following relation −
w
wiijj(
(nne
eww)
) =
= w
wiijj(
(oolld
d)) +
+ α
α[[x
xii −
− w
wiijj(
(oolld
d))]]
α
α((t
t +
+ 1
1)) =
= 0.5
0.5ααt
t
These kinds of neural networks work on the basis of pattern association, which means they can
store different patterns and at the time of giving an output they can produce one of the stored
patterns by matching them with the given input pattern. These types of memories are also called
Content-Addressable Memory (CAM). Associative memory makes a parallel search with the
stored patterns as data files.
Following are the two types of associative memories we can observe −
Auto Associative Memory
Hetero Associative memory
Architecture
As shown in the following figure, the architecture of Auto Associative memory network has ‘n’
number of input training vectors and similar ‘n’ number of output target vectors.
Training Algorithm
For training, this network is using the Hebb or Delta learning rule.
Step 1 − Initialize all the weights to zero as wij = 0 (i = 1 to n, j = 1 to n)
x
xii =
= s
sii (
(ii =
= 1
1tto
onn)
)
y
yjj =
= s
sjj (
(jj =
= 1
1tto
onn)
)
Testing Algorithm
Step 1 − Set the weights obtained during training for Hebb’s rule.
Step 2 − Perform steps 3-5 for each input vector.
Step 3 − Set the activation of the input units equal to that of the input vector.
Step 4 − Calculate the net input to each output unit j = 1 to n
n
n
y
yiin
njj
=
= ∑
∑xxii w
wiijj
i
i==1
1
+
+11 i
iff y
yiin
njj
>
> 0
0
y
yjj =
= f
f((y
yiin
nj
) = {
j) = {
−
−11 i
iff y
yiin
njj
⩽
⩽ 0
0
Architecture
As shown in the following figure, the architecture of Hetero Associative Memory network has ‘n’
number of input training vectors and ‘m’ number of output target vectors.
Training Algorithm
For training, this network is using the Hebb or Delta learning rule.
Step 1 − Initialize all the weights to zero as wij = 0 (i = 1 to n, j = 1 to m)
Step 2 − Perform steps 3-4 for each input vector.
Step 3 − Activate each input unit as follows −
x
xii =
= s
sii (
(ii =
= 1
1tto
onn)
)
y
yjj =
= s
sjj (
(jj =
= 1
1tto
omm)
)
w
wiijj(
(nne
eww)
) =
= w
wiijj(
(oolld
d)) +
+ x
xii y
yjj
Testing Algorithm
Step 1 − Set the weights obtained during training for Hebb’s rule.
Step 2 − Perform steps 3-5 for each input vector.
Step 3 − Set the activation of the input units equal to that of the input vector.
Step 4 − Calculate the net input to each output unit j = 1 to m;
n
n
y
yiin
njj
=
= ∑
∑xxii w
wiijj
i
i==1
1
⎧
⎧ +
+11 i
iff y
yiin
nj
>
> 0
0
⎪
⎪ j
y
yjj =
= f
f((y
yiin
nj
) = ⎨0
j) = ⎨ 0
i
iff y
yiin
njj
=
= 0
0
⎩
⎩
⎪
⎪
−
−11 i
iff y
yiin
njj
<
< 0
0
Hopfield neural network was invented by Dr. John J. Hopfield in 1982. It consists of a single
layer which contains one or more fully connected recurrent neurons. The Hopfield network is
commonly used for auto-association and optimization tasks.
Discrete Hopfield Network
A Hopfield network which operates in a discrete line fashion or in other words, it can be said the
input and output patterns are discrete vector, which can be either binary (0,1) or bipolar (+1, -1)
in nature. The network has symmetrical weights with no self-connections i.e., wij = wji and wii =
0.
Architecture
Following are some important points to keep in mind about discrete Hopfield network −
This model consists of neurons with one inverting and one non-inverting output.
The output of each neuron should be the input of other neurons but not the input of self.
The output from Y1 going to Y2, Yi and Yn have the weights w12, w1i and w1n respectively.
Similarly, other arcs have the weights on them.
Training Algorithm
During training of discrete Hopfield network, weights will be updated. As we know that we can
have the binary input vectors as well as bipolar input vectors. Hence, in both the cases, weight
updates can be done with the following relation
Case 1 − Binary input patterns
For a set of binary patterns s(p), p = 1 to P
Here, s(p) = s1(p), s2(p),..., si(p),..., sn(p)
Weight Matrix is given by
P
P
w
wiijj =
= ∑
∑[[2
2ssii (
(pp)
)−− 1
1]][[2
2ssjj (
(pp)
)−− 1
1]] f
foor
rii ≠
≠ j
j
p
p==1
1
P
P
w
wiijj =
= ∑
∑[[s
sii (
(pp)
)]][[s
sjj (
(pp)
)]] f
foor
rii ≠
≠ j
j
p
p==1
1
Testing Algorithm
Step 1 − Initialize the weights, which are obtained from training algorithm by using Hebbian
principle.
Step 2 − Perform steps 3-9, if the activations of the network is not consolidated.
Step 3 − For each input vector X, perform steps 4-8.
Step 4 − Make initial activation of the network equal to the external input vector X as follows −
y
yii =
= x
xii f
foor
rii =
= 1
1tto
onn
y
yiin
nii
=
= x
xii +
+ ∑
∑yyjj w
wjjii
j
j
Step 7 − Apply the activation as follows over the net input to calculate the output −
⎧1
⎧ 1 i
iff y
yiin
nii
>
> θ
θii
y
yii =
= ⎨
⎨yyii i
iff y
yiin =
= θ
θii
nii
⎩
⎩
0
0 i
iff y
yiin
nii
<
< θ
θii
Here θ
θii is the threshold.
An energy function is defined as a function that is bonded and non-increasing function of the
state of the system.
Energy function Ef, also called Lyapunov function determines the stability of discrete Hopfield
network, and is characterized as follows −
n
n n
n n
n n
n
1
1
E
Eff
=
= −
− ∑
∑∑∑y
yii y
yjj w
wiijj −
− ∑
∑xxii y
yii +
+ ∑
∑θθii y
yii
2
2
i
i==1
1 j
j==1
1 i
i==1
1 i
i==1
1
Condition − In a stable network, whenever the state of node changes, the above energy
function will decrease.
(
(kk)
) (
(kk +
+ 1
1))
Suppose when node i has changed state from y
y
i
i
to y
y
i
i
then the Energy change
Δ
ΔEEf
f
is given by the following relation
(
(kk+
+11)
) (
(kk)
)
Δ
ΔEEf
f
=
= E
Ef (y
f (y
)
) −
− E
Ef (y
f (y
)
)
i
i i
i
n
n
(
(kk)
) (
(kk+
+11)
) (
(kk)
)
=
= −
−((∑
∑wwiijjy
y
ii
+
+ x
xii −
− θ
θii )
)((y
y −
− y
y )
)
i
i i
i
j
j==1
1
=
= −
−((n
neet
tii )
)ΔΔy
yii
(
(kk +
+ 1
1)) (
(kk)
)
Here Δ
Δyyii =
= y
y
i
i
−
− y
y
i
i
The change in energy depends on the fact that only one unit can update its activation at a time.
In comparison with Discrete Hopfield network, continuous network has time as a continuous
variable. It is also used in auto association and optimization problems such as travelling
salesman problem.
Model − The model or architecture can be build up by adding electrical components such as
amplifiers which can map the input voltage to the output voltage over a sigmoid activation
function.
n
n n
n n
n n
n n
n y
yii
1
1 1
1 −
−11
E
Ef =
f =
∑
∑∑∑y
yii y
yjj w
wiijj −
−∑∑x
xii y
yii +
+ ∑
∑∑∑w
wiijjg
gr ∫
i ∫
ri
a
a (
(yy)
)ddy
y
2
2 λ
λ 0
i 0
i==1
1 j
j==1
1 i
i==1
1 i
i==1
1 j
j==1
1
j
j≠≠i
i j
j≠≠i
i
Boltzmann Machine
These are stochastic learning processes having recurrent structure and are the basis of the
early optimization techniques used in ANN. Boltzmann Machine was invented by Geoffrey
Hinton and Terry Sejnowski in 1985. More clarity can be observed in the words of Hinton on
Boltzmann Machine.
“A surprising feature of this network is that it uses only locally available information. The change
of weight depends only on the behavior of the two units it connects, even though the change
optimizes a global measure” - Ackley, Hinton 1985.
Some important points about Boltzmann Machine −
They use recurrent structure.
They consist of stochastic neurons, which have one of the two possible states, either 1
or 0.
Some of the neurons in this are adaptive (free state) and some are clamped (frozen
state).
If we apply simulated annealing on discrete Hopfield network, then it would become
Boltzmann Machine.
Architecture
The following diagram shows the architecture of Boltzmann machine. It is clear from the
diagram, that it is a two-dimensional array of units. Here, weights on interconnections between
units are –p where p > 0. The weights of self-connections are given by b where b > 0.
Training Algorithm
As we know that Boltzmann machines have fixed weights, hence there will be no training
algorithm as we do not need to update the weights in the network. However, to test the network
we have to set the weights as well as to find the consensus function (CF).
Boltzmann machine has a set of units Ui and Uj and has bi-directional connections on them.
We are considering the fixed weight say wij.
wij ≠ 0 if Ui and Uj are connected.
There also exists a symmetry in weighted interconnection, i.e. wij = wji.
wii also exists, i.e. there would be the self-connection between units.
For any unit Ui, its state ui would be either 1 or 0.
The main objective of Boltzmann Machine is to maximize the Consensus Function (CF) which
can be given by the following relation
C
CFF =
= ∑
∑∑∑w
wiijju
uii u
ujj
i
i j
j⩽⩽i
i
Now, when the state changes from either 1 to 0 or from 0 to 1, then the change in consensus
can be given by the following relation −
Δ
ΔCCF
F =
= (
(11 −
− 2
2uuii )
)( wiijj +
(w + ∑
∑u wiijj)
uii w )
j
j≠≠i
i
Generally, unit Ui does not change its state, but if it does then the information would be residing
local to the unit. With that change, there would also be an increase in the consensus of the
network.
Probability of the network to accept the change in the state of the unit is given by the following
relation −
1
1
A
AFF(
(ii,, T
T)) =
=
Δ
ΔCCF
F((i
i))
1
1 +
+ e
exxp
p[[−
− ]]
T
T
Here, T is the controlling parameter. It will decrease as CF reaches the maximum value.
Testing Algorithm
Δ
ΔCCF
F =
= (
(11 −
− 2
2uuii )
)((w
wiijj +
+ ∑
∑uuii w
wiijj)
)
j
j≠≠i
i
Step 6 − Calculate the probability that this network would accept the change in state
1
1
A
AFF(
(ii,, T
T)) =
=
Δ
ΔCCF
F((i
i))
1
1 +
+ e
exxp
p[[−
− ]]
T
T
Brain-State-in-a-Box Network
Mathematical Formulations
The node function used in BSB network is a ramp function, which can be defined as follows −
f
f((n
neet
t)) =
= m
miin
n((1
1,, m
maax
x((−
−11,, n
neet
t)))
)
n
n
xtt(
x (tt +
+ 1
1)) =
= f
f((∑
∑wwii,,jjx
xjj (
(tt)
)))
j
j==1
1
P
P
1
1
w
wiijj =
= ∑
∑((v
vp v p,,jj )
i vp
p,,i )
P
P
p
p==1
1
Optimization is an action of making something such as design, situation, resource, and system
as effective as possible. Using a resemblance between the cost function and energy function,
we can use highly interconnected neurons to solve optimization problems. Such a kind of neural
network is Hopfield network, that consists of a single layer containing one or more fully
connected recurrent neurons. This can be used for optimization.
Points to remember while using Hopfield network for optimization −
The energy function must be minimum of the network.
It will find satisfactory solution rather than select one out of the stored patterns.
The quality of the solution found by Hopfield network depends significantly on the initial
state of the network.
Travelling Salesman Problem (TSP) is a classical optimization problem in which a salesman has
to travel n cities, which are connected with each other, keeping the cost as well as the distance
travelled minimum. For example, the salesman has to travel a set of 4 cities A, B, C, D and the
goal is to find the shortest circular tour, A-B-C–D, so as to minimize the cost, which also
includes the cost of travelling from the last city D to the first city A.
Matrix Representation
Actually each tour of n-city TSP can be expressed as n × n matrix whose ith row describes the
ith city’s location. This matrix, M, for 4 cities A, B, C, D can be expressed as follows −
A
A :: 1
1 0
0 0
0 0
0
⎡
⎡ ⎤
⎤
⎢B
⎢ B :: 0
0 1
1 0
0 0
0⎥⎥
M
M =
= ⎢
⎢ ⎥
⎥
⎢
⎢CC :: 0
0 0
0 1
1 0⎥
0 ⎥
⎣
⎣ ⎦
⎦
D
D :: 0
0 0
0 0
0 1
1
While considering the solution of this TSP by Hopfield network, every node in the network
corresponds to one element in the matrix.
To be the optimized solution, the energy function must be minimum. On the basis of the
following constraints, we can calculate the energy function as follows −
Constraint-I
First constraint, on the basis of which we will calculate energy function, is that one element must
be equal to 1 in each row of matrix M and other elements in each row must equal to 0 because
each city can occur in only one position in the TSP tour. This constraint can mathematically be
written as follows −
n
n
∑
∑MMx = 1 f or x ∈ {1, . . . , n}
j = 1 f or x ∈ {1, . . . , n}
x,,j
j
j==1
1
Now the energy function to be minimized, based on the above constraint, will contain a term
proportional to −
2
2
n
n n
n
∑
∑((1
1 −
− ∑
∑MMx j )
x,,j )
x
x==1
1 j
j==1
1
Constraint-II
As we know, in TSP one city can occur in any position in the tour hence in each column of
matrix M, one element must equal to 1 and other elements must be equal to 0. This constraint
can mathematically be written as follows −
n
n
∑
∑MMx = 1 f or j ∈ {1, . . . , n}
j = 1 f or j ∈ {1, . . . , n}
x,,j
x
x==1
1
Now the energy function to be minimized, based on the above constraint, will contain a term
proportional to −
2
2
n
n n
n
∑
∑((1
1 −
− ∑
∑MMx )
j )
x,,j
j
j==1
1 x
x==1
1
Let’s suppose a square matrix of (n × n) denoted by C denotes the cost matrix of TSP for n
cities where n > 0. Following are some parameters while calculating the cost function −
Cx, y − The element of cost matrix denotes the cost of travelling from city x to y.
Adjacency of the elements of A and B can be shown by the following relation −
M
Mx =
= 1
1 a
annd
d M
My =
= 1
1
x,,i
i y,,i
i±±1
1
As we know, in Matrix the output value of each node can be either 0 or 1, hence for every pair of
cities A, B we can add the following terms to the energy function −
n
n
∑
∑CCx
x,,y
M x,,ii(
y Mx
(MMy
y,,i
i++1
1
+
+ M
Myy,,i
i−−1
1
)
)
i
i==1
1
On the basis of the above cost function and constraint value, the final energy function E can be
given as follows −
n
n
1
1
E
E =
= ∑
∑∑∑∑
∑CCx M x,,ii(
y Mx
x,,y (MMy
y,,i
i+
+ M y,,ii−
1 + My
+1 −1
) +
1) +
2
2
i
i==1
1 x
x y
y≠≠x
x
2
2 2
2
1
1 −
− ∑
∑MMx
x,,i
i
1
1 −
− ∑
∑MMx
x,,i
i
[
[γγ1 ∑(
1 ∑ (
)
) +
+ γ
γ2 ∑(
2 ∑ (
)
) ]
]
i
i x
x
x
x i
i
We can understand the main working idea of gradient descent with the help of the following
steps −
First, start with an initial guess of the solution.
Then, take the gradient of the function at that point.
Later, repeat the process by stepping the solution in the negative direction of the
gradient.
By following the above steps, the algorithm will eventually converge where the gradient is zero.
Mathematical Concept
Suppose we have a function f(x) and we are trying to find the minimum of this function.
Following are the steps to find the minimum of f(x).
slope of the curve at that x and its direction will point to the increase in the function, to
find out the best direction to minimize it.
Now change x as follows −
x
xnn +
+ 1
1
=
= x
xn − θ∇f (x n )
n − θ∇f (xn )
Here, θ > 0 is the training rate (step size) that forces the algorithm to take small jumps.
Actually a wrong step size θ may not reach convergence, hence a careful selection of the same
is very important. Following points must have to be remembered while choosing the step size
Do not choose too large step size, otherwise it will have a negative impact, i.e. it will
diverge rather than converge.
Do not choose too small step size, otherwise it take a lot of time to converge.
Some options with regards to choosing the step size −
One option is to choose a fixed step size.
Another option is to choose a different step size for every iteration.
Simulated Annealing
The basic concept of Simulated Annealing (SA) is motivated by the annealing in solids. In the
process of annealing, if we heat a metal above its melting point and cool it down then the
structural properties will depend upon the rate of cooling. We can also say that SA simulates the
metallurgy process of annealing.
Use in ANN
Algorithm
Nature has always been a great source of inspiration to all mankind. Genetic Algorithms (GAs)
are search-based algorithms based on the concepts of natural selection and genetics. GAs are
a subset of a much larger branch of computation known as Evolutionary Computation.
GAs was developed by John Holland and his students and colleagues at the University of
Michigan, most notably David E. Goldberg and has since been tried on various optimization
problems with a high degree of success.
In GAs, we have a pool or a population of possible solutions to the given problem. These
solutions then undergo recombination and mutation (like in natural genetics), producing new
children, and the process is repeated over various generations. Each individual (or candidate
solution) is assigned a fitness value (based on its objective function value) and the fitter
individuals are given a higher chance to mate and yield more “fitter” individuals. This is in line
with the Darwinian Theory of “Survival of the Fittest”.
In this way, we keep “evolving” better individuals or solutions over generations, till we reach a
stopping criterion.
Genetic Algorithms are sufficiently randomized in nature, however they perform much better
than random local search (in which we just try various random solutions, keeping track of the
best so far), as they exploit historical information as well.
Advantages of GAs
GAs have various advantages which have made them immensely popular. These include −
Does not require any derivative information (which may not be available for many real-
world problems).
Is faster and more efficient as compared to the traditional methods.
Has very good parallel capabilities.
Optimizes both continuous and discrete functions as well as multi-objective problems.
Provides a list of “good” solutions and not just a single solution.
Always gets an answer to the problem, which gets better over the time.
Useful when the search space is very large and there are large number of parameters
involved.
Limitations of GAs
Like any technique, GAs also suffers from a few limitations. These include −
GAs are not suited for all problems, especially problems which are simple and for which
derivative information is available.
Fitness value is calculated repeatedly, which might be computationally expensive for
some problems.
Being stochastic, there are no guarantees on the optimality or the quality of the
solution.
If not implemented properly, GA may not converge to the optimal solution.
GA – Motivation
Genetic Algorithms have the ability to deliver a “good-enough” solution “fast-enough”. This
makes Gas attractive for use in solving optimization problems. The reasons why GAs are
needed are as follows −
Traditional calculus based methods work by starting at a random point and by moving in the
direction of the gradient, till we reach the top of the hill. This technique is efficient and works
very well for single-peaked objective functions like the cost function in linear regression.
However, in most real-world situations, we have a very complex problem called as landscapes,
made of many peaks and many valleys, which causes such methods to fail, as they suffer from
an inherent tendency of getting stuck at the local optima as shown in the following figure.
Some difficult problems like the Travelling Salesman Problem (TSP), have real-world
applications like path finding and VLSI Design. Now imagine that you are using your GPS
Navigation system, and it takes a few minutes (or even a few hours) to compute the “optimal”
path from the source to destination. Delay in such real-world applications is not acceptable and
therefore a “good-enough” solution, which is delivered “fast” is what is required.
Before studying the fields where ANN has been used extensively, we need to understand why
ANN would be the preferred choice of application.
Areas of Application
Followings are some of the areas, where ANN is being used. It suggests that ANN has an
interdisciplinary approach in its development and applications.
Speech Recognition
Character Recognition
It is an interesting problem which falls under the general area of Pattern Recognition. Many
neural networks have been developed for automatic recognition of handwritten characters,
either letters or digits. Following are some ANNs which have been used for character
recognition −
Multilayer neural networks such as Backpropagation neural networks.
Neocognitron
Though back-propagation neural networks have several hidden layers, the pattern of connection
from one layer to the next is localized. Similarly, neocognitron also has several hidden layers
and its training is done layer by layer for such kind of applications.
Signatures are one of the most useful ways to authorize and authenticate a person in legal
transactions. Signature verification technique is a non-vision based technique.
For this application, the first approach is to extract the feature or rather the geometrical feature
set representing the signature. With these feature sets, we have to train the neural networks
using an efficient neural network algorithm. This trained neural network will classify the
signature as being genuine or forged under the verification stage.
It is one of the biometric methods to identify the given face. It is a typical task because of the
characterization of “non-face” images. However, if a neural network is well trained, then it can
be divided into two classes namely images having faces and images that do not have faces.
First, all the input images must be preprocessed. Then, the dimensionality of that image must
be reduced. And, at last it must be classified using neural network training algorithm. Following
neural networks are used for training purposes with preprocessed image −
Fully-connected multilayer feed-forward neural network trained with the help of back-
propagation algorithm.
For dimensionality reduction, Principal Component Analysis (PCA) is used.