Hebbian Learning and Gradient Descent Learning: Neural Computation: Lecture 5
Hebbian Learning and Gradient Descent Learning: Neural Computation: Lecture 5
Hebbian Learning and Gradient Descent Learning: Neural Computation: Lecture 5
1. Hebbian Learning
In more familiar terminology, that can be stated as the Hebbian Learning rule:
Then, in the notation used for Perceptrons, that can be written as the weight update:
There is strong physiological evidence that this type of learning does take place in the
region of the brain known as the hippocampus.
€
L5-2
Modified Hebbian Learning
An obvious problem with the above rule is that it is unstable – chance coincidences
will build up the connection strengths, and all the weights will tend to increase
indefinitely. Consequently, the basic learning rule (1) is often supplemented by:
Another way to stop the weights increasing indefinitely involves normalizing them so
they are constrained to lie between 0 and 1. This is preserved by the weight update
wij + η.out j .ini
Δwij = 1/2 − wij
(∑ (w
k kj + η.out j .ink ) 2
)
which, using a small η and linear neuron approximation, leads to Oja’s Learning Rule
€ L5-3
Hebbian versus Perceptron Learning
It is instructive to compare the Hebbian and Oja learning rules with the Perceptron
learning weight update rule we derived previously, namely:
There is clearly some similarity, but the absence of the target outputs targj means that
Hebbian learning is never going to get a Perceptron to learn a set of training data.
€
There exist variations of Hebbian learning, such as Contrastive Hebbian Learning,
that do provide powerful supervised learning for biologically plausible networks.
However, it has been shown that, for many relevant cases, much simpler non-
biologically plausible algorithms end up producing the same functionality as these
biologically plausible Hebbian-type learning algorithms.
For the purposes of this module, we shall therefore pursue simpler non-Hebbian
approaches for formulating learning algorithms for our artificial neural networks.
L5-4
Learning by Error Minimisation
The general requirement for learning is an algorithm that adjusts the network weights wij
to minimise the difference between the actual outputs outj and the desired outputs targj.
This is known as the Sum Squared Error (SSE) function – it is the total squared error
summed over all the output units j and all the training patterns p. The process of
€
training a neural network corresponds to minimizing such an error function.
We shall first look at the particularly simple special case of single layer regression
networks, for which there is a simple direct computation that achieves the minimum,
and then go on to look at the iterative procedures that are necessary for more complex
networks, that have other structures, activation functions and error functions.
L5-5
Single Layer Regression Networks
Regression or Function Approximation problems involve approximating an underlying
function from a set of noisy data. In this case, the appropriate output activation function
is normally the simple linear activation function f(x) = x, rather than some kind of step
function as required for classification problems. This means the outputs of a single
layer regression network take on the simple form:
n n
out j = ∑ ini wij − θ j = ∑ ini wij
i=1 i=0
Then, if we train the network by minimizing the Sum Squared Error on the outputs, the
derivative of the error with respect to each weight must be zero at the minumum, i.e.
€
2
∂ 1
2 ∑ ∑ targ j − ∑ ini wij = 0
∂wkl p j i
This is a set of simultaneous linear equations, one for each training pattern, and finding
the weights in this case reduces to solving that set of equations for the wij.
€
L5-6
Learning by Simple Matrix Inversion
If we introduce explicit training pattern labels p and compute the derivative we have
∑ targpl − ∑ inpi wil inpk = 0
p i
and this set of equations can be written in conventional matrix form as
€ in T ( targ − in w) = 0
which has a formal solution for the weights in terms of the input pseudo-inverse in†
† T −1 T
€ w = in targ = (in in) in targ
€ from the inputs
Thus, it is possible to compute the required network weights directly
and targets using standard matrix pseudo-inversion techniques.
€
Unfortunately, this simple matrix inversion approach will only work for this particularly
simple case. In general, one has to use an iterative weight update procedure.
L5-7
Gradient Descent Learning
We have already seen how iterative weight updates work in Hebbian learning and the
Perceptron Learning rule. The aim now is to develop a learning algorithm that
minimises a cost function (such as Sum Squared Error) by making appropriate iterative
adjustments to the weights w ij. The idea is to apply a series of small updates to the
weights wij → wij + Δwij until the cost E(wij) is “small enough”.
For the Perceptron Learning Rule, we determined the direction that each weight needed
to change to bring the output closer to the right side of the decision boundary, and then
updated the weight by a small step in that direction.
Now we want to determine the direction that the weight vector needs to change to best
reduce the chosen cost function. A systematic procedure for doing that requires
knowledge of how the cost E(wij) varies as the weights wij change, i.e. the gradient of E
with respect to wij. Then, if we repeatedly adjust the weights by small steps against the
gradient, we will move through weight space, descending along the gradients towards a
minimum of the cost function.
L5-8
Computing Gradients and Derivatives
The branch of mathematics concerned with computing gradients is called Differential
Calculus. The relevant general idea is straightforward. Consider a function y = f(x) :
y
f(x)
Δy
Δx
x
The gradient of f(x), at a particular value of x, is the rate of change of f(x) as we change
x, and that can be approximated by Δy/Δx for small Δx. It can be written exactly as
∂f ( x) Δy f ( x + Δx) − f (x)
= Lim = Lim
∂x Δx→0 Δx Δx→0 Δx
L5-9
Examples of Computing Derivatives Analytically
Some simple examples illustrate how derivatives can be computed:
∂f ( x) [ a.(x + Δx) + b ] − [ a.x + b]
f(x) = a.x + b ⇒ = Lim =a
∂x Δx→0 Δx
f(x) = a.x2 ⇒
∂f ( x)
= Lim
[ ] [
a.(x + Δx)2 − a.x 2 ]
= 2ax
∂x Δx→0 Δx
Other derivatives can be found in the same way. Some particularly useful ones are:
∂f ( x) ∂f ( x) 1
f(x) = a.xn ⇒ = nax n−1 f(x) = loge(x) ⇒ =
∂x ∂x x
∂f ( x) ∂f ( x)
f(x) = eax ⇒ = ae ax f(x) = sin(x) ⇒ = cos(x)
∂x ∂x
L5-10
Gradient Descent Minimisation
If we want to change the value of x to minimise a function f(x), what we need to do
depends on the gradient of f(x) at the current value of x. There are three cases:
∂f
If ∂x >0 then f(x) increases as x increases so we should decrease x
∂f
If ∂x <0 then f(x) decreases as x increases so we should increase x
∂f
If ∂x =0 then f(x) is at a maximum or minimum so we should not change x
L5-11
Gradients in More Than One Dimension
It might not be obvious that one needs the gradient/derivative itself in the weight update
equation, rather than just the sign of the gradient. So, consider the two dimensional
function shown as a contour plot with its minimum inside the smallest ellipse:
x2
x1
A few representative gradient vectors are shown. By definition, they will always be
perpendicular to the contours, and the closer the contours, the larger the vectors. It is
now clear that we need to take the relative magnitudes of the x1 and x2 components of
the gradient vectors into account if we are to head towards the minimum efficiently.
L5-12
Training a Single Layer Feed-forward Network
Now we know how gradient descent weight update rules can lead to minimisation of a
neural network’s output errors, it is straightforward to train any single layer network:
1. Take the set of training patterns you wish the network to learn
{inip, targjp : i = 1 … ninputs, j = 1 … noutputs, p = 1 … npatterns}
2. Set up the network with ninputs input units fully connected to noutputs output
units via connections with weights wij
3. Generate random initial weights, e.g. from the range [–smwt, +smwt]
4. Select an appropriate error function E(wij) and learning rate η
5. Apply the weight update Δ wij = –η ∂E(wij)/∂wij to each weight w ij for each
training pattern p. One set of updates of all the weights for all the training
patterns is called one epoch of training.
6. Repeat step 5 until the network error function is “small enough”.
This will produce a trained neural network, but steps 4 and 5 can still be difficult…
L5-13
Gradient Descent Error Minimisation
We will look at how to choose the error function E next lecture. Suppose, for now, that
we want to train a single layer network by adjusting its weights wij to minimise the SSE:
2
E(wij ) = 1
2 ∑ ∑ ( targ j − out j )
p j
We have seen that we can do this by making a series of gradient descent weight updates:
∂E(wij )
Δwkl = −η
∂wkl
If the transfer function for the output neurons is f(x), and the activations of the previous
layer of neurons are ini , then the outputs are out j = f ( ∑ in iw ij ) , and
i
2
∂ 1
Δw kl = −η 2 ∑ ∑ targ j − f ( ∑ ini wij )
∂w kl p j i
Dealing with equations like this is easy if we use the chain rules for derivatives.
L5-14
Chain Rules for Computing Derivatives
Computing complex derivatives can be done in stages. First, suppose f(x) = g(x).h(x)
∂f ( x)
= Lim
g(x + Δx).h(x + Δx) − g( x).h( x)
= Lim
( )(
g( x) + ∂g(∂xx) Δx . h(x) + ∂h(x )
)
∂x Δx − g( x).h(x)
∂x Δx→0 Δx Δx→0 Δx
∂f ( x) ∂g( x) ∂h(x )
= h(x) + g(x)
∂x ∂x ∂x
We can similarly deal with nested functions. Suppose f(x) = g(h(x))
∂f ( x)
= Lim
g(h( x)) + ∂∂g(x )
h(x ) Δh( x) − g(h(x))
= Lim
g(h(x)) + (
∂h( x) )
∂g( x) ∂h( x)
∂x Δx − g(h( x))
∂x Δx→0 Δx Δx→0 Δx
∂f ( x) ∂g(h(x)) ∂h( x)
= ⋅
∂x ∂h( x) ∂x
L5-15
Using the Chain Rule on the Weight Update Equation
The algebra gets rather messy, but after repeated application of the chain rule, and some
tidying up, we end up with a very simple weight update equation:
2
∂ 1
Δw kl = −η 2 ∑ ∑ targ j − f ( ∑ ini wij )
∂w kl p j i
2
∂
Δw kl = −η 12 ∑ ∑ targ j − f ( ∑ ini wij )
p j ∂wkl i
∂
Δw kl = −η 2 ∑ ∑ 2 targ j − f (∑ in i w ij )−
1
f (∑ in m w mj )
p j i ∂w kl m
∂
Δw kl = η∑ ∑ targ j − f (∑ in i w ij ) f ′(∑ in n w nj ) (∑ in m w mj )
€ p j i n ∂w kl m
L5-16
€
∂w mj
Δw kl = η∑ ∑ targ j − f (∑ in i w ij ) f ′(∑ in n w nj )(∑ in m )
p j i n m
∂w kl
Δw kl = η∑ ∑ targ j − f (∑ in i w ij ) f ′(∑ in n w ij )(∑ in mδmkδ jl )
€ p j i n m
Δw kl = η∑ ∑ targ j − f (∑ in i w ij ) f ′(∑ in n w nj )(in kδ jl )
€ p j i n
Δw kl = η∑ targl − f (∑ in i w il ) f ′(∑ in n w nl )(in k )
€ p i n
Notice that these weight updates involve the derivative f´(x), so the activation function f(x)
must be differentiable for gradient descent learning to work. This is clearly no good for
€
simple classification Perceptrons which use the step function step(x) as their threshold
function, because that has zero derivative everywhere except at x = 0 where it is infinite.
We will need to use a smoothed version of a step function, like a sigmoid. However, for a
linear activation function f(x) = x, appropriate for regression, we have weight updates
which is often known as the Delta Rule because each update Δwkl is simply proportional to
the relevant input ink and the corresponding output discrepancy deltal = targl − outl . This
update rule is exactly€the same as the Perceptron Learning Rule we derived earlier.
L5-18
€
Delta Rule vs. Perceptron Learning Rule
We have seen that the Delta Rule and the Perceptron Learning Rule for training Single
Layer Perceptrons have exactly the same weight update equation.
However, the two algorithms were obtained from very different theoretical starting
points. The Perceptron Learning Rule was derived from a consideration of how we
should shift around the decision hyper-planes for step function outputs, while the Delta
Rule emerged from a gradient descent minimisation of the Sum Squared Error for a
linear output activation function.
The Perceptron Learning Rule will converge to zero error and no weight changes in a
finite number of steps if the problem is linearly separable, but otherwise the weights
will keep oscillating. On the other hand, the Delta Rule will (for sufficiently small η)
always converge to a set of weights for which the error is a minimum, though the
convergence to the precise target values will generally proceed at an ever decreasing
rate proportional to the output discrepancies deltal.
L5-19
Overview and Reading
1. We began with a brief look at Hebbian Learning and Oja’s Rule.
2. We then considered how neural network weight learning could be put
into the form of minimising an appropriate output error or cost function.
3. We looked at the special case of how simple matrix pseudo-inversion
could be used to find the weights for single layer regression networks.
4. Then we saw how to compute the gradients/derivatives that enable us to
formulate efficient general error minimisation algorithms, and how they
can be used to derive the Delta Rule for training single layer networks.
Reading
1. Bishop: Sections 3.1, 3.2, 3.3, 3.4, 3.5
2. Haykin-1999: Sections 2.2, 2.4, 3.3, 3.4, 3.5
3. Gurney: Sections 5.1, 5.2, 5.3
4. Beale & Jackson: Section 4.4
5. Callan: Sections 2.1, 2.2
L5-20