Shortcomings in Single Layer Neural Networks: Most Real World Problems Are Not
Shortcomings in Single Layer Neural Networks: Most Real World Problems Are Not
Shortcomings in Single Layer Neural Networks: Most Real World Problems Are Not
1
Feed Forward Backpropagation Neural Networks
Very General Nature
Applicable to a variety of practical problems
NETtalk
Signature Classification
Disease Classification
Hand Written Character Recognition
Combat Outcome Predication
Earthquake Prediction
Etc…
Objective:
To achieve a balance between Memorization and
Generalization
Memorization: Ability to respond correctly to the input patterns used
for training
Generalization: Ability to give reasonable responses to input that is
similar but not identical to that used in training
2
Architecture
Consists of multiple layers
Layers of units other than the input
and output are called hidden units
Unidirectional weight connections
and biases
Activation functions
Use of sigmoid functions
Nonlinear Operation: Ability to solve
practical problems
Differentiable: Makes theoretical
assessment easier
Derivative can be expressed in terms
of functions themselves:
z j f z _ in j
Computational Efficiency n
z _ in j xivij , x0 1, j 1... p
Activation function is the same i 0
k
for all neurons in the same layer y f y _ in
k
Input layer just passes on the p
signal without processing (linear y _ ink z j w jk , z0 1, k 1...m
operation) j 0 3
Architecture: Activation functions
Binary Sigmoid and its derivative Bipolar Sigmmoid and its derivative
1 f2(x)
1
f1(x) 0.8 f'2(x)
f'1(x)
0.6
0.8
0.4
0.6 0.2
y, dy
y,dy
0
0.4 -0.2
-0.4
0.2
-0.6
-0.8
0
-1
-8 -6 -4 -2 0 2 4 6 8 -8 -6 -4 -2 0 2 4 6 8
x x
4
Training
During training we are presented with input
patterns and their targets
At the output layer we can compute the error
between the targets and actual output and
use it to compute weight updates through the
Delta Rule
But the Error cannot be calculated at the
hidden input as their targets are not known
Therefore we propagate the error at the
output units to the hidden units to find the
required weight changes (Backpropagation)
3 Stages
Feed-forward of the input training pattern
Calculation and Backpropagation of the
associated error
Weight Adjustment
Based on minimization of SSE (Sum of
Square Errors)
5
Backpropagation training cycle
Feed forward
Weight Update 6
Backpropagation
Proof for the Learning Rule
7
Proof for the Learning Rule…
xi
zj
yk
9
Training Algorithm…
δk
10
Training Algorithm…
δj
11
Training Algorithm…
12
Effect of Learning Rate
Controls the change in synaptic weights
The smaller the learning rate the smoother the
trajectory in the weight space
Slower rate of learning
If learning rate is made too large (for speedy
convergence) the network may become unstable
(oscillatory)
13
Stopping Criterion
The Backpropagation (BP) algorithm cannot be shown to
converge
No well defined criterion for stopping its operation
Criterion Used
BP is considered to have converged when the Euclidean
norm of the gradient vector reaches a sufficiently small
threshold
Ideally Gradient is zero at the minima
Drawbacks
For successful trials, learning times may be large
Calculation of the gradient is required
BP is considered to have converged when the absolute
rate of change in the average squared error per epoch is
sufficiently small
May result in premature termination of the learning process
14
Application Procedure
Involves only the feed-forward phase (fast!)
15
Solution of the XOR Problem
Initial weights 1 1
v01 0.8
v01
w01
v11 0.5
x1 v11 Z1
w11
v12
v21
w21
v02 0.1 x2 Z2
v21 0.4
z j f z _ in j
2
v22 1 z _ in j xivij , x0 1, j 1...2
i 0
w01 0.3 yk f y _ ink
w11 1.2
2
y _ ink z j w jk , z0 1, k 1
j 0
w21 1.1
16
Solution of the XOR Problem…
We consider the input pattern as (1,1) with target = 0 and α=0.1
Feed Forward
z1 sigmoid (V1T X ) 1/ 1 e (10.510.410.8) 0.5250
z2 sigmoid (V2T X ) 1/ 1 e(10.911.0
10.1)
0.8808
18
Solution of the XOR Problem…
Weight Update
v01 0.7962
v11 0.5038
v12 0.8985
v02 0.0985
v21 0.4038
v22 0.9985
w01 0.3127
w11 1.2067
w21 1.0888
19
Solution of the XOR Problem…
100
Sum-Squared Error
10-1
10-2
10-3
10-4
0 50 100 150 200
Epoch
20
Number of hidden layers
How to handle multiple layers?
Some modifications to the training algorithm
21
Training Algorithm
xi
yk
22
Training Algorithm…
δk
23
Training Algorithm…
δj
24
Training Algorithm…
25
Number of hidden layers…
Theoretically, One hidden layer is sufficient for a
Backpropagation net to approximate any
continuous mapping from the input patterns to
the output pattern to an arbitrary degree of
accuracy
However two hidden layers may make training
easier in some situations
26
BP Improvement Heuristics
Randomize the order of presentation of training examples
from one epoch to the next
Makes the search in weight space stochastic over the
learning cycles
Avoids limit cycles in the evolution of the synaptic
weight vectors
Sequential vs. Batch Updating
Sequential Learning (Online, pattern stochastic) mode
Weight update is performed after presenting each training cycle
Optimal in reference to online operation
Requires less storage
Due to randomization in the order of input presentation, the
search is stochastic so a local minima trapping can be
avoided
Highly effective when data is large and highly redundant
Difficult to establish theoretical results for convergence
Gradient estimation can be poor
27
BP Improvement Heuristics…
Sequential vs. Batch Updating…
Batch Updating
Weight update performed after the presentation of all the training
examples that constitute an epoch
For N training examples, the error calculation is given by
N
1
E
2N
e n
n 1 k
2
k
N ek n
w jk ek n
N n 1 w
jk
28
Weight Initialization
The weight update equation involves the Binary Sigmoid and its derivative
function 0.8
f'1(x)
y, dy
in the saturation regions where the 0.4
30
Weight Initialization: Nguyen-Widrow Method…
31
Weight Initialization: NW Example
Training a 2-4-1 Backpropagation Neural Network
to solve the XOR Problem
Original Weight Initialization: Random values
between -0.5 and +0.5, LR = 0.2
Convergence: When SSE < 0.05
Using NW requires lesser number of epochs for
convergence
32
BP Improvement Heuristics…
Antisymmetric activation functions (e.g. bipolar
sigmoids) offer better performance
Functions with Slower saturation
Non-saturating Functions
Useful when the targets are continuous and
not binary
33
Behavior of Different Activation Functions
Behavior of Different Activiation Functions
2.5
y=2/(1+e-x )-1
2
y=2tan-1(x)/
1.5 sign(x)log(1+sign(x)x)
0.5
0
y
-0.5
-1
-1.5
-2
-2.5
-8 -6 -4 -2 0 2 4 6 8 34
x
BP Improvement Heuristics
Example
XOR (2-4-1) [Fausett]
Other functions
Radial Basis Functions
35
BP Improvement Heuristics…
Maximizing the information content
Every training pattern presented to the BP Algorithm
should be chosen on the basis that its information
content is the largest possible for the task at hand
Use of an example that results in the largest training error
Use of an example that is radically different from all those
previously used
Use of an Emphasizing scheme
Involves presenting more difficult patterns than easy ones to the
network
The degree of difficulty can be determined by examining the error
it produces compared to previous iterations
Distribution of examples within an epoch presented to the
network is distorted
The presence of outlier would decrease generalization
capability
36
BP Improvement Heuristics…
Data Representation
One factor in the weight correction expression
is the activation of the lower unit, units whose
activations are zero will not learn
Thus training can be made better if the input is
represented in bipolar and the bipolar sigmoid
is used for the activation function
Convergence can be improved if the target
values are not at the asymptotes of the
sigmoid function as those values can never be
reached
37
BP Improvement Heuristics
Data Representation: Example (XOR) [Fausett]
38
Data Representation…
In many applications, the data may be given by either a
continuous valued variable or a “set of ranges”
Temperature can be a continuous variable (requiring
one continuous output neuron)
Or have ranges defining Frozen, Chilled, Temperate, Hot
(requiring 4 neurons with bipolar values)
In general it is easier for the net to learn a set of distinct
responses than a continuous response
However breaking truly continuous data into artificial distinct
categories can make it more difficult for the net to learn examples
in the border regions of these categories
Solution: Fuzzy Logic can help
Continuous valued inputs or targets should not be used
to represent distinct qualities such as alphabet
39
BP Improvement Heuristics
Data Normalization (Preprocessing)
Mean Removal
Mean should be made close to zero or else smaller in
comparison to standard deviation of the data
Decorrelation
Using PCA
Covariance Equalization
The decorrelated input variables should be scaled so that
their covariances are approximately equal
Different synaptic weights in the network should learn at
the same rate
40
Original Data Mean Removal
6 6
5 5
4 4
3 3
2
2
2 2
X
X
1 1
0 0
-1 -1
a. -2-2 -1 0 1 2 3
b. -2-2 -1 0 1 2 3
X1 X1
5 5
4 4
3 3
2
2 2
X
1 1
0 0
-1 -1
c. -2
-2 -1 0 1 2 3
d. -2-2 -1 0 1 2
41 3
X1 X1
Improving Generalization…
Cross Validation: (Early Stopping)
Form two disjoint subsets (one
for training and one for cross-
validation) from the given data
Perform weight update based
on the training data
At Intervals during training
compute the error on the cross
validation data set
Continue training until the error
on the cross validation set is
not increasing
When the error tends to
increase the net has started
memorization of the training
dataset and training must be
stopped
42
Improving Generalization…
Variations to Cross Validation…
Hold-One-Out
Divide N examples into K>1 subsets
Train on K-1 subsets and measure validation on the held-
out set
Repeat for all K trials
Measure cross validation error on each and average it
Requires large amount of data
Leave-One-Out
Train the net on N-1 examples and validate on 1, average
over the N possible trials
Useful when the amount of training data is small
43