Shortcomings in Single Layer Neural Networks: Most Real World Problems Are Not

Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

Shortcomings in Single Layer Neural Networks

 Most real world problems are not


linearly separable
 SLNN are Unable to create a nonlinear
separating boundary
 This limits their applicability to practical
problems
 Inhibited the growth of Neural Networks
till the 1980s until the Generalized Delta
Rule led to the development of a
generalized method for weight update in
Multilayer Networks
 Developed by Rumelhart & Williams
(1986a, 1986b) and McCelland &
Rumelhart (1988)

1
Feed Forward Backpropagation Neural Networks
 Very General Nature
 Applicable to a variety of practical problems
 NETtalk
 Signature Classification
 Disease Classification
 Hand Written Character Recognition
 Combat Outcome Predication
 Earthquake Prediction
 Etc…
 Objective:
 To achieve a balance between Memorization and
Generalization
 Memorization: Ability to respond correctly to the input patterns used
for training
 Generalization: Ability to give reasonable responses to input that is
similar but not identical to that used in training

2
Architecture
 Consists of multiple layers
 Layers of units other than the input
and output are called hidden units
 Unidirectional weight connections
and biases
 Activation functions
 Use of sigmoid functions
 Nonlinear Operation: Ability to solve
practical problems
 Differentiable: Makes theoretical
assessment easier
 Derivative can be expressed in terms
of functions themselves:

z j  f z _ in j 
Computational Efficiency n
z _ in j   xivij , x0  1, j  1... p
 Activation function is the same i 0

 k
for all neurons in the same layer y  f y _ in
k
 Input layer just passes on the p
signal without processing (linear y _ ink   z j w jk , z0  1, k  1...m
operation) j 0 3
Architecture: Activation functions
Binary Sigmoid and its derivative Bipolar Sigmmoid and its derivative

1 f2(x)
1
f1(x) 0.8 f'2(x)
f'1(x)
0.6
0.8
0.4

0.6 0.2
y, dy

y,dy
0

0.4 -0.2

-0.4
0.2
-0.6

-0.8
0
-1

-8 -6 -4 -2 0 2 4 6 8 -8 -6 -4 -2 0 2 4 6 8
x x

4
Training
 During training we are presented with input
patterns and their targets
 At the output layer we can compute the error
between the targets and actual output and
use it to compute weight updates through the
Delta Rule
 But the Error cannot be calculated at the
hidden input as their targets are not known
 Therefore we propagate the error at the
output units to the hidden units to find the
required weight changes (Backpropagation)
 3 Stages
 Feed-forward of the input training pattern
 Calculation and Backpropagation of the
associated error
 Weight Adjustment
 Based on minimization of SSE (Sum of
Square Errors)
5
Backpropagation training cycle
Feed forward

Weight Update 6
Backpropagation
Proof for the Learning Rule

Change in wjk affects only Yk

Use of Gradient Descent Minimization

7
Proof for the Learning Rule…

Change in vij affects all Y1..m

Change in vij affects only zj


Use of Gradient Descent Minimization
8
Training Algorithm

xi

zj

yk

9
Training Algorithm…

δk

10
Training Algorithm…

δj

11
Training Algorithm…

12
Effect of Learning Rate
 Controls the change in synaptic weights
 The smaller the learning rate the smoother the
trajectory in the weight space
 Slower rate of learning
 If learning rate is made too large (for speedy
convergence) the network may become unstable
(oscillatory)

13
Stopping Criterion
 The Backpropagation (BP) algorithm cannot be shown to
converge
 No well defined criterion for stopping its operation
 Criterion Used
 BP is considered to have converged when the Euclidean
norm of the gradient vector reaches a sufficiently small
threshold
 Ideally Gradient is zero at the minima
 Drawbacks
 For successful trials, learning times may be large
 Calculation of the gradient is required
 BP is considered to have converged when the absolute
rate of change in the average squared error per epoch is
sufficiently small
 May result in premature termination of the learning process

14
Application Procedure
 Involves only the feed-forward phase (fast!)

15
Solution of the XOR Problem
 Initial weights 1 1

v01  0.8
v01

w01

v11  0.5
x1 v11 Z1

w11
v12

v12  0.9 v02 Y1

v21
w21

v02  0.1 x2 Z2

v21  0.4 
z j  f z _ in j 
2
v22  1 z _ in j   xivij , x0  1, j  1...2
i 0
w01  0.3 yk  f  y _ ink 

w11  1.2
2
y _ ink   z j w jk , z0  1, k 1
j 0
w21  1.1
16
Solution of the XOR Problem…
 We consider the input pattern as (1,1) with target = 0 and α=0.1
 Feed Forward
z1  sigmoid (V1T X )  1/  1  e (10.510.410.8)   0.5250
z2  sigmoid (V2T X )  1/  1  e(10.911.0
 10.1)
  0.8808

y1  sigmoid (W T Z )  1/  1  e ( 0.52501.20.88081.110.3)   0.5097


 Backpropagation
e  t  y1  0  0.5097  0.5097
 k 1  y1 (1  y1 ) e  0.5097  (1  0.5097)  (0.5097)  0.1274
w21    z2   k 1  0.1 0.8808  (0.1274)  0.0112
w11    z1   k 1  0.1 0.5250  (0.1274)  0.0067
w01    (1)   k 1  0.1 (1)  (0.1274)  0.0127
17
Solution of the XOR Problem…
 Backpropagation…
 j 1  z1 (1  z1 )  1  w11  0.5250  (1  0.5250)  (  0.1274)  (  1.2)  0.0381
 j 2  z2 (1  z2 )  1  w21  0.8808  (1  0.8808)  (  0.127 4)  1.1  0.0147
v11    x1   j 1  0.11 0.0381  0.0038
v21    x2   j 1  0.11 0.0381  0.0038
v01    (1)   j 1  0.1 (1)  0.0381  0.0038
v12    x1   j 2  0.11 (0.0147)  0.0015
v22    x2   j 2  0.11 (0.0147)  0.0015
v02    (1)   j 2  0.1 (1)  (0.0147)  0.0015

18
Solution of the XOR Problem…
 Weight Update
v01  0.7962
v11  0.5038
v12  0.8985
v02  0.0985
v21  0.4038
v22  0.9985
w01  0.3127
w11  1.2067
w21  1.0888

 The training process is repeated until the sum of


squared errors is less than 0.001.

19
Solution of the XOR Problem…

Sum-Squared Network Error for 224 Epochs


101

100
Sum-Squared Error

10-1

10-2

10-3

10-4
0 50 100 150 200
Epoch

20
Number of hidden layers
 How to handle multiple layers?
 Some modifications to the training algorithm

21
Training Algorithm

xi

Repeat for each


zj
Hidden layer

yk

22
Training Algorithm…

δk

23
Training Algorithm…

δj

Repeat for each


Hidden layer

24
Training Algorithm…

25
Number of hidden layers…
 Theoretically, One hidden layer is sufficient for a
Backpropagation net to approximate any
continuous mapping from the input patterns to
the output pattern to an arbitrary degree of
accuracy
 However two hidden layers may make training
easier in some situations

26
BP Improvement Heuristics
 Randomize the order of presentation of training examples
from one epoch to the next
 Makes the search in weight space stochastic over the
learning cycles
 Avoids limit cycles in the evolution of the synaptic
weight vectors
 Sequential vs. Batch Updating
 Sequential Learning (Online, pattern stochastic) mode
 Weight update is performed after presenting each training cycle
 Optimal in reference to online operation
 Requires less storage
 Due to randomization in the order of input presentation, the
search is stochastic so a local minima trapping can be
avoided
 Highly effective when data is large and highly redundant
 Difficult to establish theoretical results for convergence
 Gradient estimation can be poor

27
BP Improvement Heuristics…
 Sequential vs. Batch Updating…
 Batch Updating
 Weight update performed after the presentation of all the training
examples that constitute an epoch
 For N training examples, the error calculation is given by
N
1
E
2N
 e  n 
n 1 k
2
k

 N  ek  n  
w jk    ek  n  
N n 1  w 
 jk 

 It provides an accurate estimate of the gradient vector


 Convergence to a local minima is guaranteed under simple
conditions
 Easier to parallelize the algorithm

28
Weight Initialization
 The weight update equation involves the Binary Sigmoid and its derivative

value of the derivative of the activation 1


f1(x)

function 0.8
f'1(x)

 If weights are too large the input to the 0.6

derivatives will be cause the output to lie

y, dy
in the saturation regions where the 0.4

values of the derivative of the sigmoid 0.2

function are very small causing a slow 0


down of the learning -8 -6 -4 -2 0 2 4 6 8

 If the weights are taken to be too small x

then the algorithm may operate on a very


flat area around the origin (especially for
Antisymmetric activation functions)
leading to slow learning
 Initial weights determine the speed of
convergence
 Initial weights also determine whether the
net reaches a global or only a local minimum
 Common procedure is to initialize weights
between -0.5 and +0.5 (or between -1 and +1) 29
Weight Initialization…
 Nguyen-Widrow Weight initialization
 This approach is based upon a geometrical analysis of
the response of the hidden neurons to a single input:
The analysis is extended to the case of several units by
using Fourier Transforms
 Provides a method for initialization weights between
input and hidden layer
 Weights from the hidden units to the output units are
initialized to random values between -0.5 and +0.5
 NG distributes the initial weights so that for each input
pattern, it is likely that the net input to one of the hidden
units will be in the range in which the hidden neuron will
learn most rapidly

30
Weight Initialization: Nguyen-Widrow Method…

31
Weight Initialization: NW Example
 Training a 2-4-1 Backpropagation Neural Network
to solve the XOR Problem
 Original Weight Initialization: Random values
between -0.5 and +0.5, LR = 0.2
 Convergence: When SSE < 0.05
 Using NW requires lesser number of epochs for
convergence

32
BP Improvement Heuristics…
 Antisymmetric activation functions (e.g. bipolar
sigmoids) offer better performance
 Functions with Slower saturation

 Non-saturating Functions
 Useful when the targets are continuous and
not binary

33
Behavior of Different Activation Functions
Behavior of Different Activiation Functions
2.5
y=2/(1+e-x )-1
2
y=2tan-1(x)/
1.5 sign(x)log(1+sign(x)x)

0.5

0
y

-0.5

-1

-1.5

-2

-2.5
-8 -6 -4 -2 0 2 4 6 8 34
x
BP Improvement Heuristics
 Example
 XOR (2-4-1) [Fausett]

 Other functions
 Radial Basis Functions

35
BP Improvement Heuristics…
 Maximizing the information content
 Every training pattern presented to the BP Algorithm
should be chosen on the basis that its information
content is the largest possible for the task at hand
 Use of an example that results in the largest training error
 Use of an example that is radically different from all those
previously used
 Use of an Emphasizing scheme
 Involves presenting more difficult patterns than easy ones to the
network
 The degree of difficulty can be determined by examining the error
it produces compared to previous iterations
 Distribution of examples within an epoch presented to the
network is distorted
 The presence of outlier would decrease generalization
capability

36
BP Improvement Heuristics…
 Data Representation
 One factor in the weight correction expression
is the activation of the lower unit, units whose
activations are zero will not learn
 Thus training can be made better if the input is
represented in bipolar and the bipolar sigmoid
is used for the activation function
 Convergence can be improved if the target
values are not at the asymptotes of the
sigmoid function as those values can never be
reached

37
BP Improvement Heuristics
Data Representation: Example (XOR) [Fausett]

38
Data Representation…
 In many applications, the data may be given by either a
continuous valued variable or a “set of ranges”
 Temperature can be a continuous variable (requiring
one continuous output neuron)
 Or have ranges defining Frozen, Chilled, Temperate, Hot
(requiring 4 neurons with bipolar values)
 In general it is easier for the net to learn a set of distinct
responses than a continuous response
 However breaking truly continuous data into artificial distinct
categories can make it more difficult for the net to learn examples
in the border regions of these categories
 Solution: Fuzzy Logic can help
 Continuous valued inputs or targets should not be used
to represent distinct qualities such as alphabet

39
BP Improvement Heuristics
 Data Normalization (Preprocessing)
 Mean Removal
 Mean should be made close to zero or else smaller in
comparison to standard deviation of the data
 Decorrelation
 Using PCA
 Covariance Equalization
 The decorrelated input variables should be scaled so that
their covariances are approximately equal
 Different synaptic weights in the network should learn at
the same rate

40
Original Data Mean Removal
6 6

5 5

4 4

3 3
2

2
2 2
X

X
1 1

0 0

-1 -1
a. -2-2 -1 0 1 2 3
b. -2-2 -1 0 1 2 3
X1 X1

Decorrelation (using PCA) Covariance Equilization and Normalization


6 6

5 5

4 4

3 3
2

2 2
X

1 1

0 0

-1 -1

c. -2
-2 -1 0 1 2 3
d. -2-2 -1 0 1 2
41 3
X1 X1
Improving Generalization…
 Cross Validation: (Early Stopping)
 Form two disjoint subsets (one
for training and one for cross-
validation) from the given data
 Perform weight update based
on the training data
 At Intervals during training
compute the error on the cross
validation data set
 Continue training until the error
on the cross validation set is
not increasing
 When the error tends to
increase the net has started
memorization of the training
dataset and training must be
stopped

42
Improving Generalization…
 Variations to Cross Validation…
 Hold-One-Out
 Divide N examples into K>1 subsets
 Train on K-1 subsets and measure validation on the held-
out set
 Repeat for all K trials
 Measure cross validation error on each and average it
 Requires large amount of data
 Leave-One-Out
 Train the net on N-1 examples and validate on 1, average
over the N possible trials
 Useful when the amount of training data is small

43

You might also like