Unit 5
Unit 5
Unit 5
• The study of artificial neural networks (ANNs) has been inspired by the
observation that biological learning systems are built of very complex webs of
interconnected Neurons
• Human information processing system consists of brain neuron: basic building
block cell that communicates information to and from various parts of body
• Simplest model of a neuron: considered as a threshold unit –a processing element
(PE)
• Collects inputs & produces output if the sum of the input exceeds an internal
threshold value
nucleus
cell body
dendrites
Examples:
1. Speech phoneme recognition
2. Image classification
3. Financial perdition
• The input to the neural network is a 30x32 grid of pixel intensities obtained from
a forward-pointed camera mounted on the vehicle.
The learning problem is to determine a weight vector that causes the perceptron to
produce the correct + 1 or - 1 output for each of the given training examples.
• If the training examples are not linearly separable, the delta rule converges toward
a best-fit approximation to the target concept.
• The key idea behind the delta rule is to use gradient descent to search the
hypothesis space of possible weight vectors to find the weights that best fit the
training examples.
To understand the delta training rule, consider the task of training an unthresholded
perceptron. That is, a linear unit for which the output O is given by
Where,
• D is the set of training examples,
• td is the target output for training example d,
• od is the output of the linear unit for training example d
• E [ w ] is simply half the squared difference between the target output td and the linear unit output
od, summed over all training examples.
How to calculate the direction of steepest descent along the error surface?
The direction of steepest can be found by computing the derivative of E with respect
to each component of the vector w . This vector derivative is called the gradient of E
with respect to w , written as
• Here η is a positive constant called the learning rate, which determines the step
size in the gradient descent search.
• The negative sign is present because we want to move the weight vector in the
direction that decreases E
• This training rule can also be written in its component form
where t, o, and xi are the target value, unit output, and ith input for the training
example in question
Where, td and od are the target value and the unit output value for training example
d.
• Stochastic gradient descent iterates over the training examples d in D, at each
iteration altering the weights according to the gradient with respect to Ed( w )
• The sequence of these weight updates, when iterated over all training examples,
provides a reasonable approximation to descending the gradient with respect to
our original error function Ed ( w )
• By making the value of η sufficiently small, stochastic gradient descent can be
made to approximate true gradient descent arbitrarily closely
• Sigmoid unit-a unit very much like a perceptron, but based on a smoothed,
differentiable threshold function.
• The sigmoid unit first computes a linear combination of its inputs, then applies a
threshold to the result. In the case of the sigmoid unit, however, the threshold
output is a continuous function of its input.
• More precisely, the sigmoid unit computes its output O as
• The BACKPROPAGATION Algorithm learns the weights for a multilayer network, given a
network with a fixed set of units and interconnections. It employs gradient descent to attempt to
minimize the squared error between the network output values and the target values for these
outputs.
• In BACKPROPAGATION algorithm, we consider networks with multiple output units rather than
single units as before, so we redefine E to sum the errors over all of the network output units.
where,
• outputs - is the set of output units in the network
• tkd and Okd - the target and output values associated with the kth output unit
• d - training example
• Deriving the stochastic gradient descent rule: Stochastic gradient descent involves
iterating through the training examples one at a time, for each training example d
descending the gradient of the error Ed with respect to this single example
• For each training example d every weight wji is updated by adding to it Δ wji
Here outputs is the set of output units in the network, tk is the target value of unit k for
training example d, and ok is the output of unit k given training example d.
Introduction
Neural network representation
Appropriate problems for Neural Network
Learning
Perceptron
MultiLayer Networks and Back propagation
Algorithm
Remarks of Back propagation algorithm
1
Introduction: How can we answer this
problems?
Psychological studies:
• How do animals learn, forget, recognize and perform various types of
tasks?
Collects inputs & produces output if the sum of the input exceeds an
internal threshold value
4
Human Brain Biological neurons:
dendrites
cell axon
synapse
dendrites
5
What Is an Artificial Neural Network (ANN)?
Structure of ANN:
6
Topological units of ANN:
Neural Network.
Is a Mathematical model.
Inspired by Human biological neurons and now
leading in solving many real world problems
ANN topology Units include:
Neurons.
Layers.
Connecting Links.
Initial weights(on links)
and bias and activation function
Inputs/outputs
7
Properties of ANNs
8
Neurons:
node = unit
Neuron
link
node
node
activation
node level
A NODE
ai = g(ini)
aj Wj,i input
activation function
function output
ini g output
input links links
ai
9
ANN Architecture:
Bias
b
x1 w1
Activation
Local function
Field
Output
x2 w2 v y
Input
values
Summing
function
xm wm
weights
10
g = Activation functions for units
11
Purpose of activation function:
Activation function is added for Hidden and
O/P layers
The purpose of an activation function is to
add some kind of non-linear property to
the input.
Without the activation functions, the only
mathematical operation during the forward
propagation would be dot-products between
an input vector and a weight matrix: which is
linear
12
Why non linear Activation
function: (ex: sigmoid)
• Linear functions always do exhibit a constant slope.
• Slope as we know pictures the rate of change of the predictor
variable (y) with respect to the independent variable (x).
• Real-world problems emphasize on finding the points of minimal
or maximal change: may be minimization of loss or maximization
of gain.
• If we take linear activation functions, these changes are constant
and cannot be distinguishable between points of minimal and
maximal changes.
• Nonlinear activation functions have gradients that vary between
various points.
• Based on these gradients descent or accent, we can address all
the minimization and maximization problems.
.
13
Why non linear Activation
function: (ex: sigmoid)
• The nonlinear exponential function e X has a range
between [0, ∞] and has an infinite range.
• The gradients for the function eX exist between [0, ∞].
• To find the optimal gradients, it is quite difficult within this
infinite range.
• Hence, such functions though nonlinear cannot be taken
as activation functions.
• Hence, the needs for activation functions which show
active gradients within shorter intervals are needed.
• One such is the sigmoid activation function which will
analyze the gradients between a small [0, 1] interval
14
Role of Bias:
15
16
17
18
19
20
21
22
23
Perceptron:
24
Example network:
25
26
27
28
29
30
31
32
33
34
35
36
37
Number of training pairs needed?
For example, for e = 0.1, a net with 80 weights will require 800
training patterns to be assured of getting 90% of the test patterns
correct (assuming it got 95% of the training examples correct).
38
How long should a net be trained?
If you train the net for too long, then you run the risk of
overfitting.
39
Implementing Backprop – Design Decisions
40
Backpropagation
Performs gradient descent over entire network weight vector
41
Convergence of Backpropagation
42
Back-propagation Using Gradient Descent
Advantages
Relatively simple implementation
Standard method and generally works well
Disadvantages
Slow and inefficient
Can get stuck in local minima resulting in sub-
optimal solutions
43
Learning rate
44
Back propagation Example:
45
46
47
48
49
50
51
52
53
The o/p from 2nd iteration is 0.5830
54
Setting the parameter values
COSC4P76 B.Ombuki-Berman 55
56
Neural Networks: Advantages
•Distributed representations
•Simple computations
•Parallel processing
COSC4P76 B.Ombuki-Berman 57
Neural Networks: Disadvantages
•Training is slow
•Interpretability is hard
Regression:
Forecasting (prediction on base of past history)
Forecasting e.g., predicting behavior of stock market
Pattern association:
Retrieve an image from corrupted one
…
Clustering:
clients profiles
disease subtypes
…
59
Applications
60