Activation Function To Back Pro

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 22

Neural Networks: From Activation Function To Back Propagation

 Main Components of Neural Networks :

 Neurons : set of functions

 Layers : Grouping of Neurons

 Weights & Biases: Numerical values

 Activation Function: Mathematical algorithms applied to get outputs

 Working Of Neural Networks :

The concept of neural network is based on three main steps:

1.For each neuron in a layer, multiply input to weight.

2.Then for each layer, sum all (input) x (weights) of neurons together.

3.Finally, apply activation function on the output to compute new output.


Understanding The Process
• Each neuron takes in an input as shown in the image below.

• Inputs are fed into neuron 1, neuron 2 and neuron 3 as they belong to the Input Layer.

• Each neuron has a weight associated with it. When an input enters a neuron, the weight on the neuron is multiplied to the input.

• For instance, weight 1 will be applied to the input of Neuron 1. If weight 1 is 0.8 and input is 1 then 0.8 will be computed from Neuron 1:
1 * 0.8 = 0.8
• Sum of weight * inputs of neurons in a layer is calculated. As an example, the calculated value on the hidden layer in the image will be:
(Weight 4 x Input To Neuron 4) + (Weight 5 x Input To Neuron 5)

• Finally an activation function is applied. Output calculated by the neurons becomes input to the activation function which then computes a new
output.

• Assume, the activation function is: If (input > 1) Then 0 Else 1

• The output from activation function is then fed to the subsequent layers.
Activation Functions in Neural Networks
• Activation function is a mathematical formula (algorithm) that is activated under certain circumstances. When neurons compute weighted
sum of inputs, they are passed to the activation function which checks if the computed value is above the required threshold.
• If the computed value is above the required threshold then the activation function is activated and an output is computed.
• This output is then passed on to the next or previous layers (dependent on the complexity of the network) which can help neural networks
alter weights on their neurons.
• Activation functions are important to learn complicated and Non-linear complex functional mappings between the inputs and output
variable. They introduce non-linear properties to the network.
• Specifically in NN we do the sum of products of inputs(X) and their corresponding Weights(W) and apply a Activation function f(x) to it to
get the output of that layer and feed it as an input to the next layer.
Y = f (Σ(inputs*weights)+bias)
• Here f(x) is activation functions which can be different mathematical functions.
• Some of the popular activation functions are,

• Sigmoid function/ Logistic function


• Tanh function
• ReLU function
• Softmax function
Sigmoid Function/Logistic Function
• It is an activation function of form , f(x) = 1 / 1 + exp(-x) .
• Its Range is between 0 and 1.
• Sigmoids saturate and kill gradients.
• For different x value we performed the sigmoid function it gives curve like this,
Tanh function

• It’s mathematical formula is f(x) =  exp(2x) - 1 / exp(2x) + 1.

• Now it’s output is zero centered because its range in between -1 to 1

i.e -1 < output < 1 

• Hence optimization is easier in this method.

• But still it suffers from Vanishing gradient problem.

• For different x value we performed the tanh function it gives curve like this,
ReLu -Rectified linear units
• It’s an activation function of form R(x) = max(0,x)
i.e if x < 0 , R(x) = 0
and if x >= 0 , R(x) = x
• Hence as seeing the mathematical form of this function we can see that it is very simple and efficient .
• It rectifies vanishing gradient problem .
• But its limitation is that it should only be used within Hidden layers of a Neural Network Model.
• Another problem with ReLu is that some neurons doesn’t allow to update the weight and never activate again. Simply saying that
ReLu could result in Dead Neurons.
• To fix this problem another modification was introduced called Leaky ReLu = max(0.1x, x) to fix the problem of dying neurons.
It introduces a small slope to keep the updates alive.
Softmax

• The softmax function is also a type of sigmoid function but is useful when we are trying to handle classification problems.
• The sigmoid function is able to handle just two classes. The softmax function would squeeze the outputs for each class
between 0 and 1 and would also divide by the sum of the outputs.
• This essentially gives the probability of the input being in a particular class. It can be defined as –

• Let’s say for example we have the outputs as-


• [1.2 , 0.9 , 0.75], When we apply the softmax function we would get [0.42 , 0.31, 0.27]. So now we can use these as
probabilities for the value to be in each class.
• The softmax function is ideally used in the output layer of the classifier where we are actually trying to attain the
probabilities to define the class of each input.
• And for regression we can use linear function at output layer.
Choosing the right Activation Function

• Depending upon the properties of the problem we able to make a better choice for quick convergence of the network.

• Sigmoid functions generally work better in the case of classifiers.

• Sigmoids and tanh functions are sometimes avoided due to the vanishing gradient problem.

• ReLU function is a general activation function and is used in most cases these days.

• If we encounter a case of dead neurons in our networks the leaky ReLU function is the best choice.
 

Error and Loss Function

• After generated predicted value for the actual output value, model try to investigate the difference between
the actual output and predicted output value.

• This difference is known as error or loss function.

• Different loss functions will give different errors for the same prediction, and thus have a considerable effect
on the performance of the model.

• One of the most widely used loss function is mean square error, which calculates the square of difference
between actual value and predicted value.

• Different loss functions are used to deal with different type of tasks, i.e. regression and classification.

Error = P – P’
Understand Back-Propagation
• The goal of back-propagation is to reduce the loss function/error of the model and  optimize the weights so that the neural
network can learn how to correctly map arbitrary inputs to outputs.
• The current error is typically propagated backwards to a previous layer, where it is used to modify the weights and bias in
such a way that the error is minimized.
• To understand back-propagation , first we need to understand the working of neural network. So, we take one e.g
• In order to have some numbers to work with, here are the initial weights, the biases, and training inputs/outputs:

• Here inputs, i1 and i2 are 0.05 and 0.10 respectively.

• Weights for w1 to w8 are 0.15,0.20,0.25,0.30,0.40,0.45,0.50,0.55 respectively.

• Bias for b1 and b2 are 0.35 and 0.60.

• And expected output value of the model o1 and o2 is 0.01 and 0.99.
The Forward Pass

• In this stage neural network try to predict the expected output value with the initial input values i1,i2, and weights and
biases given above.

• For this we take total input values, respective weight, bias and apply the activation function on it.

• Here we use Sigmoid function .

• This how we calculate the total input value for first hidden layer neuron h1 is,
net_h1 = w1 * i1 + w2 * i2 + b1 * 1
net_h1 = 0.15 * 0.05 + 0.2 * 0.1 + 0.35 * 1 = 0.3775

• Then we applied sigmoid function to get output of h1,

out_h1 = 1\1+e^(-net_h1) = 1\1+e^(-0.3775) = 0.593269992

• We perform the same process for h2 also,


Outh2 = 0.596884378
• We repeat the same process for the output layer also, so get o1,

net_o1 = w5 * out_h1 + w6 * out_h2 + b2 * 1

net_o1 = 0.4 * 0.593269992 + 0.45 * 0.596884378 + 0.6 * 1 = 1.105905967

out_o1 = 1\1+e^(-net_o1) = 1\1+e^(-1.105905967) = 0.75136507

• We also find the o2 by applying same process,

net_o2 = 0.772928465
Calculating the total error
• We calculate the error for each output neuron using the squared error function and sum them to get the total error:

E_total = ∑1/2(target - output)^2

• The target output for o_1 is 0.01 but the neural network output 0.75136507, therefore its error is:

E_o1 = 1\2(target_o1 - out_o1)^2 = 1\2(0.01 - 0.75136507)^2 = 0.274811083

• We repeat same process for finding error of o2,

E_o2 = 0.023560026

• So, the total error of neural network is the sum of both the errors,

• E_total = E_o1+E_o2= 0.274811083+0.023560026 = 0.298371109


The Backwards Pass
• Our goal with back propagation is to update each of the weights in the network so that they cause the actual
output to be closer the target output, and minimizing the error for each output neuron and the network as a
whole.

Output Layer

• Consider w5. We want to know how much a change in w5 affects the total error, aka

• This known as partial derivation of E_total with respect to w5.

• By applying chain rule we can perform this derivation is in this way,

• Visually, here’s what we’re doing:


• Now we need to find out the value of the each equation,
• First how much the total error change with respect to output,

• When we take the partial derivative of the total error with respect to out_o1, the quantity 1\2(target_o2 - out_o2)^2 becomes
zero because out_o1 does not affect it which means we’re taking the derivative of a constant which is zero.

• Next, how much does the output of o1 change with respect to its total net input. So, the partial derivative of the logistic
function is the output multiplied by 1 minus the output:

• Finally, how much does the total net input of o1 change with respect to w5
• Now we will arrange all values together,

• To decrease the error, we then subtract this value from the current weight (optionally multiplied by some learning rate, eta,
which we’ll set to 0.5):

• We repeat this process to get the new weights for w6, w7, and w8:
Hidden Layer
• We continue the backwards pass by calculating new values for w1, w2, w3, and w4.
• So, for that we need to calculate,

• Visually, we are doing


• We’re going to use a similar process as we did for the output layer, but slightly different to account for the fact that the
output of each hidden layer neuron contributes to the output (and therefore error) of multiple output neurons.
• We know that out_h1 affects both out_o1 and out_o2 therefore the needs to take into consideration its effect on
the both output neurons:

• Starting with,

• We can calculate using values we calculated earlier:

• And is equal to w5:


• Bring all the values together,

• Following the same process for we get:

• So,

• Now that we have , we need to figure out and then for each weight:
• We calculate the partial derivative of the total net input to h1 with respect to w1 the same as we did for the output neuron:

• Putting all together,

• We can now update w1:

• Repeating this for w2, w3, and w4


• Finally, we’ve updated all of our weights in the model.

• When we fed forward the 0.05 and 0.1 inputs originally, the error on the network was 0.298371109.

• After this first round of back propagation, the total error is now down to 0.291027924.

• It might not seem like much, but after repeating this process 10,000 times, for example, the error plummets to
0.0000351085.

• At that point, when we feed forward 0.05 and 0.1, the two outputs neurons generate 0.015912196 (vs 0.01 target) and
0.984065734 (vs 0.99 target).

You might also like