Activation Function To Back Pro
Activation Function To Back Pro
Activation Function To Back Pro
2.Then for each layer, sum all (input) x (weights) of neurons together.
• Inputs are fed into neuron 1, neuron 2 and neuron 3 as they belong to the Input Layer.
• Each neuron has a weight associated with it. When an input enters a neuron, the weight on the neuron is multiplied to the input.
• For instance, weight 1 will be applied to the input of Neuron 1. If weight 1 is 0.8 and input is 1 then 0.8 will be computed from Neuron 1:
1 * 0.8 = 0.8
• Sum of weight * inputs of neurons in a layer is calculated. As an example, the calculated value on the hidden layer in the image will be:
(Weight 4 x Input To Neuron 4) + (Weight 5 x Input To Neuron 5)
• Finally an activation function is applied. Output calculated by the neurons becomes input to the activation function which then computes a new
output.
• The output from activation function is then fed to the subsequent layers.
Activation Functions in Neural Networks
• Activation function is a mathematical formula (algorithm) that is activated under certain circumstances. When neurons compute weighted
sum of inputs, they are passed to the activation function which checks if the computed value is above the required threshold.
• If the computed value is above the required threshold then the activation function is activated and an output is computed.
• This output is then passed on to the next or previous layers (dependent on the complexity of the network) which can help neural networks
alter weights on their neurons.
• Activation functions are important to learn complicated and Non-linear complex functional mappings between the inputs and output
variable. They introduce non-linear properties to the network.
• Specifically in NN we do the sum of products of inputs(X) and their corresponding Weights(W) and apply a Activation function f(x) to it to
get the output of that layer and feed it as an input to the next layer.
Y = f (Σ(inputs*weights)+bias)
• Here f(x) is activation functions which can be different mathematical functions.
• Some of the popular activation functions are,
• For different x value we performed the tanh function it gives curve like this,
ReLu -Rectified linear units
• It’s an activation function of form R(x) = max(0,x)
i.e if x < 0 , R(x) = 0
and if x >= 0 , R(x) = x
• Hence as seeing the mathematical form of this function we can see that it is very simple and efficient .
• It rectifies vanishing gradient problem .
• But its limitation is that it should only be used within Hidden layers of a Neural Network Model.
• Another problem with ReLu is that some neurons doesn’t allow to update the weight and never activate again. Simply saying that
ReLu could result in Dead Neurons.
• To fix this problem another modification was introduced called Leaky ReLu = max(0.1x, x) to fix the problem of dying neurons.
It introduces a small slope to keep the updates alive.
Softmax
• The softmax function is also a type of sigmoid function but is useful when we are trying to handle classification problems.
• The sigmoid function is able to handle just two classes. The softmax function would squeeze the outputs for each class
between 0 and 1 and would also divide by the sum of the outputs.
• This essentially gives the probability of the input being in a particular class. It can be defined as –
• Depending upon the properties of the problem we able to make a better choice for quick convergence of the network.
• Sigmoids and tanh functions are sometimes avoided due to the vanishing gradient problem.
• ReLU function is a general activation function and is used in most cases these days.
• If we encounter a case of dead neurons in our networks the leaky ReLU function is the best choice.
• After generated predicted value for the actual output value, model try to investigate the difference between
the actual output and predicted output value.
• Different loss functions will give different errors for the same prediction, and thus have a considerable effect
on the performance of the model.
• One of the most widely used loss function is mean square error, which calculates the square of difference
between actual value and predicted value.
• Different loss functions are used to deal with different type of tasks, i.e. regression and classification.
Error = P – P’
Understand Back-Propagation
• The goal of back-propagation is to reduce the loss function/error of the model and optimize the weights so that the neural
network can learn how to correctly map arbitrary inputs to outputs.
• The current error is typically propagated backwards to a previous layer, where it is used to modify the weights and bias in
such a way that the error is minimized.
• To understand back-propagation , first we need to understand the working of neural network. So, we take one e.g
• In order to have some numbers to work with, here are the initial weights, the biases, and training inputs/outputs:
• And expected output value of the model o1 and o2 is 0.01 and 0.99.
The Forward Pass
• In this stage neural network try to predict the expected output value with the initial input values i1,i2, and weights and
biases given above.
• For this we take total input values, respective weight, bias and apply the activation function on it.
• This how we calculate the total input value for first hidden layer neuron h1 is,
net_h1 = w1 * i1 + w2 * i2 + b1 * 1
net_h1 = 0.15 * 0.05 + 0.2 * 0.1 + 0.35 * 1 = 0.3775
net_o2 = 0.772928465
Calculating the total error
• We calculate the error for each output neuron using the squared error function and sum them to get the total error:
• The target output for o_1 is 0.01 but the neural network output 0.75136507, therefore its error is:
E_o2 = 0.023560026
• So, the total error of neural network is the sum of both the errors,
Output Layer
• Consider w5. We want to know how much a change in w5 affects the total error, aka
• When we take the partial derivative of the total error with respect to out_o1, the quantity 1\2(target_o2 - out_o2)^2 becomes
zero because out_o1 does not affect it which means we’re taking the derivative of a constant which is zero.
• Next, how much does the output of o1 change with respect to its total net input. So, the partial derivative of the logistic
function is the output multiplied by 1 minus the output:
• Finally, how much does the total net input of o1 change with respect to w5
• Now we will arrange all values together,
• To decrease the error, we then subtract this value from the current weight (optionally multiplied by some learning rate, eta,
which we’ll set to 0.5):
• We repeat this process to get the new weights for w6, w7, and w8:
Hidden Layer
• We continue the backwards pass by calculating new values for w1, w2, w3, and w4.
• So, for that we need to calculate,
• Starting with,
• So,
• Now that we have , we need to figure out and then for each weight:
• We calculate the partial derivative of the total net input to h1 with respect to w1 the same as we did for the output neuron:
• When we fed forward the 0.05 and 0.1 inputs originally, the error on the network was 0.298371109.
• After this first round of back propagation, the total error is now down to 0.291027924.
• It might not seem like much, but after repeating this process 10,000 times, for example, the error plummets to
0.0000351085.
• At that point, when we feed forward 0.05 and 0.1, the two outputs neurons generate 0.015912196 (vs 0.01 target) and
0.984065734 (vs 0.99 target).