Gradient Descent
Gradient Descent
Gradient Descent
Optimization algorithms are responsible for reducing losses and provide most accurate
results possible. The weight is initialized using some initialization strategies and is
updated with each epoch according to the equation. The best results are achieved using
Gradient Descent
The goal of the gradient descent is to minimize a given function which, in our case,
is the loss function of the neural network. To achieve this goal, it performs two steps
iteratively.
So, the idea is to pass the training set through the hidden layers of the neural
network and then update the parameters of the layers by computing the gradients
Think of it like this. Suppose a man is at top of the valley and he wants to get to
the bottom of the valley. So he goes down the slope. He decides his next position
based on his current position and stops when he gets to the bottom of the valley
There are different ways in which that man (weights) can go down the slope.
Batch Gradient Descent: Batch Gradient Descent involves calculations over the
full training set at each step as a result of which it is very slow on very large training
data. Thus, it becomes very computationally expensive to do Batch GD. However, this is
great for convex or relatively smooth error manifolds. Also, Batch GD scales well with
Gradient Descent algorithm, the entire data set is loaded at a time. This makes it
computationally intensive.
Another drawback is there are chances the iteration values may get stuck at local
minima or saddle point and never converge to minima.
Suppose our dataset has 5 million examples, then just to take one step the model
will have to calculate the gradients of all the 5 million examples. This does not seem
an efficient way. To tackle this problem we have Stochastic Gradient Descent.
SGD tries to solve the main problem in Batch Gradient descent which is the usage of
whole training data to calculate gradients as each step.
SGD is stochastic in nature i.e it picks up a “random” instance of training data at
each step and then computes the gradient making it much faster as there is much
fewer data to manipulate at a single time, unlike Batch GD.
In Stochastic Gradient Descent (SGD), we consider just one example at a time to take
a single step. We do the following steps in one epoch for SGD:
1. Take an example
2. Feed it to Neural Network
3. Calculate it’s gradient
4. Use the gradient we calculated in step 3 to update the weights
5. Repeat steps 1–4 for all the examples in training dataset
Drawback:
Slow and computationally Faster and less computationally expensive than Batch
2. expensive algorithm GD
2. Vanishing Gradients
Using the chain rule, layers that are deeper into the network go through
derivatives are large then the gradient will increase exponentially as we propagate
down the model until they eventually explode, and this is what we call the
The large changes in the models weights creates a very unstable network, which at
extreme values the weights become so large that is causes overflow resulting in
On the other hand, the accumulation of small gradients results in a model that is
incapable of learning meaningful insights since the weights and biases of the
initial layers, which tends to learn the core features from the input data (X), will
not be updated effectively. In the worst case scenario the gradient will be 0 which
How to know?
Exploding Gradients
There are few subtle methods that you may use to determine whether your model is
1. The model is not learning much on the training data therefore resulting in a
poor loss.
2. The model will have large changes in loss on each update due to the models
instability.
gradients, there are some much more transparent signs, for instance:
Model weights grow exponentially and become very large when training the
model.
Vanishing Gradient
There are also ways to detect whether your deep network is suffering from the vanishing
gradient problem
The model will improve very slowly during the training phase and it is also
possible that training stops very early, meaning that any further training
The weights closer to the output layer of the model would witness more of a
change whereas the layers that occur closer to the input layer would not
Model weights shrink exponentially and become very small when training
the model.
There are many approaches to addressing exploding and vanishing gradients; this
This is the solution could be used in both, scenarios (exploding and vanishing gradient).
However, by reducing the amount of layers in our network, we give up some of our
models complexity, since having more layers makes the networks more capable of
Checking for and limiting the size of the gradients whilst our model trains is another
solution.
3. Weight Initialization
A more careful initialization choice of the random initialization for your network tends to