Unit 2
Unit 2
networks are so named because all information flows in a forward manner only.
The data enters the input nodes, travels through the hidden layers, and eventually
exits the output nodes. The network is devoid of links that would allow the
information exiting the output node to be sent back into the network.
The feedfоrwаrd netwоrk will mар y = f (x; θ). It then memorizes the value of θ that
As shown in the Google Photos app, a feedforward neural network serves as the
Layer of input
It contains the neurons that receive input. The data is subsequently passed on to
the next tier. The input layer’s total number of neurons is equal to the number of
Hidden layer
This is the intermediate layer, which is concealed between the input and output
layers. This layer has a large number of neurons that perform alterations on the
Output layer
It is the last layer and is depending on the model’s construction. Additionally, the
output layer is the expected feature, as you are aware of the desired outcome.
Neurons weights
Weights are used to describe the strength of a connection between neurons. The
minor adjustments to weights and biases have little effect on the categorized data
points. Thus, to determine a method for improving performance by making minor
IMAGE
Where,
b = biases
a = output vectors
x = input
A neural network’s loss function is used to identify if the learning process needs to
be adjusted.
As many neurons as there are classes in the output layer. To show the difference
IMAGE
Gradient Descent Algorithm repeatedly calculates the next point using gradient at
the current location, then scales it (by a learning rate) and subtracts achieved value
from the current position (makes a step) (makes a step). It subtracts the value since
we want to decrease the function (to increase it would be adding) (to maximize it
There’s a crucial parameter η which adjusts the gradient and hence affects the step
size. In machine learning, it is termed learning rate and has a substantial effect on
performance.
• The smaller the learning rate the longer GD converges or may approach maximum
• If the learning rate is too great the algorithm may not converge to the ideal point
3. produce a scaled step in the opposite direction to the gradient (objective: minimize)
(objective: minimize)
involuntary system.
throughout the famous networks, and this motif has been demonstrated to be a
The main features of Backpropagation are the iterative, recursive and efficient method
through which it calculates the updated weight to improve the network until it is not able
to perform the task for which it is being trained. Derivatives of the activation function to
be known at network design time is required to Backpropagation.
Now, how error function is used in Backpropagation and how Backpropagation works?
Let start with an example and do it mathematically to understand how exactly updates
the weight using Backpropagation.
Input values
X1=0.05
X2=0.10
Initial weight
W1=0.15 w5=0.40
W2=0.20 w6=0.45
W3=0.25 w7=0.50
W4=0.30 w8=0.55
Bias Values
b1=0.35 b2=0.60
Target Values
T1=0.01
T2=0.99
Forward Pass
To find the value of H1 we first multiply the input value from the weights as
H1=x1×w1+x2×w2+b1
H1=0.05×0.15+0.10×0.20+0.35
H1=0.3775
H2=x1×w3+x2×w4+b1
H2=0.05×0.25+0.10×0.30+0.35
H2=0.3925
To find the value of y1, we first multiply the input value i.e., the outcome of H1 and H2
from the weights as
y1=H1×w5+H2×w6+b2
y1=0.593269992×0.40+0.596884378×0.45+0.60
y1=1.10590597
y2=H1×w7+H2×w8+b2
y2=0.593269992×0.50+0.596884378×0.55+0.60
y2=1.2249214
Our target values are 0.01 and 0.99. Our y1 and y2 value is not matched with our target
values T1 and T2.
Now, we will find the total error, which is simply the difference between the outputs from
the target outputs. The total error is calculated as
Now, we will backpropagate this error to update the weights using a backward pass.
ADVERTISEMENT
From equation two, it is clear that we cannot partially differentiate it with respect to w5
because there is no any w5. We split equation one into multiple terms so that we can
easily differentiate it with respect to w5 as
Now, we calculate each term one by one to differentiate Etotal with respect to w5 as
Putting the value of e-y in equation (5)
So, we put the values of in equation no (3) to find the final
result.
Now, we will calculate the updated weight w5new with the help of the following formula
In the same way, we calculate w6new,w7new, and w8new and this will give us the following
values
w5new=0.35891648
w6new=408666186
w7new=0.511301270
w8new=0.561370121
From equation (2), it is clear that we cannot partially differentiate it with respect to w1
because there is no any w1. We split equation (1) into multiple terms so that we can easily
differentiate it with respect to w1 as
Now, we calculate each term one by one to differentiate Etotal with respect to w1 as
Now, we find the value of by putting values in equation (18) and (19) as
We calculate the partial derivative of the total net input to H1 with respect to w1 the same
as we did for the output neuron:
So, we put the values of in equation (13) to find the final result.
Now, we will calculate the updated weight w1new with the help of the following formula
In the same way, we calculate w2new,w3new, and w4 and this will give us the following
values
w1new=0.149780716
w2new=0.19956143
w3new=0.24975114
w4new=0.29950229
We have updated all the weights. We found the error 0.298371109 on the network when
we fed forward the 0.05 and 0.1 inputs. In the first round of Backpropagation, the total
error is down to 0.291027924. After repeating this process 10,000, the total error is down
to 0.0000351085. At this point, the outputs neurons generate 0.159121960 and
0.984065734 i.e., nearby our target value when we feed forward the 0.05 and 0.1.
NAG resolves this problem by adding a look ahead term in our equation. The intuition
behind NAG can be summarized as ‘look before you leap’. Let us try to understand this
through an example.
As can see, in the momentum-based gradient, the steps become larger and larger due
to the accumulated momentum, and then we overshoot at the 4th step. We then have to
take steps in the opposite direction to reach the minimum point.
However, the update in NAG happens in two steps. First, a partial step to reach the
look-ahead point, and then the final update. We calculate the gradient at the look-ahead
point and then use it to calculate the final update. If the gradient at the look-ahead point
is negative, our final update will be smaller than that of a regular momentum-based
gradient. Like in the above example, the updates of NAG are similar to that of the
momentum-based gradient for the first three steps because the gradient at that point
and the look-ahead point are positive. But at step 4, the gradient of the look-ahead point
is negative.
In NAG, the first partial update 4a will be used to go to the look-ahead point and then
the gradient will be calculated at that point without updating the parameters. Since the
gradient at step 4b is negative, the overall update will be smaller than the momentum-
based gradient descent.
We can see in the above example that the momentum-based gradient descent takes six
steps to reach the minimum point, while NAG takes only five steps.
This looking ahead helps NAG to converge to the minimum points in fewer steps and
reduce the chances of overshooting.
How NAG actually works?
We saw how NAG solves the problem of overshooting by ‘looking ahead’. Let us see
how this is calculated and the actual math behind it.
Update rule for gradient descent:
wt+1 = wt − η∇wt
In this equation, the weight (W) is updated in each iteration. η is the learning rate, and
∇wt is the gradient.
Update rule for momentum-based gradient descent:
In this, momentum is added to the conventional gradient descent equation. The update
equation is
wt+1 = wt − updatet
updatet is calculated by:
updatet = γ · updatet−1 + η∇wt
Source: link
This is how the gradient of all the previous updates is added to the current update.
Update rule for NAG:
wt+1 = wt − updatet
While calculating the updatet, We will include the look ahead gradient (∇wlook_ahead).
updatet = γ · updatet−1 + η∇wlook_ahead
∇wlook_ahead is calculated by:
wlook_ahead = wt − γ · updatet−1
This look-ahead gradient will be used in our update and will prevent overshooting.
Stochastic Gradient Descent (SGD):
Stochastic Gradient Descent (SGD) is a variant of the Gradient
Descent algorithm that is used for optimizing machine learning models. It
addresses the computational inefficiency of traditional Gradient Descent
methods when dealing with large datasets in machine learning projects.
In SGD, instead of using the entire dataset for each iteration, only a single
random training example (or a small batch) is selected to calculate the
gradient and update the model parameters. This random selection introduces
randomness into the optimization process, hence the term “stochastic” in
stochastic Gradient Descent
The advantage of using SGD is its computational efficiency, especially when
dealing with large datasets. By using a single example or a small batch, the
computational cost per iteration is significantly reduced compared to
traditional Gradient Descent methods that require processing the entire
dataset.
One thing to be noted is that, as SGD is generally noisier than typical Gradient
Descent, it usually took a higher number of iterations to reach the minima,
because of the randomness in its descent. Even though it requires a higher
number of iterations to reach the minima than typical Gradient Descent, it is
still computationally much less expensive than typical Gradient Descent.
Hence, in most scenarios, SGD is preferred over Batch Gradient Descent for
optimizing a learning algorithm.
AdaGrad : (Adaptive Gradient Descent) ;-
• The idea behind this particular method is that it enables the learning
rate to adapt to the geometry of the loss function, allowing it to
converge quicker in steep gradient directions while being more
conservative in flatter gradient directions. This may result in quicker
convergence and improved generalization.
to the parameters:
• Update the exponentially weighted average of the squared
gradients:
Adam :-
Adam was presented by Diederik Kingma from OpenAI and Jimmy Ba from the University of
Toronto in their 2015 ICLR paper (poster) titled “Adam: A Method for Stochastic
Optimization“. I will quote liberally from their paper in this post, unless stated otherwise.
The algorithm is called Adam. It is not an acronym and is not written as “ADAM”.
ability to adaptively adjust the learning rate for each network weight individually.
Unlike SGD, which maintains a single learning rate throughout training, Adam
optimizer considers the second moment of the gradients, but unlike RMSProp, it
calculates the uncentered variance of the gradients (without subtracting the mean).
By incorporating both the first moment (mean) and second moment (uncentered
variance) of the gradients, Adam optimizer achieves an adaptive learning rate that
can efficiently navigate the optimization landscape during training. This adaptivity
features of AdaGrad and RMSProp to provide efficient and adaptive updates to the
Learning Objectives
• Adam adjusts learning rates individually for each parameter, allowing for efficient
• Ultimately, Adam’s primary goal is to stabilize the training process and help neural
• It aims to optimize model parameters efficiently, swiftly navigating through steep and
The adam optimizer has several benefits, due to which it is used widely. It is
a faster running time, low memory requirements, and requires less tuning than any
The above formula represents the working of adam optimizer. Here B1 and B2
If the adam optimizer uses the good properties of all the algorithms and is the best
available optimizer, then why shouldn’t you use Adam in every application? And
what was the need to learn about other algorithms in depth? This is because even
Adam has some downsides. It tends to focus on faster computation time, whereas
algorithms like stochastic gradient descent focus on data points. That’s why
algorithms like SGD generalize the data in a better manner at the cost of low
Eigenvalue Decomposition
Because the preceding equation must compress some direction to zero, it is not
invertible. Thus, the determinant is zero.
Thus, we can evaluate the eigenvalues by finding for what λ is :
det(A−λI)=0
Once we get the eigenvalues, we can compute Av=λv to find the relevant
eigenvector(s).