0% found this document useful (0 votes)
8 views36 pages

Unit 2

A feedforward neural network is a type of artificial neural network where information flows in only one direction, from input to output. It consists of an input layer, hidden layers, and an output layer. The network learns by calculating errors using backpropagation to update weights to better map inputs to outputs.

Uploaded by

wansejalm527
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
8 views36 pages

Unit 2

A feedforward neural network is a type of artificial neural network where information flows in only one direction, from input to output. It consists of an input layer, hidden layers, and an output layer. The network learns by calculating errors using backpropagation to update weights to better map inputs to outputs.

Uploaded by

wansejalm527
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 36

Feedforward Neural Network

A feedforward neural network is a key component of this fantastic


technology since it aids software developers with pattern recognition and
classification, non-linear regression, and function approximation.

A feedforward neural network is a type of artificial neural network in which nodes’

connections do not form a loop.

Often referred to as a multi-layered network of neurons, feedforward neural

networks are so named because all information flows in a forward manner only.

The data enters the input nodes, travels through the hidden layers, and eventually

exits the output nodes. The network is devoid of links that would allow the

information exiting the output node to be sent back into the network.

The purpose of feedforward neural networks is to approximate functions.

Here’s how it works

There is a classifier using the formula y = f* (x).

This assigns the value of input x to the category y.

The feedfоrwаrd netwоrk will mар y = f (x; θ). It then memorizes the value of θ that

most closely approximates the function.

As shown in the Google Photos app, a feedforward neural network serves as the

foundation for object detection in photos.


A Feedforward Neural Network’s Layers

The following are the components of a feedforward neural network:

Layer of input

It contains the neurons that receive input. The data is subsequently passed on to

the next tier. The input layer’s total number of neurons is equal to the number of

variables in the dataset.

Hidden layer

This is the intermediate layer, which is concealed between the input and output

layers. This layer has a large number of neurons that perform alterations on the

inputs. They then communicate with the output layer.

Output layer

It is the last layer and is depending on the model’s construction. Additionally, the

output layer is the expected feature, as you are aware of the desired outcome.

Neurons weights

Weights are used to describe the strength of a connection between neurons. The

range of a weight’s value is from 0 to 1.

Cost Function in Feedforward Neural Network

The cost function is an important factor of a feedforward neural network. Generally,

minor adjustments to weights and biases have little effect on the categorized data
points. Thus, to determine a method for improving performance by making minor

adjustments to weights and biases using a smooth cost function.

The mean square error cost function is defined as follows:

IMAGE

Where,

w = weights collected in the network

b = biases

n = number of training inputs

a = output vectors

x = input

‖v‖ = usual length of vector v

Loss Function in Feedforward Neural Network

A neural network’s loss function is used to identify if the learning process needs to

be adjusted.
As many neurons as there are classes in the output layer. To show the difference

between the predicted and actual distributions of probabilities.

The cross-entropy loss for binary classification is as follows.

The cross-entropy loss associated with multi-class categorization is as follows:

IMAGE

Gradient Learning Algorithm

Gradient Descent Algorithm repeatedly calculates the next point using gradient at

the current location, then scales it (by a learning rate) and subtracts achieved value

from the current position (makes a step) (makes a step). It subtracts the value since
we want to decrease the function (to increase it would be adding) (to maximize it

would be adding). This procedure may be written as:

There’s a crucial parameter η which adjusts the gradient and hence affects the step

size. In machine learning, it is termed learning rate and has a substantial effect on

performance.

• The smaller the learning rate the longer GD converges or may approach maximum

iteration before finding the optimal point

• If the learning rate is too great the algorithm may not converge to the ideal point

(jump around) or perhaps diverge altogether.

In summary, the Gradient Descent method’s steps are:

1. pick a beginning point (initialization) (initialization)

2. compute the gradient at this spot

3. produce a scaled step in the opposite direction to the gradient (objective: minimize)

(objective: minimize)

4. repeat points 2 and 3 until one of the conditions is met:

• maximum number of repetitions reached

• step size is smaller than the tolerance.


Applications of Feedforward Neural Network

These neural networks are utilized in a wide variety of applications. Several of

them are denoted by the following area units:

• Physiological feedforward system: Here, feedforward management is exemplified by

the usual preventative control of heartbeat prior to exercise by the central

involuntary system.

• Gene regulation and feedforward: Throughout this, a theme predominates

throughout the famous networks, and this motif has been demonstrated to be a

feedforward system for detecting non-temporary atmospheric alteration.

• Automating and managing machines

• Parallel feedforward compensation with derivative: This is a relatively recent

approach for converting the non-minimum component of an open-loop transfer

system into the minimum part.

Backpropagation Process in Deep Neural


Network
Backpropagation is one of the important concepts of a neural network. Our task is to
classify our data best. For this, we have to update the weights of parameter and bias, but
how can we do that in a deep neural network? In the linear regression model, we use
gradient descent to optimize the parameter. Similarly here we also use gradient descent
algorithm using Backpropagation.

For a single training example, Backpropagation algorithm calculates the gradient of


the error function. Backpropagation can be written as a function of the neural network.
Backpropagation algorithms are a set of methods used to efficiently train artificial neural
networks following a gradient descent approach which exploits the chain rule.

The main features of Backpropagation are the iterative, recursive and efficient method
through which it calculates the updated weight to improve the network until it is not able
to perform the task for which it is being trained. Derivatives of the activation function to
be known at network design time is required to Backpropagation.

Now, how error function is used in Backpropagation and how Backpropagation works?
Let start with an example and do it mathematically to understand how exactly updates
the weight using Backpropagation.

Input values
X1=0.05
X2=0.10

Initial weight
W1=0.15 w5=0.40
W2=0.20 w6=0.45
W3=0.25 w7=0.50
W4=0.30 w8=0.55
Bias Values
b1=0.35 b2=0.60

Target Values
T1=0.01
T2=0.99

Now, we first calculate the values of H1 and H2 by a forward pass.

Forward Pass
To find the value of H1 we first multiply the input value from the weights as

H1=x1×w1+x2×w2+b1
H1=0.05×0.15+0.10×0.20+0.35
H1=0.3775

To calculate the final result of H1, we performed the sigmoid function as

We will calculate the value of H2 in the same way as H1

H2=x1×w3+x2×w4+b1
H2=0.05×0.25+0.10×0.30+0.35
H2=0.3925

To calculate the final result of H1, we performed the sigmoid function as


Now, we calculate the values of y1 and y2 in the same way as we calculate the H1 and H2.

To find the value of y1, we first multiply the input value i.e., the outcome of H1 and H2
from the weights as

y1=H1×w5+H2×w6+b2
y1=0.593269992×0.40+0.596884378×0.45+0.60
y1=1.10590597

To calculate the final result of y1 we performed the sigmoid function as

We will calculate the value of y2 in the same way as y1

y2=H1×w7+H2×w8+b2
y2=0.593269992×0.50+0.596884378×0.55+0.60
y2=1.2249214

To calculate the final result of H1, we performed the sigmoid function as


ADVERTISEMENT

Our target values are 0.01 and 0.99. Our y1 and y2 value is not matched with our target
values T1 and T2.

Now, we will find the total error, which is simply the difference between the outputs from
the target outputs. The total error is calculated as

So, the total error is

Now, we will backpropagate this error to update the weights using a backward pass.

Backward pass at the output layer


To update the weight, we calculate the error correspond to each weight with the help of
a total error. The error on weight w is calculated by differentiating total error with respect
to w.
We perform backward process so first consider the last weight w5 as

ADVERTISEMENT

From equation two, it is clear that we cannot partially differentiate it with respect to w5
because there is no any w5. We split equation one into multiple terms so that we can
easily differentiate it with respect to w5 as

Now, we calculate each term one by one to differentiate Etotal with respect to w5 as
Putting the value of e-y in equation (5)
So, we put the values of in equation no (3) to find the final
result.

Now, we will calculate the updated weight w5new with the help of the following formula

In the same way, we calculate w6new,w7new, and w8new and this will give us the following
values
w5new=0.35891648
w6new=408666186
w7new=0.511301270
w8new=0.561370121

Backward pass at Hidden layer


Now, we will backpropagate to our hidden layer and update the weight w1, w2, w3, and
w4 as we have done with w5, w6, w7, and w8 weights.

We will calculate the error at w1 as

From equation (2), it is clear that we cannot partially differentiate it with respect to w1
because there is no any w1. We split equation (1) into multiple terms so that we can easily
differentiate it with respect to w1 as

Now, we calculate each term one by one to differentiate Etotal with respect to w1 as

We again split this because there is no any H1final term in Etoatal as

will again split because in E1 and E2 there is no H1 term. Splitting is


done as
We again Split both because there is no any y1 and y2 term in E1 and E2. We
split it as

Now, we find the value of by putting values in equation (18) and (19) as

From equation (18)

From equation (8)

From equation (19)


Putting the value of e-y2 in equation (23)

From equation (21)


Now from equation (16) and (17)

Put the value of in equation (15) as


We have we need to figure out as

Putting the value of e-H1 in equation (30)

We calculate the partial derivative of the total net input to H1 with respect to w1 the same
as we did for the output neuron:
So, we put the values of in equation (13) to find the final result.

Now, we will calculate the updated weight w1new with the help of the following formula

In the same way, we calculate w2new,w3new, and w4 and this will give us the following
values

w1new=0.149780716
w2new=0.19956143
w3new=0.24975114
w4new=0.29950229

We have updated all the weights. We found the error 0.298371109 on the network when
we fed forward the 0.05 and 0.1 inputs. In the first round of Backpropagation, the total
error is down to 0.291027924. After repeating this process 10,000, the total error is down
to 0.0000351085. At this point, the outputs neurons generate 0.159121960 and
0.984065734 i.e., nearby our target value when we feed forward the 0.05 and 0.1.

Drawbacks of gradient descent


The main drawback of gradient descent is that it depends on the learning rate and the
gradient of that particular step only. The gradient at the plateau, also known as saddle
points of our function, will be close to zero. The step size becomes very small or even
zero. Thus, the update of our parameters is very slow at a gentle slope.
Let us look at an example. The starting point of our model is ‘A’. The loss function will
decrease rapidly on the path AB because of the higher gradient. But as the gradient
decreases from B to C, the learning is negligible. The gradient at point ‘C’ is zero, and it
is the saddle point of our function. Even after many iterations, we will be stuck at ‘C’ and
will not reach the desired minimum ‘D’.

This problem is solved by using momentum in our gradient descent.

Momentum-based gradient descent :-


Momentum-based gradient descent is an optimization method that
enhances conventional gradient descent by considering the gradient's
velocity. The neural network's weights are changed in momentum-
based gradient descent using the weighted sum of the previous
gradient and the current gradient.
Momentum-based Gradient Optimizer is a technique used in optimization
algorithms, such as Gradient Descent, to accelerate the convergence of the
algorithm and overcome local minima. In the Momentum-based Gradient
Optimizer, a fraction of the previous update is added to the current update,
which creates a momentum effect that helps the algorithm to move faster
towards the minimum.
The momentum term can be viewed as a moving average of the gradients. The
larger the momentum term, the smoother the moving average, and the more
resistant it is to changes in the gradients. The momentum term is typically set
to a value between 0 and 1, with a higher value resulting in a more stable
optimization process.
The update rule for the Momentum-based Gradient Optimizer can be
expressed as follows:

v = beta * v - learning_rate * gradient


parameters = parameters + v

// Where v is the velocity vector, beta is the momentum term,


// learning_rate is the step size,
// gradient is the gradient of the cost function with respect to the
parameters,
// and parameters are the parameters of the model.

The Momentum-based Gradient Optimizer has several advantages over the


basic Gradient Descent algorithm, including faster convergence, improved
stability, and the ability to overcome local minima. It is widely used in deep
learning applications and is an important optimization technique for training
deep neural networks.
Momentum-based gradient descent is a potent deep-learning
optimization strategy that can hasten convergence and enhance
outcomes. Momentum-based gradient descent is able to shift
direction more smoothly by accounting for the gradient's momentum,
which promotes faster convergence and possibly better outcomes. The
momentum coefficient can be chosen in many different ways;
however, a reasonable starting point is 0.9, which can then be changed
depending on the situation and dataset.
But this added momentum causes a different type of problem. We actually cross the
minimum point and have to take a U-turn to get to the minimum point. Momentum-
based gradient descent oscillates around the minimum point, and we have to take a lot
of U-turns to reach the desired point. Despite these oscillations, momentum-based
gradient descent is faster than conventional gradient descent.
To reduce these oscillations, we can use Nesterov Accelerated Gradient.

Nesterov Accelerated Gradient (NAG)

NAG resolves this problem by adding a look ahead term in our equation. The intuition
behind NAG can be summarized as ‘look before you leap’. Let us try to understand this
through an example.
As can see, in the momentum-based gradient, the steps become larger and larger due
to the accumulated momentum, and then we overshoot at the 4th step. We then have to
take steps in the opposite direction to reach the minimum point.
However, the update in NAG happens in two steps. First, a partial step to reach the
look-ahead point, and then the final update. We calculate the gradient at the look-ahead
point and then use it to calculate the final update. If the gradient at the look-ahead point
is negative, our final update will be smaller than that of a regular momentum-based
gradient. Like in the above example, the updates of NAG are similar to that of the
momentum-based gradient for the first three steps because the gradient at that point
and the look-ahead point are positive. But at step 4, the gradient of the look-ahead point
is negative.
In NAG, the first partial update 4a will be used to go to the look-ahead point and then
the gradient will be calculated at that point without updating the parameters. Since the
gradient at step 4b is negative, the overall update will be smaller than the momentum-
based gradient descent.
We can see in the above example that the momentum-based gradient descent takes six
steps to reach the minimum point, while NAG takes only five steps.
This looking ahead helps NAG to converge to the minimum points in fewer steps and
reduce the chances of overshooting.
How NAG actually works?
We saw how NAG solves the problem of overshooting by ‘looking ahead’. Let us see
how this is calculated and the actual math behind it.
Update rule for gradient descent:
wt+1 = wt − η∇wt
In this equation, the weight (W) is updated in each iteration. η is the learning rate, and
∇wt is the gradient.
Update rule for momentum-based gradient descent:
In this, momentum is added to the conventional gradient descent equation. The update
equation is
wt+1 = wt − updatet
updatet is calculated by:
updatet = γ · updatet−1 + η∇wt

Source: link
This is how the gradient of all the previous updates is added to the current update.
Update rule for NAG:
wt+1 = wt − updatet
While calculating the updatet, We will include the look ahead gradient (∇wlook_ahead).
updatet = γ · updatet−1 + η∇wlook_ahead
∇wlook_ahead is calculated by:
wlook_ahead = wt − γ · updatet−1
This look-ahead gradient will be used in our update and will prevent overshooting.
Stochastic Gradient Descent (SGD):
Stochastic Gradient Descent (SGD) is a variant of the Gradient
Descent algorithm that is used for optimizing machine learning models. It
addresses the computational inefficiency of traditional Gradient Descent
methods when dealing with large datasets in machine learning projects.
In SGD, instead of using the entire dataset for each iteration, only a single
random training example (or a small batch) is selected to calculate the
gradient and update the model parameters. This random selection introduces
randomness into the optimization process, hence the term “stochastic” in
stochastic Gradient Descent
The advantage of using SGD is its computational efficiency, especially when
dealing with large datasets. By using a single example or a small batch, the
computational cost per iteration is significantly reduced compared to
traditional Gradient Descent methods that require processing the entire
dataset.

Stochastic Gradient Descent Algorithm


• Initialization: Randomly initialize the parameters of the model.
• Set Parameters: Determine the number of iterations and the learning
rate (alpha) for updating the parameters.
• Stochastic Gradient Descent Loop: Repeat the following steps until
the model converges or reaches the maximum number of iterations:
a. Shuffle the training dataset to introduce randomness.
b. Iterate over each training example (or a small batch) in the
shuffled order.
c. Compute the gradient of the cost function with respect to the
model parameters using the current training example (or
batch).
d. Update the model parameters by taking a step in the direction of
the negative gradient, scaled by the learning rate.
e. Evaluate the convergence criteria, such as the difference in the
cost function between iterations of the gradient.
• Return Optimized Parameters: Once the convergence criteria are met
or the maximum number of iterations is reached, return the optimized
model parameters.
In SGD, since only one sample from the dataset is chosen at random for each
iteration, the path taken by the algorithm to reach the minima is usually noisier
than your typical Gradient Descent algorithm. But that doesn’t matter all that
much because the path taken by the algorithm does not matter, as long as we
reach the minimum and with a significantly shorter training time.
The path taken by Batch Gradient Descent is shown below:

Batch gradient optimization path


A path taken by Stochastic Gradient Descent looks as follows –

stochastic gradient optimization path

One thing to be noted is that, as SGD is generally noisier than typical Gradient
Descent, it usually took a higher number of iterations to reach the minima,
because of the randomness in its descent. Even though it requires a higher
number of iterations to reach the minima than typical Gradient Descent, it is
still computationally much less expensive than typical Gradient Descent.
Hence, in most scenarios, SGD is preferred over Batch Gradient Descent for
optimizing a learning algorithm.
AdaGrad : (Adaptive Gradient Descent) ;-

AdaGrad is a well-known optimization method that is used in ML and DL.


Duchi, Hazan, and Singer proposed it in 2011 as a way of adjusting
the learning rate during training.

• AdaGrad’s concept is to modify the learning rate for every


parameter in a model depending on the parameter’s previous
gradients.
• Specifically, it calculates the learning rate as the sum of the squares of
the gradients over time, one for each parameter. This reduces the
learning rate for parameters with big gradients while raising the learning
rate for parameters with modest gradients.

• The idea behind this particular method is that it enables the learning
rate to adapt to the geometry of the loss function, allowing it to
converge quicker in steep gradient directions while being more
conservative in flatter gradient directions. This may result in quicker
convergence and improved generalization.

• However, this method has significant downsides. One of the most


significant concerns is that the cumulative gradient magnitudes may get
quite big over time, resulting in a meager effective learning rate that can
inhibit further learning. Adam and RMSProp, two contemporary
optimization algorithms, combine their adaptive learning rate method
with other strategies to limit the growth of gradient magnitudes over time.

Adaptive Gradients, or AdaGrad for short, is an extension of the


gradient descent optimization algorithm that allows the step size in
each dimension used by the optimization algorithm to be automatically
adapted based on the gradients seen for the variable (partial
derivatives) seen over the course of the search.
AdaGrad’s algorithms dynamically incorporate knowledge of the
geometry of the data observed in earlier iterations to perform more
informative gradient-based learning. AdaGrad has been released in
two versions. Diagonal AdaGrad (this version is the one used in
practice), its main characteristic is to maintain and adapts one learning
rate per dimension; the second version known as Full AdaGrad
maintains one learning rate per direction (e.g.. a full PSD matrix).
Adaptive Gradient Algorithm (Adagrad) is an algorithm for gradient-
based optimization. The learning rate is adapted component-wise to
the parameters by incorporating knowledge of past observations. It
performs larger updates (e.g. high learning rates) for those
parameters that are related to infrequent features and smaller updates
(i.e. low learning rates) for frequent one. It performs smaller updates
As a result, it is well-suited when dealing with sparse data (NLP or
image recognition) Each parameter has its own learning rate that
improves performance on problems with sparse gradients.

Advantages of Using AdaGrad


• It eliminates the need to manually tune the learning rate
• Convergence is faster and more reliable – than simple SGD
when the scaling of the weights is unequal
• It is not very sensitive to the size of the master step

Benefits of using AdaGrad


The following are the benefits of utilizing the AdaGrad optimizer:

• Easy to use– It’s a reasonably straightforward optimization technique


and may be applied to various models.
• No need for manual– There is no need to manually tune
hyperparameters since this optimization method automatically adjusts
the learning rate for each parameter.
• Adaptive learning rate– Modifies the learning rate for each parameter
depending on the parameter’s past gradients. This implies that for
parameters with big gradients, the learning rate is lowered, while for
parameters with small gradients, the learning rate is raised, allowing the
algorithm to converge quicker and prevent overshooting the ideal
solution.
• Adaptability to noisy data– This method provides the ability to smooth
out the impacts of noisy data by assigning lesser learning rates to
parameters with strong gradients owing to noisy input.
• Handling sparse data efficiently– It is particularly good at dealing with
sparse data, which is prevalent in NLP and recommendation systems.
This is performed by giving sparse parameters faster learning rates,
which may speed convergence.
In the end, AdaGrad has the potential to be a strong optimization technique for
machine learning and deep learning, especially when the data is sparse, noisy,
or has a high number of parameters.

RMSProp (Root Mean Squared Propagation):-


RMSProp (Root Mean Squared Propagation) is an adaptive learning
rate optimization algorithm. It is an extension of the popular Adaptive
Gradient Algorithm and is designed to dramatically reduce the amount
of computational effort used in training neural networks. v This
algorithm works by exponentially decaying the learning rate every time the
squared gradient is less than a certain threshold. This helps reduce the learning
rate more quickly when the gradients become small. In this way, RMSProp is
able to smoothly adjust the learning rate for each of the parameters in the
network, providing a better performance than regular Gradient Descent alone.
The RMSprop algorithm utilizes exponentially weighted moving averages of
squared gradients to update the parameters. Here is the mathematical
equation for RMSprop:
1. Initialize parameters:
• Learning rate: α
• Exponential decay rate for averaging: γ
• Small constant for numerical stability: ε
• Initial parameter values: θ
2. Initialize accumulated gradients (Exponentially weighted average):
• Accumulated squared gradient for each parameter: Et= 0
3. Repeat until convergence or maximum iterations:
• Compute the gradient of the objective function with respect

to the parameters:
• Update the exponentially weighted average of the squared

gradients:

• Update the parameters:


where,
• gt is the gradient of the loss function with respect to the parameters
at time t
• is a decay factor
• Et is the exponentially weighted average of the squared gradients
• α is the learning rate
• ϵ is a small constant to prevent division by zero
This process is repeated for each parameter in the optimization problem, and it
helps adjust the learning rate for each parameter based on the historical
gradients. The exponential moving average allows the algorithm to give more
importance to recent gradients and dampen the effect of older gradients,
providing stability during optimization.

Adam :-
Adam was presented by Diederik Kingma from OpenAI and Jimmy Ba from the University of
Toronto in their 2015 ICLR paper (poster) titled “Adam: A Method for Stochastic
Optimization“. I will quote liberally from their paper in this post, unless stated otherwise.
The algorithm is called Adam. It is not an acronym and is not written as “ADAM”.

Adam optimizer, short for Adaptive Moment Estimation optimizer, is an


optimization algorithm commonly used in deep learning. It is an
extension of the stochastic gradient descent (SGD) algorithm and is
designed to update the weights of a neural network during training.
The name “Adam” is derived from “adaptive moment estimation,” highlighting its

ability to adaptively adjust the learning rate for each network weight individually.

Unlike SGD, which maintains a single learning rate throughout training, Adam

optimizer dynamically computes individual learning rates based on the past

gradients and their second moments.

The creators of Adam optimizer incorporated the beneficial features of other

optimization algorithms such as AdaGrad and RMSProp. Similar to RMSProp, Adam

optimizer considers the second moment of the gradients, but unlike RMSProp, it

calculates the uncentered variance of the gradients (without subtracting the mean).

By incorporating both the first moment (mean) and second moment (uncentered

variance) of the gradients, Adam optimizer achieves an adaptive learning rate that

can efficiently navigate the optimization landscape during training. This adaptivity

helps in faster convergence and improved performance of the neural network.

In summary, Adam optimizer is an optimization algorithm that extends SGD by

dynamically adjusting learning rates based on individual weights. It combines the

features of AdaGrad and RMSProp to provide efficient and adaptive updates to the

network weights during deep learning training.

Learning Objectives

• Adam adjusts learning rates individually for each parameter, allowing for efficient

optimization and convergence, especially in complex loss landscapes.


• Incorporating bias correction mechanisms to counter initialization bias in the first

moments facilitates faster convergence during early training stages.

• Ultimately, Adam’s primary goal is to stabilize the training process and help neural

networks converge to optimal solutions.

• It aims to optimize model parameters efficiently, swiftly navigating through steep and

flat regions of the loss function.

Adam Optimizer Formula

The adam optimizer has several benefits, due to which it is used widely. It is

adapted as a benchmark for deep learning papers and recommended as a default

optimization algorithm. Moreover, the algorithm is straightforward to implement, has

a faster running time, low memory requirements, and requires less tuning than any

other optimization algorithm.

The above formula represents the working of adam optimizer. Here B1 and B2

represent the decay rate of the average of the gradients.

If the adam optimizer uses the good properties of all the algorithms and is the best

available optimizer, then why shouldn’t you use Adam in every application? And

what was the need to learn about other algorithms in depth? This is because even

Adam has some downsides. It tends to focus on faster computation time, whereas
algorithms like stochastic gradient descent focus on data points. That’s why

algorithms like SGD generalize the data in a better manner at the cost of low

computation speed. So, the optimization algorithms can be picked accordingly

depending on the requirements and the type of data.


The above visualizations create a better picture in mind and help in comparing the

results of various optimization algorith


Adam vs RMSProp
RMSProp is often compared to the Adam (Adaptive Moment Estimation)
optimization algorithm, another popular optimization method for deep learning.
Both algorithms combine elements of momentum and adaptive learning rates
to improve the optimization process, but Adam uses a slightly different
approach to compute the moving averages and adjust the learning rates.
Adam is generally more popular and widely used than the RMSProp optimizer,
but both algorithms can be effective in different settings.

Eigenvalue Decomposition

The process of decomposing a matrix into its eigenvectors and


eigenvalues is known as eigenvalue decomposition or eigen-
decomposition. A matrix can also be transformed into an Eigenbasis
(the basis matrix where every column is an eigenvector). Loading ad

Transformations and multiplications in matrices are computationally expensive.


Thousands or millions of dimensions are every day in machine learning applications.
Consider the task of performing matrix transformations on matrices with millions of
dimensions. Even the most powerful computers have their limits. However, as
previously stated, operations on diagonal matrices are substantially simpler. If we can
deconstruct a matrix into a diagonal form before performing any expensive process, it
makes our life, as well as the lives of our computers, a lot easier.
Eigenvectors and Eigenvalues
Eigenvectors are unit vectors, meaning their length or magnitude is the same as 1.0.
They're also known as suitable vectors, which simply means "column vectors" (instead
of a row vector or a left vector). A right-vector is a vector in the traditional sense.
Eigenvalues are the coefficients given to eigenvectors that determine the length or
magnitude of the vectors. A negative eigenvalue, for example, may scale the
eigenvector in the opposite direction.
Suppose that we have a matrix A, and we can find a number λ and a vector v such
that
Av=λv
We say that v is an eigenvector for A and λ is an eigenvalue.
How to Find Eigenvalues
Let's see how we can find them. By removing the λv from both sides and then factoring
out the vector, We can see that the above is identical to:

Because the preceding equation must compress some direction to zero, it is not
invertible. Thus, the determinant is zero.
Thus, we can evaluate the eigenvalues by finding for what λ is :
det(A−λI)=0
Once we get the eigenvalues, we can compute Av=λv to find the relevant
eigenvector(s).

You might also like