UNIT3

UNIT-III
Optimization for Training Deep Models

How Learning Differs from Pure Optimization, Challenges in Neural Network
Optimization, Basic Algorithms, Parameter Initialization Strategies, Algorithms
with Adaptive Learning Rates, Approximate Second-Order Methods,
Optimization Strategies and Meta-Algorithms.
Optimization for Training Deep Models:

How Learning Differs from Pure Optimization
Optimization algorithms used for training of deep models differ from traditional
optimization algorithms in several ways. Machine learning usually acts
indirectly.
In most machine learning scenarios, we care about some performance measure
P, that is defined with respect to the test set and may also be intractable. We
therefore optimize P only indirectly. We reduce a different cost function
J(θ) in the hope that doing so will improve P . This is in contrast to pure
optimization,
where minimizing J is a goal in and of itself. Optimization algorithms for
training
deep models also typically include some specialization on the specific structure
of
machine learning objective functions.
Typically, the cost function can be written as an average over the training set,
such as
J(θ) = E(x,y)∼ˆp data L(f(x; θ), y)
Where L is the per-example loss function, f(x;θ) is the predicted output when
the
input is x , and ˆp data is the empirical distribution. In the supervised learning
case,
y is the target output. Throughout this chapter, we develop the unregularized
supervised case, where the arguments to L are f (x;θ) and y . It is trivial to
extend
this development, for example, to include θ or x as arguments, or to exclude
y as arguments, to develop various forms of regularization or unsupervised
learning.
We would usually prefer to minimize the corresponding objective function
where the
just over the finite training set:J ∗ (θ) = E (x,y)∼p data L(f(x; θ), y).
expectation is taken across the data-generating distribution p data rather than
Challenges in Neural Network Optimization
There are plenty of challenges in deep learning optimization but most of them
are related to the nature of the gradient of the model. Below, I’ve listed some of
the most common challenges in deep learning optimization that you are likely to
run into:
a)Local Minima: The grandfather of all optimization problems, local minima is a

permanent challenge in the optimization of any deep learning algorithm. The
local minima problem arises when the gradient encounters many local
minimums that are different and not correlated to a global minimum for the cost
function.
b)Flat Regions: In deep learning optimization models, flat regions are common
areas that represent both a local minimum for a sub-region and a local maximum
for another. That duality often causes the gradient to get stuck.
c)Inexact Gradients: There are many deep learning models in which the cost
function is intractable which forces an inexact estimation of the gradient. In
these cases, the inexact gradients introduces a second layer of uncertainty in the
model.
d)Local vs. Global Structures: Another very common challenge in the

optimization of deep leavening models is that local regions of the cost function
don’t correspond with its global structure producing a misleading gradient.
Basic Algorithms
 Stochastic Gradient Descent: This has already been described before but
there are certain things that should be kept in mind regarding SGD. The
learning rate ϵ is a very important parameter for SGD. ϵ should be reduced
after each epoch in general. This is due to the fact that the random sampling
of batches acts as a source of noise which might make SGD keep oscillating
around the minima without actually reaching it. This is shown below:
The true gradient of the total cost function (involving g the entire
dataset) actually becomes 0 when we reach the minimum. Hence, BGD can use
a fixed learning rate. The following conditions guarantee convergence under
convexity assumptions in case of SGD:
Setting it too low makes the training proceed slowly which might lead to the
algorithm being stuck at a high cost value. Setting it too high would lead to large
oscillations which might even push the learning outside the optimal region. The
best way is to monitor the first several iterations and set the learning rate to be
higher than the best performing one, but not too high to cause instability.
A big advantage of SGD is that the time taken to compute a weight update
doesn’t grow with the number of training examples as each update is computed
after observing a batch of samples which is independent of the total number of
training examples. Theoretically, for a convex problem, BGD makes the error
rate O(1/k) after k iterations whereas SGD makes it O(1/√k). However, SGD
compensates for this with its advantages after a few iterations along with the
ability to make rapid updates in the case of a large training set.
 Momentum: The momentum algorithm accumulates the exponentially

decaying moving average of past gradients (called as velocity) and uses it
as the direction in which to take the next step. Momentum is given by
mass times velocity, which is equal to velocity if we’re using unit mass.
The momentum update is given by:
Momentum weight update
The step size (earlier equal to learning rate * gradient) now depends on
how large and aligned the sequence of gradients are. If the gradients at each
iteration point in the same direction (say g), it will lead to a higher value of the
step size as they just keep accumulating. Once it reaches a constant (terminal)
velocity, the step size becomes ϵ || g|| / (1 — α). Thus, using α as 0.9 makes the
speed 10 times. Common values of α are 0.5, 0.9 and 0.99.
Viewing it as the Newtonian dynamics of a particle sliding down a hill, the

momentum algorithm consists of solving a set of differential equations via
numerical simulation. There are two kinds of forces involved as shown below:
Momentum can be seen as two forces operating together. 1) Proportional to the
negative of the gradient such that whenever it descends a steep part of the
surface, it gathers speed and continues sliding in that direction until it goes
uphill again. 2) A viscous drag force (friction) proportional to -v(t) without the
presence of which the particle would keep oscillating back and forth as the
negative of the gradient would keep forcing it to move downhill . Viscous force
is suitable as it is weak enough to allow the gradient to cause motion and strong
enough to resist any motion if the gradient doesn’t justify moving
 Nesterov Momentum: This is a slight modification of the usual

momentum equation. Here, the gradient is calculated after applying the
current velocity to the parameters, which can be viewed as adding a
correction factor:
Nesterov momentum weight update
The intuition behind Nesterov momentum is that upon being at a point θ in the
parameter space, the momentum update is going to shift the point by αv. So, we
are soon going to end up in the vicinity of (θ + αv). Thus, it might be better to
compute the gradient from that point onward. The figure below describes this
visually:
Parameter Initialization Strategies

Training algorithms for deep learning models are iterative in nature and require
the specification of an initial point. This is extremely crucial as it often decides
whether or not the algorithm converges and if it does, then does the algorithm
converge to a point with high cost or low cost.
We have limited understanding of neural network optimization but the one
property that we know with complete certainty is that the initialization
should break symmetry. This means that if two hidden units are connected to
the same input units, then these should have different initialization or else the
gradient would update both the units in the same way and we don’t learn
anything new by using an additional unit. The idea of having each unit learn
something different motivates random initialization of weights which is also
computationally cheaper.
Biases are often chosen heuristically (zero mostly) and only the weights are
randomly initialized, almost always from a Gaussian or uniform distribution.
The scale of the distribution is of utmost concern. Large weights might have
better symmetry-breaking effect but might lead to chaos (extreme sensitivity to
small perturbations in the input) and exploding values during forward & back
propagation. As an example of how large weights might lead to chaos, consider
that there’s a slight noise adding ϵ to the input. Now, we if did just a simple
linear transformation like W * x, the ϵ noise would add a factor of W * ϵ to the
output. In case the weights are high, this ends up making a significant
contribution to the output. SGD and its variants tend to halt in areas near the
initial values, thereby expressing a prior that the path to the final parameters
from the initial values is discoverable by steepest descent algorithms. A more
mathematical explanation for the symmetry breaking can be found in
the Appendix.
Various suggestions have been made for appropriate initialization of the

parameters. The most commonly used ones include sampling the weights of each
fully-connected layer having m inputs and n outputs uniformly from the
following distributions:
 U(-1 / √m, 1 / √m)
 U(- √6 / (m+n), √6 / (m+n))
U(a, b) represents the uniform distribution where the probability of each value
between a and b, a and b inclusive, is 1/(b-a). The probability of every other
value is 0.
These initializations have already been incorporated into the most commonly
used Deep Learning frameworks nowadays so that you can just specify which
initializer to use and the framework takes care of sampling appropriately. For
e.g. Keras, which is a very famous deep learning framework, has a module
called initializers, where the second distribution (among the 2 mentioned above)
is implemented as glorot_uniform .
One drawback of using 1 / √m as the standard deviation is that the weights end
up being small when a layer has too many input/output units. Motivated by the
idea to have the total amount of input to each unit independent of the number of
input units m, Sparse initialization sets each unit to have exactly k non-zero
weights. However, it takes a long time for GD to correct incorrect large
values and hence, this initialization might cause problems.
If the weights are too small, the range of activations across the mini-batch will
shrink as the activations propagate forward through the network. By repeatedly
identifying the first layer with unacceptably small activations and increasing its
weights, it is possible to eventually obtain a network with reasonable initial
activations throughout.
The biases are relatively easier to choose. Setting the biases to zero is compatible
with most weight initialization schemes except for a few cases for e.g. when
used for an output unit, to prevent saturation at initialization or when using unit
as a gate for making a decision.
2. Algorithms with Adaptive Learning Rates
 AdaGrad: it is important to incrementally decrease the learning rate for

faster convergence. Instead of manually reducing the learning rate after each
(or several) epochs, a better approach is to adapt the learning rate as the
training progresses. This can be done by scaling the learning rates of each
model parameter individually inversely proportional to the square root of
the sum of historical squared values of the gradient. In the parameter update
equation below, r is initialized with 0 and the multiplication in the update
step happens element-wise as mentioned. Since the gradient value would be
different for each parameter, the learning rate is scaled differently for each
parameter too. Thus, those parameters having a large gradient have a large
decrease in the learning rate as the learning rate might be too high leading to
oscillations or it might be approaching the minima but having large learning
rate might cause it to jump over the minima as explained in the figure below,
because of which the learning rate should be decreased for better
convergence, while those with small gradients have a small decrease in the
learning rate as they might have already approached their respective minima
and should not be pushed away from that. Even if they have not, reducing
the learning rate too much would reduce the gradient even further leading to
slower learning.
AdaGrad parameter update equation.
This figure illustrates the need to reduce the learning rate if gradient is large in
case of a single parameter. 1) One step of gradient descent representing a large
gradient value. 2) Result of reducing the learning rate — moves towards the
minima 3) Scenario if the learning rate was not reduced — it would have
jumped over the minima.
However, accumulation of squared gradients from the very beginning can lead to
excessive and premature decrease in the learning rate. Consider that we had a
model with only 2 parameters (for simplicity) and both the initial gradients are
1000. After some iterations, the gradient of one of the
Figure explaining the problem with AdaGrad. Accumulated gradients can cause
the learning rate to be reduced far too much in the later stages leading to slower
learning.
parameters has reduced to 100 but that of the other parameter is still around 750.
However, because of the accumulation at each update, the accumulated gradient
would still have almost the same value. For e.g. let the accumulated gradients at
each step for the Parameter 1 be 1000 + 900 + 700 + 400 + 100 = 3100,
1/3100=0.0003 and that for Parameter 2 be: 1000 + 900 + 850 + 800 + 750 =
4300, 1/4300 = 0.0002. This would lead to a similar decrease in the learning
rates for both the parameters, even though the parameter having the lower
gradient might have its learning rate reduced too much leading to slower
learning.
Approximate Second-Order Methods
The optimization algorithms that we’ve looked at till now involved computing
only the first derivative. But there are many methods which involve higher order
derivatives as well. The main problem with these algorithms are that they are not
practically feasible in their vanilla form and so, certain methods are used to
approximate the values of the derivatives. We explain three such methods, all of
which use empirical risk as the objective function:
 Newton’s Method: This is the most common higher-order derivative

method used. It makes use of the curvature of the loss function via its
second-order derivative to arrive at the optimal point. Using the second-
order Taylor Series expansion to approximate J(θ) around a point θo and
ignoring derivatives of order greater than 2 (this has already been discussed
in previous chapters), we get:
We know that we get a critical point for any function f(x) by solving for f'(x) = 0.
We get the following critical point of the above equation (refer to
the Appendix for proof):
However, if there is a strong negative curvature i.e. the eigenvalues are largely
negative, α needs to be sufficiently high to offset the negative eigenvalues in
which case the Hessian becomes dominated by the diagonal matrix. This leads to
an update which becomes the standard gradient divided by α:
Another problem restricting the use of Newton’s method is the computational

cost. It takes O(k³) time to calculate the inverse of the Hessian where k is the
number of parameters. It’s not uncommon for Deep Neural Networks to have
about a million parameters and since the parameters are updated every iteration,
this inverse needs to be calculated at every iteration, which is not
computationally feasible.
Optimization Strategies and Meta-Algorithms
 Batch Normalization: Batch normalization (BN) is one of the most exciting

innovations in Deep learning that has significantly stabilized the learning
process and allowed faster convergence rates. The intuition behind batch
normalization is as follows: Most of the Deep Learning networks are
compositions of many layers (or functions) and the gradient with respect to
one layer is taken considering the other layers to be constant. However, in
practise all the layers are updated simultaneously and this can lead to
unexpected results. For example, let y* = x W¹ W² … W¹⁰. Here, y* is a
linear function of x but not a linear function of the weights. Suppose the
gradient is given by g and we now intend to reduce y* by 0.1. Using first-
order Taylor Series approximation, taking a step of ϵg would reduce y*
by ϵg’ g. Thus, ϵ should be 0.1/(g’ g) just using the first-order information.
However, higher order effects also creep in as the updated y* is given by:
An example of a second-order term would be ϵ² g1 g2 ∏ wi. ∏ wi can be

negligibly small or exponentially high depending on whether the individual
weights are less than or greater than 1. Since the updates to one layer is so
strongly dependent on the other layers, choosing an appropriate learning rate is
tough. Batch normalization takes care of this problem by using an efficient
reparameterization of almost any deep network. Given a matrix of
activations, H, the normalization is given by: H’ = (H-μ) / σ, where the
subtraction and division is broadcasted.
𝛿 is added to ensure that σ is not equal to 0.
Going back to the earlier example of y*, let the activations of layer l be given
by h(l-1). Then h(l-1) = x W1 W2 … W (l-1). Now, if x is drawn from a unit
Gaussian, then h(l-1) also comes from a Gaussian, however, not of zero mean
and unit variance, as it is a linear transformation of x. BN makes it zero mean
and unit variance. Therefore, y* = Wl h(l-1) and thus, the learning now becomes
much simpler as the parameters at the lower layers mostly do not have any
effect. This simplicity was definitely achieved by rendering the lower layers
useless. However, in a realistic deep network with non-linearities, the lower
layers remain useful. Finally, the complete re parameterization of BN is given by
replacing H with γH’ + β. This is done to retain its expressive power and the fact
that the mean is solely determined by XW. Also, among the choice of
normalizating X or XW + B, the authors recommend the latter, specifically XW,
since B becomes redundant because of β. Practically, this means that when we
are using the Batch Normalization layer, the biases should be turned off. In a
deep learning framework like Keras, this can be done by setting the
parameter use_bias=False in the Convolutional layer.
 Coordinate Descent: Generally, a single weight update is made by taking

the gradient with respect to every parameter. However, in cases where some
of the parameters might be independent (discussed below) of the remaining,
it might be more efficient to take the gradient with respect to those
independent sets of parameters separately for making updates. Let me clarify
that with an example. Suppose we have the following cost function:
This cost function describes the learning problem called sparse coding.
Here, H refers to the sparse representation of X and W is the set of weights used
to linearly decode H to retrieve X. An explanation of why this cost function
enforces the learning of a sparse representation of X follows. The first term of
the cost function penalizes values far from 0 (positive or negative because of the
modulus, |H|, operator. This enforces most of the values to be 0, thereby sparse.
The second term is pretty self-explanatory in that it compensates the difference
between X and H being linearly transformed by W, thereby enforcing them to
take the same value. In this way, H is now learned as a sparse “representation”
of X. The cost function generally consists of additionally a regularization term
like weight decay, which has been avoided for simplicity. Here, we can divide
the entire list of parameters into two sets, W and H. Minimizing the cost
function with respect to any of these sets of parameters is a convex
problem. Coordinate Descent (CD) refers to minimizing the cost function with
respect to only 1 parameter at a time. It has been shown that repeatedly cycling
through all the parameters, we are guaranteed to arrive at a local minima. If
instead of 1 parameter, we take a set of parameters as we did before
with W and H, it is called block coordinate descent (the interested reader
should explore Alternating Minimization). CD makes sense if either the
parameters are clearly separable into independent groups or if optimizing with
respect to certain set of parameters is more efficient than with respect to others.
 Polyak Averaging: Polyak averaging consists of averaging several points in

the parameter space that the optimization algorithm traverses through. So, if
the algorithm encounters the points θ(1), θ(2), … during optimization, the
output of Polyak averaging is:
The figure below explains the intuition behind Polyak averaging:

UNIT3

Uploaded by

UNIT3

Uploaded by

UNIT-III

Optimization for Training Deep Models

Optimization for Training Deep Models:

a)Local Minima: The grandfather of all optimization problems, local minima is a

d)Local vs. Global Structures: Another very common challenge in the

 Momentum: The momentum algorithm accumulates the exponentially

Viewing it as the Newtonian dynamics of a particle sliding down a hill, the

 Nesterov Momentum: This is a slight modification of the usual

Nesterov momentum weight update

Parameter Initialization Strategies

Various suggestions have been made for appropriate initialization of the

 U(-1 / √m, 1 / √m)

 U(- √6 / (m+n), √6 / (m+n))

2. Algorithms with Adaptive Learning Rates

 AdaGrad: it is important to incrementally decrease the learning rate for

AdaGrad parameter update equation.

Approximate Second-Order Methods

 Newton’s Method: This is the most common higher-order derivative

Another problem restricting the use of Newton’s method is the computational

Optimization Strategies and Meta-Algorithms

 Batch Normalization: Batch normalization (BN) is one of the most exciting

An example of a second-order term would be ϵ² g1 g2 ∏ wi. ∏ wi can be

𝛿 is added to ensure that σ is not equal to 0.

 Coordinate Descent: Generally, a single weight update is made by taking

 Polyak Averaging: Polyak averaging consists of averaging several points in

You might also like