UNIT3
UNIT3
just over the finite training set:J ∗ (θ) = E (x,y)∼p data L(f(x; θ), y).
expectation is taken across the data-generating distribution p data rather than
Challenges in Neural Network Optimization
There are plenty of challenges in deep learning optimization but most of them
are related to the nature of the gradient of the model. Below, I’ve listed some of
the most common challenges in deep learning optimization that you are likely to
run into:
b)Flat Regions: In deep learning optimization models, flat regions are common
areas that represent both a local minimum for a sub-region and a local maximum
for another. That duality often causes the gradient to get stuck.
c)Inexact Gradients: There are many deep learning models in which the cost
function is intractable which forces an inexact estimation of the gradient. In
these cases, the inexact gradients introduces a second layer of uncertainty in the
model.
Stochastic Gradient Descent: This has already been described before but
there are certain things that should be kept in mind regarding SGD. The
learning rate ϵ is a very important parameter for SGD. ϵ should be reduced
after each epoch in general. This is due to the fact that the random sampling
of batches acts as a source of noise which might make SGD keep oscillating
around the minima without actually reaching it. This is shown below:
The true gradient of the total cost function (involving g the entire
dataset) actually becomes 0 when we reach the minimum. Hence, BGD can use
a fixed learning rate. The following conditions guarantee convergence under
convexity assumptions in case of SGD:
Setting it too low makes the training proceed slowly which might lead to the
algorithm being stuck at a high cost value. Setting it too high would lead to large
oscillations which might even push the learning outside the optimal region. The
best way is to monitor the first several iterations and set the learning rate to be
higher than the best performing one, but not too high to cause instability.
A big advantage of SGD is that the time taken to compute a weight update
doesn’t grow with the number of training examples as each update is computed
after observing a batch of samples which is independent of the total number of
training examples. Theoretically, for a convex problem, BGD makes the error
rate O(1/k) after k iterations whereas SGD makes it O(1/√k). However, SGD
compensates for this with its advantages after a few iterations along with the
ability to make rapid updates in the case of a large training set.
The step size (earlier equal to learning rate * gradient) now depends on
how large and aligned the sequence of gradients are. If the gradients at each
iteration point in the same direction (say g), it will lead to a higher value of the
step size as they just keep accumulating. Once it reaches a constant (terminal)
velocity, the step size becomes ϵ || g|| / (1 — α). Thus, using α as 0.9 makes the
speed 10 times. Common values of α are 0.5, 0.9 and 0.99.
The intuition behind Nesterov momentum is that upon being at a point θ in the
parameter space, the momentum update is going to shift the point by αv. So, we
are soon going to end up in the vicinity of (θ + αv). Thus, it might be better to
compute the gradient from that point onward. The figure below describes this
visually:
Biases are often chosen heuristically (zero mostly) and only the weights are
randomly initialized, almost always from a Gaussian or uniform distribution.
The scale of the distribution is of utmost concern. Large weights might have
better symmetry-breaking effect but might lead to chaos (extreme sensitivity to
small perturbations in the input) and exploding values during forward & back
propagation. As an example of how large weights might lead to chaos, consider
that there’s a slight noise adding ϵ to the input. Now, we if did just a simple
linear transformation like W * x, the ϵ noise would add a factor of W * ϵ to the
output. In case the weights are high, this ends up making a significant
contribution to the output. SGD and its variants tend to halt in areas near the
initial values, thereby expressing a prior that the path to the final parameters
from the initial values is discoverable by steepest descent algorithms. A more
mathematical explanation for the symmetry breaking can be found in
the Appendix.
U(a, b) represents the uniform distribution where the probability of each value
between a and b, a and b inclusive, is 1/(b-a). The probability of every other
value is 0.
These initializations have already been incorporated into the most commonly
used Deep Learning frameworks nowadays so that you can just specify which
initializer to use and the framework takes care of sampling appropriately. For
e.g. Keras, which is a very famous deep learning framework, has a module
called initializers, where the second distribution (among the 2 mentioned above)
is implemented as glorot_uniform .
One drawback of using 1 / √m as the standard deviation is that the weights end
up being small when a layer has too many input/output units. Motivated by the
idea to have the total amount of input to each unit independent of the number of
input units m, Sparse initialization sets each unit to have exactly k non-zero
weights. However, it takes a long time for GD to correct incorrect large
values and hence, this initialization might cause problems.
If the weights are too small, the range of activations across the mini-batch will
shrink as the activations propagate forward through the network. By repeatedly
identifying the first layer with unacceptably small activations and increasing its
weights, it is possible to eventually obtain a network with reasonable initial
activations throughout.
The biases are relatively easier to choose. Setting the biases to zero is compatible
with most weight initialization schemes except for a few cases for e.g. when
used for an output unit, to prevent saturation at initialization or when using unit
as a gate for making a decision.
This figure illustrates the need to reduce the learning rate if gradient is large in
case of a single parameter. 1) One step of gradient descent representing a large
gradient value. 2) Result of reducing the learning rate — moves towards the
minima 3) Scenario if the learning rate was not reduced — it would have
jumped over the minima.
However, accumulation of squared gradients from the very beginning can lead to
excessive and premature decrease in the learning rate. Consider that we had a
model with only 2 parameters (for simplicity) and both the initial gradients are
1000. After some iterations, the gradient of one of the
Figure explaining the problem with AdaGrad. Accumulated gradients can cause
the learning rate to be reduced far too much in the later stages leading to slower
learning.
parameters has reduced to 100 but that of the other parameter is still around 750.
However, because of the accumulation at each update, the accumulated gradient
would still have almost the same value. For e.g. let the accumulated gradients at
each step for the Parameter 1 be 1000 + 900 + 700 + 400 + 100 = 3100,
1/3100=0.0003 and that for Parameter 2 be: 1000 + 900 + 850 + 800 + 750 =
4300, 1/4300 = 0.0002. This would lead to a similar decrease in the learning
rates for both the parameters, even though the parameter having the lower
gradient might have its learning rate reduced too much leading to slower
learning.
The optimization algorithms that we’ve looked at till now involved computing
only the first derivative. But there are many methods which involve higher order
derivatives as well. The main problem with these algorithms are that they are not
practically feasible in their vanilla form and so, certain methods are used to
approximate the values of the derivatives. We explain three such methods, all of
which use empirical risk as the objective function:
We know that we get a critical point for any function f(x) by solving for f'(x) = 0.
We get the following critical point of the above equation (refer to
the Appendix for proof):
However, if there is a strong negative curvature i.e. the eigenvalues are largely
negative, α needs to be sufficiently high to offset the negative eigenvalues in
which case the Hessian becomes dominated by the diagonal matrix. This leads to
an update which becomes the standard gradient divided by α:
Going back to the earlier example of y*, let the activations of layer l be given
by h(l-1). Then h(l-1) = x W1 W2 … W (l-1). Now, if x is drawn from a unit
Gaussian, then h(l-1) also comes from a Gaussian, however, not of zero mean
and unit variance, as it is a linear transformation of x. BN makes it zero mean
and unit variance. Therefore, y* = Wl h(l-1) and thus, the learning now becomes
much simpler as the parameters at the lower layers mostly do not have any
effect. This simplicity was definitely achieved by rendering the lower layers
useless. However, in a realistic deep network with non-linearities, the lower
layers remain useful. Finally, the complete re parameterization of BN is given by
replacing H with γH’ + β. This is done to retain its expressive power and the fact
that the mean is solely determined by XW. Also, among the choice of
normalizating X or XW + B, the authors recommend the latter, specifically XW,
since B becomes redundant because of β. Practically, this means that when we
are using the Batch Normalization layer, the biases should be turned off. In a
deep learning framework like Keras, this can be done by setting the
parameter use_bias=False in the Convolutional layer.
This cost function describes the learning problem called sparse coding.
Here, H refers to the sparse representation of X and W is the set of weights used
to linearly decode H to retrieve X. An explanation of why this cost function
enforces the learning of a sparse representation of X follows. The first term of
the cost function penalizes values far from 0 (positive or negative because of the
modulus, |H|, operator. This enforces most of the values to be 0, thereby sparse.
The second term is pretty self-explanatory in that it compensates the difference
between X and H being linearly transformed by W, thereby enforcing them to
take the same value. In this way, H is now learned as a sparse “representation”
of X. The cost function generally consists of additionally a regularization term
like weight decay, which has been avoided for simplicity. Here, we can divide
the entire list of parameters into two sets, W and H. Minimizing the cost
function with respect to any of these sets of parameters is a convex
problem. Coordinate Descent (CD) refers to minimizing the cost function with
respect to only 1 parameter at a time. It has been shown that repeatedly cycling
through all the parameters, we are guaranteed to arrive at a local minima. If
instead of 1 parameter, we take a set of parameters as we did before
with W and H, it is called block coordinate descent (the interested reader
should explore Alternating Minimization). CD makes sense if either the
parameters are clearly separable into independent groups or if optimizing with
respect to certain set of parameters is more efficient than with respect to others.