Adas: Adaptive Scheduling of Stochastic Gradients: Preprint. Under Review

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

AdaS: Adaptive Scheduling of Stochastic Gradients

Mahdi S. Hosseini and Konstantinos N. Plataniotis


University of Toronto, The Edward S. Rogers Sr. Department of Electrical & Computer Engineering
Toronto, Ontario, M5S 3G4, Canada
[email protected]
arXiv:2006.06587v1 [cs.LG] 11 Jun 2020

Abstract
The choice of step-size used in Stochastic Gradient Descent (SGD) optimization is
empirically selected in most training procedures. Moreover, the use of scheduled
learning techniques such as Step-Decaying, Cyclical-Learning, and Warmup to
tune the step-size requires extensive practical experience–offering limited insight
into how the parameters update–and is not consistent across applications. This
work attempts to answer a question of interest to both researchers and practition-
ers, namely “how much knowledge is gained in iterative training of deep neural
networks?” Answering this question introduces two useful metrics derived from
the singular values of the low-rank factorization of convolution layers in deep
neural networks. We introduce the notions of “knowledge gain” and “mapping
condition” and propose a new algorithm called Adaptive Scheduling (AdaS) that
utilizes these derived metrics to adapt the SGD learning rate proportionally to
the rate of change in knowledge gain over successive iterations. Experimentation
reveals that, using the derived metrics, AdaS exhibits: (a) faster convergence and
superior generalization over existing adaptive learning methods; and (b) lack of
dependence on a validation set to determine when to stop training. Code is available
at https://github.com/mahdihosseini/AdaS.

1 Introduction
Stochastic Gradient Descent (SGD), a first-order optimization method [26, 3, 4, 29], has become
the mainstream method for training over-parametrized models such as deep neural networks [17, 8].
Attempting to augment this method, SGD with momentum [24, 34] accumulates the historically
aligned gradients which helps in navigating past ravines and towards a more optimal solution. It
eventually converges faster and exhibits better generalization compared to vanilla SGD. However,
as the step-size (aka global learning rate) is mainly fixed for momentum SGD, it blindly follows
these past gradients and can eventually overshoot an optimum and cause oscillatory behavior. From a
practical standpoint (e.g. in the context of deep neural network [2, 28, 17]) it is even more concerning
to deploy a fixed global learning rate as it often leads to poorer convergence, requires extensive tuning,
and exhibits strong performance fluctuations over the selection range.
A handful of methods have been introduced over the past decade to solve the latter issues based on
the adaptive gradient methods [7, 36, 40, 13, 6, 25, 18, 20, 1, 27, 37, 22]. These methods can be
represented in the general form:
ηk
Φk ←− Φk−1 − φ(g1 , · · · , gk ), (1)
ψ(g1 , · · · , gk )
where for some kth iteration, gi is the stochastic gradient obtained at the ith iteration, φ(g1 , · · · , gk )
is the gradient estimation, and ηk /ψ(g1 , · · · , gk ) is the adaptive learning rate, where ψ(g1 , · · · , gk )
generally relates to the square of the gradients. Each adaptive method therefore attempts to modify
the gradient estimation (through the use of momentum) or the adaptive learning rate (through a

Preprint. Under review.


difference choice of ψ(g1 , · · · , gk )). Furthermore, it is also common to subject ηk to a manually set
schedule for more optimal performance and better theoretical convergence guarantees.
Such methods were first introduced in [7] (AdaGrad) by regulating the update size with the accumu-
lated second order statistical measures of gradients which provides a robust framework for sparse
gradient updates. The issue of vanishing learning rate caused by equally weighted accumulation of
gradients is the main drawback of AdaGrad that was raised in [36] (RMSProp), which utilizes the
exponential decaying average of gradients instead of accumulation. A variant of first-order gradient
measures was also introduced in [40] (AdaDelta), which solves the decaying learning rate problem
using an accumulation window, providing a robust framework toward hyper-parameter tuning issues.
The adaptive moment estimation in [13] (AdaM) was introduced later to leverage both first and
second moment measures of gradients for parameter updating. AdaM can be seen as the celebration
of all three adaptive optimizers: AdaGrad, RMSProp and AdaDelta–solving the vanishing learning
rate problem and offering a more optimal adaptive learning rate to improve in rapid convergence and
generalization capabilities. Further improvements were made on AdaM using Nesterov Momentum
[6], long-term memory of past gradients [25], rectified estimations [18], dynamic bound of learning
rate [20], hyper-gradient descent method [1], and loss-based step-size [27]. Methods based on
line-search techniques [37] and coin betting [22] are also introduced to avoid bottlenecks caused by
the hyper-parameter tuning issues in SGD.
The AdaM optimizer, as well as its other variants, has attracted many practitioners in deep learning
for two main reasons: (1) it requires minimal hyper-parameter tuning effort; and (2) it provides an
efficient convergence optimization framework. Despite the ease of implementation of such optimizers,
there is a growing concern about their poor “generalization” capabilities. They perform well on the
given-samples i.e. training data (at times even better performance can be achieved compared to non-
adaptive methods such as in [19, 32, 33]), but perform poorly on the out-of-samples i.e. test/evaluation
data [38]. Despite various research efforts taken for adaptive learning methods, the non-adaptive
SGD based optimizers (such as scheduled learning methods including Warmup Techniques [19],
Cyclical-Learning [32, 33], and Step-Decaying [8]) are still considered to be the golden frameworks
to achieve better performance at the price of either more epochs for training and/or costly tuning for
optimal hyper-parameter configurations given different datasets and models.
Our goal in this paper is twofold: (1) we address the above issues by proposing a new approach for
adaptive methods in SGD optimization; and (2) we introduce new probing metrics that enable the
monitoring and evaluation of quality of learning within layers of a Convolutional Neural Network
(CNN). Unlike the general trend in most adaptive methods where the raw measurements from
gradients are utilized to adapt the step-size and regulate the gradients (through different choices of
adaptive learning rate or gradient estimation), we take a different approach and focus our efforts on
the scheduling of the learning rate ηk independently for each convolutional block. Specifically, we
first ask “how much of the gradients are useful for SGD updates?” and then translate this into a new
concept we call the “knowledge gain”, which is measured from the energy of low-rank factorization
of convolution weights in deep layers. The knowledge gain defines the usefulness of gradients and
adapts the next step-size ηk for SGD updates. We summarize our contributions as follows:

1. The new concepts of “knowledge gain” and “mapping condition” are introduced to measure
the quality of convolution weights that can be used in iterative training and to provide an
answer to these questions: how well are the layers trained given a certain epoch steps? Is
there enough information obtained via the sequence of updates?
2. We proposed a new adaptive scheduling algorithm for SGD called “AdaS” which introduces
minimal computational overhead over vanilla SGD and guarantees the increase of knowledge
gain over consecutive epochs. AdaS adaptively schedules ηk for every conv block and both
generalizes well and outperforms previous adaptive methods e.g. AdaM. A pitching factor
called gain-factor is tuned in AdaS to trade off between fast convergence and greedy
performance. Code is available at https://github.com/mahdihosseini/AdaS.
3. Thorough experiments are conducted for image classification problems using various dataset
and CNN models. We adopt different optimizers and compare their convergence speed and
generalization characteristics to our AdaS optimizer.
4. A new probing tool based on knowledge gain and mapping condition is introduced to measure
the quality of network training without requiring test/evaluation data. We investigate the
relationship between our new quality metrics and performance results.

2
2 Knowledge Gain in CNN Training
Central to our work is the notion of knowledge gain measured from convolutional weights of
CNNs. Consider the convolutional weights of a particular layer in a CNN defined by a four-
way array (aka fourth-order tensor) Φ ∈ RN1 ×N2 ×N3 ×N4 , where N1 and N2 are the height and
width of the convolutional kernels, and N3 and N4 correspond to the number of input and output
channels, respectively. The feature mapping under this convolution operation follows FO (:, :, `4 ) =
PN3
`3 =1 FI (:, :, `3 ) ∗ Φ(:, :, `3 , `4 ), where FI and FO are the input and output feature maps stacked in
3D volumes, and `4 ∈ {1, . . . , N4 } is the output index. The well-posedness of this feature mapping
can be studied by the generalized spectral decomposition (i.e. SVD) form of the tensor arrays using
the Tucker model [14, 30] in full-core tensor mode
N1 X
X N2 X
N3 X
N4
Φ= G(`1 , `2 , `3 , `4 )u`1 } u`2 } u`3 } u`4 , (2)
`1 =1 `2 =1 `3 =1 `4 =1

where, the core G (containing singular values) is called a (N1 , N2 , N3 , N4 )-tensor, u`i ∈ RNi is the
factor basis for decomposition, and } is outer product operation. We use similar notations as in [30]
for brevity. Note that Φ can be at most of rank (N1 , N2 , N3 , N4 ).
The tensor array in (2) is (usually) initialized by random noise sampling for CNN training such that the
mapping under this tensor randomly spans the output dimensions (i.e. the diffusion of knowledge is
fully random in the beginning with no learned structure). Throughout an iterative training framework,
more knowledge is gained lying in the tensor space as a mixture of a low-rank manifold and perturbing
noise. Therefore, it makes sense to decompose (factorize) the observing tensor within each layer
of the CNN as Φ = Φ̂ + E. This decomposes the observing tensor array into a low-rank tensor
Φ̂ containing the small-core tensor such that the error residues E = ||Φ − Φ̂||2F are minimized. A
similar framework is also used in CNN compression [16, 35, 12, 39]. A handful of techniques (e.g.
CP/PRAFAC, TT, HT, truncated-MLSVD, Compression) can be found in [14, 23, 9, 30] to estimate
such small-core tensor. The majority of these solutions are iterative and we therefore take a more
careful consideration toward such low-rank decomposition.
An equivalent representation of the tensor decomposition (2) is the vector form x , vec (Φ) =
(U1 ⊗ U2 ⊗ U3 ⊗ U4 ) g, where vec (·) is a vector obtained by stacking all tensor elements column-
wise, g = vec (G), ⊗ is the Kronecker product, and Ui is a factor matrix containing all bases
u`i stacked in column form. Since we are interested in input and output channels of CNNs for
decomposition, we use mode-3 and mode-4 vector expressions yielding two matrices
Φ3 = (U1 ⊗ U2 ⊗ U4 ) G3 UT3 and Φ4 = (U1 ⊗ U2 ⊗ U3 ) G4 UT4 , (3)
N1 N2 N4 ×N3 N1 N2 N3 ×N4
where, Φ3 ∈ R , Φ4 ∈ R , and G3 and G4 are likewise reshaped forms of
the core tensor G. The tensor decomposition (3) is the equivalent representation to (2) decomposed
at mode-3 and mode-4. Recall the matrix (two-way array) decomposition e.g. SVD such that
Φ3 = UΣVT where U ≡ U1 ⊗ U2 ⊗ U4 , V ≡ U3 , and Σ ≡ G3 [30]. In other words, to
decompose a tensor on a given mode, we first unfold the tensor (on the given mode) and then apply a
decomposition method of interest such as SVD.
The presence of noise, however, is still a barrier for better understanding of the latter reshaped
forms. Similar to [16], we revise our goal into low-rank matrix factorizations of Φ3 = Φ̂3 + E3
and Φ4 = Φ̂4 + E4 , where a global analytical solution is given by the Variational Baysian Matrix
Factorization (VBMF) technique in [21] as a re-weighted SVD of the observation matrix. This
method avoids unnecessary implementing an iterative algorithm.
Using the above decomposition framework, we introduce the following two definitions.
Definition 1. (Knowledge Gain). For convolutional weights in deep CNNs, define the knowledge
gain across a particular channel (i.e. d-th dimension)
0
Nd
1 X
Gd,p (Φ) = σip (Φ̂d ), (4)
Nd · σ1p (Φ̂d ) i=1
where, σ1 ≥ σ2 ≥ · · · ≥ σNd0 are the low-rank singular values of a single-channel convolutional
weight in descending order, Nd0 = rank{Φ̂d }, d stands for dimension index, and p ∈ {1, 2}.

3
The notion of knowledge gain on the input tensor Φ in (4) is in fact a direct measure of the norm
energy of the factorized matrix.
Remark 1. Recall for p = 2 that the summation of squared n singular
o values from Definition 1 is
PNd0 2
equivalent to the Frobenius (norm) i.e. i=1 σi (Φ̂d ) = Tr Φ̂d Φ̂d = ||Φ̂d ||2F [11]. Also, for p =
T

PNd0
σi (Φ̂d ) ≤ Nd0 ||Φ̂d ||F .
p
1 the summation of singular values is bounded between ||Φ̂d ||F ≤ i=1
The energies here indicate the distance measure from the matrix separability obtained from low-
rank factorization (similar to the index of inseparability in neurophysiology [5]). In other words,
it measures the space span obtained by the low-rank structure. The division factors Nd in (4) also
normalize the gain Gp ∈ [0, 1] as a fraction of channel capacity. In this study we are mainly interested
in third and fourth dimension measures (i.e. d = {3, 4}).
Definition 2. (Mapping Condition). For convolutional weights in deep CNNs, define the mapping
condition across a particular channel (i.e. d-th dimension)
κd (Φ) = σ1 (Φ̂d )/σNd0 (Φ̂d ), (5)
where, σ1 and σNd0 are the maximum and minimum low-rank singular values of a single-channel
convolutional weight, respectively.
Recall the matrix-vector calculation form by mapping the input vector into the output vector, where
its numerical stability is defined by the matrix condition number as a relative ratio of maximum
to minimum singular values [11]. The convolution operations in CNNs follow a similar concept
by mapping input feature images into output features. Accordingly, the mapping condition of the
convolutional layers in CNN is defined by (5) as a direct measurement of condition number of
low-rank factorized matrices: indicating the well-posedness of convolution operation.

3 Adapting Stochastic Gradient Descent with Knowledge Gain


As an optimization method in deep learning, SGD typically attempts to minimize the loss functions
of large networks [3, 4, 17, 8]. Consider the updates on the convolutional weights Φk using this
optimization
Φk ←− Φk−1 − ηk ∇f˜k (Φk−1 ) for k ∈ {(t − 1)K + 1, · · · , tK}, (6)
where tP and K correspond to epoch index and number of mini-batches, respectively, ∇f˜k (Φk−1 ) =
1/|Ωk | i∈Ωk ∇fi (Φk−1 ) is the average stochastic gradients on kth mini-batch that are randomly
selected from a batch of n-samples Ωk ⊂ {1, · · · , n}, and ηk defines the step-size taken toward the
opposite direction of average gradients. The selection of step-size ηk can be either adaptive with
respect to the statistical measure from gradients [7, 40, 36, 13, 6, 25, 20, 37, 18] or could be subject
to change in different scheduled learning regimes [19, 32, 33, 19].
In the scheduled learning rate method, the step-size is usually fixed for every tth epoch (i.e. for all
K mini-batch updates) and changes according to the schedule assignment for the next epoch (i.e.
ηk ≡ η(t)). We setup our problem by accumulating all observed gradients throughout K mini-batch
updates within the tth epoch
kb
X
Φ kb
=Φ ka
− η(t) ∇f˜k (Φk ), (7)
k=ka

where ka = (t − 1)K + 1 and kb = tK. Note that the significance of updates in (7) from ka th to kb th
iteration is controlled by the step-size η(t), which directly impacts the rate of the knowledge gain.
Here we provide satisfying conditions on the step-size for increasing the knowledge gain in SGD.
Theorem 1. (Increasing Knowledge Gain for SGD). Using the knowledge gain from Definition 4
and setting the step-size of Stochastic Gradient Descent (SGD) proportionate to
η = ζ G(Φkb ) − G(Φka )
 
(8)
will guarantee the monotonic increase of the knowledge gain i.e. G(Φkb ) ≥ G(Φka ) for some
existing lower bound η ≥ η0 and ζ ≥ 0.

4
The proof of Theorem 1 is provided in the Appendix-A. The step-size in (8) is proportional to the
knowledge gain through the updating scheme in SGD where we update the value in every epoch.
Therefore, the computational overhead on vanilla SGD is only limited to calculating the knowledge
gain for each convolutional layer in the CNN for every epoch. This overhead is minimal due to the
empirical solution provided by the low-rank factorization method (EVBMF) in [21].

4 AdaS Algorithm

We formulate the update rule for AdaS using SGD with Momentum as follows

η(t, `) ← β · η(t − 1, `) + ζ · [Ḡ(t, `) − Ḡ(t − 1, `)] (9)


v`k ← α · v`k−1 − η(t, `) · g`k (10)
θ`k ← θ`k−1 + v`k (11)

where k is the current mini-batch, t is the current epoch iteration, ` is the conv block index, Ḡ(·) is
the average knowledge gain obtained from both mode-3 and mode-4 decompositions, v is the velocity
term, and θ are the learnable parameters.
The pseudo-code for our proposed algorithm AdaS is presented in Algorithm 1. Each convolution
block in the CNN is assigned an index {`}L `=1 where all learnable parameters (e.g. conv, biases, batch-
norms, etc) are called using this index. The goal in AdaS is firstly to callback the convolutional weights
within each block, secondly to apply low-rank matrix factorization on the unfolded tensors, and finally
approximate the overall knowledge gain Gt` and mapping condition κt` . The approximation is done
once every epoch and introduces minimal computational overhead over the rest of the optimization
framework. The learning rate is computed relative to the rate of change in knowledge gain over two
consecutive epoch updates (from previous to current). The learning rate η(t, `) is then further updated
by an exponential moving average called the gain-factor, with hyper-parameter β, to accumulate the
history of knowledge gain over the sequence of epochs. In effect, β controls the tradeoff between
convergence speed and training accuracy of AdaS. An ablative study on the effect of this parameter is
provided in the Appendix-B. The computed step-sizes for all conv-blocks are then passed through the
SGD optimization framework for adaptation. Note that the same step-size is used within each block
for all learnable parameters. Code is available at https://github.com/mahdihosseini/AdaS.

Algorithm 1: Adaptive Scheduling (AdaS) for SGD with Momentum


L
Require :batch size n, # of epochs T , # of conv blocks L, initial step-sizes {η(0, `)}`=1 , initial
 L  L
momentum vectors v`0 `=1 , initial parameter vectors θ`0 `=1 , SGD momentum
rate α ∈ [0, 1), AdaS gain factor β ∈ [0, 1), knowledge gain hyper-parameter ζ = 1,
minimum learning rate ηmin > 0
for t = 1 : T do
for ` = 1 : L do
1. unfold tensors using (3): Φ3 ← mode-3 (Φt` ) and Φ4 ← mode-4 (Φt` )
2. apply low-rank factorization [21]: Φ̂3 ← EVBMF (Φ3 ) and Φ̂4 ← EVBMF (Φ4 )
3. compute average knowledge gain using (4): G(t, `) ← [G3,1 (Φ) + G4,1 (Φ)]/2
4. compute average mapping condition using (5): κ(t, `) ← [κ3 (Φ) + κ4 (Φ)]/2
5. compute step momentum: η(t, `) ← β · η(t − 1, `) + ζ · [G(t, `) − G(t − 1, `)]
6. lower bound the learning rate: η(t, `) ← max (η(t, `), ηmin )
end
randomly shuffle dataset, generate K mini-batches {Ωk ⊂ {1, · · · , n}}K k=1
for k = (t − 1)K + 1 : tK do
1. compute gradient: g`k ← |Ω1k | ∇Φ f ((x(i) , y (i) ); Φk−1
P
` ), ` ∈ {1, · · · , L}
i∈Ωk
2. compute the velocity term: v`k ← α · v`k−1 − η(t, `) · g`k
3. apply update: θ`k ← θ`k−1 + v`k
end
end

5
5 Experiments
We compare our AdaS algorithm to several adaptive and non-adaptive optimizers in the context of
image classification. In particular, we implement AdaS with SGD with momentum, four adaptive
methods i.e. AdaGrad [7], RMSProp [8], AdaM [13], AdaBound [20], and two non-adaptive
momentum SGDs guided by scheduled learning techniques i.e. OneCyleLR (also known as the
super-convergence method) [33] and SGD with StepLR (step decaying) [8]. We further investigate
the dependencies of CNN training quality to knowledge gain and mapping conditions and provide
useful insights into the usefulness of different optimizers for training different models. For details on
ablative studies and a complete set of experiments, please refer to the Appendix-B.

5.1 Experimental Setup

We investigate the efficacy of AdaS with respect to variations in the number of deep layers using
VGG16 [31] and ResNet34 [10] and the number of classes using the standard CIFAR-10 and CIFAR-
100 datasets [15] for training. The details of pre-processing steps, network implementation and
training/testing frameworks are adopted from the CIFAR GitHub repository1 using PyTorch. We set
the initial learning rates of AdaGrad, RMSProp and AdaBound to η0 = {1e-2, 3e-4, 1e-3} per their
suggested default values. We further followed the suggested tuning in [38] for AdaM (η0 = 3e-4 for
VGG16 and η0 = 1e-3 for ResNet34) and SGD-StepLR (η0 = 1e-1 dropping half magnitude every
25 epochs) to achieve the best performance. For SGD-1CycleLR we set 50 epochs for the whole
cycle and found the best configuration (η0 = 3e-2 for VGG16 and η0 = 4e-2 for ResNet34). To
configure the best initial learning rate for AdaS, we performed a dense grid search and found the
values for VGG16 and ResNet34 to be η0 = {5e-3, 3e-2}. Despite the differences in optimal values
that are independently obtained for each network, the optimizer performance is fairly robust relative
to changes in these values. Each model is trained for 250 epochs in 5 independent runs and average
test accuracy and training losses are reported. The mini-batch size is also set to |Ωk | = 128.

(a) Test Acc–VGG16–CIFAR10 (b) Test Acc–ResNet34–CIFAR10 (c) Test Acc–VGG16–CIFAR100 (d) Test Acc–ResNet34–CIFAR100

(e) Train Loss–VGG16–CIFAR10 (f) Train Loss–ResNet34–CIFAR10 (g) Train Loss–VGG16–CIFAR100 (h) Train Loss–ResNet34–CIFAR100

Figure 1: Training performance using different optimizers across two different datasets (i.e. CIFAR10
and CIFAR100) and two different CNNs (i.e. VGG16 and ResNet34)

5.2 Image Classification Problem

We first empirically evaluate the effect of the gain-factor β on AdaS convergence by defining eight
different grid values (i.e. β ∈ {0.8, 0.825, · · · , 0.975}). The trade-off between the selection of
different values of β is demonstrated in Figure 1 (complete ablation study is provided in Appendix-B).
1
https://github.com/kuangliu/pytorch-cifar

6
Here, lower β translates to faster convergence, whereas setting it to higher values yields better final
performance–at the cost of requiring more epochs for training. The performance comparison of
optimizers is also overlaid in the same figure, where AdaS (with lower β) surpasses all adaptive and
non-adaptive methods by a large margin in both test accuracy and training loss during the initial
stages of training (i.e. epoch < 50), where as SGD-StepLR and AdaS (with higher β) eventually
overtake the other methods with more training epochs. Furthermore, AdaGrad, RMSProp, AdaM, and
AdaBound all achieve similar or sometimes even lower training losses compared to AdaS (including
the two non-adaptive methods), but attain lower test accuracies. Such controversial results were also
reported in [38] where adaptive optimizers generalize worse compared to non-adaptive methods.
In retrospect, we claim here that AdaS solves this issue by generalizing better than other adaptive
optimizers.
We further provide quantitative results on the convergence of all optimizers trained on ResNet34 in
Table 1 with a fixed number of training epochs. The rank consistency of AdaS (using two different
gain factors of low β = 0.85 and high β = 0.95 values) over other optimizers is evident. For instance,
AdaSβ=0.850 gains 3.63% test accuracy (with half confidence interval) over the second best optimizer
AdaM on CIFAR-100 trained with 25 epochs.

Table 1: Image classification performance (test accuracy) of ResNet34 on CIFAR-10 and CIFAR-100
with fixed budget epochs. Four adaptive (AdaGrad, RMSProp, AdaM, AdaS) and one non-adaptive
(SGD-StepLR) optimizers are deployed for comparison.
Epoch AdaGrad RMSProp AdaM SGD-StepLR AdaSβ=0.850 AdaSβ=0.950
25 0.8859 ± 0.47% 0.8916 ± 0.71% 0.8957 ± 0.73% 0.8325 ± 2.79% 0.9136 ± 0.23% 0.8611 ± 1.67%
50 0.9017 ± 0.61% 0.9086 ± 0.58% 0.9154 ± 0.35% 0.8653 ± 1.67% 0.9370 ± 0.13% 0.9209 ± 0.52%
CIFAR-10

75 0.9103 ± 0.18% 0.9139 ± 0.78% 0.9211 ± 0.26% 0.9067 ± 0.38% 0.9372 ± 0.14% 0.9472 ± 0.20%
100 0.9109 ± 0.25% 0.9159 ± 0.77% 0.9271 ± 0.27% 0.9225 ± 0.29% 0.9372 ± 0.12% 0.9510 ± 0.18%
150 0.9134 ± 0.37% 0.9269 ± 0.33% 0.9307 ± 0.33% 0.9464 ± 0.09% 0.9370 ± 0.13% 0.9510 ± 0.11%
200 0.9140 ± 0.15% 0.9287 ± 0.30% 0.9317 ± 0.30% 0.9544 ± 0.10% 0.9368 ± 0.11% 0.9508 ± 0.15%
250 0.9149 ± 0.24% 0.9290 ± 0.29% 0.9322 ± 0.36% 0.9543 ± 0.08% 0.9370 ± 0.17% 0.9516 ± 0.12%
25 0.6221 ± 0.70% 0.6341 ± 1.14% 0.6653 ± 0.46% 0.5545 ± 1.45% 0.7016 ± 0.27% 0.5981 ± 1.55%
50 0.6515 ± 0.27% 0.6769 ± 0.62% 0.6866 ± 0.46% 0.6217 ± 1.68% 0.7479 ± 0.23% 0.7123 ± 0.51%
CIFAR-100

75 0.6618 ± 0.38% 0.6837 ± 0.50% 0.6975 ± 0.49% 0.6611 ± 1.79% 0.7491 ± 0.26% 0.7714 ± 0.30%
100 0.6658 ± 0.38% 0.6928 ± 0.29% 0.6978 ± 0.27% 0.6878 ± 0.97% 0.7494 ± 0.26% 0.7752 ± 0.31%
150 0.6691 ± 0.31% 0.6996 ± 0.48% 0.7045 ± 0.42% 0.7740 ± 0.46% 0.7481 ± 0.22% 0.7768 ± 0.32%
200 0.6697 ± 0.25% 0.7039 ± 0.50% 0.7061 ± 0.33% 0.7763 ± 0.42% 0.7484 ± 0.29% 0.7765 ± 0.22%
250 0.6702 ± 0.23% 0.7025 ± 0.29% 0.7111 ± 0.37% 0.7765 ± 0.32% 0.7483 ± 0.21% 0.7760 ± 0.22%

(a) AdaGrad (b) RMSProp (c) AdaM (d) AdaBound (e) SGD-StepLR (f) SGD-1CycleLR

(g) AdaSβ=0.800 (h) AdaSβ=0.850 (i) AdaSβ=0.900 (j) AdaSβ=0.925 (k) AdaSβ=0.950 (l) AdaSβ=0.975

Figure 2: Evolution of knowledge gain versus mapping condition across different training epochs
using ResNet34 on CIFAR10. The transition of color shades correspond to different convolution
blocks. The transparency of scatter plots corresponds to the epoch convergence–the higher trans-
parency is inversely related to the training epoch. For complete results of different optimizers and
models, please refer to Appendix-B.

7
5.3 Dependence of the Quality of Network Training to Knowledge Gain

Both concepts of knowledge gain G and mapping condition κ can be used to prob within an
intermediate layer of a CNN and quantify the quality of training with respect to different parameter
settings. Such quantization do not require test or evaluation data where one can directly measure
the “expected performance” of the network throughout the training updates. Our first observation
here is made by linking the knowledge gain measure to the relative success of each method in the
test accuracy performance. For instance, by raising the gain factor β in AdaS, the deeper layers of
CNN eventually gain further knowledge as shown in Figure 2. This directly affects the success in
test performance results. Also, deploying different optimizers yields different behavior in knowledge
gain obtained in different layers of the CNN. Table 2 lists all four numerical measurements of Test
Accuracy, Training Loss, Knowledge Gain, and Mapping Condition for different optimizers. Note the
rank order correlation of knowledge gain and test accuracy. Although for both RMSProp and AdaM
the knowledge gains are high, however, the Mapping Conditions are also high which deteriorates the
overall performance of the network.

Table 2: Performance of ResNet34 on CIFAR10 dataset reported at final training epoch for each
optimizer. The average scores are reported for G and κ across all convolutional blocks.
R 80 85 90 95
eL R 0. 0. 0. 0.
ad rop nd ycl pL β= β= β= β=
Gr P M Bo
u -1C - Ste S S S S
Ada RMS Ada Ada S GD SGD Ada Ada Ada Ada

Test Accuracy 0.9150 0.9290 0.9322 0.9274 0.9413 0.9543 0.9302 0.9370 0.9446 0.9516
Training Loss 1.25e-3 6.35e-3 5.66e-3 2.30e-3 4.90e-3 9.70e-4 6.10e-3 2.33e-3 1.27e-3 8.69e-4
Knowledge Gain (G) 0.1790 0.3011 0.2973 0.1288 0.2965 0.3180 0.2063 0.2467 0.2791 0.2938
Mapping Condition (κ) 7.482 18.124 18.484 4.833 7.478 10.086 5.524 6.661 7.839 8.783

Our second observation here is made by studying the effect of mapping condition and how it relates to
the possible lack of generalizability of each optimizer. Although adaptive optimizers (e.g. RMSProp
and AdaM) yield lower training loss, they over-fit perturbing features (mainly caused by incomplete
second order statistic measure e.g. diagonal Hessian approximation) and accordingly hamper their
generalization [38]. We suspect such unwanted phenomena is related to the mapping condition within
CNN layers. In fact, a mixture of both average κ and average G can help to better realize how well
each optimizer can be generalized for training/testing evaluations.
We conclude by identifying that an ideal optimizer would yield G → 1 and κ → 1 across all
layers within a CNN. We highlight that increases in κ correlates to greater disentanglement between
intermediate input and output layers, hampering the flow of information. Further, we identify that
increases in knowledge gain strengthen the carriage of information through the network which enables
greater performance.

6 Conclusion
We have introduced a new adaptive method called AdaS to solve the issue of combined fast conver-
gence and high precision performance of SGD in deep neural networks–all in a unified optimization
framework. The method combines the low-rank approximation framework in each convolution layer
and identifies how much knowledge is gained in the progression of epoch training and adapts the
SGD learning rate proportionally to the rate of change in knowledge gain. AdaS adds minimal
computational overhead on the regular SGD algorithm and accordingly provides a well generalized
framework to trade off between convergence speed and performance results. Furthermore, AdaS
provides an optimization framework which suggests the validation data is no longer required and the
stopping criteria for training can be obtained directly from the training loss. Empirical evaluations
reveal the possible existence of a lower-bound on SGD step-size that can monotonically increase
the knowledge gain independently to each network convolution layer and accordingly improve the
overall performance. AdaS is capable of significant improvements in generalization over traditional
adaptive methods (i.e. AdaM) while maintaining their rapid convergence characteristics. We highlight
that these improvements come through the application of AdaS to simple SGD with momentum.
We further identify that since AdaS adaptively tunes the learning rates η(t, `) independently to all
convolutional blocks, it can be deployed with adaptive methods such as AdaM, replacing the tradi-
tional scheduling techniques. We postulate that such deployments of AdaS with adaptive gradient

8
updates could introduce greater robustness to initial learning rate choice and leave this exploration as
future work. Finally, we emphasize that, without loss of generality, AdaS can be deployed on fully-
connected networks, where the weight matrices can be directly fed into the low-rank factorization for
metric evaluations.

Broader Impact

The content of this research is of broad interest to both researchers and practitioners of computer
science and engineering for training deep learning models in machine learning. The method proposed
in this paper introduces a new optimization tool that can be adopted for training variety of models
such as Convolutional Neural Network (CNN). The proposed optimizer has strong generalizability
that include both fast convergence speed and also achieve superior performance compared to the
existing off-the-shelf optimizers. The method further introduces a new concept that measures how
well the CNN model is trained by probing in different layers of the network and obtain a quality
measure for training. This metric can be of broad interest to computer scientists and engineers to
develop efficient models that can be tailored on specific applications and dataset.

References
[1] Atilim Gunes Baydin, Robert Cornish, David Martinez Rubio, Mark Schmidt, and Frank Wood. Online
learning rate adaptation with hypergradient descent. In International Conference on Learning Representa-
tions, 2018.
[2] Yoshua Bengio. Practical recommendations for gradient-based training of deep architectures. In Neural
networks: Tricks of the trade, pages 437–478. Springer, 2012.
[3] Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMP-
STAT’2010, pages 177–186. Springer, 2010.
[4] Léon Bottou. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pages 421–436.
Springer, 2012.
[5] Didier A Depireux, Jonathan Z Simon, David J Klein, and Shihab A Shamma. Spectro-temporal response
field characterization with dynamic ripples in ferret primary auditory cortex. Journal of neurophysiology,
85(3):1220–1234, 2001.
[6] Timothy Dozat. Incorporating nesterov momentum into adam. 2016.
[7] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and
stochastic optimization. Journal of Machine Learning Research, 12(61):2121–2159, 2011.
[8] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, Cambridge, MA, USA,
2016. http://www.deeplearningbook.org.
[9] Lars Grasedyck, Daniel Kressner, and Christine Tobler. A literature survey of low-rank tensor approxima-
tion techniques. GAMM-Mitteilungen, 36(1):53–78, 2013.
[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[11] Roger A Horn and Charles R Johnson. Matrix analysis. Cambridge university press, 2012.
[12] Yong-Deok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun Shin. ompression of
deep convolutional neural networks for fast and low power mobile applications. jan 2016. 4th International
Conference on Learning Representations, ICLR 2016.
[13] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
[14] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3):455–500,
2009.
[15] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
[16] Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky. Speeding-up
convolutional neural networks using fine-tuned cp-decomposition. jan 2015. 3rd International Conference
on Learning Representations, ICLR 2015.
[17] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
[18] Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei
Han. On the variance of the adaptive learning rate and beyond. In International Conference on Learning
Representations, 2020.
[19] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. 2017. 5th
International Conference on Learning Representations, ICLR 2017.
[20] Liangchen Luo, Yuanhao Xiong, and Yan Liu. Adaptive gradient methods with dynamic bound of learning
rate. In International Conference on Learning Representations, 2019.
[21] Shinichi Nakajima, Masashi Sugiyama, S Derin Babacan, and Ryota Tomioka. Global analytic solution
of fully-observed variational bayesian matrix factorization. Journal of Machine Learning Research,
14(Jan):1–37, 2013.

9
[22] Francesco Orabona and Tatiana Tommasi. Training deep networks without learning rates through coin
betting. In Advances in Neural Information Processing Systems, pages 2160–2170, 2017.
[23] Ivan V Oseledets. Tensor-train decomposition. SIAM Journal on Scientific Computing, 33(5):2295–2317,
2011.
[24] Boris Polyak. Some methods of speeding up the convergence of iteration methods. Ussr Computational
Mathematics and Mathematical Physics, 4:1–17, 12 1964.
[25] Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In International
Conference on Learning Representations, 2018.
[26] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical
statistics, pages 400–407, 1951.
[27] Michal Rolinek and Georg Martius. L4: Practical loss-based stepsize adaptation for deep learning. In
Advances in Neural Information Processing Systems, pages 6433–6443, 2018.
[28] Tom Schaul, Sixin Zhang, and Yann LeCun. No more pesky learning rates. In International Conference on
Machine Learning, pages 343–351, 2013.
[29] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average
gradient. Mathematical Programming, 162(1-2):83–112, 2017.
[30] Nicholas D Sidiropoulos, Lieven De Lathauwer, Xiao Fu, Kejun Huang, Evangelos E Papalexakis, and
Christos Faloutsos. Tensor decomposition for signal processing and machine learning. IEEE Transactions
on Signal Processing, 65(13):3551–3582, 2017.
[31] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In
International Conference on Learning Representations, 2015.
[32] Leslie N Smith. Cyclical learning rates for training neural networks. In 2017 IEEE Winter Conference on
Applications of Computer Vision (WACV), pages 464–472. IEEE, 2017.
[33] Leslie N Smith and Nicholay Topin. Super-convergence: Very fast training of neural networks using large
learning rates. In Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications,
volume 11006, page 1100612. International Society for Optics and Photonics, 2019.
[34] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and
momentum in deep learning. In International conference on machine learning, pages 1139–1147, 2013.
[35] Cheng Tai, Tong Xiao, Yi Zhang, Xiaogang Wang, and E. Weinan. Convolutional neural networks with
low-rank regularization. jan 2016. 4th International Conference on Learning Representations, ICLR 2016.
[36] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of
its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
[37] Sharan Vaswani, Aaron Mishkin, Issam Laradji, Mark Schmidt, Gauthier Gidel, and Simon Lacoste-Julien.
Painless stochastic gradient: Interpolation, line-search, and convergence rates. In Advances in Neural
Information Processing Systems, pages 3727–3740, 2019.
[38] Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value
of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems,
pages 4148–4158, 2017.
[39] Xiyu Yu, Tongliang Liu, Xinchao Wang, and Dacheng Tao. On compressing deep models by low rank
and sparse decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 7370–7379, 2017.
[40] Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.

10
Appendix-A: Proof of Theorems

The Proofs 6 and 6 correspond to the proof of Theorem 1 for p = 2 and p = 1, respectively.

Proof. (Theorem 1) For simplicity of notations, we assume the following replacements A = Φka ,
Pkb
B = k=k a
∇f˜k (Φk ), and C = Φkb . So, the SGD update in (7) changes to C = A − ηB. Using
the Definition 1, the knowledge gain of matrix C (assumed to be a column matrix N ≤ M ) is
expressed by
N0
1 X 1
σ 2 (C) = Tr CT C .

G(C) = (12)
N σ12 (C) i=1 i N ||C||22
An upper-bound of first singular value can be calculated by first recalling its equivalence to `2 -norm
and then applying the triangular inequality
σ12 (C) = ||C||22 = ||A − ηB||22 ≤ ||A||22 + η 2 ||B||22 + 2η||A||2 ||B||2 . (13)
By substituting (13) in (12) and expanding the terms in trace, a lower bound of C is given by
1   T
Tr A A − 2ηTr AT B + η 2 Tr BT B ,
  
G(C) ≥ (14)

where, γ = ||A||22 + η 2 ||B||22 + 2η||A||2 ||B||2 . The latter inequality can be revised to
h   i
γ γ
G(C) ≥ N1γ 1 − ||A||
 
2 + ||A||2 Tr AT A − 2ηTr AT B + η 2 Tr BT B
h
γ
 T2 2
γ
  i
= N1γ ||A||
 
2 Tr A A + 1 − ||A|| 2 Tr AT A − 2ηTr AT B + η 2 Tr BT B
2  2 
 
 γ  T  T 2
 T 
= G(A) + N1γ   1 − ||A||2 Tr A A − 2ηTr A B + η Tr B B  .

2
| {z }
D
(15)
Therefore, the bound in (15) is revised to
1
G(C) − G(A) ≥ D. (16)

The monotonicity of the knowledge gain in (16) is guaranteed if D ≥ 0. The remaining term D can
be expressed as a quadratic function of η
||B||22  T
   
||B||2  T
D(η) = Tr BT B − 2
  T
Tr A A η − 2Tr A B + 2 Tr A A η (17)
||A||22 ||A||2
where, the condition for D(η) ≥ 0 in (17) is
 

 Tr AT B + ||B||2  T
||A||2 Tr A A

η ≥ max 2 ||B||22
, 0 . (18)
 Tr {BT B} − Tr {AT A} 
||A||22

Hence, given the lower bound in (18) it will guarantee the monotonicity of the knowledge gain
through the update scheme C = A − ηB.
Our final inspection is to check if the substitution of step-size (8) in (16) would still hold the inequality
condition in (16). Followed by the substitution, the inequality should satisfy
1
η≥ζ D. (19)

We have found that D(η) ≥ 0 for some lower bound in (18), where the inequality in (19) also holds
from some ζ ≥ 0 and the proof is done.

11
Proof. (Theorem 1) Following similar notation in Proof 6, the knowledge gain of matrix C is
expressed by
N0
1 X
G(C) = σi (C). (20)
N σ1 (C) i=1
By stacking all singular values in a vector form (and recall from `1 and `2 norms inequality)
 0 2
N
X XN0
2 2
σi2 (C) = Tr CT C ,

 σi (C) = ||σ(C)||1 ≥ ||σ(C)||2 =

i=1 i=1

and by substituting the matrix composition C, the following inequality holds


 0 2
XN
σi (C) ≥ Tr AT A − 2ηTr AT B + η 2 Tr BT B .
  
 (21)
i=1

An upper-bound of first singular value can be calculated by recalling its equivalence to `2 -norm and
triangular inequality as follows
σ12 (C) = ||C||22 = ||A − ηB||22 ≤ ||A||22 + 2η||A||2 ||B||2 + η 2 ||B||22 . (22)
By substituting the lower-bound (21) and upper-bound (22) into (20), a lower bound of knowledge
gain is given by
1  
G2 (C) ≥ 2 Tr AT A − 2ηTr AT B + η 2 Tr BT B ,
  
N γ
where γ = ||A||22 + 2η||A||2 ||B||2 + η 2 ||B||22 . The latter inequality can be revised to
1 N 00 γ  T N 00 γ
G2 (C) ≥ )Tr AT A − 2ηTr AT B + η 2 Tr BT B ],
  
2
[ 2 Tr A A + (1 − 2
N γ ||A||2 ||A||2
| {z }
D
(23)
where, the lower bound of the first summand term is given by
00
N
N 00 γ  T N 00 γ X 2 N 00 γ γ
2 Tr A A = 2 σi (A) = 2 ||σ(A)||22 ≥ 2 ||σ(A)||21 = γN 2 G2 (A).
||A||2 ||A||2 i=1 σ1 (A) σ1 (A)
Therefore, the bound in (23) is revised to
1
G2 (C) ≥ G2 (A) + D. (24)
N 2γ
Note that γ ≥ 0 (step-size η ≥ 0 is always positive) and the only condition for the bound in (24) to
hold is to D ≥ 0. Here the remaining term D can be expressed as quadratic function of step-size i.e.
D(η) = aη 2 + bη + c where
||B||22  T ||B||2
a = Tr BT B −N 00 Tr A A , b = −2Tr AT B −N 00 , c = −(N 00 −1)Tr AT A .
  
||A||22 ||A||2
The
√ quadratic function can √ be factorized D(η) = (η − η1 )(η − η2 ) where the roots η1 = (−b +
∆)/2a and η2 = (−b − ∆)/2a, and ∆ = b2 − 4ac. Here c ≤ 0 and assuming a ≥ 0 then ∆ ≥ 0.
Accordingly, η1 ≥ 0 and η2 ≤ 0. For the function D(η) to yield a positive value, both factorized
elements should be either positive (i.e. η − η1 ≥ 0 and η − η2 ≥ 0) or negative (i.e. η − η1 ≤ 0 and
η − η2 ≤ 0). Here, only the positive conditions hold which yield η ≥ η1 . The assumption a ≥ 0
PN 000 PN 00
is equivalent to i=1 σi2 (B)/σ12 (B) ≥ N 00 i=1 σi2 (A)/σ12 (A). The condition strongly holds for
the beginning epochs due to random initialization of weights where the low-rank matrix A is indeed
an empty matrix at epoch= 0. By progression epoch training this condition loosens and might not
hold. Therefore, the monotonicity of knowledge gain for p = 1 could be violated in the interim
process.

ïż£

12
Appendix-B: AdaS Ablation Study
The ablative analysis of AdaS optimizer is studied here with respect to different parameter settings.
Figure 3 demonstrates the AdaS performance with respect to different range of gain-factor β. Figure 4
demonstrates the knowledge gain of different dataset and network with respect to different gain-factor
settings over successive epochs. Similarly, Figure 5 also demonstrates the rank gain (aka the ratio of
non-zero singular values of low-rank structure with respect to channel size) over successive epochs.
Mapping conditions are shown in Figure 6 and Figure 7 demonstrates the learning rate approximation
through AdaS algorithm over successive epoch training. Evolution of knowledge gain versus mapping
conditions are also shown in Figure 8 and Figure 9.

(a) CIFAR10/VGG16 (b) CIFAR10/ResNet34 (c) CIFAR100/VGG16 (d) CIFAR100/ResNet34

(e) CIFAR10/VGG16 (f) CIFAR10/ResNet34 (g) CIFAR100/VGG16 (h) CIFAR100/ResNet34

Figure 3: Ablative study of AdaS momentum rate over two different datasets (i.e. CIFAR10 and
CIFAR100) and two CNNs (i.e. VGG16 and ResNet34). Top row corresponds to test-accuracies and
bottom row to training-losses.

13
(a) CIFAR10/VGG16/β = 0.80 (b) CIFAR10/VGG16/β = 0.85 (c) CIFAR10/VGG16/β = 0.90 (d) CIFAR10/VGG16/β = 0.95

(e) CIFAR10/ResNet34/β = 0.80 (f) CIFAR10/ResNet34/β = 0.85 (g) CIFAR10/ResNet34/β = 0.90 (h) CIFAR10/ResNet34/β = 0.95

(i) CIFAR100/VGG16/β = 0.80 (j) CIFAR100/VGG16/β = 0.85 (k) CIFAR100/VGG16/β = 0.90 (l) CIFAR100/VGG16/β = 0.95

(m) CIFAR100/ResNet34/β = 0.80 (n) CIFAR100/ResNet34/β = 0.85 (o) CIFAR100/ResNet34/β = 0.90 (p) CIFAR100/ResNet34/β = 0.95

Figure 4: Ablative study of AdaS momentum rate β versus knowledge gain G over two different
datasets (i.e. CIFAR10 and CIFAR100) and two CNNs (i.e. VGG16 and ResNet34). The transition
in color shades from light to dark lines correspond to first to the end of convolution layers in each
network.

14
(a) CIFAR10/VGG16/β = 0.80 (b) CIFAR10/VGG16/β = 0.85 (c) CIFAR10/VGG16/β = 0.90 (d) CIFAR10/VGG16/β = 0.95

(e) CIFAR10/ResNet34/β = 0.80 (f) CIFAR10/ResNet34/β = 0.85 (g) CIFAR10/ResNet34/β = 0.90 (h) CIFAR10/ResNet34/β = 0.95

(i) CIFAR100/VGG16/β = 0.80 (j) CIFAR100/VGG16/β = 0.85 (k) CIFAR100/VGG16/β = 0.90 (l) CIFAR100/VGG16/β = 0.95

(m) CIFAR100/ResNet34/β = 0.80 (n) CIFAR100/ResNet34/β = 0.85 (o) CIFAR100/ResNet34/β = 0.90 (p) CIFAR100/ResNet34/β = 0.95

Figure 5: Ablative study of AdaS momentum rate β versus rank gain rank{Φ̂} over two different
datasets (i.e. CIFAR10 and CIFAR100) and two CNNs (i.e. VGG16 and ResNet34). The transition
in color shades from light to dark lines correspond to first to the end of convolution layers in each
network.

15
(a) CIFAR10/VGG16/β = 0.80 (b) CIFAR10/VGG16/β = 0.85 (c) CIFAR10/VGG16/β = 0.90 (d) CIFAR10/VGG16/β = 0.95

(e) CIFAR10/ResNet34/β = 0.80 (f) CIFAR10/ResNet34/β = 0.85 (g) CIFAR10/ResNet34/β = 0.90 (h) CIFAR10/ResNet34/β = 0.95

(i) CIFAR100/VGG16/β = 0.80 (j) CIFAR100/VGG16/β = 0.85 (k) CIFAR100/VGG16/β = 0.90 (l) CIFAR100/VGG16/β = 0.95

(m) CIFAR100/ResNet34/β = 0.80 (n) CIFAR100/ResNet34/β = 0.85 (o) CIFAR100/ResNet34/β = 0.90 (p) CIFAR100/ResNet34/β = 0.95

Figure 6: Ablative study of AdaS momentum rate β versus mapping condition κ over two different
datasets (i.e. CIFAR10 and CIFAR100) and two CNNs (i.e. VGG16 and ResNet34). The transition
in color shades from light to dark lines correspond to first to the end of convolution layers in each
network.

16
(a) CIFAR10/VGG16/β = 0.80 (b) CIFAR10/VGG16/β = 0.85 (c) CIFAR10/VGG16/β = 0.90 (d) CIFAR10/VGG16/β = 0.95

(e) CIFAR10/ResNet34/β = 0.80 (f) CIFAR10/ResNet34/β = 0.85 (g) CIFAR10/ResNet34/β = 0.90 (h) CIFAR10/ResNet34/β = 0.95

(i) CIFAR100/VGG16/β = 0.80 (j) CIFAR100/VGG16/β = 0.85 (k) CIFAR100/VGG16/β = 0.90 (l) CIFAR100/VGG16/β = 0.95

(m) CIFAR100/ResNet34/β = 0.80 (n) CIFAR100/ResNet34/β = 0.85 (o) CIFAR100/ResNet34/β = 0.90 (p) CIFAR100/ResNet34/β = 0.95

Figure 7: Ablative study of AdaS momentum rate β versus mapping condition κ over two different
datasets (i.e. CIFAR10 and CIFAR100) and two CNNs (i.e. VGG16 and ResNet34). The transition
in color shades from light to dark lines correspond to first to the end of convolution layers in each
network.

17
(a) VGG16, AdaGrad (b) VGG16, RMSProp (c) VGG16, AdaM (d) VGG16, AdaBound (e) VGG16, StepLR (f) VGG16, OneCy-
cleLR

(g) VGG16, AdaS-β = (h) VGG16, AdaS-β = (i) VGG16, AdaS-β = (j) VGG16, AdaS-β = (k) VGG16, AdaS-β = (l) VGG16, AdaS-β =
0.80 0.85 0.90 0.925 0.95 0.975

(m) ResNet34, (n) ResNet34, RMSProp (o) ResNet34, AdaM (p) ResNet34, (q) ResNet34, StepLR (r) ResNet34, OneCy-
AdaGrad AdaBound cleLR

(s) ResNet34, (t) ResNet34, (u) ResNet34, (v) ResNet34, (w) ResNet34, AdaS- (x) ResNet34,
AdaS-β = 0.80 AdaS-β = 0.85 AdaS-β = 0.90 AdaS-β = 0.925 β = 0.95 AdaS-β = 0.975

Figure 8: Evolution of knowledge gain versus mapping condition over iteration of epoch training
for CIFAR10 dataset. Transition of colors shades correspond to different convolution blocks. The
transparency of scatter plots corresponds to the convergence in epochs–the higher the transparency,
the faster the convergence.

18
(a) VGG16, AdaGrad (b) VGG16, RMSProp (c) VGG16, AdaM (d) VGG16, AdaBound (e) VGG16, StepLR (f) VGG16, OneCy-
cleLR

(g) VGG16, AdaS-β = (h) VGG16, AdaS-β = (i) VGG16, AdaS-β = (j) VGG16, AdaS-β = (k) VGG16, AdaS-β = (l) VGG16, AdaS-β =
0.80 0.85 0.90 0.925 0.95 0.975

(m) ResNet34, (n) ResNet34, RMSProp (o) ResNet34, AdaM (p) ResNet34, (q) ResNet34, StepLR (r) ResNet34, OneCy-
AdaGrad AdaBound cleLR

(s) ResNet34, (t) ResNet34, (u) ResNet34, (v) ResNet34, (w) ResNet34, AdaS- (x) ResNet34,
AdaS-β = 0.80 AdaS-β = 0.85 AdaS-β = 0.90 AdaS-β = 0.925 β = 0.95 AdaS-β = 0.975

Figure 9: Evolution of knowledge gain versus mapping condition over iteration of epoch training
for CIFAR100 dataset. Transition of colors shades correspond to different convolution blocks. The
transparency of scatter plots corresponds to the convergence in epochs–the higher the transparecy, the
faster the convergence.

19

You might also like