1 Introduction
Stochastic Gradient Descent (SGD), a firstorder optimization method Robbins and Monro (1951); Bottou (2010, 2012); Schmidt et al. (2017)
, has become the mainstream method for training overparametrized models such as deep neural networks
LeCun et al. (2015); Goodfellow et al. (2016). Attempting to augment this method, SGD with momentum Polyak (1964); Sutskever et al. (2013) accumulates the historically aligned gradients which helps in navigating past ravines and towards a more optimal solution. It eventually converges faster and exhibits better generalization compared to vanilla SGD. However, as the stepsize (aka global learning rate) is mainly fixed for momentum SGD, it blindly follows these past gradients and can eventually overshoot an optimum and cause oscillatory behavior. From a practical standpoint (e.g. in the context of deep neural network Bengio (2012); Schaul et al. (2013); LeCun et al. (2015)) it is even more concerning to deploy a fixed global learning rate as it often leads to poorer convergence, requires extensive tuning, and exhibits strong performance fluctuations over the selection range.A handful of methods have been introduced over the past decade to solve the latter issues based on the adaptive gradient methods Duchi et al. (2011); Tieleman and Hinton (2012); Zeiler (2012); Kingma and Ba (2014); Dozat (2016); Reddi et al. (2018); Liu et al. (2020); Luo et al. (2019); Baydin et al. (2018); Rolinek and Martius (2018); Vaswani et al. (2019); Orabona and Tommasi (2017). These methods can be represented in the general form:
(1) 
where for some th iteration, is the stochastic gradient obtained at the th iteration,
is the gradient estimation, and
is the adaptive learning rate, where generally relates to the square of the gradients. Each adaptive method therefore attempts to modify the gradient estimation (through the use of momentum) or the adaptive learning rate (through a difference choice of ). Furthermore, it is also common to subject to a manually set schedule for more optimal performance and better theoretical convergence guarantees.Such methods were first introduced in Duchi et al. (2011) (AdaGrad) by regulating the update size with the accumulated second order statistical measures of gradients which provides a robust framework for sparse gradient updates. The issue of vanishing learning rate caused by equally weighted accumulation of gradients is the main drawback of AdaGrad that was raised in Tieleman and Hinton (2012)
(RMSProp), which utilizes the exponential decaying average of gradients instead of accumulation. A variant of firstorder gradient measures was also introduced in
Zeiler (2012)(AdaDelta), which solves the decaying learning rate problem using an accumulation window, providing a robust framework toward hyperparameter tuning issues. The adaptive moment estimation in
Kingma and Ba (2014)(AdaM) was introduced later to leverage both first and second moment measures of gradients for parameter updating. AdaM can be seen as the celebration of all three adaptive optimizers: AdaGrad, RMSProp and AdaDelta–solving the vanishing learning rate problem and offering a more optimal adaptive learning rate to improve in rapid convergence and generalization capabilities. Further improvements were made on AdaM using Nesterov Momentum
Dozat (2016), longterm memory of past gradients Reddi et al. (2018), rectified estimations Liu et al. (2020), dynamic bound of learning rate Luo et al. (2019), hypergradient descent method Baydin et al. (2018), and lossbased stepsize Rolinek and Martius (2018). Methods based on linesearch techniques Vaswani et al. (2019) and coin betting Orabona and Tommasi (2017) are also introduced to avoid bottlenecks caused by the hyperparameter tuning issues in SGD.The AdaM optimizer, as well as its other variants, has attracted many practitioners in deep learning for two main reasons: (1) it requires minimal hyperparameter tuning effort; and (2) it provides an efficient convergence optimization framework. Despite the ease of implementation of such optimizers, there is a growing concern about their poor
“generalization” capabilities. They perform well on the givensamples i.e. training data (at times even better performance can be achieved compared to nonadaptive methods such as in Loshchilov and Hutter (2017); Smith (2017); Smith and Topin (2019)), but perform poorly on the outofsamples i.e. test/evaluation data Wilson et al. (2017). Despite various research efforts taken for adaptive learning methods, the nonadaptive SGD based optimizers (such as scheduled learning methods including Warmup Techniques Loshchilov and Hutter (2017), CyclicalLearning Smith (2017); Smith and Topin (2019), and StepDecaying Goodfellow et al. (2016)) are still considered to be the golden frameworks to achieve better performance at the price of either more epochs for training and/or costly tuning for optimal hyperparameter configurations given different datasets and models.
Our goal in this paper is twofold: (1) we address the above issues by proposing a new approach for adaptive methods in SGD optimization; and (2) we introduce new probing metrics that enable the monitoring and evaluation of quality of learning within layers of a Convolutional Neural Network (CNN). Unlike the general trend in most adaptive methods where the raw measurements from gradients are utilized to adapt the stepsize and regulate the gradients (through different choices of adaptive learning rate or gradient estimation), we take a different approach and focus our efforts on the scheduling of the learning rate
independently for each convolutional block. Specifically, we first ask “how much of the gradients are useful for SGD updates?” and then translate this into a new concept we call the “knowledge gain”, which is measured from the energy of lowrank factorization of convolution weights in deep layers. The knowledge gain defines the usefulness of gradients and adapts the next stepsize for SGD updates. We summarize our contributions as follows:
The new concepts of “knowledge gain” and “mapping condition” are introduced to measure the quality of convolution weights that can be used in iterative training and to provide an answer to these questions: how well are the layers trained given a certain epoch steps? Is there enough information obtained via the sequence of updates?

We proposed a new adaptive scheduling algorithm for SGD called “AdaS” which introduces minimal computational overhead over vanilla SGD and guarantees the increase of knowledge gain over consecutive epochs. AdaS adaptively schedules for every conv block and both generalizes well and outperforms previous adaptive methods e.g. AdaM. A pitching factor called gainfactor is tuned in AdaS to trade off between fast convergence and greedy performance. Code is available at https://github.com/mahdihosseini/AdaS.

Thorough experiments are conducted for image classification problems using various dataset and CNN models. We adopt different optimizers and compare their convergence speed and generalization characteristics to our AdaS optimizer.

A new probing tool based on knowledge gain and mapping condition is introduced to measure the quality of network training without requiring test/evaluation data. We investigate the relationship between our new quality metrics and performance results.
2 Knowledge Gain in CNN Training
Central to our work is the notion of knowledge gain
measured from convolutional weights of CNNs. Consider the convolutional weights of a particular layer in a CNN defined by a fourway array (aka fourthorder tensor)
, where and are the height and width of the convolutional kernels, and and correspond to the number of input and output channels, respectively. The feature mapping under this convolution operation follows , where and are the input and output feature maps stacked in 3D volumes, and is the output index. The wellposedness of this feature mapping can be studied by the generalized spectral decomposition (i.e. SVD) form of the tensor arrays using the Tucker model Kolda and Bader (2009); Sidiropoulos et al. (2017) in fullcore tensor mode(2) 
where, the core (containing singular values) is called a tensor, is the factor basis for decomposition, and is outer product operation. We use similar notations as in Sidiropoulos et al. (2017) for brevity. Note that can be at most of rank .
The tensor array in (2) is (usually) initialized by random noise sampling for CNN training such that the mapping under this tensor randomly spans the output dimensions (i.e. the diffusion of knowledge is fully random in the beginning with no learned structure). Throughout an iterative training framework, more knowledge is gained lying in the tensor space as a mixture of a lowrank manifold and perturbing noise. Therefore, it makes sense to decompose (factorize) the observing tensor within each layer of the CNN as . This decomposes the observing tensor array into a lowrank tensor containing the smallcore tensor such that the error residues are minimized. A similar framework is also used in CNN compression Lebedev et al. (2015); Tai et al. (2016); Kim et al. (2016); Yu et al. (2017). A handful of techniques (e.g. CP/PRAFAC, TT, HT, truncatedMLSVD, Compression) can be found in Kolda and Bader (2009); Oseledets (2011); Grasedyck et al. (2013); Sidiropoulos et al. (2017) to estimate such smallcore tensor. The majority of these solutions are iterative and we therefore take a more careful consideration toward such lowrank decomposition.
An equivalent representation of the tensor decomposition (2
) is the vector form
, where is a vector obtained by stacking all tensor elements columnwise, , is the Kronecker product, and is a factor matrix containing all bases stacked in column form. Since we are interested in input and output channels of CNNs for decomposition, we use mode3 and mode4 vector expressions yielding two matrices(3) 
where, , , and and are likewise reshaped forms of the core tensor . The tensor decomposition (3) is the equivalent representation to (2) decomposed at mode3 and mode4. Recall the matrix (twoway array) decomposition e.g. SVD such that where , , and Sidiropoulos et al. (2017). In other words, to decompose a tensor on a given mode, we first unfold the tensor (on the given mode) and then apply a decomposition method of interest such as SVD.
The presence of noise, however, is still a barrier for better understanding of the latter reshaped forms. Similar to Lebedev et al. (2015), we revise our goal into lowrank matrix factorizations of and , where a global analytical solution is given by the Variational Baysian Matrix Factorization (VBMF) technique in Nakajima et al. (2013) as a reweighted SVD of the observation matrix. This method avoids unnecessary implementing an iterative algorithm.
Using the above decomposition framework, we introduce the following two definitions.
Definition 1.
(Knowledge Gain). For convolutional weights in deep CNNs, define the knowledge gain across a particular channel (i.e. th dimension)
(4) 
where, are the lowrank singular values of a singlechannel convolutional weight in descending order, , stands for dimension index, and .
The notion of knowledge gain on the input tensor in (4) is in fact a direct measure of the norm energy of the factorized matrix.
Remark 1.
The energies here indicate the distance measure from the matrix separability obtained from lowrank factorization (similar to the index of inseparability in neurophysiology Depireux et al. (2001)). In other words, it measures the space span obtained by the lowrank structure. The division factors in (4) also normalize the gain as a fraction of channel capacity. In this study we are mainly interested in third and fourth dimension measures (i.e. ).
Definition 2.
(Mapping Condition). For convolutional weights in deep CNNs, define the mapping condition across a particular channel (i.e. th dimension)
(5) 
where, and are the maximum and minimum lowrank singular values of a singlechannel convolutional weight, respectively.
Recall the matrixvector calculation form by mapping the input vector into the output vector, where its numerical stability is defined by the matrix condition number as a relative ratio of maximum to minimum singular values Horn and Johnson (2012). The convolution operations in CNNs follow a similar concept by mapping input feature images into output features. Accordingly, the mapping condition of the convolutional layers in CNN is defined by (5) as a direct measurement of condition number of lowrank factorized matrices: indicating the wellposedness of convolution operation.
3 Adapting Stochastic Gradient Descent with Knowledge Gain
As an optimization method in deep learning, SGD typically attempts to minimize the loss functions of large networks
Bottou (2010, 2012); LeCun et al. (2015); Goodfellow et al. (2016). Consider the updates on the convolutional weights using this optimization(6) 
where and correspond to epoch index and number of minibatches, respectively, is the average stochastic gradients on th minibatch that are randomly selected from a batch of samples , and defines the stepsize taken toward the opposite direction of average gradients. The selection of stepsize can be either adaptive with respect to the statistical measure from gradients Duchi et al. (2011); Zeiler (2012); Tieleman and Hinton (2012); Kingma and Ba (2014); Dozat (2016); Reddi et al. (2018); Luo et al. (2019); Vaswani et al. (2019); Liu et al. (2020) or could be subject to change in different scheduled learning regimes Loshchilov and Hutter (2017); Smith (2017); Smith and Topin (2019); Loshchilov and Hutter (2017).
In the scheduled learning rate method, the stepsize is usually fixed for every th epoch (i.e. for all minibatch updates) and changes according to the schedule assignment for the next epoch (i.e. ). We setup our problem by accumulating all observed gradients throughout minibatch updates within the th epoch
(7) 
where and . Note that the significance of updates in (7) from th to th iteration is controlled by the stepsize , which directly impacts the rate of the knowledge gain.
Here we provide satisfying conditions on the stepsize for increasing the knowledge gain in SGD.
Theorem 1.
(Increasing Knowledge Gain for SGD). Using the knowledge gain from Definition 4 and setting the stepsize of Stochastic Gradient Descent (SGD) proportionate to
(8) 
will guarantee the monotonic increase of the knowledge gain i.e. for some existing lower bound and .
The proof of Theorem 1 is provided in the AppendixA. The stepsize in (8) is proportional to the knowledge gain through the updating scheme in SGD where we update the value in every epoch. Therefore, the computational overhead on vanilla SGD is only limited to calculating the knowledge gain for each convolutional layer in the CNN for every epoch. This overhead is minimal due to the empirical solution provided by the lowrank factorization method (EVBMF) in Nakajima et al. (2013).
4 AdaS Algorithm
We formulate the update rule for AdaS using SGD with Momentum as follows
(9)  
(10)  
(11) 
where is the current minibatch, is the current epoch iteration, is the conv block index, is the average knowledge gain obtained from both mode3 and mode4 decompositions, is the velocity term, and are the learnable parameters.
The pseudocode for our proposed algorithm AdaS is presented in Algorithm 1. Each convolution block in the CNN is assigned an index where all learnable parameters (e.g. conv, biases, batchnorms, etc) are called using this index. The goal in AdaS is firstly to callback the convolutional weights within each block, secondly to apply lowrank matrix factorization on the unfolded tensors, and finally approximate the overall knowledge gain and mapping condition . The approximation is done once every epoch and introduces minimal computational overhead over the rest of the optimization framework. The learning rate is computed relative to the rate of change in knowledge gain over two consecutive epoch updates (from previous to current). The learning rate is then further updated by an exponential moving average called the gainfactor, with hyperparameter , to accumulate the history of knowledge gain over the sequence of epochs. In effect, controls the tradeoff between convergence speed and training accuracy of AdaS. An ablative study on the effect of this parameter is provided in the AppendixB. The computed stepsizes for all convblocks are then passed through the SGD optimization framework for adaptation. Note that the same stepsize is used within each block for all learnable parameters. Code is available at https://github.com/mahdihosseini/AdaS.
5 Experiments
We compare our AdaS algorithm to several adaptive and nonadaptive optimizers in the context of image classification. In particular, we implement AdaS with SGD with momentum, four adaptive methods i.e. AdaGrad Duchi et al. (2011), RMSProp Goodfellow et al. (2016), AdaM Kingma and Ba (2014), AdaBound Luo et al. (2019), and two nonadaptive momentum SGDs guided by scheduled learning techniques i.e. OneCyleLR (also known as the superconvergence method) Smith and Topin (2019) and SGD with StepLR (step decaying) Goodfellow et al. (2016). We further investigate the dependencies of CNN training quality to knowledge gain and mapping conditions and provide useful insights into the usefulness of different optimizers for training different models. For details on ablative studies and a complete set of experiments, please refer to the AppendixB.
5.1 Experimental Setup
We investigate the efficacy of AdaS with respect to variations in the number of deep layers using VGG16 Simonyan and Zisserman (2015) and ResNet34 He et al. (2016) and the number of classes using the standard CIFAR10 and CIFAR100 datasets Krizhevsky et al. (2009) for training. The details of preprocessing steps, network implementation and training/testing frameworks are adopted from the CIFAR GitHub repository^{1}^{1}1https://github.com/kuangliu/pytorchcifar
using PyTorch. We set the initial learning rates of AdaGrad, RMSProp and AdaBound to
per their suggested default values. We further followed the suggested tuning in Wilson et al. (2017) for AdaM ( for VGG16 and for ResNet34) and SGDStepLR ( dropping half magnitude every epochs) to achieve the best performance. For SGD1CycleLR we set epochs for the whole cycle and found the best configuration ( for VGG16 and for ResNet34). To configure the best initial learning rate for AdaS, we performed a dense grid search and found the values for VGG16 and ResNet34 to be . Despite the differences in optimal values that are independently obtained for each network, the optimizer performance is fairly robust relative to changes in these values. Each model is trained for epochs in independent runs and average test accuracy and training losses are reported. The minibatch size is also set to .

5.2 Image Classification Problem
We first empirically evaluate the effect of the gainfactor on AdaS convergence by defining eight different grid values (i.e. ). The tradeoff between the selection of different values of is demonstrated in Figure 1 (complete ablation study is provided in AppendixB). Here, lower translates to faster convergence, whereas setting it to higher values yields better final performance–at the cost of requiring more epochs for training. The performance comparison of optimizers is also overlaid in the same figure, where AdaS (with lower ) surpasses all adaptive and nonadaptive methods by a large margin in both test accuracy and training loss during the initial stages of training (i.e. ), where as SGDStepLR and AdaS (with higher ) eventually overtake the other methods with more training epochs. Furthermore, AdaGrad, RMSProp, AdaM, and AdaBound all achieve similar or sometimes even lower training losses compared to AdaS (including the two nonadaptive methods), but attain lower test accuracies. Such controversial results were also reported in Wilson et al. (2017) where adaptive optimizers generalize worse compared to nonadaptive methods. In retrospect, we claim here that AdaS solves this issue by generalizing better than other adaptive optimizers.
We further provide quantitative results on the convergence of all optimizers trained on ResNet34 in Table 1 with a fixed number of training epochs. The rank consistency of AdaS (using two different gain factors of low and high values) over other optimizers is evident. For instance, AdaS gains
test accuracy (with half confidence interval) over the second best optimizer AdaM on CIFAR100 trained with
epochs.Epoch  AdaGrad  RMSProp  AdaM  SGDStepLR  AdaS  AdaS  

CIFAR10  
CIFAR100  

5.3 Dependence of the Quality of Network Training to Knowledge Gain
Both concepts of knowledge gain and mapping condition can be used to prob within an intermediate layer of a CNN and quantify the quality of training with respect to different parameter settings. Such quantization do not require test or evaluation data where one can directly measure the “expected performance” of the network throughout the training updates. Our first observation here is made by linking the knowledge gain measure to the relative success of each method in the test accuracy performance. For instance, by raising the gain factor in AdaS, the deeper layers of CNN eventually gain further knowledge as shown in Figure 2. This directly affects the success in test performance results. Also, deploying different optimizers yields different behavior in knowledge gain obtained in different layers of the CNN. Table 2 lists all four numerical measurements of Test Accuracy, Training Loss, Knowledge Gain, and Mapping Condition for different optimizers. Note the rank order correlation of knowledge gain and test accuracy. Although for both RMSProp and AdaM the knowledge gains are high, however, the Mapping Conditions are also high which deteriorates the overall performance of the network.
AdaGrad 
RMSProp 
AdaM 
AdaBound 
SGD1CycleLR 
SGDStepLR 
AdaS 
AdaS 
AdaS 
AdaS 


Test Accuracy  
Training Loss  
Knowledge Gain ()  
Mapping Condition () 
Our second observation here is made by studying the effect of mapping condition and how it relates to the possible lack of generalizability of each optimizer. Although adaptive optimizers (e.g. RMSProp and AdaM) yield lower training loss, they overfit perturbing features (mainly caused by incomplete second order statistic measure e.g. diagonal Hessian approximation) and accordingly hamper their generalization Wilson et al. (2017). We suspect such unwanted phenomena is related to the mapping condition within CNN layers. In fact, a mixture of both average and average can help to better realize how well each optimizer can be generalized for training/testing evaluations.
We conclude by identifying that an ideal optimizer would yield and across all layers within a CNN. We highlight that increases in correlates to greater disentanglement between intermediate input and output layers, hampering the flow of information. Further, we identify that increases in knowledge gain strengthen the carriage of information through the network which enables greater performance.
6 Conclusion
We have introduced a new adaptive method called AdaS to solve the issue of combined fast convergence and high precision performance of SGD in deep neural networks–all in a unified optimization framework. The method combines the lowrank approximation framework in each convolution layer and identifies how much knowledge is gained in the progression of epoch training and adapts the SGD learning rate proportionally to the rate of change in knowledge gain. AdaS adds minimal computational overhead on the regular SGD algorithm and accordingly provides a well generalized framework to trade off between convergence speed and performance results. Furthermore, AdaS provides an optimization framework which suggests the validation data is no longer required and the stopping criteria for training can be obtained directly from the training loss. Empirical evaluations reveal the possible existence of a lowerbound on SGD stepsize that can monotonically increase the knowledge gain independently to each network convolution layer and accordingly improve the overall performance. AdaS is capable of significant improvements in generalization over traditional adaptive methods (i.e. AdaM) while maintaining their rapid convergence characteristics. We highlight that these improvements come through the application of AdaS to simple SGD with momentum. We further identify that since AdaS adaptively tunes the learning rates independently to all convolutional blocks, it can be deployed with adaptive methods such as AdaM, replacing the traditional scheduling techniques. We postulate that such deployments of AdaS with adaptive gradient updates could introduce greater robustness to initial learning rate choice and leave this exploration as future work. Finally, we emphasize that, without loss of generality, AdaS can be deployed on fullyconnected networks, where the weight matrices can be directly fed into the lowrank factorization for metric evaluations.
Broader Impact
The content of this research is of broad interest to both researchers and practitioners of computer science and engineering for training deep learning models in machine learning. The method proposed in this paper introduces a new optimization tool that can be adopted for training variety of models such as Convolutional Neural Network (CNN). The proposed optimizer has strong generalizability that include both fast convergence speed and also achieve superior performance compared to the existing offtheshelf optimizers. The method further introduces a new concept that measures how well the CNN model is trained by probing in different layers of the network and obtain a quality measure for training. This metric can be of broad interest to computer scientists and engineers to develop efficient models that can be tailored on specific applications and dataset.
References
 Online learning rate adaptation with hypergradient descent. In International Conference on Learning Representations, External Links: Link Cited by: §1, §1.
 Practical recommendations for gradientbased training of deep architectures. In Neural networks: Tricks of the trade, pp. 437–478. Cited by: §1.
 Largescale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pp. 177–186. Cited by: §1, §3.
 Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pp. 421–436. Cited by: §1, §3.
 Spectrotemporal response field characterization with dynamic ripples in ferret primary auditory cortex. Journal of neurophysiology 85 (3), pp. 1220–1234. Cited by: §2.
 Incorporating nesterov momentum into adam. Cited by: §1, §1, §3.
 Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12 (61), pp. 2121–2159. External Links: Link Cited by: §1, §1, §3, §5.
 Deep learning. MIT press, Cambridge, MA, USA. Note: http://www.deeplearningbook.org Cited by: §1, §1, §3, §5.
 A literature survey of lowrank tensor approximation techniques. GAMMMitteilungen 36 (1), pp. 53–78. Cited by: §2.

Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: §5.1.  Matrix analysis. Cambridge university press. Cited by: §2, Remark 1.
 Ompression of deep convolutional neural networks for fast and low power mobile applications. (English (US)). Note: 4th International Conference on Learning Representations, ICLR 2016 Cited by: §2.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §1, §1, §3, §5.
 Tensor decompositions and applications. SIAM review 51 (3), pp. 455–500. Cited by: §2, §2.
 Learning multiple layers of features from tiny images. Cited by: §5.1.
 Speedingup convolutional neural networks using finetuned cpdecomposition. Note: 3rd International Conference on Learning Representations, ICLR 2015 Cited by: §2, §2.
 Deep learning. nature 521 (7553), pp. 436–444. Cited by: §1, §3.

On the variance of the adaptive learning rate and beyond
. In International Conference on Learning Representations, External Links: Link Cited by: §1, §1, §3.  Sgdr: stochastic gradient descent with warm restarts. Note: 5th International Conference on Learning Representations, ICLR 2017 Cited by: §1, §3.
 Adaptive gradient methods with dynamic bound of learning rate. In International Conference on Learning Representations, External Links: Link Cited by: §1, §1, §3, §5.
 Global analytic solution of fullyobserved variational bayesian matrix factorization. Journal of Machine Learning Research 14 (Jan), pp. 1–37. Cited by: §2, §3, 1.
 Training deep networks without learning rates through coin betting. In Advances in Neural Information Processing Systems, pp. 2160–2170. Cited by: §1, §1.
 Tensortrain decomposition. SIAM Journal on Scientific Computing 33 (5), pp. 2295–2317. Cited by: §2.
 Some methods of speeding up the convergence of iteration methods. Ussr Computational Mathematics and Mathematical Physics 4, pp. 1–17. External Links: Document Cited by: §1.
 On the convergence of adam and beyond. In International Conference on Learning Representations, External Links: Link Cited by: §1, §1, §3.
 A stochastic approximation method. The annals of mathematical statistics, pp. 400–407. Cited by: §1.
 L4: practical lossbased stepsize adaptation for deep learning. In Advances in Neural Information Processing Systems, pp. 6433–6443. Cited by: §1, §1.
 No more pesky learning rates. In International Conference on Machine Learning, pp. 343–351. Cited by: §1.
 Minimizing finite sums with the stochastic average gradient. Mathematical Programming 162 (12), pp. 83–112. Cited by: §1.
 Tensor decomposition for signal processing and machine learning. IEEE Transactions on Signal Processing 65 (13), pp. 3551–3582. Cited by: §2, §2, §2.
 Very deep convolutional networks for largescale image recognition. In International Conference on Learning Representations, Cited by: §5.1.
 Superconvergence: very fast training of neural networks using large learning rates. In Artificial Intelligence and Machine Learning for MultiDomain Operations Applications, Vol. 11006, pp. 1100612. Cited by: §1, §3, §5.
 Cyclical learning rates for training neural networks. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 464–472. Cited by: §1, §3.
 On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp. 1139–1147. Cited by: §1.
 Convolutional neural networks with lowrank regularization. (English (US)). Note: 4th International Conference on Learning Representations, ICLR 2016 Cited by: §2.
 Lecture 6.5rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4 (2), pp. 26–31. Cited by: §1, §1, §3.

Painless stochastic gradient: interpolation, linesearch, and convergence rates
. In Advances in Neural Information Processing Systems, pp. 3727–3740. Cited by: §1, §1, §3.  The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, pp. 4148–4158. Cited by: §1, §5.1, §5.2, §5.3.
 On compressing deep models by low rank and sparse decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7370–7379. Cited by: §2.
 Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701. Cited by: §1, §1, §3.
AppendixA: Proof of Theorems
The Proofs AppendixA: Proof of Theorems and AppendixA: Proof of Theorems correspond to the proof of Theorem 1 for and , respectively.
Proof.
(Theorem 1) For simplicity of notations, we assume the following replacements , , and . So, the SGD update in (7) changes to . Using the Definition 1, the knowledge gain of matrix (assumed to be a column matrix ) is expressed by
(12) 
An upperbound of first singular value can be calculated by first recalling its equivalence to norm and then applying the triangular inequality
(13) 
By substituting (13) in (12) and expanding the terms in trace, a lower bound of is given by
(14) 
where, . The latter inequality can be revised to
(15) 
Therefore, the bound in (15) is revised to
(16) 
The monotonicity of the knowledge gain in (16) is guaranteed if . The remaining term can be expressed as a quadratic function of
(17) 
where, the condition for in (17) is
(18) 
Hence, given the lower bound in (18) it will guarantee the monotonicity of the knowledge gain through the update scheme .
Our final inspection is to check if the substitution of stepsize (8) in (16) would still hold the inequality condition in (16). Followed by the substitution, the inequality should satisfy
(19) 
We have found that for some lower bound in (18), where the inequality in (19) also holds from some and the proof is done. ∎
Proof.
(Theorem 1) Following similar notation in Proof AppendixA: Proof of Theorems, the knowledge gain of matrix is expressed by
(20) 
By stacking all singular values in a vector form (and recall from and norms inequality)
and by substituting the matrix composition , the following inequality holds
(21) 
An upperbound of first singular value can be calculated by recalling its equivalence to norm and triangular inequality as follows
(22) 
By substituting the lowerbound (21) and upperbound (22) into (20), a lower bound of knowledge gain is given by
where . The latter inequality can be revised to
(23) 
where, the lower bound of the first summand term is given by
Therefore, the bound in (23) is revised to
(24) 
Note that (stepsize is always positive) and the only condition for the bound in (24) to hold is to . Here the remaining term can be expressed as quadratic function of stepsize i.e. where
The quadratic function can be factorized where the roots and , and . Here and assuming then . Accordingly, and . For the function to yield a positive value, both factorized elements should be either positive (i.e. and ) or negative (i.e. and ). Here, only the positive conditions hold which yield . The assumption is equivalent to . The condition strongly holds for the beginning epochs due to random initialization of weights where the lowrank matrix is indeed an empty matrix at epoch. By progression epoch training this condition loosens and might not hold. Therefore, the monotonicity of knowledge gain for could be violated in the interim process. ∎
AppendixB: AdaS Ablation Study
The ablative analysis of AdaS optimizer is studied here with respect to different parameter settings. Figure 3 demonstrates the AdaS performance with respect to different range of gainfactor . Figure 4 demonstrates the knowledge gain of different dataset and network with respect to different gainfactor settings over successive epochs. Similarly, Figure 5 also demonstrates the rank gain (aka the ratio of nonzero singular values of lowrank structure with respect to channel size) over successive epochs. Mapping conditions are shown in Figure 6 and Figure 7 demonstrates the learning rate approximation through AdaS algorithm over successive epoch training. Evolution of knowledge gain versus mapping conditions are also shown in Figure 8 and Figure 9.






Comments
There are no comments yet.