Deep networks have achieved excellent performance in a variety of domains such as computer visionhe2016deep , language modeling zaremba2014recurrent , and speech recognition graves2013speech . The most popular optimizer is stochastic gradient decent (SGD) robbins1951stochastic , which is simple and has low per-iteration complexity. Its convergence rate is also well-established ghadimi2013stochastic ; bottou2018optimization . However, vanilla SGD is sensitive to the choice of stepsize, and requires careful tuning. To improve the efficiency and robustness of SGD, many variants have been proposed, such as momentum acceleration polyak1964some ; nesterov1983method ; sutskever2013importance and adaptive stepsizes duchi2011adaptive ; hinton2012rmsprop ; zeiler2012adadelta ; kingma2014adam .
Though variants with coordinate-wise adaptive stepsize (such as Adam duchi2011adaptive ) have shown to be effective in accelerating convergence, their generalization performance is often worse than SGD wilson2017marginal . To improve generalization performance, attempts have been made to use a layer-wise stepsize singh2015layer ; yang2017lars ; adam2017normalized ; zhou2018adashift , which assign different stepsizes to different layers or normalize the layer-wise gradient. However, there has been no theoretical analysis for its empirical success. More generally, the whole network parameter can also be partitioned into blocks instead of simply into layers.
Recently, it is shown that coordinate-wise adaptive gradient descent is closely related to sign-based gradient descent pmlr-v80-balles18a ; pmlr-v80-bernstein18a . Theoretical arguments and empirical evidence suggest that the gradient sign would impede generalization pmlr-v80-balles18a . To contract the generalization gap, a partial adaptive parameter for the second-order momentum is proposed chen2018closing . By using a smaller partial adaptive parameter, the adaptive gradient algorithm behaves less like sign descent and more like SGD.
Moreover, in methods with coordinate-wise adaptive stepsize, a small () parameter is typically used to avoid numerical problems in practical implementation. It is discussed in zaheer2018adaptive that this parameter controls adaptivity of the algorithm, and using a larger value (say, ) can reduce adaptivity and empirically helps Adam to match its generalization performance with SGD. This implies that coordinate-wise adaptivity may be too strong for good generalization performance.
In this paper, by revisiting the derivation of Adagrad, we consider partitioning the model parameters into blocks as in singh2015layer ; yang2017lars ; adam2017normalized ; zhou2018adashift , and propose the use of a blockwise stepsize. By allowing this blockwise stepsize to depend on the corresponding gradient block, we have the notion of blockwise adaptivity. Intuitively, it is less aggressive to adapt to parameter blocks instead of to individual coordinates, and this reduced adaptivity can have a better balance between adaptivity and generalization. Moreover, as blockwise adaptivity is not coordinate-wise adaptivity, it does not suffer from the performance deterioration as for sign-based gradient descent.
is some possibly nonconvex loss function, andis a random sample. The expected risk measures the generalization performance on unseen data bottou2018optimization , and reduces to the empirical risk when a finite training set is considered. We show theoretically that the proposed blockwise adaptive gradient descent can be faster than its counterpart with coordinate-wise adaptive stepsize. Using tools on uniform stability bousquet2002stability ; hardt2016train , we also show that blockwise adaptivity has potentially lower generalization error than coordinate-wise adaptivity. Empirically, blockwise adaptive gradient descent converges faster and obtains better generalization performance than its coordinate-wise counterpart (Adam) and Nesterov’s accelerated gradient (NAG) sutskever2013importance .
Notations. For an integer ,
. For a vector, denotes its transpose, is a diagonal matrix with on its diagonal, is the element-wise square root of , is the coordinate-wise square of , , , , where is a positive semidefinite (psd) matrix, and means for all . For two vectors and , , and denote the element-wise division and dot product, respectively. For a square matrix , is its inverse, and means that is psd. Moreover, .
2 Related Work
Adagrad duchi2011adaptive is the first adaptive gradient method in online convex learning with coordinate-wise stepsize. It is particularly useful for sparse learning, as parameters for the rare features can take large steps. Its stepsize schedule is competitive with the best coordinate-wise stepsize in hindsight mcmahan2010adaptive . Recently, its convergence rate with a global adaptive stepsize in nonconvex optimization is established ward2018adagrad . It is shown that Adagrad converges to a stationary point at the optimal rate (up to a factor ), where is the total number of iterations.
Recall that the SGD iterate is the solution to the problem: , where is the gradient of the loss function at iteration , and is the parameter vector. To incorporate information about the curvature of sequence , the -norm in the SGD update can be replaced by the Mahalanobis norm, leading to duchi2011adaptive :
where . This is an instance of mirror descent nemirovsky1983problem . Its regret bound has a gradient-related term . Adagrad’s stepsize can be obtained by examining a similar objective duchi2011adaptive :
where , and is some constant. At optimality, , where . As cannot depend on ’s with , this suggests . Theoretically, this choice of leads to a regret bound that is competitive with the best post-hoc optimal bound mcmahan2010adaptive .
To solve the expected risk minimization problem in (1), an Adagrad variant called weighted AdaEMA is recently proposed in zou2018sufficient . It employs weighted averaging of ’s for stepsize and momentum acceleration. This is a general coordinate-wise adaptive method and includes many Adagrad variants as special cases, including Adam and RMSprop.
3 Blockwise Adaptive Descent
3.1 Blockwise vs Coordinate-wise Adaptivity
Let be the sample size, be the input dimensionality, and be the output dimensionality. Consider a -layer neural network, with output , where is the input matrix and are the weight matrices with and
. The activation functions
are assumed to be bijective (e.g., tanh and leaky ReLU). For simplicity, assume thatfor all . Training this neural network with the square loss corresponds to solving the nonlinear optimization problem: , where is the label matrix. Consider training the network layer-by-layer, starting from the bottom one. For layer , , where is a stochastic gradient evaluated at at time , and is the stepsize which may be adaptive in that it depends on . This layer-wise training is analogous to block coordinate descent, with each layer being a block. The optimization subproblem for the th layer can be rewritten as
is the input hidden representation ofat the th layer, and .
Assume that ’s (with ) are invertible. If is initialized to zero, and has full row rank, then the critical point that it converges to is also the minimum -norm solution of (4) in expectation.
As stepsize can depend on , Proposition 1 shows that blockwise adaptivity can find the minimum -norm solution of (4). In contrast, coordinate-wise adaptivity fails to find the minimum -norm solution even for the underdetermined linear least squares problem wilson2017marginal . Another benefit of using a blockwise stepsize is that the optimizer’s extra memory cost can be reduced. Using a coordinate-wise stepsize requires an additionalmemory, where is the number of blocks. A deep network generally has millions of parameters but only tens of layers. If we set to be the number of layers, memory reduction can be significant.
There have been some recent attempts on the use of layer-wise stepsize in deep networks, either by assigning a specific adaptive stepsize to each layer or normalizing the layer-wise gradient singh2015layer ; yang2017lars ; adam2017normalized ; zhou2018adashift . However, justifications and convergence analysis are still lacking.
3.2 Blockwise Adaptive Learning Rate with Momentum
Let the gradient be partitioned to , where is the set of indices in block , and is the subvector of belonging to block . Inspired by problem (3) in the derivation of Adagrad, we consider the following variant which imposes a block structure on :111We assume the indices in are consecutive; otherwise, we can simply reorder the elements of the gradient. Note that reordering does not change the result, as the objective is invariant to ordering of the coordinates.
where for some . It can be easily shown that at optimality, , where . The optimal is thus proportional to . When in (2) is partitioned by the same block structure, the optimal suggests to incorporate into for block at time . Thus, we consider the following update rule with blockwise adaptive stepsize:
is a hyperparameter that prevents numerical issues. When, this update rule reduces to Adagrad. In Appendix A, we show that it can outperform Adagrad in online convex learning.
As in (6) is increasing w.r.t. , the update in (7) suffers from vanishing stepsize, making slow progress on nonconvex problems such as deep network training. To alleviate this problem, weighted moving average momentum has been used in many Adagrad variants such as RMSprop, Adam and weighted AdaEMA zou2018sufficient . In this paper, we adopt weighted AdaEMA with the use of a blockwise adaptive stepsize. The proposed procedure, which will be called blockwise adaptive gradient with momentum (BAGM), is shown in Algorithm 1. When and , BAGM reduces to weighted AdaEMA. As weighted AdaEMA includes many Adagrad variants, the proposed BAGM also covers the corresponding blockwise variants. In Algorithm 1, serves as an exponential moving averaged momentum, and is a sequence of momentum parameters. The
’s assign different weights to the past gradients in the accumulation of variance, as:
In this paper, we consider the three weight sequences introduced in zou2018convergence . S.1: for some ; S.2: for some : The fraction in (8) then decreases as . S.3: for some : It can be shown that this is equivalent to using the exponential moving average estimate: , and . When , , and , the proposed algorithm reduces to Adam.
3.3 Convergence Analysis
We make the following assumptions.
in (1) is lower-bounded (i.e., ) and -smooth.
Each block of stochastic gradient has bounded second moment, i.e., , where the expectation is taken w.r.t. the random .
Assumption 2 implies that the variance of each block of stochastic gradient is upper-bounded by (i.e., ).
We make the following assumption on sequence . This implies that we can use, for example, a constant , or an increasing sequence .
for some .
(i) is non-decreasing; (ii) grows slowly such that is non-decreasing and for some ; (iii) .
zou2018sufficient The stepsize is chosen such that is “almost" non-increasing, i.e., there exists a non-increasing sequence and positive constants and such that for all .
As in weighted AdaEMA, we define a sequence of virtual estimates of the second moment: . Let be its maximum over all blocks and training iterations, where the expectation is taken over all random ’s. Let for and . For a constant such that , define , where is the largest index for which . When , we set .
The following Theorem provides a bound related to the gradients.
). The following Corollary shows the bound with high probability.
With probability at least , we have .
When , we obtain the same non-asymptotic convergence rates as in zou2018sufficient . Note that SGD is analogous to BAGM with , as they both use a single stepsize for all coordinates and the convergence rates depend on the same second-order moment upper bound in Assumption 2. With a decreasing stepsize, SGD also has a convergence rate of , which can be seen by setting their stepsize to in (2.4) of ghadimi2013stochastic . Thus, our rate is as good as SGD.
Next, we compare the effect of on convergence. As in depends on the sequence , a direct comparison is difficult. Instead, we study an upper bound looser than . First, we introduce the following assumption, which is stronger than Assumption 2 (that only bounds the expectation).
With Assumption 6, it can be easily shown that . We can then define a looser upper bound by replacing in with . We proceed to compare the convergence using coordinate-wise stepsize (with ) and blockwise stepsize (with for some ). Note that when , Assumption 6 becomes for some , and Assumption 2 becomes for some . When , we assume that Assumption 2 is tight in the sense that ,333Note that . On the other hand, . Thus, this bound is tight in the sense that . where is the set of indices in block . The following Corollary shows that blockwise stepsize can have faster convergence than coordinate-wise stepsize.
Assume that Assumption 6 holds. Let and be the values of for and , respectively. Define , and . Let . Then, .
Note that can be larger than as . Corollary 3 then indicates that blockwise adaptive stepsize will lead to improvement if . Assume that the upper bound is tight so that . Thus, , and the above condition is likely to hold when is close to . From the definitions of , and , we can see that they get close to when are close to (i.e., has low variability). In particular, when for all (note that ). This is empirically verified in Appendix C.2.1.
3.4 Uniform Stability and Generalization Error
Given a sample of examples drawn i.i.d. from an underlying unknown data distribution , one often learns the model by minimizing the empirical risk: , where is the output of a possibly randomized algorithm (e.g., SGD) running on data .
hardt2016train Let and be two samples of size that differ in only one example. Algorithm is -uniformly stable if .
The generalization error hardt2016train is defined as , where the expectation is taken w.r.t. the sample and randomness of . It is shown that the generalization error is bounded by the uniform stability of , i.e., hardt2016train . In other words, the more uniformly stable an algorithm is, the lower is its generalization error.
Let , and , where are the th iterates of BAGM on and , respectively. The following shows how (uniform stability) grows with .
Using Proposition 2, we can study how affects the growth of . Consider the first term on the RHS of the bound. Recall that . If , this term is smallest when ; otherwise, some will make this term smallest. For the term, as , the term inside is typically the smallest when , and is largest when . Thus, the first term of the bound is small when is close to , while is small when approaches . As a result, for equals to some , , and thus the generalization error, grows slower than those of and .
In this section, we perform experiments on CIFAR-10 (Section 4.1
), ImageNet (Section4.2), and WikiText-2 (Section 4.3). All the experiments are run on a AWS p3.16 instance with 8 NVIDIA V100 GPUs. We introduce four block construction strategies: B.1
: Use a single adaptive stepsize for each parameter tensor/matrix/vector. A parameter tensor can be the kernel tensor in a convolution layer, a parameter matrix can be the weight matrix in a fully-connected layer, and a parameter vector can be a bias vector;B.2: Use an adaptive stepsize for each output dimension of the parameter matrix/vector in a fully connected layer, and an adaptive stepsize for each output channel in the convolution layer; B.3: Use an adaptive stepsize for each output dimension of the parameter matrix/vector in a fully connected layer, and an adaptive stepsize for each kernel in the convolution layer; B.4: Use an adaptive stepsize for each input dimension of the parameter tensor/matrix, and an adaptive stepsize for each parameter vector.
We compare the proposed BAGM (with block construction approaches 4, 4, 4, 4) with the following baselines: (i) Nesterov’s accelerated gradient (NAG) sutskever2013importance ; and (ii) Adam kingma2014adam . These two algorithms are widely applied in deep networks zaremba2014recurrent ; he2016deep ; vaswani2017attention . NAG provides a strong baseline with good generalization performance, while Adam serves as a fast counterpart with coordinate-wise adaptive stepsize.
As grid search for all hyper-parameters is very computationally expensive, we only tune the most important ones using a validation set and fix the rest. We use a constant (momentum parameter) and exponential increasing sequence 3.2 with for BAGM. For Adam, we also fix its second moment parameter to and tune its momentum parameter. Note that with such configurations, Adam is a special case of BAGM with (i.e., weighted AdaEMA). For all the adaptive methods, we use as suggested in zaheer2018adaptive .
4.1 ResNet on CIFAR-10
We train a deep residual network from the MXNet Gluon CV model zoo555https://github.com/dmlc/gluon-cv/blob/master/gluoncv/model_zoo/model_zoo.py on the CIFAR-10 data set. We use the 56-layer and 110-layer networks as in he2016deep . 10% of the training data are carved out as validation set. We perform grid search using the validation set for the initial stepsize and momentum parameter on ResNet56. The obtained hyperparameters are then also used on ResNet110. We follow the similar setup as in he2016deep . Details are in Appendix C.2.
Table 1 shows the testing errors of the various methods. With a large , the testing performance of Adam matches that of NAG. This agrees with zaheer2018adaptive that a larger reduces adaptivity and improves generalization performance. It also agrees with Proposition 2 that the bound is smaller when is larger. Specifically, Adam has lower testing error than NAG on ResNet56 but higher on ResNet110. For both models, BAGM reduces the testing error over Adam for all block construction strategies used. In particular, except 4, BAGM with all other schemes outperform NAG.
|test error (%)||top-1 validation error (%)||top-5 validation error (%)|
Convergence of the training, testing, and generalization errors (absolute difference between training error and testing error) are shown in Figure 1.666To reduce clutterness, we only show results of the block construction scheme BAGM-4, which gives the lowest testing error among the proposed block schemes. Figure with the full results is shown in Appendix C.2. As can be seen, on both models, BAGM-4 converges to a lower training error rate than Adam. This agrees with Corollary 3 that blockwise adaptive methods can have faster convergence than their counterparts with element-wise adaptivity. Moreover, the generalization error of BAGM-4 is smaller than Adam, which agrees with Proposition 2 that blockwise adaptivity can have a slower growth of generalization error. On both models, BAGM-4 gives the smallest generalization error, while NAG has the highest generalization error on ResNet56. Overall, the proposed methods can accelerate convergence and improve generalization performance.
4.2 ImageNet Classification
In this experiment, we train a 50-layer ResNet model on ImageNet russakovsky2015imagenet . The data set has 1000 classes, 1.28M training samples, and 50,000 validation images. As the data set does not come with labels for its test set, we evaluate its generalization performance on the validation set. We use the ResNet50_v1d network from the MXNet Gluon CV model zoo. We train the FP16 (half precision) model on 8 GPUs, each of which processes 128 images in each iteration. More details are in Appendix C.3.
Performance on the validation set is shown in Table 1. As can be seen, BAGM with all the block schemes (particularly BAGM-4) achieve lower top-1 errors than Adam and NAG. As for the top-5 error, BAGM-4 obtains the lowest, which is then followed by BAGM-4. Overall, BAGM-4 has the best performance on both CIFAR-10 and ImageNet.
4.3 Word-Level Language Modeling
In this section, we train the AWD-LSTM word-level language model merity2017regularizing on the WikiText-2 (WT2) data set merity2016pointer . We use the publicly available implementation in the Gluon NLP toolkit777https://gluon-nlp.mxnet.io/.. We perform grid search on the initial learning rate and momentum parameter as in Section 4.1, and set the weight decay to as in merity2017regularizing . More details on the setup are in Appendix C.4. As there is no convolutional layer, 4 and 4 are the same. Table 2 shows the testing perplexities, the lower the better. As can be seen, all adaptive methods achieve lower test perplexities than NAG, and BAGM-4 obtains the best result.
In this paper, we proposed adapting the stepsize for each parameter block, instead of for each individual parameter as in Adam and RMSprop. Convergence and uniform stability analysis shows that it can have faster convergence and lower generalization error than its counterpart with coordinate-wise adaptive stepsize. Experiments on image classification and language modeling confirm these theoretical results.
L. Balles and P. Hennig.
Dissecting adam: The sign, magnitude and variance of stochastic
Proceedings of the International Conference on Machine Learning, pages 404–413, 2018.
- (2) J. Bernstein, Y. Wang, K. Azizzadenesheli, and A. Anandkumar. signSGD: Compressed optimisation for non-convex problems. In Proceedings of the International Conference on Machine Learning, pages 560–569, 2018.
B. E. Boser, I. M. Guyon, and V. N. Vapnik.
A training algorithm for optimal margin classifiers.In
Proceedings of the annual workshop on Computational learning theory, pages 144–152. ACM, 1992.
- (4) L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311, 2018.
- (5) O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2(3):499–526, 2002.
- (6) N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, 2004.
- (7) J. Chen and Q. Gu. Closing the generalization gap of adaptive gradient methods in training deep neural networks. arXiv preprint arXiv:1806.06763, 2018.
- (8) J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(7):2121–2159, 2011.
- (9) B. S. Everitt. The Cambridge dictionary of statistics. Cambridge University Press, 2006.
- (10) S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
A. Graves, A. Mohamed, and G. Hinton.
Speech recognition with deep recurrent neural networks.In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, pages 6645–6649, 2013.
- (12) M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. In Proceedings of the International Conference on Machine Learning, pages 1225–1234, 2016.
K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
Proceedings of the International Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
- (14) D. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proceedings of the International Conference for Learning Representations, 2015.
- (15) I. Loshchilov and F. Hutter. SGDR: Stochastic gradient descent with warm restarts. In Proceedings of the International Conference on Learning Representations, 2017.
- (16) H. B. McMahan and M. Streeter. Adaptive bound optimization for online convex optimization. In Proceedings of the Annual Conference on Computational Learning Theory, page 244, 2010.
- (17) S. Merity, N. S. Keskar, and R. Socher. Regularizing and optimizing LSTM language models. In Proceedings of the International Conference on Learning Representations, 2018.
- (18) S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models. In Proceedings of the International Conference on Learning Representations, 2017.
- (19) A. Nemirovski and D.B. Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley, 1983.
- (20) Y. E. Nesterov. A method for solving the convex programming problem with convergence rate o (1/k^ 2). In Dokl. Akad. Nauk SSSR, volume 269, pages 543–547, 1983.
- (21) B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
- (22) J. D. Rennie and N. Srebro. Loss functions for preference levels: Regression with discrete ordered labels. In Proceedings of the IJCAI Multidisciplinary Workshop on Advances in Preference Handling, 2005.
- (23) H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
- (24) O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
- (25) O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In Proceedings of the International Conference on Machine Learning, pages 71–79, 2013.
- (26) B. Singh, S. De, Y. Zhang, T. Goldstein, and G. Taylor. Layer-specific adaptive learning rates for deep networks. In Proceedings of the International Conference on Machine Learning and Applications, pages 364–368, 2015.
- (27) I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the International Conference on Machine Learning, pages 1139–1147, 2013.
- (28) T. Tieleman and G. Hinton. Lecture 6.5 - RMSProp, COURSERA: Neural networks for machine learning, 2012.
- (29) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
- (30) R. Ward, X. Wu, and L. Bottou. Adagrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization. arXiv preprint arXiv:1806.01811, 2018.
- (31) A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, pages 4148–4158, 2017.
- (32) Y. You, I. Gitman, and B. Ginsburg. Large batch training of convolutional networks. arXiv preprint arXiv:1707.03888, 2017.
- (33) A. W. Yu, Q. Lin, R. Salakhutdinov, and J. Carbonell. Normalized gradient with adaptive stepsize method for deep neural network training. arXiv preprint arXiv:1707.04822, 2017.
- (34) M. Zaheer, S. Reddi, D. Sachan, S. Kale, and S. Kumar. Adaptive methods for nonconvex optimization. In Advances in Neural Information Processing Systems, pages 9793–9803, 2018.
- (35) W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
- (36) M. D. Zeiler. ADADELTA: An adaptive learning rate method. Preprint arXiv:1212.5701, 2012.
- (37) C. Zhang, Q. Liao, A. Rakhlin, K. Sridharan, B. Miranda, N. Golowich, and T. Poggio. Theory of deep learning iii: Generalization properties of sgd. Technical report, Center for Brains, Minds and Machines (CBMM), 2017.
- (38) H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. In Proceedings of the International Conference on Learning Representations, 2018.
- (39) Z. Zhou, Q. Zhang, G. Lu, H. Wang, W. Zhang, and Y. Yu. Adashift: Decorrelation and convergence of adaptive learning rate methods. arXiv preprint arXiv:1810.00143, 2018.
- (40) F. Zou and L. Shen. On the convergence of weighted adagrad with momentum for training deep neural networks. arXiv preprint arXiv:1808.03408v2, 2018.
- (41) F. Zou, L. Shen, Z. Jie, W. Zhang, and W. Liu. A sufficient condition for convergences of adam and rmsprop. arXiv preprint arXiv:1811.09358, 2018.
Appendix A Online Convex Learning
In online learning, the learner picks a prediction at round , and then suffers a loss . The goal of the learner is to choose and achieve a low regret w.r.t. an optimal predictor in hindsight. The regret (over rounds) is defined as
a.1 Proposed Algorithm
The proposed procedure, which will be called blockwise adaptive gradient (BAG), is shown in Algorithm2. Compared to Adagrad, each block, instead of each coordinate, has its own learning rate.
a.2 Regret Analysis
First, we make the following assumptions.
Each in (9) is convex but possibly nonsmooth. There exists a subgradient such that for all .
Each parameter block is in a ball of the corresponding optimal block throughout the iterations. In other words, for all , where is the subvector of for block .
When , by setting for all , where is some constant such that , the regret bound reduces to that of Adagrad in Theorem 5 of .
By Jensen’s inequality, the last term of (10) is minimized when . However, the comparison with Adagrad is indeterminate in the first term due to the constant .
In the following, we provide an example showing that when gradient magnitudes for elements in the same block have the same upper bound, blockwise adaptive learning rate can lead to lower regret than coordinate-wise adaptive learning rate (in Adagrad). This then indicates that blockwise adaptive method can potentially be beneficial in training deep networks, as its architecture can be naturally divided into blocks and parameters in the same block are likely to have gradients with similar magnitudes.
Let be the hinge loss for a linear model:
where is the label and is the feature vector. Assume that input is partitioned into blocks. For each in input block , with probability , for some given , and otherwise. Then, , and the expected gradient magnitudes for elements in the same input block have the same upper bound. Taking expectation of the gradient terms in (10), we have, for all ’s,