A New Adaptive Gradient Method with Gradient Decomposition

07/18/2021 ∙ by Zhou Shao, et al. ∙ Peking University 0

Adaptive gradient methods, especially Adam-type methods (such as Adam, AMSGrad, and AdaBound), have been proposed to speed up the training process with an element-wise scaling term on learning rates. However, they often generalize poorly compared with stochastic gradient descent (SGD) and its accelerated schemes such as SGD with momentum (SGDM). In this paper, we propose a new adaptive method called DecGD, which simultaneously achieves good generalization like SGDM and obtain rapid convergence like Adam-type methods. In particular, DecGD decomposes the current gradient into the product of two terms including a surrogate gradient and a loss based vector. Our method adjusts the learning rates adaptively according to the current loss based vector instead of the squared gradients used in Adam-type methods. The intuition for adaptive learning rates of DecGD is that a good optimizer, in general cases, needs to decrease the learning rates as the loss decreases, which is similar to the learning rates decay scheduling technique. Therefore, DecGD gets a rapid convergence in the early phases of training and controls the effective learning rates according to the loss based vectors which help lead to a better generalization. Convergence analysis is discussed in both convex and non-convex situations. Finally, empirical results on widely-used tasks and models demonstrate that DecGD shows better generalization performance than SGDM and rapid convergence like Adam-type methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider the following stochastic optimization problem:

(1)

where

is a random variable,

is the instantaneous loss parameterized by on a sample ,

is a closed convex set. This problem is a common learning task in machine learning and deep learning. Many efforts have been spent on proposing stochastic optimization methods to solve this problem. Stochastic gradient descent (SGD) is the dominant first-order method for the above problem

Krizhevsky et al. (2012); Graves et al. (2013); Lecun et al. (1998). SGD is often trained in the form of mini-batch SGD in order to meet the requirements of computing power, and achieve better generalization performance Heskes and Kappen (1993); LeCun et al. (2012)

. However, SGD has the following main drawbacks. First, SGD chooses the negative gradients of loss functions as descent directions which would yield a slow convergence near the local minima. Second, SGD scales the gradients uniformly in all directions which may yield poor performance as well as limited training speed. Last but not least, when applied to machine learning and deep learning tasks, SGD is painstakingly hard to tune the learning rates decay scheduling manually. However, one has to decay learning rates as the algorithm proceeds in order to control the variances of stochastic gradients for achieving convergence due to the high-dimensional non-convexity of machine learning and deep learning optimization problems.

To tackle aforementioned issues, considerable efforts have been spent and several remarkable variants have been proposed recently. Accelerated schemes and adaptive methods are two categories of dominant variants. Accelerated schemes, such as Nesterov’s accelerated gradient (NAG) Nesterov (1983) and SGD with momentum (SGDM) Polyak (1964), employ momentum to adjust descent directions which can help achieve faster convergence and better generalization performance than other variants. However, they also suffer from the third drawback of SGD so that one need to spend many efforts on tuning and decaying learning rates manually. On the other hand, adaptive methods aim to alleviate this issue which automatically decay the learning rates and scale them nonuniformly. The first prominent algorithm in this line of research is AdaGrad Duchi et al. (2011), which divides element-wisely accumulative squared historical gradients. AdaGrad performs well when gradients are sparse, but its performance degrades in dense or non-convex settings which is attributed to the rapid decay in learning rates. Towards this end, several methods proposed scale gradients down by square roots of exponential moving averages of squared historical gradients (called EMA mechanism) which focus on only the recent gradients. This mechanism is very popular and some famous variants, including AdaDelta Zeiler (2012)

, RMSprop

Tieleman and Hinton (2012) and Adam Kingma and Ba (2015), are based on it. Particularly, Adam is a combination of momentum and EMA mechanism which converges fast in the early training phases and is easier to tuning than SGD, becoming the default algorithms leveraged across various deep learning frameworks.

Despite Adam’s popularity, there also have been concerns about their convergence and generalization properties. In particular, EMA based methods may not converge to the optimal solution even in simple convex settings Reddi et al. (2018) which relies on the fact that effective learning rates of EMA based methods can potentially increase over time in a fairly quickly manner. For convergence, it is important to have the effective learning rates decrease over iterations, or at least have controlled increase Zaheer et al. (2018). Moreover, this problem persists even if the learning rate scheduling (decay) is applied. Recently, considerable efforts have been spent on improving EMA based methods Reddi et al. (2018); Luo et al. (2019); Zaheer et al. (2018); Chen et al. (2020); Liu et al. (2020); Zhuang et al. (2020) to narrow the generalization gap between EMA based methods with SGD. However, one pursuing the best generalization ability of models has to choose SGD as the default optimizer rather than Adam based on the fact that there is not enough evidence to show that those Adam-type methods, which claim to improve the generalization ability of the EMA mechanism, can get close to or even surpass SGD in general tasks. Therefore, a natural idea is whether it is possible to develop a new adaptive method different form EMA based methods, which can overcome aforementioned issues and obtain even better generalization than SGD. However, to the best of our knowledge, there exists few efforts on proposing new adaptive mechanisms whose starting points are different form EMA mechanism.

Contributions

In the light of this background, we list the main contributions of our paper.

  • We propose a new adaptive method, called DecGD, different from Adam-type methods. DecGD decomposes gradients into the product of two terms including surrogate gradients and loss based vectors. Our method achieves adaptivity in SGD according to loss based vectors with the intuition that a good optimizer, in general cases, needs to decrease the learning rates as the loss decreases, which is similar to the learning rates decay scheduling technique. DecGD overcomes aforementioned drawbacks of SGD and achieve comparable and even better generalization than SGD with momentum.

  • We theoretically analyze the convergence of DecGD in both convex and non-convex settings.

  • We conduct extensive empirical experiments for DecGD and compare with several representative methods. Empirical results show that DecGD is robust to hyperparameters and learning rates. Moreover, our method achieves fast convergence as Adam-type methods and shows the best generalization performance in most tasks.

Related Work

The literature in stochastic methods is vast and we review a few very closely related work on improving SGD or Adam. These proposed methods can simply be summarized into two families: EMA based methods and others. For EMA based methods, many efforts have been spent on closing the generalization gap between Adam and SGD family. AMSGrad Reddi et al. (2018) controls the increasing of effective learning rates over iterations, AdaBound Luo et al. (2019) clips them, Yogi Zaheer et al. (2018) considers the mini-batch size, PAdam Chen et al. (2020) modifies the square root, RAdam Liu et al. (2020) rectifies the variance of learning rates, AdamW Loshchilov and Hutter (2019) decouples weight decay from gradient descent, AdaBelief Zhuang et al. (2020) centralizes the second order momentum in Adam. To our best knowledge, there exists several methods achieving different adaptivity from Adam-type methods in SGD. AdaGD Malitsky and Mishchenko (2017) focuses on the local geometry and use the gradients of the latest two steps to adaptively adjust learning rates, which has a convergence only depending on the local smoothness in a neighborhood of the local minima. Besides, AEGD Liu and Tian (2020) is closest to our work which uses the same composite function of the loss with ours: where is the loss and is a constant s.t. for all in the feasible region. However, the intuition for AEGD which is far different from our method is to realize a gradient descent with the stable energy which is defined as the above composite function of the loss. Note that the energy of AEGD equals the definition just in the first step and updates in a monotonous way as the algorithm proceeds. Hence, the stable energy seems meaningless because that it unconditionally decrease over iterations.

Notation

For a vector , we denote its -th coordinate by . We use to denote in the -th iteration and use for the -th coordinate of in the -th iteration. Furthermore, we use to denote -norm and use to denote -norm. Given two vectors , we use to denote element-wise product and use to denote element-wise square. We use to denote element-wise division.

2 DecGD

As summarized before, one with aim to improve SGD performance in machine learning and deep learning tasks should consider the following three directions: improving descent directions, a non-uniform scale, and a combination with the learning rate decay scheduling technique. Momentum based methods pay attention to the first aspect, while adaptive methods make efforts to achieve adaptivity with element-wise operations motivated by the second direction. In particular, the dominant adaptive methods are Adam-type methods such as Adam, AdaBound and AdaBelief which employ the second raw moment or second central moment of stochastic gradients, also called EMA mechanism, to achieve adaptivity. However, although many efforts have been spent on the generalization ability, there is not enough evidence showing that Adam-type methods could generalize better than SGD. Therefore, SGD may need a new and different adaptivity.

In the light of this background, we propose a new adaptive variant of SGD, called DecGD, which decomposes gradients into the product of two terms including surrogate gradients and loss based vectors. DecGD achieves new adaptivity in SGD with the loss based vectors and the pseudo-code is shown in Algorithm 1.

  Input: , learning rate , , momentum , AMS
  Initialize , , ,
  for  to  do
                         (compute the scaled gradients)
                                        (update the first order momentum)
                          (update the loss based vector)
      if AMS else     (choose a monotonically decreasing or not)
                                      (update the parameters)
  end for
Algorithm 1 DecGD (default initialization: , , , AMS=False)

Intuition

The intuition for DecGD is that the loss can help to adjust learning rates over iterations. As the algorithm proceeds, typically one need to adjust the learning rates for convergence. Thus, the learning rate decay scheduling technique is often applied into the training process. Note that both momentum based methods and adaptive methods benefit from the combination with the learning rate decay scheduling technique. Hence, adjusting learning rates according to the loss information is feasible. DecGD employs a decomposition of gradients to access the loss information.

Gradient Decomposition

Consider a composite function of the loss :

(2)

where the objective loss is a lower bounded function and is a constant s.t. , . Take the derivative of and we can decompose the gradient into the product of two terms

(3)

where and are the gradient of and respectively. Note that which has the same monotonicity with includes the current information of the loss, and is a scaled version of with a factor which is a constant for a certain . Thus, is also a descent direction because we have where , . In conclusion, we decompose the gradient of the loss into a surrogate gradient which is a descent direction for optimizing and a loss scalar .

Update Rules

Based on the above decomposition, we show the update rule of DecGD. To determine an optimizer, all we need is calculating the learning rate (step size) and searching the update direction. For example, the update scheme of vanilla SGD is:

(4)

where , , is a constant learning rate, is the gradient at the time step and is the steepest descent direction.

First, we consider the update direction of DecGD. As mentioned above, is the descent direction which is actually the scaled vector of the steepest descent direction. We employ the momentum to achieve acceleration:

(5)

where , a constant control the exponential decay rate and the initialization , . The above formula shows how to update the direction of DecGD.

We next consider calculating the learning rate of DecGD. In machine learning and deep learning optimization problems, the learning rate at a certain time step is often a constant rather than calculated by traditional line search for the sake of computing cost. To avoid the second issue of SGD, the learning rate often multiplies a vector element-wisely for adaptivity. DecGD, motivated by the learning rate decay scheduling technique, employs a loss based vector which comes from the loss scalar in the above gradient decomposition and has the following update rule

(6)

where the initial vector , . Note that applying the loss scalar , which is the second term of the gradient decomposition, directly results in two disadvantages: First, is a scalar for a certain so that we fail to achieve a non-uniform adaptive learning rate. Besides, if so, DecGD would exactly equal to SGD or SGD with momentum. Therefore, we employ the first order Taylor polynomial to approximate linearly to achieve the approximate loss. In detail, we start with a vector with the initial value and update it element-wisely to get a non-uniform loss based vector in various directions. The update rule 6 can be viewed as the momentum version of the first order Taylor polynomial. With the same aim as AMSGrad, DecGD provides a switch on whether to rectifies to ensure that would not increase in a fairly vast way.

DecGD employs the following scheme to update parameters based on the above decomposition

(7)

Finally, we note that the computing complexity and the time complexity of DecGD are both which is same with Adam. However, DecGD has one less parameter than Adam (the ’AMS’ in Adam is whether to enable AMSGrad).

Relevance and Difference

With the same aim to achieve adaptivity in SGD, DecGD and Adam-type methods all employ element-wise operations to scale the learning rates non-uniformly and apply momentum to update directions. Differently, Adam-type methods have been working on using the square gradients to approximate the Hessian to obtain second-order information. In particular, AdaBelief Zhuang et al. (2020), which centralizes second order momentum, considers the variance in gradients. In fact, it is a kind of approximation of Hessian to achieve second order information. AdaGD Malitsky and Mishchenko (2017) considers the local Lipschitz constant which is similar to AdaBelief actually. However, it is very different that DecGD considers a zero order information motivated by the practical application. It seems that we can integrate DecGD with Adam-type methods to achieve both zero order and second order approximate information for better convergence. This topic remains open.

3 Convergence Analysis

We discuss the convergence of DecGD in both convex and non-convex situations. The convergence under the condition of convex objective functions is showed in the online convex optimization framework Duchi et al. (2011); Reddi et al. (2018); Hazan (2019); Alacaoglu et al. (2020) which is similar to Adam Kingma and Ba (2015), AMSGrad Reddi et al. (2018), AdaBound Luo et al. (2019) and AdaBelief Zhuang et al. (2020). Furthermore, we analyze the convergence in the stochastic non-convex optimization problem, which is similar to the previous work Zhuang et al. (2020); Chen et al. (2019). This situation is more in line with actual scenarios of machine learning and deep learning tasks.

3.1 Online Convex Optimization

In online optimization, we have a loss function . After a decision is picked by the algorithm, we have the following regret to minimize:

(8)

The standard assumptions Duchi et al. (2011); Reddi et al. (2018); Hazan (2019); Alacaoglu et al. (2020) in the setting of online convex optimization are as follows:

Assumption 1.

(1) is a compact convex set; (2) is a convex lower semi-continuous (lsc) function, ; (3) , .

We propose the following lemma:

Lemma 1.

is a lower bounded function and is a constant s.t. . Let . If has bounded gradients, then has bounded gradients too and is bounded in the feasible regions.

Remark 1.

The above lemma shows that two terms from the decomposition of the gradient are both bounded. In particular, the assumption (3) in the standard assumptions 1 yields , , .

Therefore, we can get the following assumptions for DecGD which are entirely yielded from the standard assumptions 1:

Assumption 2.

(1) is a compact convex set; (2) is a convex lsc function, , ; (3) , , , .

The key results are as follows:

Theorem 1.

Under the Assumption 2, let and , , DecGD has the following bound on the regret:

(9)

The following result falls as an immediate corollary of the above results:

Corollary 1.

Suppose has a decay with factor in Theorem 1, we have

(10)
Corollary 2.

Under the same assumptions of Theorem 1, DecGD has the following average regret of convergence:

(11)
Remark 2.

Theorem 1 implies the regret of DecGD is upper bounded by , similar to Adam Kingma and Ba (2015), AMSGrad Reddi et al. (2018), AdaBound Luo et al. (2019) and AdaBelief Zhuang et al. (2020). Besides, the condition in Corollary 1 can be relaxed to and still ensures a regret bound of .

3.2 Stochastic Non-convex Optimization

We discuss the convergence in the stochastic non-convex learning which is more in line with actual scenarios of machine learning and deep learning tasks than the online convex optimization. The standard assumptions Zhuang et al. (2020); Chen et al. (2019) are as follows:

Assumption 3.

(1) is lower bounded and differentiable, ; (2) The noisy gradient is unbiased, and has independent noise, i.e. , , , , ; (3) At step t, the algorithm can access a bounded noisy gradient, and the true gradient is also bounded, i.e. , , .

Similarly, the above assumptions yield the following assumptions for DecGD according to Lemma 1:

Assumption 4.

(1) is lower bounded and differentiable, ; (2) The noisy gradient is unbiased, and has independent noise, i.e. , , , , ; (3) At step t, the algorithm can access a bounded noisy gradient, and the true gradient is also bounded, i.e. , , is the noisy gradient of , .

The key result is as follows:

Theorem 2.

Under the Assumption 4, let , and , , DecGD satisfies

(12)

where , , are constants independent of and , is a constant independent of .

Remark 3.

Theorem 2 implies that DecGD has a convergence rate in the stochastic non-convex situation which is similar to Adam-type methods Chen et al. (2019); Zhuang et al. (2020). Besides, Theorem 3.1 in Chen et al. (2019) needs to specify the bound of each update, but DecGD needs not. The proof follows the general framework in Chen et al. (2019), and it’s possible the above bound is loose. A sharper convergence analysis remains open.

4 Experiments

In this section, we study the generalization performance of our methods and several representative optimization methods. Except SGD with momentum (SGDM), we additionally test two families of optimizers including Adam-type methods and other adaptive methods. The former includes Adam, AMSGrad, AdaBound, AdaBelief and the latter includes AEGD and our method DecGD. We conduct experiments in popular deep learning tasks for testing the performance in thestochastic situation. Particularly, several neural network structures will be chosen including multilayer perceptron, deep convolution neural network and deep recurrent neural network. Concretely, we focus on the following experiments: multilayer perceptron (MLP) on MNIST dataset

Lecun et al. (1998); ResNet-34 He et al. (2016) and DenseNet-121 Huang et al. (2017)

on CIFAR-10 dataset

Krizhevsky and Hinton (2009); ResNet-34 He et al. (2016) and DenseNet-121 Huang et al. (2017) on CIFAR-100 dataset Krizhevsky and Hinton (2009)

; LSTMs on Penn Treebank dataset

Marcus et al. (1993).

4.1 Details

Hyperparameters

For SGDM and AEGD, we employ the grid search for learning rates in . We set momentum in SGDM to the default value . Note that reported in Zhuang et al. (2020), the best learning rate for SGDM is for LSTMs on Penn Treebank dataset, and we follow this setting. For Adam, AMSGrad, AdaBound and AdaBelief, we employ the grid search for learning rates in . We turn over values of and values of . For other parameters in above Adam-type methods, we follow the setting reported in Zhuang et al. (2020); Luo et al. (2019) for achieving the best performance on CIFAR-10, CIFAR-100 and Penn Treebank dataset and use the default values for other experiments. For DecGD, we use the default value of hyperparameters, the default learning rate for CIFAR-10 and CIFAR-100 and a warm-up learning rate for LSTMs.

(a) different in DecGD
(b) MLP on MNIST
(c) ResNet-34 on CIFAR-10
(d) DesNet-121 on CIFAR-10
(e) ResNet-34 on CIFAR-100
(f) DesNet-121 on CIFAR-100
(g) one layer LSTM on PTB
(h) two layers LSTM on PTB
(i) three layers LSTM on PTB
Figure 1: Performance of various optimizers in popular deep learning tasks. In (a)-(f), the higher is better and in (g)-(i), the lower is better. The results in (c) (d) (g) (h) (i), except AEGD, AMSGrad and our method DecGD, are reported in AdaBelief.
SGDM AEGD Adam AMSGrad AdaBound AdaBelief DecGD
ResNet-34
95.44
DenseNet-121
95.64
Table 1: Test accuracy on CIFAR-10
SGDM AEGD Adam AMSGrad AdaBound AdaBelief DecGD
ResNet-34
78.85
DenseNet-121
79.99
Table 2: Test accuracy on CIFAR-100
SGDM Adam AdaBound AdaBelief DecGD
1-layer
82.97
2-layer
66.29
3-layer
61.23
Table 3: Test perplexity of LSTMs on PTB

MLP on MNIST

We conduct the experiment to test the performance of aforementioned optimizers with MLP on MNIST. We follow the experiment settings reported in AdaBound Luo et al. (2019)

. MLP is a fully connected neural network with only one hidden layer and total epoch is

. Figure 1(b) shows the empirical result. Note that all optimization algorithms achieve a test error below and our method DecGD and AMSGrad achieve slightly better performance than other methods on the test set.

ResNet-34 and DenseNet-121 on CIFAR-10

CIFAR-10 is a more complex dataset than MNIST. We use more advanced and powerful deep convolution neural networks, including ResNet-34 and DenseNet-121, to test various optimization methods in this classification task on CIFAR-10 dataset. We employ the fixed budget of 200 epochs, set the mini-batch size to . Figure 1(c) and 1(d) show the empirical results. Code is modified from the official implementation of AdaBelief and except AEGD and DecGD, data of other optimizers are reported in Zhuang et al. (2020). As expected, the overall performances of each algorithm on ResNet-34 are similar to those on DenseNet-121. We note that DecGD shows the best generalization performance on DenseNet-121. For ResNet-34, the error of DecGD is slightly lower than that of AdaBelief with a margin which is the best performance; For DenseNet-121, DecGD surpasses AdaBelief with a margin . We find that classical adaptive methods show rapid descent in the early period of training such as Adam and AdaBound. However, they show mediocre generalization ability on the test set. The empirical results show that our method overcomes the above drawback and achieves even better generalization performance than SGDM.

ResNet-34 and DenseNet-121 on CIFAR-100

CIFAR-100 is similar to CIFAR-10, but the total class is up to . Therefore, CIFAR-100 is more difficult and more close to the reality than CIFAR-10. We choose the same structures ResNet-34 and DenseNet-121. We employ the fixed budget of 200 epochs, set the mini-batch size to . Figure 1(e) and 1(f) show the empirical results. Code is modified from the official implementation of AdaBelief. Note that DecGD achieves the best generalization performance both in ResNet-34 and DenseNet-121. Concretely, DecGD surpasses AdaBelief with a margin in ResNet-34 and a margin in DenseNet-121. DecGD shows the generalization ability far beyond other methods on CIFAR-100.

Robustness to change

Considering the popularity of classification tasks, we test the performance of different for better application of DecGD in these tasks. We select ResNet-18 on CIFAR-10 dataset and is chosen from . Figure 1(a) shows the result. Note that DecGD is robust to different and the default value achieves the slightly better performance. As a sequence, we use the default for almost all deep learning experiments.

LSTMs on Penn Treebank dataset

We test our method on Penn Treebank dataset with one-layer LSTM, two-layers LSTM and three-layers LSTM respectively. We follow the setting of experiments in AdaBelief Zhuang et al. (2020). One difference is that AdaBelief improves these experiments by setting learning rate scheduling at epoch and in their official implementation. Except our methods, the results of other methods are reported in AdaBelief. Code is modified from the official implementation of AdaBelief. The perplexities (ppl) of methods are reported in Figure 1(g), 1(h) and 1(i) except AEGD and AMSGrad due to their worse performances. To our knowledge, AdaBelief has been the best optimizer on Penn Treebank dataset. Note that our method DecGD achieves the similar performance to AdaBelief in all three experiments: For one-layer LSTM, DecGD surpasses AdaBelief and achieves the lowest perplexity; For two-layers LSTM, DecGD and AdaBelief show the best performance and DecGD is higher than AdaBelief by a margin ; For three-layers LSTM, DecGD shows the lower perplexity than other methods except AdaBelief and is higher than AdaBelief by a margin . Considering perplexities equal to , the gap between AdaBelief and DecGD is very small.

4.2 Analysis

We select the popular tasks of computer vision and natural language processing to investigate the generalization performance of our proposed methods. As shown above, DecGD, as a new adaptive method, shows an excellent generalization performance which is even better than SGDM. Besides, DecGD is robust to hyperparameters and achieves the best performance with the default learning rate in most cases, especially on CIFAR-100 dataset.

5 Conclusion

We have introduced DecGD, a simple and computationally efficient adaptive algorithm for non-convex stochastic optimization. This method is aimed towards large-scale optimization problems in the sense of large datasets and/or high-dimensional parameter spaces such as machine learning and deep neural networks. The practical intuition and excellent performance of DecGD show that our method is worth further research.

Despite excellent performance of our method, there still remains several directions to explore in the future:

  • First, we prove a bound of our method DecGD in non-convex setting. However, empirical results show that the generalization performance of DecGD is better than many methods with a similar bound such as Adam. A tighter regret bound of DecGD needs to be explored in the future.

  • Furthermore, as mentioned before, DecGD and Adam can be integrated with each other. This topic remains open.

  • Finally, several works aim to find a new way to generate adaptive learning rates different from Adam-type methods. Thus, as this kind of works increase, how to measure the quality of adaptive learning rates is more and more important. However, there are few works on this topic.

6 Broader Impact

Optimization is at the core of machine learning and deep learning. To our best knowledge, there exists few works on different adaptivity from Adam-type methods or EMA mechanism. DecGD shows better performance than those methods. DecGD can boost the application in all models which can numerically estimate parameter gradients.

References

  • A. Alacaoglu, Y. Malitsky, P. Mertikopoulos, and V. Cevher (2020) A new regret analysis for adam-type algorithms. In International Conference on Machine Learning, pp. 202–210. Cited by: §3.1, §3.
  • J. Chen, D. Zhou, Y. Tang, Z. Yang, Y. Cao, and Q. Gu (2020) Closing the generalization gap of adaptive gradient methods in training deep neural networks. In

    International Joint Conferences on Artificial Intelligence

    ,
    Cited by: §1, §1.
  • X. Chen, S. Liu, R. Sun, and M. Hong (2019) On the convergence of a class of adam-type algorithms for non-convex optimization. External Links: 1808.02941 Cited by: §3.2, §3, Remark 3.
  • J. Duchi, E. Hazan, and Y. Singer (2011) Adaptive subgradient methods for online learning and stochastic optimization. In Journal of Machine Learning Research (JMLR), Cited by: §1, §3.1, §3.
  • A. Graves, A. Mohamed, and G. E. Hinton (2013) Speech recognition with deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 6645–6649. Cited by: §1.
  • E. Hazan (2019) Introduction to online convex optimization. CoRR abs/1909.05207. External Links: Link, 1909.05207 Cited by: §3.1, §3.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §4.
  • T. M. Heskes and B. Kappen (1993) On-line learning processes in artificial neural networks. J.G. Taylor (Ed.), North-Holland Mathematical Library, Vol. 51, pp. 199 – 233. External Links: ISSN 0924-6509 Cited by: §1.
  • G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.
  • D. P. Kingma and J. L. Ba (2015) Adam: a method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), Cited by: §1, §3, Remark 2.
  • A. Krizhevsky and G. E. Hinton (2009) Learning multiple layers of features from tiny images. In Technical report, Cited by: §4.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (NIPS), pp. 1097–1105. Cited by: §1.
  • Y. A. LeCun, L. Bottou, G. B. Orr, and K. R. Müller (2012) Efficient backprop. In Neural Networks, G. Montavon, K. Muller, G. B. Orr, and K. Muller (Eds.), Lecture Notes in Computer Science, pp. 9–48 (English (US)). Cited by: §1.
  • Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pp. 2278–2324. Cited by: §1, §4.
  • H. Liu and X. Tian (2020) AEGD: adaptive gradient decent with energy. External Links: 2010.05109 Cited by: §1.
  • L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han (2020) On the variance of the adaptive learning rate and beyond. In Proceedings of the Eighth International Conference on Learning Representations (ICLR 2020), Cited by: §1, §1.
  • I. Loshchilov and F. Hutter (2019) Decoupled weight decay regularization. External Links: 1711.05101 Cited by: §1.
  • L. Luo, Y. Xiong, Y. Liu, and X. Sun (2019) Adaptive gradient methods with dynamic bound of learning rate. In Proceedings of International Conference on Learning Representations (ICLR), Cited by: §1, §1, §3, §4.1, §4.1, Remark 2.
  • Y. Malitsky and K. Mishchenko (2017) Adaptive gradient descent without descent. In CoRR, Cited by: §1, §2.
  • M. Marcus, B. Santorini, and M. A. Marcinkiewicz (1993) Building a large annotated corpus of english: the penn treebank. Cited by: §4.
  • Y. Nesterov (1983) A method of solving a convex programming problem with convergence rate O(1/sqrt(k)). In Soviet Mathematics Doklady, pp. 27:372–376. Cited by: §1.
  • Boris. T. Polyak (1964) Some methods of speeding up the convergence of iteration methods. In USSR Computational Mathematics and Mathematical Physics, pp. 4:791–803. Cited by: §1.
  • S. J. Reddi, S. Kale, and S. Kumar (2018) On the convergence of Adam and beyond. In Proceedings of International Conference on Learning Representations (ICLR), Cited by: §1, §1, §3.1, §3, Remark 2.
  • T. Tieleman and G. Hinton (2012) RMSprop: divide the gradient by a running average of its recent magnitude. In COURSERA: Neural networks for machine learning, Cited by: §1.
  • M. Zaheer, S. Reddi, D. Sachan, S. Kale, and S. Kumar (2018) Adaptive methods for nonconvex optimization. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. . Cited by: §1, §1.
  • M. D. Zeiler (2012) ADADELTA: an adaptive learning rate method. In CoRR, Cited by: §1.
  • J. Zhuang, T. Tang, Y. Ding, S. Tatikonda, N. Dvornek, X. Papademetris, and J. Duncan (2020) AdaBelief optimizer: adapting stepsizes by the belief in observed gradients. Conference on Neural Information Processing Systems. Cited by: §1, §1, §2, §3.2, §3, §4.1, §4.1, §4.1, Remark 2, Remark 3.

Appendix A Proof of Lemma 1

Proof.

Because , , then

If is bounded, is bounded. Therefore, at least one of and is bounded.

First, consider is bounded and is unbounded, the only possible case is that

Obviously, this case doesn’t exist because if , must have gradients with a limit of .

Second, consider is bounded and is unbounded, the only possible case is that

However, if has gradients with a limit of 0, must be finite. Thus, this case doesn’t exist too.

Finally, the only case is that is bounded and is bounded too. This proves that if has bounded gradients, then has bounded gradients too and is bounded in the feasible regions. ∎

Appendix B Proof of Theorem 1

Proof.

We first replace the element-wise product with the dialog matrix and obtain

where . If AMS is true, is monotonic decreasing. We aim to minimize the following regret:

Let be the optimal solution, the above regret is

Because

we have

where the last inequality follows Cauchy-Schwarz inequality and Young’s inequality.

Thus, we obtain

and the regret is as follows:

We divide the right formula into three parts:

Consider the part and apply Assumption 2:

Then, the part 2 is as follows:

Finally, we give the upper bound of the part 3 by applying Assumption 2:

Hence, we get the final regret bound:

Appendix C Proof of Corollary 1

Proof.

The inequalities yield

Appendix D Proof of Theorem 2

We follow the proof in [3] and recall the Theorem 3.1 in [3]

Theorem 3.1

[3] For an Adam-type method under the following assumptions:

  • is lower bounded and differentiable; .

  • Both the true and noisy gradients are bounded, i.e. .

  • Unbiased and independent noise in , i.e. , , and , .

Assume in non-increasing, , then:

where , , are constants independent of and , is a constant independent of .

We note that Theorem 3.1 in [3] gives the convergence bound for generalize Adam [3]. However, this general framework needs no EMA mechanism, i.e. squared gradients and represents more general adaptivity which uses first order gradients. DecGD belongs to this general framework with in this general framework corresponding to in DecGD. Therefore, we can apply the above theorem to our proof of DecGD.

Proof.

According to the above theorem and Assumption 4, we obtain

where , , are constants independent of and , is a constant independent of .

We divide the right formula to three parts: