Domain-independent Dominance of Adaptive Methods

12/04/2019 ∙ by Pedro Savarese, et al. ∙ The University of Chicago Toyota Technological Institute at Chicago 6

From a simplified analysis of adaptive methods, we derive AvaGrad, a new optimizer which outperforms SGD on vision tasks when its adaptability is properly tuned. We observe that the power of our method is partially explained by a decoupling of learning rate and adaptability, greatly simplifying hyperparameter search. In light of this observation, we demonstrate that, against conventional wisdom, Adam can also outperform SGD on vision tasks, as long as the coupling between its learning rate and adaptability is taken into account. In practice, AvaGrad matches the best results, as measured by generalization accuracy, delivered by any existing optimizer (SGD or adaptive) across image classification (CIFAR, ImageNet) and character-level language modelling (Penn Treebank) tasks. This later observation, alongside of AvaGrad's decoupling of hyperparameters, could make it the preferred optimizer for deep learning, replacing both SGD and Adam.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

Code Repositories

avagrad

Implementation of AvaGrad


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep network architectures are becoming increasingly complex, often containing parameters that can be grouped according to multiple functionalities, such as gating, attention, convolution, and generation. Such parameter groups should arguably be treated differently during training, as their gradient statistics might be highly distinct. Adaptive gradient methods designate parameter-wise learning rates based on gradient histories, treating such parameters groups differently and, in principle, promise to be better suited for training complex neural network architectures.

Nonetheless, advances in neural architectures have not been matched by progress in adaptive gradient descent algorithms. SGD is still prevalent, in spite of the development of seemingly more sophisticated adaptive alternatives, such as RMSProp

(Dauphin et al., 2015) and Adam (Kingma & Ba, 2015). Such adaptive methods have been observed to yield poor generalization compared to SGD in classification tasks (Wilson et al., 2017), and hence have been mostly adopted for training complex models (Vaswani et al., 2017; Arjovsky et al., 2017). For relatively simple architectures, such as ResNets (He et al., 2016a) and DenseNets (Huang et al., 2017), SGD is still the dominant choice.

At a theoretical level, concerns have also emerged about the current crop of adaptive methods. Recently, Reddi et al. (2018) has identified cases, even in the stochastic convex setting, where Adam (Kingma & Ba, 2015) fails to converge. Modifications to Adam that provide convergence guarantees have been formulated, but have shortcomings. AMSGrad (Reddi et al., 2018) requires non-increasing learning rates, while AdamNC (Reddi et al., 2018) and AdaBound (Luo et al., 2019) require that adaptivity be gradually eliminated during training. Moreover, while most of the recently proposed variants do not provide formal guarantees for non-convex problems, the few current convergence rate analyses in the literature (Zaheer et al., 2018; Chen et al., 2019) do not match SGD’s. Section 3 fully details the convergence rates of the most popular Adam variants, along with their shortcomings.

Our contribution is marked improvements to adaptive optimizers, from both theoretical and practical perspectives. At the theoretical level, we focus on convergence guarantees, deriving new algorithms:

  • [leftmargin=0.2in]

  • Delayed Adam. Inspired by Zaheer et al. (2018)’s analysis of Adam, Section 4 proposes a simple modification for adaptive gradient methods which yields a provable convergence rate of in the stochastic non-convex setting – the same as SGD. Our modification can be implemented by swapping two lines of code and preserves adaptivity without incurring extra memory costs. To illustrate these results, we present a non-convex problem where Adam fails to converge to a stationary point, while Delayed Adam – Adam with our proposed modification – provably converges with a rate of .

  • AvaGrad.

    Inspecting the convergence rate of Delayed Adam, we show that it would improve with an adaptive global learning rate, which self-regulates based on global statistics of the gradient second moments. Following this insight, Section 

    5 proposes a new adaptive method, AvaGrad, whose hyperparameters decouple learning rate and adaptability.

Through extensive experiments, Section 6 demonstrates that AvaGrad is not merely a theoretical exercise. AvaGrad performs as well as both SGD and Adam in their respectively favored usage scenarios. Along this experimental journey, we happen to disprove some conventional wisdom, finding adaptive optimizers, including Adam, to be superior to SGD for training CNNs. The caveat is that, excepting AvaGrad, these methods are sensitive to hyperparameter values. AvaGrad is a uniquely attractive adaptive optimizer, yielding near best results over a wide range of hyperparameters. Implementation can be found at https://github.com/lolemacs/avagrad.

2 Preliminaries

2.1 Notation

For vectors

, we use the following notation: for element-wise division (), for element-wise square root (), for element-wise addition (), for element-wise multiplication (). Moreover, is used to denote the -norm: other norms will be specified whenever used (e.g., ).

For subscripts and vector indexing, we adopt the following convention: the subscript is used to denote an object related to the -th iteration of an algorithm (e.g., denotes the iterate at time step ); the subscript is used for indexing: denotes the -th coordinate of . When used together, precedes : denotes the -th coordinate of .

2.2 Stochastic Non-Convex Optimization

In the stochastic non-convex setting, we are concerned with the optimization problem:

(1)

where

is a probability distribution over a set

of “data points”. We also assume that is -smooth in , as is typically done in non-convex optimization:

(2)

Methods for stochastic non-convex optimization are evaluated in terms of number of iterations or gradient evaluations required to achieve small loss gradients. This differs from the stochastic convex setting where convergence is measured w.r.t. suboptimality . We assume that the algorithm takes a sequence of data points from which it deterministically computes a sequence of parameter settings together with a distribution over . We say an algorithm has a convergence rate of if where, as defined above, .

We also assume that the functions have bounded gradients: there exists some such that for all and . Throughout the paper, we also let denote an upper bound on .

3 Related Work

Here we present a brief overview of optimization methods commonly used for training neural networks, along with their convergence rate guarantees for stochastic smooth non-convex problems. We consider methods which, at each iteration

, receive or compute a gradient estimate:

(3)

and perform an update of the form:

(4)

where is the global learning rate, are the parameter-wise learning rates, and is the update direction, typically defined as:

(5)

Non-momentum methods such as SGD, AdaGrad, and RMSProp (Dauphin et al., 2015; Duchi et al., 2011) have (i.e., ), while momentum SGD and Adam (Kingma & Ba, 2015) have . Note that while can always be absorbed into , representing the update in this form will be convenient throughout the paper.

SGD uses the same learning rate for all parameters, i.e., . Although SGD is simple and offers no adaptation, it has a convergence rate of with either constant, increasing, or decreasing learning rates (Ghadimi & Lan, 2013), and is widely used when training deep networks, especially CNNs (He et al., 2016a; Huang et al., 2017). At the heart of its convergence proof is the fact that .

Popular adaptive methods such as RMSProp (Dauphin et al., 2015), AdaGrad (Duchi et al., 2011), and Adam (Kingma & Ba, 2015) have , where is given by:

(6)

As is an estimate of the second moments of the gradients, the optimizer designates smaller learning rates for parameters with larger uncertainty in their stochastic gradients. However, in this setting and are no longer independent, hence . This “bias” can cause RMSProp and Adam to present convergence issues, even in the stochastic convex setting (Reddi et al., 2018).

Recently, Zaheer et al. (2018) showed that, with a constant learning rate, RMSProp and Adam have a convergence rate of , where , hence their result does not generally guarantee convergence. Chen et al. (2019) showed that AdaGrad and AMSGrad enjoy a convergence rate of when a decaying learning rate is used. Note that both methods constrain in some form, the former with (adaptability diminishes with ), and the latter explicitly enforces for all ( is point-wise non-increasing). In both cases, the method is less adaptive than Adam, and yet analyses so far have not delivered a convergence rate that matches SGD’s.

4 SGD-like Convergence without Constrained Rates

Input: , ,

1:Set
2:for  to  do
3:     Draw
4:     Compute
5:     
6:     
7:     
8:     
9:end for
Algorithm 1 Delayed Adam

We first take a step back to note the following: to show that Adam might not converge in the stochastic convex setting, Reddi et al. (2018) provide a stochastic linear problem where Adam fails to converge w.r.t. suboptimality. Since non-convex optimization is evaluated w.r.t. norm of the gradients, a different instance is required to characterize Adam’s behavior in this setting.

The following result shows that even for a quadratic problem, Adam indeed does not converge to a stationary point:

Theorem 1.

For any and constant , there is a stochastic convex optimization problem for which Adam does not converge to a stationary point.

Proof.

The full proof is given in Appendix A. The argument follows closely from Reddi et al. (2018), where we explicitly present a stochastic optimization problem:

(7)

We show that, for large enough (as a function of ), Adam will move towards where , and that the constraint does not make a stationary point. ∎

This result, like the one in Reddi et al. (2018), relies on the fact that and are correlated: upon a draw of the rare sample , the learning rate decreases significantly and Adam takes a small step in the correct direction. On the other hand, a sequence of common samples increases and Adam moves faster towards .

Instead of enforcing to be point-wise non-increasing in (Reddi et al., 2018), which forces the optimizer to take small steps even for a long sequence of common samples, we propose to simply have be independent of . As an extra motivation for this approach, note that successful proof strategies (Zaheer et al., 2018) to analyzing adaptive methods include the following step:

(8)

where bounding , seen as a form of bias, is a key part of recent convergence analyses. Replacing by in the update equation of Adam removes this bias and can be implemented by simply swapping lines of code (updating after ), yielding a simple convergence analysis without hindering the adaptability of the method in any way. Algorithm 1 provides pseudo-code when applying this modification, highlighted in red, to Adam, yielding Delayed Adam. The following Theorem shows that this modification is enough to guarantee a SGD-like convergence rate of in the stochastic non-convex setting for general adaptive gradient methods.

Theorem 2.

Consider any optimization method which updates parameters as follows:

(9)

where , , and are independent of .

Assume that , is -smooth, and for all . Moreover, let .

For , if for all , then:

(10)

where assigns probabilities .

Proof.

The full proof is given in Appendix B, along with analysis for the case with momentum in Appendix B.1, and in particular , which yields a similar rate. ∎

The convergence rate depends on and

, which are random variables for Adam-like algorithms. However, if there are constants

and such that for all and , then a rate of is guaranteed. This is the case for Delayed Adam, where for all and . Theorem 2 also requires that and are independent of , which can be assured to hold by applying a “delay” to their respective computations, if necessary (i.e., replacing by , as in Delayed Adam).

Additionally, the assumption that , meaning that a single sample should not affect the distribution of , is required since is conditioned on the samples (unlike in standard analysis, where and is deterministic), and is expected to hold as . Practitioners typically use the last iterate or perform early-stopping: in this case, whether the assumption holds or not does not affect the behavior of the algorithm. Nonetheless, we also show in Appendix B.2 a similar rate that does not require this assumption to hold, which also yields a convergence rate taken that the parameter-wise learning rates are bounded from above and below.

5 AvaGrad: An Adaptive Method with Adaptive Variance

Input: , ,

1:Set
2:for  to  do
3:     Draw
4:     Compute
5:     
6:     
7:     
8:     
9:end for
Algorithm 2 AvaGrad

Now, we consider the implications of Theorem 2 for Delayed Adam, where , and hence for all and .

For a fixed , chosen a-priori (that is, without knowledge of the realization of ), we can optimize to minimize the worst-case rate using and . This yields , and a convergence rate of , suggesting that, at least in the worst case, should be chosen to be as large as possible, and the learning rate should scale linearly with (hence, also being large).

What if we allow to vary in each time step? For example, choosing yields a convergence rate with a linear dependence on . While using we see that in the worst-case this is also , this dependence differs from the one with fixed in a few aspects. Most notably, if we consider different scalings of (e.g., small and varying ), the convergence rate with fixed can get arbitrarily worse, while for it remains unchanged. In particular, for the case , we have , while for fixed we get a dependence on , which again can be large if is either large or small. Lastly, multiplying the learning rate by removes its dependence on in the worst-case setting, making the two hyperparameters more separable.

The choice of , motivated by the above facts, yields a method which we name AvaGrad – Adaptive VAriance Gradients, presented as pseudo-code in Algorithm 2

with the proposed scaling highlighted in red. We call it an “adaptive variance” method since, if we scale up or down the variance of the gradients, and hence also

and , the convergence guarantee in Theorem 2 does not change, while for a fixed learning rate (as is not uncommonly done in practice, except for discrete decays during training (Zagoruyko & Komodakis, 2016; Merity et al., 2018)) it can get arbitrarily bad.

6 Experiments

6.1 Synthetic Data

To illustrate empirically the implications of Theorem 1 and Theorem 2, we set up a synthetic stochastic optimization problem with the same form as the one used in the proof of Theorem 1:

(11)

This function has a stationary point , and it satisfies Theorem 1 for . We proceed to perform stochastic optimization with Adam, AMSGrad, and Delayed Adam, with constant learning rate . For simplicity, we let be uniform over , since is constant.

[width=]figures/mean_iters.pdf

[width=]figures/mean_grads.pdf

Figure 1: Plots of Adam, AMSGrad, and Delayed Adam trained on the synthetic example in Equation  11, with a stationary point at . Left: The expected iterate sampled uniformly from , for each iteration . As predicted by our theoretical results, Adam moves towards with , while Delayed Adam converges to . Right: The expected norm squared of the gradient, for randomly sampled from . Delayed Adam converges significantly faster than AMSGrad, while Adam fails to converge.

Figure 1 shows the progress of and for each iteration : as expected, Adam fails to converge to the stationary point , while both AMSGrad and Delayed Adam converge. Note that Delayed Adam converges significantly faster, likely because it has no constraint on the learning rates.

6.2 Image Classification with CNNs

Our theory suggests that, in the worst case, should be chosen as large as possible, at which point the learning rate should scale linearly with it. As a first experiment to assess this hypothesis, we analyze the interaction between and when training a Wide ResNet 28-4 (Zagoruyko & Komodakis, 2016) on the CIFAR-10 dataset (Krizhevsky, 2009). Our training follows Zagoruyko & Komodakis (2016)

: images are channel-wise normalized, randomly cropped, and horizontally flipped during training. The learning rate is decayed by a factor of 5 at epochs 60, 120 and 160, and the model is trained for a total of 200 epochs with a weight decay of

. Appendix C describes additional experimental details.

We use a validation set of 5,000 images to evaluate the performance of SGD and different adaptive gradient methods: Adam, AMSGrad, AdaBound (Luo et al., 2019; Savarese, 2019), AdaShift (Zhou et al., 2019), and our proposed algorithm, AvaGrad. We also assess whether performing weight decay as proposed in Loshchilov & Hutter (2019) instead of standard regularization positively impacts the performance of adaptive methods: we do this by evaluating AdamW and AvaGradW.

We run each adaptive method with different powers of for , from its default value up to , which is large enough such that adaptability should be almost completely removed from the algorithm. We also vary the learning rate of each method with different powers of , multiplied by and (e.g., ). Figure 2 shows the results for Adam and AvaGrad. Our main findings are twofold:

  • [leftmargin=0.2in]

  • The optimal for every adaptive method is considerably larger than the values typically used in practice, ranging from (Adam, AMSGrad, AvaGradW) to (AvaGrad, AdamW). For Adam and AMSGrad, the optimal learning rate is , a value times larger than the default.

  • All adaptive methods, except for AdaBound, outperform SGD in terms of validation performance. Note that for SGD the optimal learning rate is , matching the value used in work such as He et al. (2016a); Zagoruyko & Komodakis (2016); Xie et al. (2017), which presented state-of-the-art results at time of publication.

Adam
[width=1.0]figures/adam_heatmap.png

AvaGrad
[width=1.0]figures/avagrad2_heatmap.png

Figure 2: Validation error of a Wide ResNet 28-4 trained on the CIFAR-10 dataset with Adam (left) and AvaGrad (right), for different values of the learning rate and parameter , where larger yields less adaptability. Best performance is achieved with small adaptability ().

However, the fact that adaptive methods outperform SGD in this setting is not conclusive, since they are executed with more hyperparameter settings (varying as well as ). Moreover, the main motivation for adaptive methods is to be less sensitive to hyperparameter values; performing an extensive grid search defeats their purpose.

Aiming for a fair comparison between SGD and adaptive methods, we also train a Wide ResNet 28-10 on both CIFAR-10 and CIFAR-100, evaluating the test performance of each adaptive method with its optimal values for and found in the previous experiment. For SGD, we confirmed that the learning rate still yielded the best validation performance with the new architecture, hence the fact that we transfer hyperparameters from the Wide ResNet 28-4 runs does not unfairly advantage adaptive methods in the comparison with SGD. With a larger network and a different task (CIFAR-100), this experiment should also capture how hyperparameters of adaptive methods transfer between tasks and models.

On CIFAR-10, SGD achieves test error (reported as in Zagoruyko & Komodakis (2016)) and is outperformed by both Adam () and AvaGrad (). On CIFAR-100, SGD () is outperformed by Adam (), AMSGrad (), AdaShift (), AvaGrad (), and AvaGradW (). We believe these results are surprising, as they show that adaptive methods can yield state-of-the-art performance when training CNNs as long as their adaptability is correctly controlled with .

As a final evaluation of the role of adaptability when training convolutional networks, we repeat the previous experiment on the ImageNet dataset (Russakovsky et al., 2015), training a ResNet 50 (He et al., 2016b) with SGD and different adaptive methods, transferring the hyperparameters from our original CIFAR-10 results. Training follows Gross & Wilber (2016): the network is trained for 100 epochs with a batch size of , the learning rate is decayed by a factor of 10 at epochs 30, 60 and 90, and a weight decay of is applied. SGD yields top-1 validation error, underperforming Adam (), AMSGrad (), AvaGrad () and AvaGradW (). Again, note that the hyperparameters used for SGD match the ones in He et al. (2016a), He et al. (2016b) and Gross & Wilber (2016): an initial learning rate of with a momentum of . Table 1 summarizes.

Method
CIFAR-10
(Test Err %)
CIFAR-100
(Test Err %)
ImageNet
(Top-1 Val Err %)
Penn Treebank
(Test Bits per Character)
SGD 3.86 (0.08) 19.05 (0.24) 24.01 1.238
Adam 3.64 (0.06) 18.96 (0.21) 23.45 1.182
AMSGrad 3.90 (0.17) 18.97 (0.09) 23.46 1.187
AdaBound 5.40 (0.24) 22.76 (0.17) 27.99 2.863
AdaShift 4.08 (0.11) 18.88 (0.06) N/A 1.274
AdamW 4.11 (0.17) 20.13 (0.22) 27.10 1.230
AvaGrad 3.80 (0.02) 18.76 (0.20) 23.58 1.179
AvaGradW 3.97 (0.02) 19.04 (0.37) 23.49 1.175
Table 1:

Test performance of SGD and popular adaptive methods in benchmark tasks. Red indicates results with the recommended optimizer, following the paper that proposed each model, and any improved performance is given in blue. The best result for each task is in bold, and numbers in parentheses present standard deviations of 3 runs for CIFAR.

6.3 Language Modelling with RNNs

It is perhaps not very surprising that to perform optimally in the image classification tasks studied previously, adaptive gradient methods required large values of , and hence were barely adaptive. Here, we consider a task where state-of-the-art results are typically achieved by adaptive methods with low values for : language modelling with recurrent networks. In particular, we perform character-level language modelling on Penn Treebank dataset (Marcus et al., 1994; Mikolov et al., 2010) with LSTMs, following Merity et al. (2018). The model is trained for 500 epochs, and the learning rate is decayed by at epochs 300 and 400. A batch size of 128, a BPTT length of 150, and weight decay of are used, along with dropout.

Adam
[width=]figures/adam_ptb_heatmap.png

AvaGrad
[width=]figures/avagrad2_ptb_heatmap.png

Figure 3: Validation bits-per-character (lower is better) of a 3-layer LSTM with 300 hidden units, trained on the Penn Treebank dataset with Adam (left) and AvaGrad (right), for different values of the learning rate and parameter , where larger yields less adaptability. Best performance is achieved with high adaptability ().

We evaluate the validation performance of SGD, Adam, AMSGrad, AdaShift, AdaBound, AdamW, AvaGrad and AvaGradW with varying learning rate and adaptability parameter , when training a 3-layer LSTM with 300 hidden units in each layer. Figure 3 shows that, in this task, smaller values for are indeed optimal: Adam, AMSGrad and AvaGrad performed best with . The optimal learning rates for both Adam and AMSGrad, , agree with the value used in Merity et al. (2018). Both AvaGrad and AvaGradW performed best with : the former with , the latter with .

Next, we train a 3-layer LSTM with 1000 hidden units per layer (the same model used in Merity et al. (2018), where it was trained with Adam), choosing values for which yielded the best validation performance in the previous experiment. For SGD, we again confirmed that a learning rate of performed best on the validation set. Table 1 (right column) reports all results. In this setting, AvaGrad and AvaGradW outperform Adam, achieving bit-per-characters of and compared to . The poor performance of AdaBound could be caused by convergence issues or due to the default values for its hyperparameters: it was shown in Savarese (2019) that the bound functions strongly affect the optimizer’s behavior and might require careful tuning.

6.4 Hyperparameter Separability and Domain-Independence

One of the main motivations behind AvaGrad is that it removes the dependence between the learning rate and the adaptability parameter , at least in the worst-case rate of Theorem 2. Observing the heatmaps in Figure 2 and 3, we can see that indeed AvaGrad offers more separability between and , when compared to Adam. For example, for , it has little to no interaction with the learning rate , as opposed to Adam where the optimal increases linearly with . For language modelling on Penn Treebank, the optimal learning rate for AvaGrad was for every choice of , while for image classification on CIFAR-10, we had for all values of except for . This suggests that AvaGrad enables a grid search over and (with quadratic complexity) to be broken into two line searches over and separately (linear complexity).

7 Conclusion

As neural architectures become more complex, with parameters having highly heterogeneous roles, parameter-wise learning rates are often necessary for training. However, adaptive methods have both theoretical and empirical gaps, with SGD outperforming them in some tasks and having stronger theoretical convergence guarantees. In this paper, we close this gap, by first providing a convergence rate guarantee that matches SGD’s, and by showing that, with proper hyperparameter tuning, adaptive methods can dominate in both computer vision and natural language processing tasks. Key to our finding is AvaGrad, our proposed optimizer whose adaptability is decoupled from its learning rate.

Our experimental results show that proper tuning of the learning rate together with the adaptability of the method is necessary to achieve optimal results in different domains, where distinct neural network architectures are used across tasks. By enabling this tuning to be performed in linear time, AvaGrad takes a leap towards efficient domain-agnostic training of general neural architectures.

References

  • Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. arXiv e-prints, 2017.
  • Chen et al. (2019) Xiangyi Chen, Sijia Liu, Ruoyu Sun, and Mingyi Hong. On the convergence of a class of adam-type algorithms for non-convex optimization. ICLR, 2019.
  • Dauphin et al. (2015) Y. Dauphin, H. de Vries, J. Chung, and Y. Bengio. Rmsprop and equilibrated adaptive learning rates for non-convex optimization. corrL, 2015.
  • Duchi et al. (2011) J. Duchi, E. Hazan, , and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. ICML, 2011.
  • Ghadimi & Lan (2013) Saeed Ghadimi and Guanghui Lan. Stochastic First- and Zeroth-order Methods for Nonconvex Stochastic Programming. SIAM, 2013.
  • Gross & Wilber (2016) Sam Gross and Martin Wilber. Training and investigating residual nets. https://github.com/facebook/fb.resnet.torch, 2016.
  • He et al. (2016a) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CVPR, 2016a.
  • He et al. (2016b) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. ECCV, 2016b.
  • Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 1997.
  • Huang et al. (2017) Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. CVPR, 2017.
  • Kingma & Ba (2015) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015.
  • Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
  • Loshchilov & Hutter (2019) Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. 2019.
  • Luo et al. (2019) Liangchen Luo, Xiong, Yuanhao, Liu, Yan, and Xu. Sun. Adaptive gradient methods with dynamic bound of learning rate. ICLR, 2019.
  • Marcus et al. (1994) Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. The penn treebank: Annotating predicate argument structure. In Proceedings of the Workshop on Human Language Technology, 1994.
  • Merity et al. (2017) Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and Optimizing LSTM Language Models. arXiv e-prints, 2017.
  • Merity et al. (2018) Stephen Merity, Nitish Shirish Keskar, and Richard Socher. An Analysis of Neural Language Modeling at Multiple Scales. arXiv e-prints, 2018.
  • Mikolov et al. (2010) Tomas Mikolov, Martin Karafiát, Lukás Burget, Jan Honza Cernocký, and Sanjeev Khudanpur. Recurrent neural network based language model. INTERSPEECH, 2010.
  • Reddi et al. (2018) Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. ICLR, 2018.
  • Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. Imagenet large scale visual recognition challenge. IJCV, 115(3), 2015.
  • Savarese (2019) Pedro Savarese. On the Convergence of AdaBound and its Connection to SGD. arXiv e-prints, 2019.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.

    Journal of Machine Learning Research

    , 15(1), 2014.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. arXiv e-prints, 2017.
  • Wilson et al. (2017) Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The Marginal Value of Adaptive Gradient Methods in Machine Learning. NIPS, 2017.
  • Xie et al. (2017) Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. CVPR, 2017.
  • Zagoruyko & Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. BMVC, 2016.
  • Zaheer et al. (2018) Manzil Zaheer, Sashank Reddi, Devendra Sachan, Satyen Kale, and Sanjiv Kumar. Adaptive methods for nonconvex optimization. NIPS, 2018.
  • Zhou et al. (2019) Zhiming Zhou, Qingru Zhang, Guansong Lu, Hongwei Wang, Weinan Zhang, and Yong Yu. Adashift: Decorrelation and convergence of adaptive learning rate methods. ICLR, 2019.

Appendix A Proof of Theorem 1

Proof.

Consider the following stochastic optimization problem:

(12)

where . Note that , and is minimized at .

The proof follows closely from Reddi’s linear example for convergence in suboptimality. We assume w.l.o.g. that . Consider:

(13)
(14)

where the expectation is over all the randomness in the algorithm up to time , as all expectations to follow in the proof. Note that for . For we bound as follows:

(15)

Hence, .

As for , we have, from Jensen’s inequality:

(16)

Now, remember that , hence:

(17)

and thus:

(18)

Plugging in the bounds for and in Equation 14:

(19)

Hence, for large enough , and , while the above quantity becomes non-negative, and hence . In other words, Adam will, in expectation, drift away from the stationary point, towards , at which point . For example, implies that . To see that is not a stationary point due to the feasibility constraints, check that : that is, the negative gradient points towards the feasible region. ∎

Appendix B Proof of Theorem 2

Proof.

Throughout the proof we use the following notation for clarity:

(20)

We start from the fact that is -smooth:

(21)

and use the update :

(22)

where in the first step we used the fact that , in the second we used , in the third we used Cauchy-Schwarz, and in the fourth we used , along with .

Now, taking the expectation over , and using the fact that , and that , are both independent of :

(23)

where in the second step we used and .

Re-arranging, we get:

(24)

Now, defining , where , dividing by and summing over :

(25)

Now, taking the conditional expectation over all samples given :

(26)

where in the second step we used which follows from the assumption that , and the third step follows from a telescoping sum, along with the fact that . Now, using :

(27)

Then, taking the expectation over :

(28)

Now, let :

(29)

where we used the fact that .

Now, recall that :

(30)

Setting and checking that