We consider first-order optimization methods which are concerned with problems of the following form:
where is the feasible set of solutions and is the objective function. First-order methods typically operate in an iterative fashion: at each step , the current candidate solution is updated using both zero-th and first-order information about (e.g., and
, or unbiased estimates of each). Methods such as gradient descent and its stochastic counterpart can be written as:
where is the learning rate at step , is the update direction (e.g., for deterministic gradient descent), and denotes a projection onto . The behavior of vanilla gradient-based methods is well-understood under different frameworks and assumptions ( regret in the online convex framework (Zinkevich, 2003), suboptimality in the stochastic convex framework, and so on).
In contrast with SGD, adaptive gradient methods such as AdaGrad (Duchi et al., 2011)
, RMSProp(Tieleman and Hinton, 2012) and Adam (Kingma and Ba, 2015) propose to compute a different learning rate for each parameter in the model. In particular, the parameters are updated according to the following rule:
where are parameter-wise learning rates and denotes element-wise multiplication. For Adam, we have and with , where captures first-order information of the objective function (e.g., in the stochastic setting).
Adaptive methods have become popular due to their flexibility in terms of hyperparameters, which require less tuning than SGD. In particular, Adam is currently the de-facto optimizer for training complex models such as BERT (Devlin et al., 2018) and VQ-VAE (van den Oord et al., 2017).
Recently, it has been observed that Adam has both theoretical and empirical gaps. Reddi et al. (2018) showed that Adam can fail to achieve convergence even in the stochastic convex setting, while Wilson et al. (2017) have formally demonstrated that Adam can cause poor generalization – a fact often observed when training simpler CNN-based models such as ResNets (He et al., 2016). While the theoretical gap has been closed in Reddi et al. (2018) with AMSGrad – an Adam variant with provable convergence for online convex problems – achieving SGD-like performance with adaptive methods has remained an open-problem.
AdaBound (Luo et al., 2019) is a recently proposed adaptive gradient method that aims to bridge the empirical gap between Adam-like methods and SGD, and consists of enforcing dynamic bounds on such that as goes to infinity,
converges to a vector whose components are equal – hence degenerating to SGD. AdaBound comes with aregret rate in the online convex setting, yielding an immediate guarantee in the stochastic convex framework due to Cesa-Bianchi et al. (2006). Moreover, empirical experiments suggest that it is capable of outperforming SGD in image classification tasks – problems where adaptive methods have historically failed to provide competitive results.
In Section 3, we highlight issues in the convergence rate proof of AdaBound (Theorem 4 of Luo et al. (2019)), and present a stochastic convex problem for which AdaBound can take arbitrarily long to converge. More importantly, we show that the presented problem leads to a contradiction with the convergence guarantee of AdaBound while satisfying all of its assumptions, implying that Theorem 4 of Luo et al. (2019) is indeed incorrect. In Section 4, we introduce a new assumption which yields a regret guarantee without assuming that the bound functions are monotonic nor that they converge to the same limit. Driven by the new guarantee, in Section 5 we re-evaluate the performance of AdaBound on the CIFAR dataset, and observe that its performance can be matched with a specific form of SGDM, whose computational cost is significantly smaller than that of Adam-like methods.
For vectors and scalar , we use the following notation: for element-wise division (), for element-wise square root (), for element-wise addition (), for element-wise multiplication (). Moreover, is used to denote the -norm: other norms will be specified whenever used (e.g., ).
For subscripts and vector indexing, we adopt the following convention: the subscript is used to denote an object related to the -th iteration of an algorithm (e.g., denotes the iterate at time step ); the subscript is used for indexing: denotes the -th coordinate of . When used together, precedes : denotes the -th coordinate of .
3 AdaBound’s Arbitrarily Slow Convergence
AdaBound is given as Algorithm 1, following (Luo et al., 2019). It consists of an update rule similar to Adam, except for the extra element-wise clipping operation , which assures that for all . The bound functions are chosen such that is non-decreasing, is non-increasing, and , for some . It then follows that , thus AdaBound degenerates to SGD in the time limit.
In (Luo et al., 2019), the authors present the following Theorem:
Its proof claims that follows from the definition of in AdaBound, a fact that only generally holds if for all . Even for the bound functions considered in Luo et al. (2019) and used in the released code, this requirement is not satisfied for any . Finally, it is also possible to show that AMSBound does not meet this requirement either, hence the proof of Theorem 5 of Luo et al. (2019) is also problematic.
It turns out that the convergence of AdaBound in the stochastic convex case can be arbitrarily slow, even for bound functions that satisfy the assumptions in Theorem 1:
For any constant , and initial step size , there exist bound functions such that , , , and a stochastic convex optimization problem for which the iterates produced by AdaBound satisfy for all .
We consider the same stochastic problem as presented in Reddi et al. (2018), for which Adam fails to converge. In particular, a one-dimensional problem over , where is chosen i.i.d. as follows:
Here, is taken to be large in terms of and , and . Now, consider the following bound functions:
and check that and are non-decreasing and non-increasing in , respectively, and . We will show that such bound functions can be effectively ignored for . Check that, for all :
where we used the fact that and that . Hence, we have, for :
Since for all , the clipping operation acts as an identity mapping and . Therefore, in this setting, AdaBound produces the same iterates as Adam. We can then invoke Theorem 3 of Reddi et al. (2018), and have that, with large enough (as a function of ), for all , we have . In particular, with , for all and hence . Setting finishes the proof. ∎
While the bound functions considered in the Theorem above might seem artificial, the same result holds for bound functions of the form and , considered in Luo et al. (2019) and in the publicly released implementation of AdaBound:
Theorem 2 also holds for the bound functions and with
Check that, for all :
Hence, for the stochastic problem in Theorem 2, we also have that for all . ∎
Note that it is straightforward to prove a similar result for the online convex setting by invoking Theorem 2 instead of Theorem 3 of Reddi et al. (2018) – this would immediately imply that Theorem 1 is incorrect. Instead, Theorem 2 was presented in the convex stochastic setup as it yields a stronger result, and it almost immediately implies that Theorem 1 might not hold:
There exists an instance where Theorem 1 does not hold.
where we used the fact that . Pick large enough such that . Taking expectation over sequences and dividing by :
However, Theorem 2 assures for all , raising a contradiction. ∎
Note that while the above result shows that Theorem 1 is indeed incorrect, it does not imply that AdaBound might fail to converge.
4 A New Guarantee
The results in the previous section suggest that Theorem 1 fails to capture all relevant properties of the bound functions. Although it is indeed possible to show that , it is not clear whether a regret rate can be guaranteed for general bound functions.
It turns out that replacing the previous requirements on the bound functions by the assumption that for all suffices to guarantee a regret of :
Let and be the sequences obtained from Algorithm 1, , for all and . Suppose and for all . Assume that for all and for all and . For generated using the AdaBound algorithm, we have the following bound on the regret
We start from an intermediate result of the original proof of Theorem 4 of Luo et al. (2019):
For the setting in Theorem 3, we have:
The result follows from the proof of Theorem 4 in Luo et al. (2019), up to (but not including) Equation 6. ∎
We will proceed to bound and from the above Lemma. Starting with :
In the second inequality we used , in the third the definition of along with the fact that for all and , in the fifth the assumption that , and in the sixth we used the bound on the feasible region, along with and for all .
For , we have:
where we used the bound on the feasible region, and the fact that for all .
The above regret guarantee is similar to the one in Theorem 4 of Luo et al. (2019), except for the term which accounts for assumption introduced. Note that Theorem 3 does not require to be non-increasing, to be non-decreasing, nor that .
It is easy to see that the assumption indeed holds for the bound functions in Luo et al. (2019):
For the bound functions
if , we have:
First, check that and . Then, we have:
In the first inequality we used for all , and in the last the fact that for all and , which is equivalent to . ∎
With this in hand, we have the following regret bound for AdaBound:
Suppose , , and for in Theorem 3. Then, we have:
It is easy to check that the previous results also hold for AMSBound (Algorithm 3 in Luo et al. (2019)), since no assumptions were made on the point-wise behavior of .
5 Experiments on AdaBound and SGD
Unfortunately, the regret bound in Corollary 2 is minimized in the limit , where AdaBound immediately degenerates to SGD. To inspect whether this fact has empirical value or is just an artifact of the presented analysis, we evaluate the performance of AdaBound when training neural networks on the CIFAR dataset (Krizhevsky, 2009) with an extremely small value for parameter.
Note that was used for the CIFAR results in Luo et al. (2019), for which we have and
after only 3 epochs (iterations per epoch for a batch size of ), hence we believe results with considerably smaller/larger values for are required to understand its impact on the performance of AdaBound.
We trained a Wide ResNet-28-2 (Zagoruyko and Komodakis, 2016) using the same settings in Luo et al. (2019) and its released code 111https://github.com/Luolc/AdaBound, version 2e928c3: , a weight decay of , a learning rate decay of factor 10 at epoch 150, and batch size of . For AdaBound, we used the author’s implementation with , and for SGD we used
. Experiments were done in PyTorch.
To clarify our network choice, note that the model used in Luo et al. (2019) is not a ResNet-34 from He et al. (2016), but a variant used in DeVries and Taylor (2017), often referred as ResNet-34. In particular, the ResNet-34 from He et al. (2016) consists of 3 stages and less than 0.5M parameters, while the network used in Luo et al. (2019) has 4 stages and around 21M parameters. The network we used has roughly 1.5M parameters.
Our preliminary results suggest that the final test performance of AdaBound is monotonically increasing with – more interestingly, there is no significant difference throughout training between and (for the latter, we have and ).
To see why AdaBound with behaves so differently than SGDM, check that the momentum updates slightly differ between the two: for AdaBound, we have:
while, for the implementation of SGDM used in Luo et al. (2019), we have:
where is the dampening factor. The results in Luo et al. (2019) use , which can cause to be larger by a factor of compared to AdaBound. In principle, setting in SGDM should yield dynamics similar to AdaBound’s as long as is not extremely small.
Figure 1 presents our main empirical results: setting causes noticeable performance degradation compared to in AdaBound, as Corollary 2 might suggest. Moreover, setting in SGDM causes a dramatic performance increase throughout training. In particular, it slightly outperforms AdaBound in terms of final test accuracy ( against , average over 5 runs), while being comparably fast and consistent in terms of progress during optimization.
We believe SGDM with (which is currently not
the default in either PyTorch or Tensorflow) might be a reasonable alternative to adaptive gradient methods in some settings, as it also requires less computational resources: AdaBound, Adam and SGDM’s updates cost, and float operations, respectively, and their memory costs are , and . Moreover, AdaBound has 5 hyperparameters (), while SGDM with has only 2 (). Studying the effectiveness of ‘dampened’ SGDM, however, requires extensive experiments which are out of the scope of this technical report.
Lastly, we evaluated whether performing bias correction on the form of SGDM affects its performance. More specifically, we divide the learning rate at step by a factor ofto when adding bias correction to SGDM (compared to of AdaBound).
In this technical report, we identified issues in the proof of the main Theorem of Luo et al. (2019), which presents a regret rate guarantee for AdaBound. We presented an instance where the statement does not hold, and provided a regret guarantee under different – and arguably less restrictive – assumptions. Finally, we observed empirically that AdaBound with a theoretically optimal indeed yields superior performance, although it degenerates to a specific form of momentum SGD. Our experiments suggest that this form of SGDM (with a dampening factor equal to its momentum) performs competitively to AdaBound on CIFAR.
We are in debt to Rachit Nimavat for proofreading the manuscript and the extensive discussion, and thank Sudarshan Babu and Liangchen Luo for helpful comments.
- On the generalization ability of on-line learning algorithms. IEEE Trans. Inf. Theor.. Cited by: §1.
- BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. Cited by: §1.
Improved regularization of convolutional neural networks with cutout. arXiv:1708.04552. Cited by: §5.
- Adaptive subgradient methods for online learning and stochastic optimization. ICML. Cited by: §1.
- Deep residual learning for image recognition. CVPR. Cited by: §1, §5.
- Adam: a method for stochastic optimization. ICLR. Cited by: §1.
- Learning multiple layers of features from tiny images. Technical report Cited by: §5.
- Adaptive gradient methods with dynamic bound of learning rate. ICLR (arXiv:1902.09843). Cited by: On the Convergence of AdaBound and its Connection to SGD, §1, §1, §3, §3, §3, §3, §4, §4, §4, §4, §4, §5, §5, §5, §5, §6, Theorem 1.
- On the convergence of adam and beyond. ICLR. Cited by: §1, §3, §3.
Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude.
COURSERA: Neural Networks for Machine LearningCited by: §1.
- Neural Discrete Representation Learning. arXiv:1711.00937. Cited by: §1.
- The Marginal Value of Adaptive Gradient Methods in Machine Learning. NIPS. Cited by: §1.
- Wide residual networks. BMVC. Cited by: §5.
- Online convex programming and generalized infinitesimal gradient ascent. ICML. Cited by: §1.