# On the Convergence of AdaBound and its Connection to SGD

Adaptive gradient methods such as Adam have gained extreme popularity due to their success in training complex neural networks and less sensitivity to hyperparameter tuning compared to SGD. However, it has been recently shown that Adam can fail to converge and might cause poor generalization -- this lead to the design of new, sophisticated adaptive methods which attempt to generalize well while being theoretically reliable. In this technical report we focus on AdaBound, a promising, recently proposed optimizer. We present a stochastic convex problem for which AdaBound can provably take arbitrarily long to converge in terms of a factor which is not accounted for in the convergence rate guarantee of Luo et al. (2019). We present a new $O(\sqrt T)$ regret guarantee under different assumptions on the bound functions, and provide empirical results on CIFAR suggesting that a specific form of momentum SGD can match AdaBound's performance while having less hyperparameters and lower computational costs.

## Authors

• 9 publications
• ### Calibrating the Learning Rate for Adaptive Gradient Methods to Improve Generalization Performance

08/02/2019 ∙ by Qianqian Tong, et al. ∙ 0

• ### The Role of Momentum Parameters in the Optimal Convergence of Adaptive Polyak's Heavy-ball Methods

02/15/2021 ∙ by Wei Tao, et al. ∙ 0

02/26/2019 ∙ by Liangchen Luo, et al. ∙ 0

• ### YellowFin and the Art of Momentum Tuning

Hyperparameter tuning is one of the big costs of deep learning. State-of...
06/12/2017 ∙ by Jian Zhang, et al. ∙ 0

• ### Improving Generalization Performance by Switching from Adam to SGD

Despite superior training outcomes, adaptive optimization methods such a...
12/20/2017 ∙ by Nitish Shirish Keskar, et al. ∙ 0

07/18/2021 ∙ by Zhou Shao, et al. ∙ 0

• ### Robust Neural Network Training using Periodic Sampling over Model Weights

Deep neural networks provide best-in-class performance for a number of c...
05/14/2019 ∙ by Samarth Tripathi, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider first-order optimization methods which are concerned with problems of the following form:

 minx∈Ff(x) (1)

where is the feasible set of solutions and is the objective function. First-order methods typically operate in an iterative fashion: at each step , the current candidate solution is updated using both zero-th and first-order information about (e.g., and

, or unbiased estimates of each). Methods such as gradient descent and its stochastic counterpart can be written as:

 xt+1=ΠF(xt−αt⋅mt) (2)

where is the learning rate at step , is the update direction (e.g., for deterministic gradient descent), and denotes a projection onto . The behavior of vanilla gradient-based methods is well-understood under different frameworks and assumptions ( regret in the online convex framework (Zinkevich, 2003), suboptimality in the stochastic convex framework, and so on).

(Tieleman and Hinton, 2012) and Adam (Kingma and Ba, 2015) propose to compute a different learning rate for each parameter in the model. In particular, the parameters are updated according to the following rule:

 xt+1=ΠF(xt−ηt⊙mt) (3)

where are parameter-wise learning rates and denotes element-wise multiplication. For Adam, we have and with , where captures first-order information of the objective function (e.g., in the stochastic setting).

Adaptive methods have become popular due to their flexibility in terms of hyperparameters, which require less tuning than SGD. In particular, Adam is currently the de-facto optimizer for training complex models such as BERT (Devlin et al., 2018) and VQ-VAE (van den Oord et al., 2017).

Recently, it has been observed that Adam has both theoretical and empirical gaps. Reddi et al. (2018) showed that Adam can fail to achieve convergence even in the stochastic convex setting, while Wilson et al. (2017) have formally demonstrated that Adam can cause poor generalization – a fact often observed when training simpler CNN-based models such as ResNets (He et al., 2016). While the theoretical gap has been closed in Reddi et al. (2018) with AMSGrad – an Adam variant with provable convergence for online convex problems – achieving SGD-like performance with adaptive methods has remained an open-problem.

AdaBound (Luo et al., 2019) is a recently proposed adaptive gradient method that aims to bridge the empirical gap between Adam-like methods and SGD, and consists of enforcing dynamic bounds on such that as goes to infinity,

converges to a vector whose components are equal – hence degenerating to SGD. AdaBound comes with a

regret rate in the online convex setting, yielding an immediate guarantee in the stochastic convex framework due to Cesa-Bianchi et al. (2006). Moreover, empirical experiments suggest that it is capable of outperforming SGD in image classification tasks – problems where adaptive methods have historically failed to provide competitive results.

In Section 3, we highlight issues in the convergence rate proof of AdaBound (Theorem 4 of Luo et al. (2019)), and present a stochastic convex problem for which AdaBound can take arbitrarily long to converge. More importantly, we show that the presented problem leads to a contradiction with the convergence guarantee of AdaBound while satisfying all of its assumptions, implying that Theorem 4 of Luo et al. (2019) is indeed incorrect. In Section 4, we introduce a new assumption which yields a regret guarantee without assuming that the bound functions are monotonic nor that they converge to the same limit. Driven by the new guarantee, in Section 5 we re-evaluate the performance of AdaBound on the CIFAR dataset, and observe that its performance can be matched with a specific form of SGDM, whose computational cost is significantly smaller than that of Adam-like methods.

## 2 Notation

For vectors and scalar , we use the following notation: for element-wise division (), for element-wise square root (), for element-wise addition (), for element-wise multiplication (). Moreover, is used to denote the -norm: other norms will be specified whenever used (e.g., ).

For subscripts and vector indexing, we adopt the following convention: the subscript is used to denote an object related to the -th iteration of an algorithm (e.g., denotes the iterate at time step ); the subscript is used for indexing: denotes the -th coordinate of . When used together, precedes : denotes the -th coordinate of .

## 3 AdaBound’s Arbitrarily Slow Convergence

AdaBound is given as Algorithm 1, following (Luo et al., 2019). It consists of an update rule similar to Adam, except for the extra element-wise clipping operation , which assures that for all . The bound functions are chosen such that is non-decreasing, is non-increasing, and , for some . It then follows that , thus AdaBound degenerates to SGD in the time limit.

In (Luo et al., 2019), the authors present the following Theorem:

###### Theorem 1.

(Theorem 4 of Luo et al. (2019)) Let and be the sequences obtained from Algorithm 1, , for all and . Suppose , , as , as , and . Assume that for all and for all and . For generated using the AdaBound algorithm, we have the following bound on the regret

 RT≤D2∞√T2(1−β1)d∑i=1^η−1T,i+D2∞2(1−β1)T∑t=1d∑i=1β1tη−1t,i+(2√T−1)R∞G221−β1 (4)

Its proof claims that follows from the definition of in AdaBound, a fact that only generally holds if for all . Even for the bound functions considered in Luo et al. (2019) and used in the released code, this requirement is not satisfied for any . Finally, it is also possible to show that AMSBound does not meet this requirement either, hence the proof of Theorem 5 of Luo et al. (2019) is also problematic.

It turns out that the convergence of AdaBound in the stochastic convex case can be arbitrarily slow, even for bound functions that satisfy the assumptions in Theorem 1:

###### Theorem 2.

For any constant , and initial step size , there exist bound functions such that , , , and a stochastic convex optimization problem for which the iterates produced by AdaBound satisfy for all .

###### Proof.

We consider the same stochastic problem as presented in Reddi et al. (2018), for which Adam fails to converge. In particular, a one-dimensional problem over , where is chosen i.i.d. as follows:

 ft(x)={Cx,with probability p\coloneqq1+δC+1−x,with probability 1−p (5)

Here, is taken to be large in terms of and , and . Now, consider the following bound functions:

 ηl(t)=α/Cηu(t;K)={α/√1−β2,for t≤Kα/C,otherwise (6)

and check that and are non-decreasing and non-increasing in , respectively, and . We will show that such bound functions can be effectively ignored for . Check that, for all :

 vt=(1−β2)t∑i=1βt−i2g2i≤C2(1−βt2)≤C2vt=(1−β2)t∑i=1βt−i2g2i≥1−βt2≥1−β2 (7)

where we used the fact that and that . Hence, we have, for :

 ηl(t)=αC≤α√vt≤α√1−β2=ηu(t;K) (8)

Since for all , the clipping operation acts as an identity mapping and . Therefore, in this setting, AdaBound produces the same iterates as Adam. We can then invoke Theorem 3 of Reddi et al. (2018), and have that, with large enough (as a function of ), for all , we have . In particular, with , for all and hence . Setting finishes the proof. ∎

While the bound functions considered in the Theorem above might seem artificial, the same result holds for bound functions of the form and , considered in Luo et al. (2019) and in the publicly released implementation of AdaBound:

###### Claim 1.

Theorem 2 also holds for the bound functions and with

 γ=1K⋅min(αC−α,√1−β2α) (9)
###### Proof.

Check that, for all :

 ηl(t;K)≤1−1γK+1≤1−1αC−α+1=1−C−αC=αCηu(t;K)≥1+1γK≥1+α√1−β2≥α√1−β2 (10)

Hence, for the stochastic problem in Theorem 2, we also have that for all . ∎

Note that it is straightforward to prove a similar result for the online convex setting by invoking Theorem 2 instead of Theorem 3 of Reddi et al. (2018) – this would immediately imply that Theorem 1 is incorrect. Instead, Theorem 2 was presented in the convex stochastic setup as it yields a stronger result, and it almost immediately implies that Theorem 1 might not hold:

###### Corollary 1.

There exists an instance where Theorem 1 does not hold.

###### Proof.

Consider AdaBound with the bound functions presented in Theorem 2 and . For any sequence drawn for the stochastic problem in Theorem 2, setting and in Theorem 1 yields, for :

 RK=K∑t=1(ft(xt)−ft(x∗))≤2dC√Kα+(2√K−1)αC2√1−β2 (11)

where we used the fact that . Pick large enough such that . Taking expectation over sequences and dividing by :

 1KK∑t=1E[f(xt)]−f(x∗)<0.01 (12)

However, Theorem 2 assures for all , raising a contradiction. ∎

Note that while the above result shows that Theorem 1 is indeed incorrect, it does not imply that AdaBound might fail to converge.

## 4 A New Guarantee

The results in the previous section suggest that Theorem 1 fails to capture all relevant properties of the bound functions. Although it is indeed possible to show that , it is not clear whether a regret rate can be guaranteed for general bound functions.

It turns out that replacing the previous requirements on the bound functions by the assumption that for all suffices to guarantee a regret of :

###### Theorem 3.

Let and be the sequences obtained from Algorithm 1, , for all and . Suppose and for all . Assume that for all and for all and . For generated using the AdaBound algorithm, we have the following bound on the regret

 RT≤D2∞2(1−β1)[2dM(√T−1)+d∑i=1[η−11,i+T∑t=1β1tη−1t,i]]+(2√T−1)R∞G221−β1 (13)
###### Proof.

We start from an intermediate result of the original proof of Theorem 4 of Luo et al. (2019):

###### Lemma 1.

For the setting in Theorem 3, we have:

 RT≤T∑t=112(1−β1t)[∥η−1/2t⊙(xt−x∗)∥2−∥η−1/2t⊙(xt+1−x∗)∥2]S1+T∑t=1β1t2(1−β1t)∥η−1/2t⊙(xt−x∗)∥2S2+(2√T−1)R∞G221−β1 (14)
###### Proof.

The result follows from the proof of Theorem 4 in Luo et al. (2019), up to (but not including) Equation 6. ∎

We will proceed to bound and from the above Lemma. Starting with :

 S1=d∑i=1T∑t=112(1−β1t)[η−1t,i(xt,i−x∗i)2−η−1t,i(xt+1,i−x∗i)2]≤d∑i=1[η−11,i(x1,i−x∗i)22(1−β11)+T∑t=2⎡⎣η−1t,i2(1−β1t)−η−1t−1,i2(1−β1(t−1))⎤⎦(xt,i−x∗i)2]≤d∑i=1[η−11,i(x1,i−x∗i)22(1−β11)+T∑t=2[η−1t,i−η−1t−1,i]2(1−β1(t−1))(xt,i−x∗i)2]≤d∑i=1[η−11,i(x1,i−x∗i)22(1−β11)+T∑t=2[√tηl(t)−√t−1ηu(t−1)](xt,i−x∗i)22(1−β1(t−1))]≤d∑i=1[η−11,i(x1,i−x∗i)22(1−β11)+T∑t=21√t[tηl(t)−t−1ηu(t−1)](xt,i−x∗i)22(1−β1(t−1))]≤d∑i=1[η−11,i(x1,i−x∗i)22(1−β11)+T∑t=2M√t⋅(xt,i−x∗i)22(1−β1(t−1))]≤D2∞2(1−β1)d∑i=1[η−11,i+2M(√T−1)]=D2∞2(1−β1)[2dM(√T−1)+d∑i=1η−11,i] (15)

In the second inequality we used , in the third the definition of along with the fact that for all and , in the fifth the assumption that , and in the sixth we used the bound on the feasible region, along with and for all .

For , we have:

 S2=d∑i=1T∑t=112(1−β1t)β1tη−1t,i(xt,i−x∗i)2≤D2∞2(1−β1)d∑i=1T∑t=1β1tη−1t,i (16)

where we used the bound on the feasible region, and the fact that for all .

Combining (15) and (16) into (14), we get:

 RT≤D2∞2(1−β1)[2dM(√T−1)+d∑i=1[η−11,i+T∑t=1β1tη−1t,i]]+(2√T−1)R∞G221−β1 (17)

The above regret guarantee is similar to the one in Theorem 4 of Luo et al. (2019), except for the term which accounts for assumption introduced. Note that Theorem 3 does not require to be non-increasing, to be non-decreasing, nor that .

It is easy to see that the assumption indeed holds for the bound functions in Luo et al. (2019):

###### Proposition 1.

For the bound functions

 ηl(t)=1−1γt+1ηu(t)=1+1γt (18)

if , we have:

 tηl(t)−t−1ηu(t−1)≤3+2γ−1 (19)
###### Proof.

First, check that and . Then, we have:

 (20)

In the first inequality we used for all , and in the last the fact that for all and , which is equivalent to . ∎

With this in hand, we have the following regret bound for AdaBound:

###### Corollary 2.

Suppose , , and for in Theorem 3. Then, we have:

 RT≤5√T1−β1(1+γ−1)(dD2∞+G22) (21)
###### Proof.

From the bound in Theorem 3, it follows that:

 (22)

In the first inequality we used the facts that , and that . In the second, that . In the third, that . In the fourth, we used the bound on from Proposition 1. ∎

It is easy to check that the previous results also hold for AMSBound (Algorithm 3 in Luo et al. (2019)), since no assumptions were made on the point-wise behavior of .

###### Remark 1.

Theorem 3 and Corollary 2 also hold for AMSBound.

## 5 Experiments on AdaBound and SGD

Unfortunately, the regret bound in Corollary 2 is minimized in the limit , where AdaBound immediately degenerates to SGD. To inspect whether this fact has empirical value or is just an artifact of the presented analysis, we evaluate the performance of AdaBound when training neural networks on the CIFAR dataset (Krizhevsky, 2009) with an extremely small value for parameter.

Note that was used for the CIFAR results in Luo et al. (2019), for which we have and

after only 3 epochs (

iterations per epoch for a batch size of ), hence we believe results with considerably smaller/larger values for are required to understand its impact on the performance of AdaBound.

We trained a Wide ResNet-28-2 (Zagoruyko and Komodakis, 2016) using the same settings in Luo et al. (2019) and its released code 111https://github.com/Luolc/AdaBound, version 2e928c3: , a weight decay of , a learning rate decay of factor 10 at epoch 150, and batch size of . For AdaBound, we used the author’s implementation with , and for SGD we used

. Experiments were done in PyTorch.

To clarify our network choice, note that the model used in Luo et al. (2019) is not a ResNet-34 from He et al. (2016), but a variant used in DeVries and Taylor (2017), often referred as ResNet-34. In particular, the ResNet-34 from He et al. (2016) consists of 3 stages and less than 0.5M parameters, while the network used in Luo et al. (2019) has 4 stages and around 21M parameters. The network we used has roughly 1.5M parameters.

Our preliminary results suggest that the final test performance of AdaBound is monotonically increasing with – more interestingly, there is no significant difference throughout training between and (for the latter, we have and ).

To see why AdaBound with behaves so differently than SGDM, check that the momentum updates slightly differ between the two: for AdaBound, we have:

 mt=β1mt−1+(1−β1)gt (23)

while, for the implementation of SGDM used in Luo et al. (2019), we have:

 mt=β1mt−1+(1−κ)gt (24)

where is the dampening factor. The results in Luo et al. (2019) use , which can cause to be larger by a factor of compared to AdaBound. In principle, setting in SGDM should yield dynamics similar to AdaBound’s as long as is not extremely small.

Figure 1 presents our main empirical results: setting causes noticeable performance degradation compared to in AdaBound, as Corollary 2 might suggest. Moreover, setting in SGDM causes a dramatic performance increase throughout training. In particular, it slightly outperforms AdaBound in terms of final test accuracy ( against , average over 5 runs), while being comparably fast and consistent in terms of progress during optimization.

We believe SGDM with (which is currently not

the default in either PyTorch or Tensorflow) might be a reasonable alternative to adaptive gradient methods in some settings, as it also requires less computational resources: AdaBound, Adam and SGDM’s updates cost

, and float operations, respectively, and their memory costs are , and . Moreover, AdaBound has 5 hyperparameters (), while SGDM with has only 2 (). Studying the effectiveness of ‘dampened’ SGDM, however, requires extensive experiments which are out of the scope of this technical report.

Lastly, we evaluated whether performing bias correction on the form of SGDM affects its performance. More specifically, we divide the learning rate at step by a factor of

. We observed that bias correction has no significant effect on the average performance, but yields smaller variance: the standard deviation of the final test accuracy over 5 runs decreased from

## 6 Discussion

In this technical report, we identified issues in the proof of the main Theorem of Luo et al. (2019), which presents a regret rate guarantee for AdaBound. We presented an instance where the statement does not hold, and provided a regret guarantee under different – and arguably less restrictive – assumptions. Finally, we observed empirically that AdaBound with a theoretically optimal indeed yields superior performance, although it degenerates to a specific form of momentum SGD. Our experiments suggest that this form of SGDM (with a dampening factor equal to its momentum) performs competitively to AdaBound on CIFAR.

### Acknowledgements

We are in debt to Rachit Nimavat for proofreading the manuscript and the extensive discussion, and thank Sudarshan Babu and Liangchen Luo for helpful comments.

## References

• N. Cesa-Bianchi, A. Conconi, and C. Gentile (2006) On the generalization ability of on-line learning algorithms. IEEE Trans. Inf. Theor.. Cited by: §1.
• J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. Cited by: §1.
• T. DeVries and G. W. Taylor (2017)

Improved regularization of convolutional neural networks with cutout

.
arXiv:1708.04552. Cited by: §5.
• J. Duchi, E. Hazan, and Y. Singer (2011) Adaptive subgradient methods for online learning and stochastic optimization. ICML. Cited by: §1.
• K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. CVPR. Cited by: §1, §5.
• D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. ICLR. Cited by: §1.
• A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Technical report Cited by: §5.
• L. Luo, Xiong, Yuanhao, Liu, Yan, and Xu. Sun (2019) Adaptive gradient methods with dynamic bound of learning rate. ICLR (arXiv:1902.09843). Cited by: On the Convergence of AdaBound and its Connection to SGD, §1, §1, §3, §3, §3, §3, §4, §4, §4, §4, §4, §5, §5, §5, §5, §6, Theorem 1.
• S. J. Reddi, S. Kale, and S. Kumar (2018) On the convergence of adam and beyond. ICLR. Cited by: §1, §3, §3.
• T. Tieleman and G. Hinton (2012) Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. Note:

COURSERA: Neural Networks for Machine Learning

Cited by: §1.
• A. van den Oord, O. Vinyals, and K. Kavukcuoglu (2017) Neural Discrete Representation Learning. arXiv:1711.00937. Cited by: §1.
• A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht (2017) The Marginal Value of Adaptive Gradient Methods in Machine Learning. NIPS. Cited by: §1.
• S. Zagoruyko and N. Komodakis (2016) Wide residual networks. BMVC. Cited by: §5.
• M. Zinkevich (2003) Online convex programming and generalized infinitesimal gradient ascent. ICML. Cited by: §1.