DeepAI

• 4 publications
• 13 publications
• 45 publications
10/15/2019

10/18/2019

### Scheduling the Learning Rate via Hypergradients: New Insights and a New Algorithm

We study the problem of fitting task-specific learning rate schedules fr...
03/04/2019

05/31/2016

### Adaptive Learning Rate via Covariance Matrix Based Preconditioning for Deep Neural Networks

Adaptive learning rate algorithms such as RMSProp are widely used for tr...
04/07/2019

09/08/2018

### Online Adaptive Methods, Universality and Acceleration

We present a novel method for convex unconstrained optimization that, wi...
01/10/2020

### Tangent-Space Gradient Optimization of Tensor Network for Machine Learning

The gradient-based optimization method for deep machine learning models ...

## 1 Introduction

Along with the rise of deep learning, various first-order stochastic optimization methods emerged. Among them, the most fundamental one is the stochastic gradient descent, and the Nesterov’s Accelerated Gradient method

NESTEROV (1983) is also a well-known acceleration algorithm. Recently, many adaptive stochastic optimization methods have been proposed, such as AdaGrad Duchi et al. (2010), RMSProp Tieleman and Hinton (2012), AdaDelta Zeiler (2012) and Adam Kingma and Ba (2014). These algorithms can be written in the following general form:

 xt+1=xt−αtψ(g1,...,gt)ϕ(g1,...,gt), (1)

where is the gradient obtained in the -th time step, the adaptive learning rate, and the gradient estimation. There have been extensive studies on the design of gradient estimations which can be traced back to classical momentum methods  Polyak (1964) and NAG NESTEROV (1983). In this paper, however, we focus more on how to understand and improve the adaptive learning rate.

Adam Kingma and Ba (2014) is perhaps the most widely used adaptive stochastic optimization method which uses an exponential moving average (EMA) to estimate the square of the gradient scale, so that the learning rate can be adjusted adaptively. More specifically, Adam takes the form of (1) with

 ψ(g1,...,gt)=√Vt,Vt=diag(vt) vt=β2vt−1+(1−β2)g2t. (2)

We shall call the re-scaling term of the Adam and its variants, since it serves as a coordinate-wise re-scaling of the gradients. Despite its fast convergence and easiness in implementation, Adam is also known for its non-convergence and poor generalization in some cases Reddi et al. (2018)Wilson et al. (2017). More recently, Balles and Hennig (2018) both theoretically and empirically pointed out that generalization is mainly determined by the sign effect rather than the adaptive learning rate, and the sign effect is problem-dependent. In this paper, we are mainly dealing with the non-convergence issue and will only empirically compare generalization ability among different Adam variants.

As for the non-convergence issue, Reddi et al. (2018) suggested that the EMA of of Adam is the cause. The main problem lies in the following quantity:

 Γt+1=√Vt+1αt+1−√Vtαt,

which essentially measures the change in the inverse of learning rate with respect to time. Algorithms that use EMA to estimate the scale of the gradients cannot guarantee the positive semi-definiteness of , and that causes the non-convergence of Adam. To fix this issue, Reddi et al. (2018) proposed AMSGrad, which added one more step in (2). AMSGrad is claimed by its authors to have a “long-term memory” of past gradients.

Another explanation on the cause of non-convergence was recently proposed by Zhou et al. (2018). The authors observed that Adam may diverge because a small gradient may have a large step size which leads to a large update. Therefore, if the small with large step size is often in the wrong direction, it could lead to divergence. Thus, they proposed a modification to Adam called AdaShift by replacing with for some manually chosen when calculating .

## 2 Related Work

Adam is widely used in both academia and industry. However, it is also one of the least well-understood algorithms. In recent years, some remarkable works provided us with better understanding of the algorithm, and proposed different variants of it. Most of works focused on how to interpret or modify the re-scaling term of (2).

As mentioned above, Reddi et al. (2018), Zhou et al. (2018) focused on the non-convergence issue of Adam, and proposed their own modified algorithms. Wilson et al. (2017) pointed out the generalization issue of adaptive optimization algorithms. Based on the assumption that is the estimate of the second moment estimate of , Balles and Hennig (2018)

dissected Adam into sign-based direction and variance adaption magnitude. They also pointed out that the sign-based direction part is the decisive factor of generalization performance, and that is problem-dependent. This in a way addressed the generalization issue raised in

Wilson et al. (2017).

However, the interpretation of as an estimate of the second moment assumption may not be correct, since Chen and Gu (2019) showed that in the Adam update (2) can be replaced by for any . The modified algorithm is called Padam. In our supplementary material, we also proved that a convergence theorem of a “p-norm” form of NosAdam, where the re-scaling term can be essentially viewed as a “p-moment” of . These discoveries cast doubts on the the second moment assumption, since both the convergence analysis and empirical performance seemed not so dependent on this assumption.

The true role of , however, remains a mystery. In AdaGrad Duchi et al. (2010), which is a special case of NosAdam, the authors mentioned an metaphor that “the adaptation allows us to find needles in haystacks in the form of very predictive but rarely seen features.” They were suggesting that is to some extent balancing the update speeds of different features according to their abundance in the data set. This understanding might be supported by a previous work called SWATS (Switching from Adam to SGD) Keskar and Socher (2017)

, which uses Adam for earlier epochs and then fix the re-scaling term

for later epochs. This suggests that there may be some sort of “optimal” re-scaling term, and we can keep using it after we obtain a good enough estimate.

Despite all the previous efforts, our understanding of the re-scaling term is still very limited. In this paper, we investigate the issue from a loss landscape approach, and this provides us with some deeper understanding of when and how different Adam-like algorithms can perform well or poorly.

In this section, we introduce the Nostalgic Adam (NosAdam) algorithm, followed by a discussion on its convergence. Let us first consider a general situation where we allow the parameter in Equation (2) change in time . Without loss of generality, we may let . Then, the NosAdam algorithm reads as in Algorithm 1. Like Adam and its variants, the condition is crucial in ensuring convergence. We will also see that to ensure positive semi-definiteness of , the algorithm naturally requires to weight more of the past gradients than the more recent ones when calculating . To see this, we first present the following lemma.

###### Lemma 3.1.

The positive semi-definiteness of is tightly satisfied if is non-increasing.

###### Proof.
 Vtα2t= tα2t∑j=1Πt−jk=1β2,t−k+1(1−β2,j)g2j = tα2t∑j=1Bt−1Bt…BjBj+1Bj−Bj−1Bjg2j = tBtα2t∑j=1bjg2j≥t−1Bt−1α2t−1∑j=1bjg2j = Vt−1α2t−1

Here the “tightly satified” in the lemma means the conclusion cannot be strengthened, in that if is increasing, then will be very easily violated since can be infinitesimal.

Again, without loss of generality, we can write as . Then, it is not hard to see that is non-increasing if and only if is non-increasing. Noting that , we can see that the sufficient condition for positive semi-definiteness of is that in the weighted average , the weights of gradients should be non-increasing w.r.t. . In other words, we should weight more of the past gradients than the more recent ones.

From Algorithm 1, we can see that can either decrease or increase based on the relationship between and , which is the reason why NosAdam circumvents the flaw of AMSGrad (Figure 4). Convergence of NosAdam is also guaranteed as stated by the following theorem.

###### Theorem 3.2 (Convergence of NosAdam).

Let and be the sequences defined in Algorithm 1, , for all t. Assume that has bounded diameter and for all t and . Furthermore, let be such that the following conditions are satisfied:

 1 Btt≤Bt−1t−1 2 Bttb2t≥Bt−1(t−1)b2t−1

Then for generated using NosAdam, we have the following bound on the regret

 RT≤D2∞2α(1−β1)d∑i=1√Tv12T,i+D2∞2(1−β1)T∑t=1d∑i=1β1,tv12t,iαt +αβ1(1−β1)3d∑i=1 ⎷BTT∑Tt=1btg2t,ib2T

Here, we have adopted the notation of online optimization introduced in Zinkevich (2003). At each time step , the optimization algorithm picks a point in its feasible set . Let

be the loss function corresponding to the underlying mini-batch, and the algorithm incurs loss

. We evaluate our algorithm using the regret that is defined as the sum of all the previous differences between the online prediction and loss incurred by the fixed parameter point in for all the previous steps, i.e.

 RT=T∑t=1ft(xt)−minx∈FT∑t=1ft(x). (3)

Denote the set of all positive definite matrices. The projection operator for is defined as for . Finally, we say has bounded diameter if for all .

One notable characteristic of NosAdam, which makes it rather different from the analysis by Reddi et al. (2018), is that the conditions on and are data-independent and are very easy to check. In particular, if we choose as a hyperharmonic series, i.e. , then the convergence criteria are automatically satisfied. We shall call this special case NosAdam-HH, and its convergence result is summarized in the following corollary.

###### Corollary 3.2.1.

Suppose, , thus , and in Algorithm 1. Then and satisfy the constraints in Therorem A.1, and we have

 RT≤D2∞2α(1−β1)d∑i=1√Tv12T,i+D2∞G∞β12(1−β1)1(1−λ)2⋅d +2αβ1(1−β1)3G∞√T

Our theory shows that the proposed NosAdam achieves convergence rate of , which is so far the best known convergence rate.

## 4 Why Nostalgic?

In this section, we investigate more about the mechanism behind Adam and AMSGrad, and analyze the pros and cons of being “nostalgic”.

As mentioned in Section 1, Reddi et al. (2018) proved that if is positive semi-definite, Adam converges. Otherwise, it may diverge. An example of divergence made by Reddi et al. (2018) is

 ft(x)={Cxt\; mod\; 3=1−xotherwise , (4)

where is slightly larger than 2. The correct optimization direction should be -1, while Adam would go towards 1. To fix this, they proposed AMSGrad, which ensures by updating as follows

 vt=β2vt−1+(1−β2)g2t, ^vt=max(^vt−1,vt),

where is used in the update step.

However, this example is not representative of real situations. Also, the explanation of “long-term memory” by Reddi et al. (2018) is not very illustrative. In the remaining part of this section, we aim to discuss some more realistic senarios and try to understand the pros and cons of different algorithms.

We start from analyzing the different weighting strategies when calculating . For Adam,

and the weight increases exponentially. For NosAdam,

and for NosAdam-HH, is the -th term of a hyperharmonic series. For AMSGrad, is data-dependent and therefore cannot be explicitly expressed. However, is chosen to be the largest in . Therefore, it can be seen as a shifted version of , i.e. , where depends on the data. This is similar as AdaShift, where

1 plots the first 100 weights of Adam, NosAdam and AMSGrad, where , , , is chosen as 0.9, 0.1 and 20, respectively.

From the above analysis, we can see that of Adam is mainly determined by its most current gradients. Therefore, when keeps being small, the adaptive learning rate could be large, which may lead to oscillation of the sequence, and increasing chance of being trapped in local minimum. On the other hand, NosAdam adopts a more stable calculation of , since it relies on all the past gradients.

We support the above discussion with an example of an objective function with a bowl-shaped landscape where the global minima is at the bottom of the bowl with lots of local minimum surrounding it. The explicit formula of the objective function is

 f(x, y,z)=−ae−b((x−π)2+(y−π)2)+(z−π)2) −c∑icos(x)cos(y)e−β((x−rsin(i2)−π)2+(y−rcos(i2)−π)2).

Figure 2a shows one slice of the function for . In the function, and determine the depth and width of the global minima, and , , determine depth, location and width of the local minimums. In this example, , , , , are set to 30, 0.007, 0.25, 1, 20, respectively.

Figure 2b shows different trajectories of Adam and NosAdam when they are initiated at the same point on the side of the bowl. As expected, the trajectory of Adam (yellow) passes the global minima and ends up trapped in valley , while NosAdam (red) gradually converges to the global minima, i.e. valley .

There are also situations in which NosAdam can work poorly. Just because NosAdam is nostalgic, it requires a relatively good initial point to achieve good performances though this is commonly required by most optimization algorithms. However, Adam can be less affected by bad initializations sometime due to its specific way of calculating . This gives it a chance of jumping out of the local minimum (and a chance of jumping out of the global minima as well as shown in Figure 2). To demonstrate this, we let both Adam and NosAdma initialize in the valley A (see Figure 5). We can see that the trajectory of Adam manages to jump out of the valley, while it is more difficult for NosAdam to do so.

We note that although NosAdam requires good initialization, it does not necessarily mean initializing near the global minima. Since the algorithm is nostalgic, as long as the initial gradients are pointing towards the right direction, the algorithm may still converge to the global minima even though the initialization is far away from the global minima. As we can see from Figure 4 that NosAdam converges because all of the gradients are good ones at the beginning of the algorithm, which generates enough momentum to help the sequence dashes through the region with sharp local minimum.

Like any Adam-like algorithm, the convergence of NosAdam depends on the loss landscape and initialization. However, if the landscape is as shown in the above figures, then NosAdam has a better chance to converge than Adam and AMSGrad. In practice, it is therefore helpful to first examine the loss landscape before selecting an algorithm. However, it is time consuming to do in general. Nonetheless, earlier studies showed that neural networks with skip connections like ResNet and DenseNet lead to coercive loss functions similar to the one shown in the above figures Li et al. (2018).

## 5 Experiments

In this section, we conduct some experiments to compare NosAdam with Adam and its variant AMSGrad. We consider the task of multi-class classification using logistic regression, multi-layer fully connected neural networks and deep convolutional neural networks on MNIST

LECUN and CIFAR-10 Krizhevsky et al. . The results generally indicate that NosAdam is a promising algorithm that works well in practice.

Throughout our experiments, we fixed to be 0.9, to be 0.999 for Adam and AMSGrad, and search in for NosAdam. The initial learning rate is chosen from and the results are reported using the best set of hyper-parameters. All the experiments are done using Pytorch0.4.

Logistic Regression To investigate the performance of the algorithms on convex problems, we evaluate Adam, AMSGrad and NosAdam on multi-class logistic regression problem using the MNIST dataset. To be consistent with the theory, we set the step size . We set the minibatch size to be 128. According to Figure (a)a, the three algorithms have very similar performance.

Multilayer Fully Connected Neural Networks

We first train a simple fully connected neural network with 1 hidden layer (with 100 neurons and ReLU as the activation function) for the multi-class classification problem on MNIST. We use constant step size

throughout the experiments for this set of experiments. The results are shown in Figure (b)b. We can see that NosAdam slightly outperforms AMSGrad, while Adam is much worse than both NosAdam and AMSGrad and oscillates a lot. This is due to the difference of the definition of for each algorithm: in AMSGrad and NosAdam gradually becomes stationary and stays at a good re-scaling value; while in Adam does not have such property.

Deep Convolutional Neural Networks Finally, we train a deep convolutional neural network on CIFAR-10. Wide Residual Network Zagoruyko and Komodakis (2016) is known to be able to achieve high accuracy with much less layers than ResNet He et al. (2015). In our experiment, we choose Wide ResNet28. The model is trained on 4 GPUs with the minibatch size 100. The initial learning rate is decayed at epoch 50 and epoch 100 by multiplying 0.1. In our experiments, the optimal performances are usually achieved when the learning rate is around 0.02 for all the three algorithms. For reproducibility, an anonymous link of code will be provided in the supplementary material.

Our results are shown in Figure 7. We observe that NosAdam works slightly better than AMSGrad and Adam in terms of both convergence speed and generalization. This indicates that NosAdam is a promising alternative to Adam and its variants.

## 6 Discussion

In this paper, we suggested that we should weight more of the past gradients when designing the adaptive learning rate. In fact, our original intuition came from mathematical analysis of the convergence of Adam-like algorithms. Based on such observation, we then proposed a new algorithm called Nostalgic Adam (NosAdam), and provided a convergence analysis. We also discussed the pros and cons of NosAdam comparing to Adam and AMSGrad using a simple example, which gave us a better idea when NosAdam could be effective.

For future works, we believe that loss landscape analysis and the design of a strategy to choose different algorithms adaptively based on the loss landscape would be worth pursuing. Hopefully, we can design an optimization algorithm that can adaptively adjust its re-scaling term in order to fully exploit the local geometry of the loss landscape.

## Acknowledgments

This work would not have existed without the support of BICMR and School of Mathematical Sciences, Peking University. Bin Dong is supported in part by Beijing Natural Science Foundation (Z180001).

## Appendix A Convergence of p-NosAdam

In this appendix, we use the same notations as in the paper “Nostalgic Adam: Weighting more of the past gradients when designing the adaptive learning rate”. We are going to prove a more general convergence theorem. In the original paper, we propose NosAdam, as shown in Algorithm 1. But in fact, NosAdam can be considered as a particular case of a more general algorithm, in which we replaces in the calculation of by , and in the update equation by . We call this algorithm p-NosAdam, as shown in Algorithm 2. NosAdam is p-NosAdam when .

In the remaining part of this appendix, we are going to prove the convergence theorem of p-NosAdam when . From Theorem A.1, we can see that the regret bound is .

###### Theorem A.1 (Convergence of p-NosAdam).

Let and be the sequences defined in p-NosAdam, , for all t. Assume that has bounded diameter and for all t and . Furthermore, let be such that the following conditions are satisfied:

 1 Btt≤Bt−1t−1 2 Bttbpt≥Bt−1(t−1)bpt−1

Then for generated using p-NosAdam, we have the following bound on the regret

 RT≤D2∞2α(1−β1)d∑i=1T1pv1pT,i+D2∞2(1−β1)T∑t=1d∑i=1β1,tv1pt,iαt+α(β1+1)(1−β1)3d∑i=1(T∑t=1btgpt,i)p−1p(BTTbpT)1p

### Proof of Theorem a.1:

Recall that

 RT=T∑t=1ft(xt)−minx∈FT∑t=1ft(x). (5)

Let . Therefore .

To prove this theorem, we will use the following lemmas.

###### Lemma A.2.
 T∑t=1ft(xt)−ft(x∗) ≤T∑t=1[12αt(1−β1t)(||V1/2pt(xt−x∗)||2 −||V1/2pt(xt+1−x∗)||2)+αt2(1−β1t)||V−1/2ptmt||2 +β1t2(1−β1t)α||V−1/2ptmt+1||2+β1t2αt(1−β1t)||V1/2pt(xt−x∗)||2]

#### Proof of Lemma a.2:

We begin with the following observation:

 xt+1=ΠF,V1/pt(xt−αtV−1/ptmt)=minx∈F||V1/2pt(x−(xt−αtV−1/ptmt))||

Using Lemma 4 in Reddi et al. (2018) with and , we have the following:

 ||V1/2pt(xt+1−x∗)||2 ≤||V1/2pt(xt−αtV−1/ptmt−x∗)||2 =||V1/2pt(xt−x∗)||2+α2t||V−1/2ptmt||2−2αt⟨mt,xt−x∗⟩ =||V1/2pt(xt−x∗)||2+α2t||V−1/2ptmt||2 −2αt⟨β1tmt−1+(1−β1t)gt,xt−x∗⟩

Rearranging the above inequality, we have

 ⟨gt,xt−x∗⟩≤12αt(1−β1t)[||V1/2pt (xt−x∗)||2−||V1/2pt(xt+1−x∗)||2] +αt2(1−β1t)||V−1/2ptmt||2−β1t1−β1t⟨mt−1,xt−x∗⟩ ≤12αt(1−β1t)[||V1/2pt (xt−x∗)||2−||V1/2pt(xt+1−x∗)||2] +αt2(1−β1t)||V−1/2ptmt||2+αtβ1t2(1−β1t)||V−1/2ptmt−1||2 +β1t2αt(1−β1t)||V1/2pt(xt−x∗)||2

The second inequality follows from simple application of Cauchy-Schwarz and Young’s inequality. We now use the standard approach of bounding the regret at each step using convexity of the function in the following manner:

 T∑t=1ft(xt)−ft(x∗)≤T∑t=1⟨gt,xt−x∗⟩ ≤T∑t=1[12αt(1−β1t)(||V1/2pt(xt−x∗)||2−||V1/2pt(xt+1−x∗)||2)+αt2(1−β1t)||V−1/2ptmt||2 +β1t2(1−β1t)α||V−1/2ptmt+1||2+β1t2αt(1−β1t)||V1/2pt(xt−x∗)||2]

This completes the proof of Lemma A.2.

Base on this Lemma, we are going to find the corresponding upper bound for each term in the above regret bound inequality.

For the first term , we have Lemma A.3.

###### Lemma A.3.

When is non-increasing, then is semi-positive, and

 T∑t=112αt(1−β1t)(||V1/2pt(xt−x∗)||2−||V1/2pt(xt+1−x∗)||2)≤T1/p2(1−β1)V1/ptαD2∞

#### Proof of Lemma a.3:

 Vtαpt= tαpt∑j=1Πt−jk=1β2,t−k+1(1−β2,j)gpj = tαpt∑j=1Bt−1Bt…BjBj+1Bj−Bj−1Bjgpj = tBtαpt∑j=1bjgpj≥t−1Bt−1α2t−1∑j=1bjgpj = Vt−1αpt−1

which means is semi-positive.

 T∑t=112αt(1−β1t)(||V1/2pt(xt−x∗)||2−||V1/2pt(xt+1−x∗)||2) ≤ 12(1−β1)T∑t=11αt(||V1/2pt(xt−x∗)||2−||V1/2pt(xt+1−x∗)||2) ≤ 12(1−β1)(||V1/2p1(x1−x∗)||2−||V1/2pT(xT+1−x∗)||2) + 12(1−β1)T∑t=2(V1/ptαt−V1/pt−1αt−1)(xt−x∗)2 ≤ 12(1−β1)||V1/2p1||2D2∞+12(1−β1)T∑t=2(V1/ptαt−V1/pt−1αt−1)D2∞ = T1/p2(1−β1)V1/ptαD2∞

The third inequation use the knowledge that .

This completes the proof of Lemma A.3.

For the second and the third terms in Lemma A.2, we have Lemma A.4.

###### Lemma A.4.
 αt2(1−β1t)||V−1/2ptmt||2+αtβ1t2(1−β1t)||V−1/2ptmt−1||2≤p2(p−1)α(1+β1)(1−β1)3Sp−1pT(BTTbpT)1p

#### Proof of Lemma a.4

For the second term in Lemma A.2 :

 T∑t=1αt||V−1/2ptmt||2 = T−1∑t=1αt||V−1/2ptmt||2+αTd∑i=1m2T,iv1/pT,i ≤ T−1∑t=1αt||V−1/2ptmt||2+αTd∑i=1(∑Tj=1(1−β1j)ΠT−jk=1β1(T−k+1)gj,i)2)(∑Tj=1ΠT−jk=1β2(T−k+1)(1−β2j)g2j,i)1/p ≤ T−1∑t=1αt||V−1/2ptmt||2 +αTd∑i=1(∑Tj=1(1−β1j)ΠT−jk=1β1(T−k+1))(∑Tj=1(1−β1j)ΠT−jk=1β1(T−k+1)g2j,i)(