## Authors

• 15 publications
• 1 publication
• 5 publications
• 23 publications
• 91 publications
• 75 publications
• ### Convergence Analysis of a Momentum Algorithm with Adaptive Step Size for Non Convex Optimization

Although ADAM is a very popular algorithm for optimizing the weights of ...
11/18/2019 ∙ by Anas Barakat, et al. ∙ 0

• ### Stochastic Polyak Step-size for SGD: An Adaptive Learning Rate for Fast Convergence

We propose a stochastic variant of the classical Polyak step-size (Polya...
02/24/2020 ∙ by Nicolas Loizou, et al. ∙ 13

We propose a statistical adaptive procedure called SALSA for automatical...
02/25/2020 ∙ by Pengchuan Zhang, et al. ∙ 0

• ### Adaptive Gradient Methods Converge Faster with Over-Parameterization (and you can do a line-search)

06/11/2020 ∙ by Sharan Vaswani, et al. ∙ 12

• ### Iterative exact global histogram specification and SSIM gradient ascent: a proof of convergence, step size and parameter selection

The SSIM-optimized exact global histogram specification (EGHS) is shown ...
02/17/2010 ∙ by Alireza Avanaki, et al. ∙ 0

• ### Automatic, Dynamic, and Nearly Optimal Learning Rate Specification by Local Quadratic Approximation

In deep learning tasks, the learning rate determines the update step siz...
04/07/2020 ∙ by Yingqiu Zhu, et al. ∙ 3

We investigate several confounding factors in the evaluation of optimiza...
02/26/2020 ∙ by Naman Agarwal, et al. ∙ 6

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

First-order optimization algorithms with adaptive learning rate play an important role in deep learning due to their efficiency in solving large-scale optimization problems. Denote

as the gradient of loss function

with respect to its parameters at timestep , then the general updating rule of these algorithms can be written as follows (Reddi et al., 2018):

 θt+1=θt− αt√vtmt. (1)

In the above equation, is a function of the historical gradients; is an

-dimension vector with non-negative elements, which adapts the learning rate for the

elements in respectively; is the base learning rate; and is the adaptive step size for .

One common choice of is the exponential moving average of the gradients used in Momentum (Qian, 1999) and Adam (Kingma & Ba, 2014), which helps alleviate gradient oscillations. The commonly-used in deep learning community is the exponential moving average of squared gradients, such as Adadelta (Zeiler, 2012)

(Tieleman & Hinton, 2012), Adam (Kingma & Ba, 2014) and Nadam (Dozat, 2016).

Adam (Kingma & Ba, 2014) is a typical adaptive learning rate method, which assembles the idea of using exponential moving average of first and second moments and bias correction. In general, Adam is robust and efficient in both dense and sparse gradient cases, and is popular in deep learning research. However, Adam is shown not being able to converge to optimal solution in certain cases. Reddi et al. (2018) point out that the key issue in the convergence proof of Adam lies in the quantity

 Γt≜(√vtαt−√vt−1αt−1), (2)

which is assumed to be positive, but unfortunately, such an assumption does not always hold in Adam. They provide a set of counterexamples and demonstrate that the violation of positiveness of will lead to undesirable convergence behavior in Adam.

Reddi et al. (2018) then propose two variants, AMSGrad and AdamNC, to address the issue by keeping positive. Specifically, AMSGrad defines as the historical maximum of , i.e., , and replaces with to keep non-decreasing and therefore forces to be positive; while AdamNC forces to have “long-term memory” of past gradients and calculates as their average to make it stable. Though these two algorithms solve the non-convergence problem of Adam to a certain extent, they turn out to be inefficient in practice: they have to maintain a very large once a large gradient appears, and a large decreases the adaptive learning rate and slows down the training process.

In this paper, we provide a new insight into adaptive learning rate methods, which brings a new perspective on solving the non-convergence issue of Adam. Specifically, in Section 3, we study the counterexamples provided by Reddi et al. (2018) via analyzing the accumulated step size of each gradient . We observe that in the common adaptive learning rate methods, a large gradient tends to have a relatively small step size, while a small gradient is likely to have a relatively large step size. We show that the unbalanced step sizes stem from the inappropriate positive correlation between and , and we argue that this is the fundamental cause of the non-convergence issue of Adam.

In Section 4, we further prove that decorrelating and leads to equal and unbiased expected step size for each gradient, thus solving the non-convergence issue of Adam. We subsequently propose AdaShift, a decorrelated variant of adaptive learning rate methods, which achieves decorrelation between and by calculating using temporally shifted gradients. Finally, in Section 5, we study the performance of our proposed AdaShift, and demonstrate that it solves the non-convergence issue of Adam, while still maintaining a decent performance compared with Adam in terms of both training speed and generalization.

## 2 Preliminaries

In Adam, and are defined as the exponential moving average of and :

 mt=β1mt−1+(1−β1)gt  and  vt=β2vt−1+(1−β2)g2t, (3)

where and are the exponential decay rates for and , respectively, with and . They can also be written as:

 mt=(1−β1)t∑i=1βt−i1gi  %and  vt=(1−β2)t∑i=1βt−i2g2i. (4)

To avoid the bias in the estimation of the expected value at the initial timesteps,

Kingma & Ba (2014) propose to apply bias correction to and . Using as instance, it works as follows:

 mt=(1−β1)∑ti=1βt−i1gi(1−β1)∑ti=1βt−i1=∑ti=1βt−i1gi∑ti=1βt−i1=(1−β1)∑ti=1βt−i1gi1−βt1. (5)

#### Online optimization problem.

An online optimization problem consists of a sequence of cost functions , where the optimizer predicts the parameter at each timestep and evaluate it on an unknown cost function . The performance of the optimizer is usually evaluated by regret , which is the sum of the difference between the online prediction and the best fixed-point parameter prediction for all the previous steps, where is the best fixed-point parameter from a feasible set .

#### Counterexamples.

Reddi et al. (2018) highlight that for any fixed and , there exists an online optimization problem where Adam has non-zero average regret, i.e., Adam does not converge to optimal solution . The counterexamples in the sequential version are given as follows:

 ft(θ)={Cθ,if t mod d = 1;−θ,otherwise, (6)

where is a relatively large constant and

is the length of an epoch. In Equation

6, most gradients of with respect to are , but the large positive gradient at the beginning of each epoch makes the overall gradient of each epoch positive, which means that one should decrease to minimize the loss. However, according to (Reddi et al., 2018), the accumulated update of in Adam under some circumstance is opposite (i.e., is increased), thus Adam cannot converge in such case. Reddi et al. (2018) argue that the reason of the non-convergence of Adam lies in that the positive assumption of does not always hold in Adam.

#### Basic Solutions

Reddi et al. (2018) propose maintaining the strict positiveness of as solution, for example, keeping non-decreasing or using increasing . In fact, keeping positive is not the only way to guarantee the convergence of Adam. Another important observation is that for any fixed sequential online optimization problem with infinitely repeating epochs (e.g., Equation 6), Adam will converge as long as is large enough. Formally, we have the following theorem:

###### Theorem 1 (The influence of β1).

For any fixed sequential online convex optimization problem with infinitely repeating of finite length epochs, let the length of an epoch be , if such that and such that holds for all , then, for any fixed , there exists a such that Adam has average regret ;

The intuition behind Theorem 1 is that, if , then , i.e., approaches the average gradient of an epoch, according to Equation 5. Therefore, no matter what the adaptive learning rate is, Adam will always converge along the correct direction.

## 3 The cause of non-convergence: unbalanced step size

In this section, we study the non-convergence issue by analyzing the counterexamples provided by Reddi et al. (2018). We show that the fundamental problem of common adaptive learning rate methods is that: is positively correlated to the scale of gradient , which results in a small step size for a large gradient, and a large step size for a small gradient. We argue that such an unbalanced step size is the cause of non-convergence.

We will first define net update factor for the analysis of the accumulated influence of each gradient , then apply the net update factor to study the behaviors of Adam using Equation 6 as an example. The argument will be extended to the stochastic online optimization problem and general cases.

### 3.1 Net update factor

When , due to the exponential moving effect of , the influence of exists in all of its following timesteps. For timestep (), the weight of is . We accordingly define a new tool for our analysis: the net update of each gradient , which is its accumulated influence on the entire optimization process:

 net(gt)≜∞∑i=t αi√vi[(1−β1)βi−t1gt]=k(gt)⋅gt,  where  k(gt)=∞∑i=t αi√vi(1−β1)βi−t1, (7)

and we call the net update factor of , which is the equivalent accumulated step size for gradient . Note that depends on , and in Adam, if , then all elements in are related to . Therefore, is a function of .

It is worth noticing that in Momentum method, is equivalently set as . Therefore, we have and , which means that the accumulated influence of each gradient in Momentum is the same as vanilla SGD (Stochastic Gradient Decent). Hence, the convergence of Momentum is similar to vanilla SGD. However, in adaptive learning rate methods, is function over the past gradients, which makes its convergence nontrivial.

### 3.2 Analysis on sequential online optimization counterexamples

Note that exists in the definition of net update factor (Equation 7). Before further analyzing the convergence of Adam using the net update factor, we first study the pattern of in the sequential online optimization problem in Equation 6. Since Equation 6 is deterministic, we can derive the formula of as follows:

###### Theorem 2 (Limit of vt).

In the sequential online optimization problem in Equation 6, denote as the decay rates, as the length of an epoch, as the index of epoch, and as the index of timestep in one epoch. Then the limit of when is:

 limn→∞vnd+i=1−β21−βd2(C2−1)βi−12+1 . (8)

Given the formula of in Equation 8, we now study the net update factor of each gradient. We start with a simple case where . In this case we have

 limn→∞k(gnd+i)=limn→∞αt√vnd+i. (9)

Since the limit of in each epoch monotonically decreases with the increase of index according to Equation 8, the limit of monotonically increases in each epoch. Specifically, the first gradient in epoch represents the correct updating direction, but its influence is the smallest in this epoch. In contrast, the net update factor of the subsequent gradients are relatively larger, though they indicate a wrong updating direction.

The above problem stems from the inappropriate correlation between and . Recall that , and we assume is independent of . When a new gradient arrives, if is large, is likely to be larger; and if is small, is also likely to be smaller. As a result, a large gradient is likely to have a small net update factor, while a small gradient is likely to have a large net update factor in Adam.

We further consider the general case where . The result is presented in the following theorem:

###### Theorem 3 (Unbalanced net update factor).

In the sequential online optimization problem in Equation 6, when , the limit of net update factor of epoch is:

 limn→∞k(gnd+i)=∞∑t=nd+i(1−β1)βt−nd−i1√1−β21−βd2(C2−1)β(t−1)modd2+1. (10)

And there exists such that

 limn→∞k(C)=limn→∞k(gnd+1)

and

 limn→∞k(gnd+j)>limn→∞k(gnd+j+1)>⋯>limn→∞k(gnd+d+1)=limn→∞k(C), (12)

where denotes the net update factor for gradient .

Theorem 3 tells us that, in sequential online optimization problem in Equation 6, the net update factors are unbalanced. Specifically, the net update factor for the large gradient is the smallest in the entire epoch, while all gradients have larger net update factors. Such unbalanced net update factors will possibly lead Adam to a wrong accumulated update direction.

### 3.3 Analysis on stochastic online optimization counterexamples

The counterexamples are also extended to stochastic cases in Reddi et al. (2018), where a finite set of cost functions appear in a stochastic order. Compared with sequential online optimization counterexample, the stochastic version is more general and closer to the practical situation. For the simplest one dimensional case, at each timestep , the function is chosen as i.i.d.:

 ft(θ)=⎧⎨⎩Cθ,with probability p=1+δC+1;−θ,with probability 1−p=C−δC+1 , (13)

where is a small positive constant that is smaller than . The expected cost function of the above problem is , therefore, one should decrease to minimize the loss. Reddi et al. (2018) prove that when is large enough, the expectation of accumulated parameter update in Adam is positive and results in increasing .

To conduct more rigorous study on the stochastic online optimization problem in Equation 13, we derive the expectation of the net update factor for each gradient in the following theorem:

###### Theorem 4 (Unbalanced net update factor in stochastic online optimization problem).

In the stochastic online optimization problem in Equation 13, assuming , the expectation of net update factors are as follows:

 k(C)=∞∑t=0(1−β1)βt1⎡⎢ ⎢⎣1√(1−β2)βt2C2+(1+βt+12−βk2)E[g2i]+Dt8[(1−β2)βt2C2+(1+βt+12−βt2)E[g2i]]52⎤⎥ ⎥⎦, (14)

and

 k(−1)=∞∑t=0(1−β1)βt1⎡⎢ ⎢⎣1√(1−β2)βt2+(1+βt+12−βk2)E[g2i]+Dt8[(1−β2)βt2+(1+βt+12−βt2)E[g2i]]52⎤⎥ ⎥⎦, (15)

where denotes the net update factor for and denotes the net update factor for . is a positive value.

Though the formulas of net update factors in the stochastic case are more complicated than those in deterministic case, the analysis is actually more easier: the gradients with the same scale share the same expected net update factor, so we only need to analyze and . We can see that each term in the infinite series of is smaller than the corresponding one in , therefore, the accumulated influence of gradient is smaller than gradient .

The above observation can also be interpreted as a direct consequence of the inappropriate correlation between and : given , not only positively correlates with , but also the entire infinite sequence positively correlates with . Since the net update factor negatively correlates with each in , it also negatively correlates with . That is, for a large gradient is likely to be smaller, while for a small gradient is likely to be larger.

The unbalanced net update factor causes the non-convergence problem of Adam as well as all other adaptive learning rate methods where correlates with . All these counterexamples follow the same pattern: the large gradient is along the “correct” direction, while the small gradient is along the opposite direction. Due to the fact that the accumulated influence of a large gradient is small while the accumulated influence of a small gradient is large, Adam may update parameters along the wrong direction. Even if Adam updates parameters along the right direction in general, the unbalanced net update factors are still unfavorable since they slow down the convergence.

## 4 The proposed method: decorrelation via temporal shifting

According to the previous discussion, we conclude that the main cause of the non-convergence of Adam is the inappropriate correlation between and . Currently we have two possible solutions: (1) making act like a constant, which declines the correlation, e.g., using a large or keep non-decreasing (Reddi et al., 2018); (2) using a large (Theorem 1), where the aggressive momentum term helps to mitigate the impact of unbalanced net update factors. However, neither of them solves the problem fundamentally.

The dilemma caused by enforces us to rethink its role. In adaptive learning rate methods, plays the role of estimating the second moments of gradients, which reflects the scale of gradient on average. With the adaptive learning rate , the update step of is scaled down by and achieves rescaling invariance with respect to the scale of , which is practically useful to make the training process easy to control and the training system robust. However, the current scheme of , i.e., , brings a positive correlation between and , which results in reducing the effect of large gradients and increasing the effect of small gradients, and finally causes the non-convergence problem. Therefore, the key is to let be a quantity that reflects the scale of the gradients, while at the same time, be decorrelated with current gradient . Formally, we have the following theorem:

###### Theorem 5 (Decorrelation leads to convergence).

For any fixed online optimization problem with infinitely repeating of a finite set of cost functions , assuming and is fixed, we have, if follows a fixed distribution and is independent of the current gradient , then the expected net update factor for each gradient is identical.

Let denote the distribution of . In the infinitely repeating online optimization scheme, the expectation of net update factor for each gradient is

 E[k(gt)]=∞∑i=tEvi∼Pv[ αi√vi(1−β1)βi−t1]. (16)

Given is independent of , the expectation of the net update factor is independent of and remains the same for different gradients. With the expected net update factor being a fixed constant, the convergence of the adaptive learning rate method reduces to vanilla SGD.

Momentum (Qian, 1999) can be viewed as setting as a constant, which makes and independent. Furthermore, in our view, using an increasing (AdamNC) or keeping as the largest (AMSGrad) is also to make almost fixed. However, fixing is not a desirable solution, because it damages the adaptability of Adam with respect to the adapting of step size.

We will next introduce the proposed solution to make independent of , which is based on temporal independent assumption among gradients. The proposed solution decorrelates and with a simple temporal shifting.

### 4.1 Temporal decorrelation

In practical setting, usually involves different mini-batches , i.e., . Given the randomness of mini-batch, we assume that the mini-batch is independent of each other and further assume that keeps unchanged over time, then the gradient of each mini-batch is independent of each other. Then we present the temporal decorrelation algorithm as follows.

The key change is that the update rule for now involves instead of (line 4), which makes and temporally shifted and hence decorrelated. Note that in the sequential online optimization problem, the assumption “

is independent of each other” does not hold. However, in the stochastic online optimization problem and practical neural network settings, our assumption generally holds.

### 4.2 Temporal-spatial decorrelation

In most optimization schemes, there exists many parameters, i.e., the dimension of is high, thus and are also of high dimension. However, in Algorithm 1, is element-wisely computed; that is, we only use the -th dimension of to calculate -th dimension of . In other words, it only makes use of the independence between and , where denotes the -th element of .

Actually, in the case of high-dimensional and , we can further assume that all elements of gradient at previous timestep are independent of the -th dimension of . Thus all elements in can be used to compute without introducing correlation. We propose introducing a function over all elements of to achieve this goal, i.e., . For easy of reference, we name the elements of other than as the spatial elements of and name the spatial function or spatial operation.

There is no restriction on the choice of , and we use for most of our experiments, which is shown to be a good choice. The operation has a side effect that turns the adaptive learning rate into a shared scalar. An important thing here is that, we no longer interpret as the second moment of

. It is merely a random variable that is independent of

, while at the same time, reflects the overall gradient scale. We leave further investigations on as future work.

### 4.3 Block-wise temporal-spatial decorrelation

In practical setting, e.g., deep neural network,

usually consists of many parameter blocks, e.g., the weight and bias for each layer. In deep neural network, the gradient scales (i.e., the variance) for different layers tend to be different

(Glorot & Bengio, 2010; He et al., 2015). Different gradient scales make it hard to find a learning rate that is suitable for all layers, when using SGD and Momentum methods. In traditional adaptive learning rate methods, they apply element-wise rescaling for each gradient dimension, which achieves rescaling-invariance and somehow solves the above problem. However, Adam sometimes does not generalize better than SGD (Wilson et al., 2017; Keskar & Socher, 2017), which might relate to the excessive learning rate adaptation in Adam.

In our temporal-spatial decorrelation scheme, we can solve the “different gradient scales” issue more naturally, by applying block-wisely and outputs a shared adaptive learning rate scalar for each block. It makes the algorithm work like an adaptive learning rate SGD, where each block has an adaptive learning rate while the relative gradient scale among in-block elements keep unchanged. The corresponding algorithm is illustrated in Algorithm 2, where the parameters including the related and are divided into blocks. Every block contains the parameters of the same type or same layer in neural network.

### 4.4 Incorporating first moment: moving averaging windows

First moment estimation, i.e., defining as a moving average of , is an important technique of modern first order optimization algorithms, which alleviates mini-batch oscillations. In this section, we extend our algorithm to incorporate first moment estimation.

We have argued that needs to be decorrelated with . Similarly, when introducing the first moment estimation, we need to make and independent to make the expected net update factor unbiased. Based on our assumption of temporal and spatial independence, we further keep out the latest gradients , and update and via

 vt=β2vt−1+(1−β2)[ϕ(gt−n)]2  and  mt=∑n−1i=0βi1gt−i∑N−1i=0βi1. (17)

In Equation 17, plays the role of decay rate for temporal elements. It can be viewed as a truncated version of exponential moving average that only applied to the latest few elements. Since we use truncating, it is feasible to use large without taking the risk of using too old gradients. In the extreme case where , it becomes vanilla averaging.

The pseudo code of the algorithm that unifies all proposed techniques is presented in the Appendix with the following parameters: spatial operation , , , and .

## 5 Experiments

In this section, we empirically study the proposed method and compare them with Adam, AMSGrad and SGD, on various tasks in terms of training performance and generalization. Without additional declaration, the reported result for each algorithm is the best we have found via parameter grid search. The anonymous code is provided at http://bit.ly/2NDXX6x.

### 5.2 Logistic Regression and Multilayer Perceptron on MNIST

We further compare the proposed method with Adam, AMSGrad and SGD by using Logistic Regression and Multilayer Perceptron on MNIST, where the Multilayer Perceptron has two hidden layers and each has

hidden units with no internal activation. The results are shown in Figure 3 and Figure 3, respectively. We find that in Logistic Regression, these learning algorithms achieve very similar final results in terms of both training speed and generalization. In Multilayer Perceptron, we compare Adam, AMSGrad and AdaShift with reduce-max spatial operation (max-AdaShift) and without spatial operation (non-AdaShift). We observe that max-AdaShift achieves the lowest training loss, while non-AdaShift has mild training loss oscillation and at the same time achieves better generalization. The worse generalization of max-AdaShift may be due to overfitting in this task, and the better generalization of non-AdaShift may stem from the regularization effect of its relatively unstable step size.

### 5.3 DenseNet and ResNet on CIFAR-10

ResNet (He et al., 2016) and DenseNet (Huang et al., 2017) are two typical modern neural networks, which are efficient and widely-used. We test our algorithm with ResNet and DenseNet on CIFAR-10 datasets. We use a -layer ResNet and -layer DenseNet in our experiments. We plot the best results of Adam, AMSGrad and AdaShift in Figure 4 and Figure 5 for ResNet and DenseNet, respectively. We can see that AMSGrad is relatively worse in terms of both training speed and generalization. Adam and AdaShift share competitive results, while AdaShift is generally slightly better, especially the test accuracy of ResNet and the training loss of DenseNet.

### 5.4 DenseNet with Tiny-ImageNet

We further increase the complexity of dataset, switching from CIFAR-10 to Tiny-ImageNet, and compare the performance of Adam, AMSGrad and AdaShift with DenseNet. The results are shown in Figure

6, from which we can see that the training curves of Adam and AdaShift are basically overlapped, but AdaShift achieves higher test accuracy than Adam. AMSGrad has relatively higher training loss, and its test accuracy is relatively lower at the initial stage.

### 5.5 Generative model and Recurrent model

We also test our algorithm on the training of generative model and recurrent model. We choose WGAN-GP (Gulrajani et al., 2017)

that involves Lipschitz continuity condition (which is hard to optimize), and Neural Machine Translation (NMT)

(Luong et al., 2017) that involves typical recurrent unit LSTM, respectively. In Figure 6(a), we compare the performance of Adam, AMSGrad and AdaShift in the training of WGAN-GP discriminator, given a fixed generator. We notice that AdaShift is significantly better than Adam, while the performance of AMSGrad is relatively unsatisfactory. The test performance in terms of BLEU of NMT is shown in Figure 6(b), where AdaShift achieves a higher BLEU than Adam and AMSGrad.

## 6 Conclusion

In addition, based on our new perspective on adaptive learning rate methods, is no longer necessarily the second moment of , but a random variable that is independent of and reflects the overall gradient scale. Thus, it is valid to calculate with the spatial elements of previous gradients. We further found that when the spatial operation outputs a shared scalar for each block, the resulting algorithm turns out to be closely related to SGD, where each block has an overall adaptive learning rate and the relative gradient scale in each block is maintained. The experiment results demonstrate that AdaShift is able to solve the non-convergence issue of Adam. In the meantime, AdaShift achieves competitive and even better training and testing performance when compared with Adam.

## References

• Dozat (2016) Timothy Dozat.

2016.
• Glorot & Bengio (2010) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In

Proceedings of the thirteenth international conference on artificial intelligence and statistics

, pp. 249–256, 2010.
• Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5767–5777, 2017.
• He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In

Proceedings of the IEEE international conference on computer vision

, pp. 1026–1034, 2015.
• He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In

Proceedings of the IEEE conference on computer vision and pattern recognition

, pp. 770–778, 2016.
• Huang et al. (2017) Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. 2017.
• Keskar & Socher (2017) Nitish Shirish Keskar and Richard Socher. Improving generalization performance by switching from adam to sgd. arXiv preprint arXiv:1712.07628, 2017.
• Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
• Luong et al. (2017) Minh-Thang Luong, Eugene Brevdo, and Rui Zhao. Neural machine translation (seq2seq) tutorial.
• Qian (1999) Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999.
• Reddi et al. (2018) Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018.
• Tieleman & Hinton (2012) Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.

COURSERA: Neural networks for machine learning

, 4(2):26–31, 2012.
• Wilson et al. (2017) Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, pp. 4148–4158, 2017.

## Appendix A The relation among β1, β2 and C

To provide an intuitive impression on the relation among and the convergence of Adam, we let , initialize , vary and among and let Adam go through 2000 timesteps (iterations). The final result of is shown in Figure 7(a). It suggests that for a fixed sequential online optimization problem, both of and determine the direction and speed of Adam optimization process. Furthermore, we also study the threshold point of and , under which Adam will change to the incorrect direction, for each fixed and that vary among . To simplify the experiments, we keep such that the overall gradient of each epoch being . The result is shown in Figure 7(b), which suggests, at the condition of larger or larger , it needs a larger

to make Adam stride on the opposite direction. In other words, large

and will make the non-convergence rare to happen.

We also conduct the experiment in the stochastic problem to analyze the relation among , , and the convergence behavior of Adam. Results are shown in the Figure 7(c) and Figure 7(d) and the observations are similar to the previous: larger will cause non-convergence more easily and a larger or somehow help to resolve non-convergence issue. In this experiment, we set .

###### Theorem 6 (Critical condition).

In the sequential online optimization problem Equation 6, let being fixed, define to be the sum of the limits of step updates in a -step epoch:

 S(β1,β2,C)≜ d∑i=1limnd→∞mnd+i√vnd+i . (18)

Let , assuming and are large enough such that , we get the equation:

 C+1=(1−βd1)(√βd2−βd1)(1−√β2)(1−β1)(√β2−β1)(1−√βd2) . (19)

Equation 19, though being quite complex, tells that both and are closely related to the counterexamples, and there exists a critical condition among these parameters.

## Appendix C Correlation between gt and vt

In order to verify the correlation between and in Adam and AdaShift, we conduct experiments to calculate the correlation coefficient between and . We train the Multilayer Perceptron on MNIST until converge and gather the gradient of the second hidden layer of each step. Based on these data, we calculate and the correlation coefficient between and , between and and between and of the last epochs using the Pearson correlation coefficient, which is formulated as follows:

 ρ=∑ni=1(Xi−¯X)(Yi−¯Y)√∑ni=1(Xi−¯X)2√∑ni=1(Yi−¯Y)2.

To verify the temporal correlation between and , we range from to and calculate the average temporal correlation coefficient of all variables . Results are shown in Table 1.

To verify the spatial correlation between and , we again range from to and randomly sample some pairs of and and calculate the average spatial correlation coefficient of all the selected pairs. Results are shown in Table 2.

To verify the correlation between and within Adam, we calculate and the average correlation coefficient between and of all variables . The result is 0.435885276.

To verify the correlation between and within non-AdaShift and between and within max-AdaShift, we range the keep number from to to calculate and the average correlation coefficient of all variables . The result is shown in Table 3 and Table 4.

## Appendix D Proof of Theorem 1

###### Proof.

With bias correction, the formulation of is written as follows

 mt=(1−β1)∑ti=1βt−i1gi(1−β1)∑ti=1βt−i1=∑ti=1βt−i1gi∑ti=1βt−i1. (20)

According to L’Hospital‘s rule, we can draw the following:

 limβ1→1t∑i=1βt−i1=limβ1→11−βt11−β1=t.

Thus,

 limβ1→1mt=∑ti=1git.

According to the definition of limitation, let , we have, , , such that

 ∥mt−g∗∥∞<ϵ.

We set to be , then for each dimension of , i.e. ,

 g∗[i]2≤mt[i]≤3g∗[i]2

So, shares the same sign with in every dimension.

Given it is a convex optimization problem, let the optimal parameter be , and the maximum step size is that holds , we have,

 limt→∞∥θt−θ∗∥∞<ϵ2/G. (21)

Given , we have , which implies the average regret

 R(T)/T=T∑t=1[ft(θt)−ft(θ∗)]/T<ϵ2. (22)

Let .