1 Introduction
Firstorder optimization algorithms with adaptive learning rate play an important role in deep learning due to their efficiency in solving largescale optimization problems. Denote
as the gradient of loss function
with respect to its parameters at timestep , then the general updating rule of these algorithms can be written as follows (Reddi et al., 2018):(1) 
In the above equation, is a function of the historical gradients; is an
dimension vector with nonnegative elements, which adapts the learning rate for the
elements in respectively; is the base learning rate; and is the adaptive step size for .One common choice of is the exponential moving average of the gradients used in Momentum (Qian, 1999) and Adam (Kingma & Ba, 2014), which helps alleviate gradient oscillations. The commonlyused in deep learning community is the exponential moving average of squared gradients, such as Adadelta (Zeiler, 2012)
, RMSProp
(Tieleman & Hinton, 2012), Adam (Kingma & Ba, 2014) and Nadam (Dozat, 2016).Adam (Kingma & Ba, 2014) is a typical adaptive learning rate method, which assembles the idea of using exponential moving average of first and second moments and bias correction. In general, Adam is robust and efficient in both dense and sparse gradient cases, and is popular in deep learning research. However, Adam is shown not being able to converge to optimal solution in certain cases. Reddi et al. (2018) point out that the key issue in the convergence proof of Adam lies in the quantity
(2) 
which is assumed to be positive, but unfortunately, such an assumption does not always hold in Adam. They provide a set of counterexamples and demonstrate that the violation of positiveness of will lead to undesirable convergence behavior in Adam.
Reddi et al. (2018) then propose two variants, AMSGrad and AdamNC, to address the issue by keeping positive. Specifically, AMSGrad defines as the historical maximum of , i.e., , and replaces with to keep nondecreasing and therefore forces to be positive; while AdamNC forces to have “longterm memory” of past gradients and calculates as their average to make it stable. Though these two algorithms solve the nonconvergence problem of Adam to a certain extent, they turn out to be inefficient in practice: they have to maintain a very large once a large gradient appears, and a large decreases the adaptive learning rate and slows down the training process.
In this paper, we provide a new insight into adaptive learning rate methods, which brings a new perspective on solving the nonconvergence issue of Adam. Specifically, in Section 3, we study the counterexamples provided by Reddi et al. (2018) via analyzing the accumulated step size of each gradient . We observe that in the common adaptive learning rate methods, a large gradient tends to have a relatively small step size, while a small gradient is likely to have a relatively large step size. We show that the unbalanced step sizes stem from the inappropriate positive correlation between and , and we argue that this is the fundamental cause of the nonconvergence issue of Adam.
In Section 4, we further prove that decorrelating and leads to equal and unbiased expected step size for each gradient, thus solving the nonconvergence issue of Adam. We subsequently propose AdaShift, a decorrelated variant of adaptive learning rate methods, which achieves decorrelation between and by calculating using temporally shifted gradients. Finally, in Section 5, we study the performance of our proposed AdaShift, and demonstrate that it solves the nonconvergence issue of Adam, while still maintaining a decent performance compared with Adam in terms of both training speed and generalization.
2 Preliminaries
Adam.
In Adam, and are defined as the exponential moving average of and :
(3) 
where and are the exponential decay rates for and , respectively, with and . They can also be written as:
(4) 
To avoid the bias in the estimation of the expected value at the initial timesteps,
Kingma & Ba (2014) propose to apply bias correction to and . Using as instance, it works as follows:(5) 
Online optimization problem.
An online optimization problem consists of a sequence of cost functions , where the optimizer predicts the parameter at each timestep and evaluate it on an unknown cost function . The performance of the optimizer is usually evaluated by regret , which is the sum of the difference between the online prediction and the best fixedpoint parameter prediction for all the previous steps, where is the best fixedpoint parameter from a feasible set .
Counterexamples.
Reddi et al. (2018) highlight that for any fixed and , there exists an online optimization problem where Adam has nonzero average regret, i.e., Adam does not converge to optimal solution . The counterexamples in the sequential version are given as follows:
(6) 
where is a relatively large constant and
is the length of an epoch. In Equation
6, most gradients of with respect to are , but the large positive gradient at the beginning of each epoch makes the overall gradient of each epoch positive, which means that one should decrease to minimize the loss. However, according to (Reddi et al., 2018), the accumulated update of in Adam under some circumstance is opposite (i.e., is increased), thus Adam cannot converge in such case. Reddi et al. (2018) argue that the reason of the nonconvergence of Adam lies in that the positive assumption of does not always hold in Adam.Basic Solutions
Reddi et al. (2018) propose maintaining the strict positiveness of as solution, for example, keeping nondecreasing or using increasing . In fact, keeping positive is not the only way to guarantee the convergence of Adam. Another important observation is that for any fixed sequential online optimization problem with infinitely repeating epochs (e.g., Equation 6), Adam will converge as long as is large enough. Formally, we have the following theorem:
Theorem 1 (The influence of ).
For any fixed sequential online convex optimization problem with infinitely repeating of finite length epochs, let the length of an epoch be , if such that and such that holds for all , then, for any fixed , there exists a such that Adam has average regret ;
3 The cause of nonconvergence: unbalanced step size
In this section, we study the nonconvergence issue by analyzing the counterexamples provided by Reddi et al. (2018). We show that the fundamental problem of common adaptive learning rate methods is that: is positively correlated to the scale of gradient , which results in a small step size for a large gradient, and a large step size for a small gradient. We argue that such an unbalanced step size is the cause of nonconvergence.
We will first define net update factor for the analysis of the accumulated influence of each gradient , then apply the net update factor to study the behaviors of Adam using Equation 6 as an example. The argument will be extended to the stochastic online optimization problem and general cases.
3.1 Net update factor
When , due to the exponential moving effect of , the influence of exists in all of its following timesteps. For timestep (), the weight of is . We accordingly define a new tool for our analysis: the net update of each gradient , which is its accumulated influence on the entire optimization process:
(7) 
and we call the net update factor of , which is the equivalent accumulated step size for gradient . Note that depends on , and in Adam, if , then all elements in are related to . Therefore, is a function of .
It is worth noticing that in Momentum method, is equivalently set as . Therefore, we have and , which means that the accumulated influence of each gradient in Momentum is the same as vanilla SGD (Stochastic Gradient Decent). Hence, the convergence of Momentum is similar to vanilla SGD. However, in adaptive learning rate methods, is function over the past gradients, which makes its convergence nontrivial.
3.2 Analysis on sequential online optimization counterexamples
Note that exists in the definition of net update factor (Equation 7). Before further analyzing the convergence of Adam using the net update factor, we first study the pattern of in the sequential online optimization problem in Equation 6. Since Equation 6 is deterministic, we can derive the formula of as follows:
Theorem 2 (Limit of ).
In the sequential online optimization problem in Equation 6, denote as the decay rates, as the length of an epoch, as the index of epoch, and as the index of timestep in one epoch. Then the limit of when is:
(8) 
Given the formula of in Equation 8, we now study the net update factor of each gradient. We start with a simple case where . In this case we have
(9) 
Since the limit of in each epoch monotonically decreases with the increase of index according to Equation 8, the limit of monotonically increases in each epoch. Specifically, the first gradient in epoch represents the correct updating direction, but its influence is the smallest in this epoch. In contrast, the net update factor of the subsequent gradients are relatively larger, though they indicate a wrong updating direction.
The above problem stems from the inappropriate correlation between and . Recall that , and we assume is independent of . When a new gradient arrives, if is large, is likely to be larger; and if is small, is also likely to be smaller. As a result, a large gradient is likely to have a small net update factor, while a small gradient is likely to have a large net update factor in Adam.
We further consider the general case where . The result is presented in the following theorem:
Theorem 3 (Unbalanced net update factor).
In the sequential online optimization problem in Equation 6, when , the limit of net update factor of epoch is:
(10) 
And there exists such that
(11) 
and
(12) 
where denotes the net update factor for gradient .
Theorem 3 tells us that, in sequential online optimization problem in Equation 6, the net update factors are unbalanced. Specifically, the net update factor for the large gradient is the smallest in the entire epoch, while all gradients have larger net update factors. Such unbalanced net update factors will possibly lead Adam to a wrong accumulated update direction.
3.3 Analysis on stochastic online optimization counterexamples
The counterexamples are also extended to stochastic cases in Reddi et al. (2018), where a finite set of cost functions appear in a stochastic order. Compared with sequential online optimization counterexample, the stochastic version is more general and closer to the practical situation. For the simplest one dimensional case, at each timestep , the function is chosen as i.i.d.:
(13) 
where is a small positive constant that is smaller than . The expected cost function of the above problem is , therefore, one should decrease to minimize the loss. Reddi et al. (2018) prove that when is large enough, the expectation of accumulated parameter update in Adam is positive and results in increasing .
To conduct more rigorous study on the stochastic online optimization problem in Equation 13, we derive the expectation of the net update factor for each gradient in the following theorem:
Theorem 4 (Unbalanced net update factor in stochastic online optimization problem).
In the stochastic online optimization problem in Equation 13, assuming , the expectation of net update factors are as follows:
,  (14) 
and
,  (15) 
where denotes the net update factor for and denotes the net update factor for . is a positive value.
Though the formulas of net update factors in the stochastic case are more complicated than those in deterministic case, the analysis is actually more easier: the gradients with the same scale share the same expected net update factor, so we only need to analyze and . We can see that each term in the infinite series of is smaller than the corresponding one in , therefore, the accumulated influence of gradient is smaller than gradient .
The above observation can also be interpreted as a direct consequence of the inappropriate correlation between and : given , not only positively correlates with , but also the entire infinite sequence positively correlates with . Since the net update factor negatively correlates with each in , it also negatively correlates with . That is, for a large gradient is likely to be smaller, while for a small gradient is likely to be larger.
The unbalanced net update factor causes the nonconvergence problem of Adam as well as all other adaptive learning rate methods where correlates with . All these counterexamples follow the same pattern: the large gradient is along the “correct” direction, while the small gradient is along the opposite direction. Due to the fact that the accumulated influence of a large gradient is small while the accumulated influence of a small gradient is large, Adam may update parameters along the wrong direction. Even if Adam updates parameters along the right direction in general, the unbalanced net update factors are still unfavorable since they slow down the convergence.
4 The proposed method: decorrelation via temporal shifting
According to the previous discussion, we conclude that the main cause of the nonconvergence of Adam is the inappropriate correlation between and . Currently we have two possible solutions: (1) making act like a constant, which declines the correlation, e.g., using a large or keep nondecreasing (Reddi et al., 2018); (2) using a large (Theorem 1), where the aggressive momentum term helps to mitigate the impact of unbalanced net update factors. However, neither of them solves the problem fundamentally.
The dilemma caused by enforces us to rethink its role. In adaptive learning rate methods, plays the role of estimating the second moments of gradients, which reflects the scale of gradient on average. With the adaptive learning rate , the update step of is scaled down by and achieves rescaling invariance with respect to the scale of , which is practically useful to make the training process easy to control and the training system robust. However, the current scheme of , i.e., , brings a positive correlation between and , which results in reducing the effect of large gradients and increasing the effect of small gradients, and finally causes the nonconvergence problem. Therefore, the key is to let be a quantity that reflects the scale of the gradients, while at the same time, be decorrelated with current gradient . Formally, we have the following theorem:
Theorem 5 (Decorrelation leads to convergence).
For any fixed online optimization problem with infinitely repeating of a finite set of cost functions , assuming and is fixed, we have, if follows a fixed distribution and is independent of the current gradient , then the expected net update factor for each gradient is identical.
Let denote the distribution of . In the infinitely repeating online optimization scheme, the expectation of net update factor for each gradient is
(16) 
Given is independent of , the expectation of the net update factor is independent of and remains the same for different gradients. With the expected net update factor being a fixed constant, the convergence of the adaptive learning rate method reduces to vanilla SGD.
Momentum (Qian, 1999) can be viewed as setting as a constant, which makes and independent. Furthermore, in our view, using an increasing (AdamNC) or keeping as the largest (AMSGrad) is also to make almost fixed. However, fixing is not a desirable solution, because it damages the adaptability of Adam with respect to the adapting of step size.
We will next introduce the proposed solution to make independent of , which is based on temporal independent assumption among gradients. The proposed solution decorrelates and with a simple temporal shifting.
4.1 Temporal decorrelation
In practical setting, usually involves different minibatches , i.e., . Given the randomness of minibatch, we assume that the minibatch is independent of each other and further assume that keeps unchanged over time, then the gradient of each minibatch is independent of each other. Then we present the temporal decorrelation algorithm as follows.
The key change is that the update rule for now involves instead of (line 4), which makes and temporally shifted and hence decorrelated. Note that in the sequential online optimization problem, the assumption “
is independent of each other” does not hold. However, in the stochastic online optimization problem and practical neural network settings, our assumption generally holds.
4.2 Temporalspatial decorrelation
In most optimization schemes, there exists many parameters, i.e., the dimension of is high, thus and are also of high dimension. However, in Algorithm 1, is elementwisely computed; that is, we only use the th dimension of to calculate th dimension of . In other words, it only makes use of the independence between and , where denotes the th element of .
Actually, in the case of highdimensional and , we can further assume that all elements of gradient at previous timestep are independent of the th dimension of . Thus all elements in can be used to compute without introducing correlation. We propose introducing a function over all elements of to achieve this goal, i.e., . For easy of reference, we name the elements of other than as the spatial elements of and name the spatial function or spatial operation.
There is no restriction on the choice of , and we use for most of our experiments, which is shown to be a good choice. The operation has a side effect that turns the adaptive learning rate into a shared scalar. An important thing here is that, we no longer interpret as the second moment of
. It is merely a random variable that is independent of
, while at the same time, reflects the overall gradient scale. We leave further investigations on as future work.4.3 Blockwise temporalspatial decorrelation
In practical setting, e.g., deep neural network,
usually consists of many parameter blocks, e.g., the weight and bias for each layer. In deep neural network, the gradient scales (i.e., the variance) for different layers tend to be different
(Glorot & Bengio, 2010; He et al., 2015). Different gradient scales make it hard to find a learning rate that is suitable for all layers, when using SGD and Momentum methods. In traditional adaptive learning rate methods, they apply elementwise rescaling for each gradient dimension, which achieves rescalinginvariance and somehow solves the above problem. However, Adam sometimes does not generalize better than SGD (Wilson et al., 2017; Keskar & Socher, 2017), which might relate to the excessive learning rate adaptation in Adam.In our temporalspatial decorrelation scheme, we can solve the “different gradient scales” issue more naturally, by applying blockwisely and outputs a shared adaptive learning rate scalar for each block. It makes the algorithm work like an adaptive learning rate SGD, where each block has an adaptive learning rate while the relative gradient scale among inblock elements keep unchanged. The corresponding algorithm is illustrated in Algorithm 2, where the parameters including the related and are divided into blocks. Every block contains the parameters of the same type or same layer in neural network.
4.4 Incorporating first moment: moving averaging windows
First moment estimation, i.e., defining as a moving average of , is an important technique of modern first order optimization algorithms, which alleviates minibatch oscillations. In this section, we extend our algorithm to incorporate first moment estimation.
We have argued that needs to be decorrelated with . Similarly, when introducing the first moment estimation, we need to make and independent to make the expected net update factor unbiased. Based on our assumption of temporal and spatial independence, we further keep out the latest gradients , and update and via
(17) 
In Equation 17, plays the role of decay rate for temporal elements. It can be viewed as a truncated version of exponential moving average that only applied to the latest few elements. Since we use truncating, it is feasible to use large without taking the risk of using too old gradients. In the extreme case where , it becomes vanilla averaging.
The pseudo code of the algorithm that unifies all proposed techniques is presented in the Appendix with the following parameters: spatial operation , , , and .
Summary
The key difference between Adam and the proposed method is that the latter temporally shifts the gradient for step, i.e., using for calculating and using the keptout gradients to evaluate (Equation 17), which makes and decorrelated and consequently solves the nonconvergence issue. In addition, based on our new perspective on adaptive learning rate methods, is not necessarily the second moment and it is valid to further involve the calculation of with the spatial elements of previous gradients. And we found that when the spatial operation outputs a shared scalar for each block, the resulting algorithm turns out to be closely related to SGD, where each block has an overall adaptive learning rate and the relative gradient scale in each block is maintained. This is what we call “blockwise adaptive learning rate SGD”. We name the proposed method that makes use of temporalshifting to decorrelated and AdaShift, which means “ADAptive learning rate method with temporal SHIFTing”.
5 Experiments
In this section, we empirically study the proposed method and compare them with Adam, AMSGrad and SGD, on various tasks in terms of training performance and generalization. Without additional declaration, the reported result for each algorithm is the best we have found via parameter grid search. The anonymous code is provided at http://bit.ly/2NDXX6x.
5.1 Online Optimization Counterexamples
Firstly, we verify our analysis on the stochastic online optimization problem in Equation 13, where we set and . We compare Adam, AMSGrad and AdaShift in this experiment. For fair comparison, we set , and for all these methods. The results are shown in Figure 0(a). We can see that Adam tends to increase , that is, the accumulate update of in Adam is along the wrong direction, while AMSGrad and AdaShift update in the correct direction. Furthermore, given the same learning rate, AdaShift decreases faster than AMSGrad, which validates our argument that AMSGrad has a relatively higher that slows down the training. In this experiment, we also verify Theorem 1. As shown in Figure 0(b), Adam is also able to converge to the correct direction with a sufficiently large and . Note that (1) AdaShift still converges with the fastest speed; (2) a small (e.g., , the lightblue line in Figure 0(b)) does not make Adam converge to the correct direction. We do not conduct the experiments on the sequential online optimization problem in Equation 6, because it does not fit our temporal independence assumption. To make it converge, one can use a large or , or set as a constant.
5.2 Logistic Regression and Multilayer Perceptron on MNIST
We further compare the proposed method with Adam, AMSGrad and SGD by using Logistic Regression and Multilayer Perceptron on MNIST, where the Multilayer Perceptron has two hidden layers and each has
hidden units with no internal activation. The results are shown in Figure 3 and Figure 3, respectively. We find that in Logistic Regression, these learning algorithms achieve very similar final results in terms of both training speed and generalization. In Multilayer Perceptron, we compare Adam, AMSGrad and AdaShift with reducemax spatial operation (maxAdaShift) and without spatial operation (nonAdaShift). We observe that maxAdaShift achieves the lowest training loss, while nonAdaShift has mild training loss oscillation and at the same time achieves better generalization. The worse generalization of maxAdaShift may be due to overfitting in this task, and the better generalization of nonAdaShift may stem from the regularization effect of its relatively unstable step size.5.3 DenseNet and ResNet on CIFAR10
ResNet (He et al., 2016) and DenseNet (Huang et al., 2017) are two typical modern neural networks, which are efficient and widelyused. We test our algorithm with ResNet and DenseNet on CIFAR10 datasets. We use a layer ResNet and layer DenseNet in our experiments. We plot the best results of Adam, AMSGrad and AdaShift in Figure 4 and Figure 5 for ResNet and DenseNet, respectively. We can see that AMSGrad is relatively worse in terms of both training speed and generalization. Adam and AdaShift share competitive results, while AdaShift is generally slightly better, especially the test accuracy of ResNet and the training loss of DenseNet.
5.4 DenseNet with TinyImageNet
We further increase the complexity of dataset, switching from CIFAR10 to TinyImageNet, and compare the performance of Adam, AMSGrad and AdaShift with DenseNet. The results are shown in Figure
6, from which we can see that the training curves of Adam and AdaShift are basically overlapped, but AdaShift achieves higher test accuracy than Adam. AMSGrad has relatively higher training loss, and its test accuracy is relatively lower at the initial stage.5.5 Generative model and Recurrent model
We also test our algorithm on the training of generative model and recurrent model. We choose WGANGP (Gulrajani et al., 2017)
that involves Lipschitz continuity condition (which is hard to optimize), and Neural Machine Translation (NMT)
(Luong et al., 2017) that involves typical recurrent unit LSTM, respectively. In Figure 6(a), we compare the performance of Adam, AMSGrad and AdaShift in the training of WGANGP discriminator, given a fixed generator. We notice that AdaShift is significantly better than Adam, while the performance of AMSGrad is relatively unsatisfactory. The test performance in terms of BLEU of NMT is shown in Figure 6(b), where AdaShift achieves a higher BLEU than Adam and AMSGrad.6 Conclusion
In this paper, we study the nonconvergence issue of adaptive learning rate methods from the perspective of the equivalent accumulated step size of each gradient, i.e., the net update factor defined in this paper. We show that there exists an inappropriate correlation between and , which leads to unbalanced net update factor for each gradient. We demonstrate that such unbalanced step sizes are the fundamental cause of nonconvergence of Adam, and we further prove that decorrelating and will lead to unbiased expected step size for each gradient, thus solving the nonconvergence problem of Adam. Finally, we propose AdaShift, a novel adaptive learning rate method that decorrelates and via calculating using temporally shifted gradient .
In addition, based on our new perspective on adaptive learning rate methods, is no longer necessarily the second moment of , but a random variable that is independent of and reflects the overall gradient scale. Thus, it is valid to calculate with the spatial elements of previous gradients. We further found that when the spatial operation outputs a shared scalar for each block, the resulting algorithm turns out to be closely related to SGD, where each block has an overall adaptive learning rate and the relative gradient scale in each block is maintained. The experiment results demonstrate that AdaShift is able to solve the nonconvergence issue of Adam. In the meantime, AdaShift achieves competitive and even better training and testing performance when compared with Adam.
References

Dozat (2016)
Timothy Dozat.
Incorporating nesterov momentum into adam.
2016. 
Glorot & Bengio (2010)
Xavier Glorot and Yoshua Bengio.
Understanding the difficulty of training deep feedforward neural
networks.
In
Proceedings of the thirteenth international conference on artificial intelligence and statistics
, pp. 249–256, 2010.  Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5767–5777, 2017.

He et al. (2015)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Delving deep into rectifiers: Surpassing humanlevel performance on
imagenet classification.
In
Proceedings of the IEEE international conference on computer vision
, pp. 1026–1034, 2015. 
He et al. (2016)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778, 2016.  Huang et al. (2017) Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. 2017.
 Keskar & Socher (2017) Nitish Shirish Keskar and Richard Socher. Improving generalization performance by switching from adam to sgd. arXiv preprint arXiv:1712.07628, 2017.
 Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Luong et al. (2017) MinhThang Luong, Eugene Brevdo, and Rui Zhao. Neural machine translation (seq2seq) tutorial. https://github.com/tensorflow/nmt, 2017.
 Qian (1999) Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999.
 Reddi et al. (2018) Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=ryQu7fRZ.

Tieleman & Hinton (2012)
Tijmen Tieleman and Geoffrey Hinton.
Lecture 6.5rmsprop: Divide the gradient by a running average of its
recent magnitude.
COURSERA: Neural networks for machine learning
, 4(2):26–31, 2012.  Wilson et al. (2017) Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, pp. 4148–4158, 2017.
 Zeiler (2012) Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
Appendix A The relation among , and
To provide an intuitive impression on the relation among and the convergence of Adam, we let , initialize , vary and among and let Adam go through 2000 timesteps (iterations). The final result of is shown in Figure 7(a). It suggests that for a fixed sequential online optimization problem, both of and determine the direction and speed of Adam optimization process. Furthermore, we also study the threshold point of and , under which Adam will change to the incorrect direction, for each fixed and that vary among . To simplify the experiments, we keep such that the overall gradient of each epoch being . The result is shown in Figure 7(b), which suggests, at the condition of larger or larger , it needs a larger
to make Adam stride on the opposite direction. In other words, large
and will make the nonconvergence rare to happen.We also conduct the experiment in the stochastic problem to analyze the relation among , , and the convergence behavior of Adam. Results are shown in the Figure 7(c) and Figure 7(d) and the observations are similar to the previous: larger will cause nonconvergence more easily and a larger or somehow help to resolve nonconvergence issue. In this experiment, we set .
Theorem 6 (Critical condition).
In the sequential online optimization problem Equation 6, let being fixed, define to be the sum of the limits of step updates in a step epoch:
(18) 
Let , assuming and are large enough such that , we get the equation:
(19) 
Equation 19, though being quite complex, tells that both and are closely related to the counterexamples, and there exists a critical condition among these parameters.
Appendix B The AdaShift Pseudo Code
Appendix C Correlation between and
In order to verify the correlation between and in Adam and AdaShift, we conduct experiments to calculate the correlation coefficient between and . We train the Multilayer Perceptron on MNIST until converge and gather the gradient of the second hidden layer of each step. Based on these data, we calculate and the correlation coefficient between and , between and and between and of the last epochs using the Pearson correlation coefficient, which is formulated as follows:
To verify the temporal correlation between and , we range from to and calculate the average temporal correlation coefficient of all variables . Results are shown in Table 1.
1  2  3  4  5  

0.000368929  0.000989286  0.001540511  0.00116966  0.001613395  
6  7  8  9  10  
0.001211721  0.000357474  0.00082293  0.001755237  0.001267641 
To verify the spatial correlation between and , we again range from to and randomly sample some pairs of and and calculate the average spatial correlation coefficient of all the selected pairs. Results are shown in Table 2.
1  2  3  4  5  

0.000609471  0.001948853  0.001426661  0.000904615  0.000329359  
6  7  8  9  10  
0.000971337  0.000644563  0.00137805  0.001147973  0.000592037 
To verify the correlation between and within Adam, we calculate and the average correlation coefficient between and of all variables . The result is 0.435885276.
To verify the correlation between and within nonAdaShift and between and within maxAdaShift, we range the keep number from to to calculate and the average correlation coefficient of all variables . The result is shown in Table 3 and Table 4.
1  2  3  4  5  

0.010897023  0.010952548  0.010890854  0.010853069  0.010810747  
6  7  8  9  10  
0.010777789  0.01075946  0.010739279  0.010728553  0.010720019 
1  2  3  4  5  

0.000706289  0.000794959  0.00076306  0.000712474  0.000668459  
6  7  8  9  10  
0.000623162  0.000566573  0.000542046  0.000598015  0.000592707 
Appendix D Proof of Theorem 1
Proof.
With bias correction, the formulation of is written as follows
(20) 
According to L’Hospital‘s rule, we can draw the following:
Thus,
According to the definition of limitation, let , we have, , , such that
We set to be , then for each dimension of , i.e. ,
So, shares the same sign with in every dimension.
Given it is a convex optimization problem, let the optimal parameter be , and the maximum step size is that holds , we have,
(21) 
Given , we have , which implies the average regret
(22) 
∎
Appendix E Proof of Theorem 2
Proof.
Let .
Comments
There are no comments yet.