Stochastic Optimization with Non-stationary Noise

06/08/2020 ∙ by Jingzhao Zhang, et al. ∙ MIT ibm 0

We investigate stochastic optimization problems under relaxed assumptions on the distribution of noise that are motivated by empirical observations in neural network training. Standard results on optimal convergence rates for stochastic optimization assume either there exists a uniform bound on the moments of the gradient noise, or that the noise decays as the algorithm progresses. These assumptions do not match the empirical behavior of optimization algorithms used in neural network training where the noise level in stochastic gradients could even increase with time. We address this behavior by studying convergence rates of stochastic gradient methods subject to changing second moment (or variance) of the stochastic oracle as the iterations progress. When the variation in the noise is known, we show that it is always beneficial to adapt the step-size and exploit the noise variability. When the noise statistics are unknown, we obtain similar improvements by developing an online estimator of the noise level, thereby recovering close variants of RMSProp. Consequently, our results reveal an important scenario where adaptive stepsize methods outperform SGD.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Stochastic gradient descent (SGD) is one of the most popular optimization methods in machine learning because of its computational efficiency compared to traditional full gradient methods. Great progress has been made in understanding the performance of SGD under different smoothness and convexity conditions (Ghadimi and Lan, 2013, 2012; Nemirovsky and Yudin, 1983; Rakhlin et al., 2012; Agarwal et al., 2009; Arjevani et al., 2019; Drori and Shamir, 2019). These results show that with a fixed step size, SGD can achieve the minimax optimal convergence rate for both convex and nonconvex optimization problems, provided the gradient noise is uniformly bounded.

Yet, despite the theoretical minimax optimality of SGD, adaptive gradient methods (Tieleman and Hinton, 2012; Kingma and Ba, 2014; Duchi et al., 2011) have become the methods of choice for training deep neural networks, and have received a surge of attention recently (Levy, 2017; Ward et al., 2019; Li and Orabona, 2019; Zhou et al., 2018; Staib et al., 2019; Chen et al., 2019; Zou and Shen, 2018; Zhou et al., 2019; Agarwal et al., 2018; Levy et al., 2018; Zou et al., 2019; Liu et al., 2019; Ma and Yarats, 2019; Huang et al., 2019; Zhang et al., 2020, 2019; Liu et al., 2020). Instead of using fixed stepsizes, these methods construct their stepsizes adaptively using the current and past gradients. But despite advances in the literature on adaptivity, theoretical understanding of the benefits of adaptation is still quite limited.

We provide a different perspective on understanding the benefits of adaptivity by considering it in the context of non-stationary gradient noise, i.e., the noise intensity varies with iteration. Surprisingly, this setting is rarely studied, even for SGD. To our knowledge, this is the first work to formally study stochastic gradient methods in this varying noise scenario. Our main goal is to show that:

Adaptive step-sizes can guarantee faster rates than SGD when the noise is non-stationary.

We focus on this goal based on several empirical observations (Section 2), which lead us to model the noise of stochastic gradient oracles via the following more realistic assumptions:

(1)

where is the stochastic gradient and the true gradient at iteration . The second moments and variances are independent of the algorithm.

Assumption (1) relaxes the standard assumption (on SGD) that uniformly bounds the variance, and helps model gradient methods that operate with iteration dependent noise intensity. It is intuitive that one should prefer smaller stepsizes when the noise is large and vice versa. Thus, under non-stationarity, an ideal algorithm should adapt its stepsize according to the parameters or , suggesting a potential benefit of using adaptive stepsizes.

Contributions.

The primary contribution of our paper is to show that a stochastic optimization method with adaptive stepsize can achieve a faster rate of convergence (by a factor that is polynomial-in-) than fixed-step SGD. We first analyze an idealized setting where the noise intensities are known, using it to illustrate how to select noise dependent stepsizes that are provably more effective (Theorem 1). Next, we study the case with unknown noise, where we show under an appropriate smoothness assumption on the noise variation that a variant of RMSProp (Tieleman and Hinton, 2012) can achieve the idealized convergence rate (Theorem 3). Remarkably, this variant does not require the noise levels. Finally, we generalize our results to nonconvex settings (Theorems 12 and 14).

2 Motivating observation: nonstationary noise in neural network training

(a) ResNet18 on Cifar10
(b) AWD-LSTM on PTB
(c) Transformer on En-De translation
Figure 1: We empirically evaluate the second moment (in blue) and variance (in orange) of stochastic gradients during the training of neural networks. We observe that the magnitude of these quantities changes significantly as iteration count increases, ranging from 10 times (ResNet) to times (Transformer). This phenomenon motivates us to consider a setting with non-stationary noise.

Neural network training involves optimizing an empirical risk minimization problem of the form , where each

represents the loss function with respect to the

-th data or minibatch. Stochastic methods optimize this objective randomly sampling an incremental gradient

at each iteration and using it as an unbiased estimate of the full gradient. The noise intensity of this stochastic gradient is measured by its second moments or variances, defined as,

  1. [leftmargin=.2in]

  2. Second moment: ;

  3. Variance: , where is the full gradient,

To illustrate how these quantities evolve over iterations, we empirically evaluate them along three popular tasks of neural network training: ResNet18 training on Cifar10 dataset for image classification111Code source for CIFAR10 https://github.com/kuangliu/pytorch-cifar, LSTM training on PTB dataset for language modelling222Code source for LSTM https://github.com/salesforce/awd-lstm-lm; transformer training on WMT16 en-de for language translation333Code source for Transformer https://github.com/jadore801120/attention-is-all-you-need-pytorch. The results are shown in Figure 1, where both the second moments and variances are evaluated using the default training procedure of the original code.

On one hand, the variation of the second moment/variance has a very different shape in each of the considered tasks. In the CIFAR experiment, the noise intensity is quite steady after the first iteration, indicating a fast convergence of the training model. In LSTM training, the noise level increases and converges to a threshold. While, in training Transformers, the noise level increases very fast at the early epochs, then reaches a maximum, and turns down gradually.

On the other hand, the preferred optimization algorithms in these tasks are also different. For CIFAR10, SGD with momentum is the most popular choice. While for language models, adaptive methods such as Adam or RMSProp are the rule of thumb. This discrepancy is usually taken as granted, based on empirical validation; and little theoretical understanding of it exists in the literature.

Based on the observations made in Figure 1, a natural candidate emerges to explain this discrepancy in the choice of algorithms: the performance of different stochastic algorithms used varies according the characteristics of gradient noise encountered during training. Despite this behavior, noise level modeling has drawn surprisingly limited attention in prior art. Reference (Moulines and Bach, 2011) studies convergence of SGD assuming each component function is convex and smooth; extensions to the variation of the full covariance matrix are in (Gadat and Panloup, 2017). A more fine-grained assumption is that the variances grow with the gradient norm as , or grow with the suboptimality  (Bottou et al., 2018; Jofré and Thompson, 2019; Rosasco et al., 2019).

Unfortunately, these known assumptions fail to express the variation of noise observed in Figure 1. Indeed, the norm of the full gradient, represented as the difference between the orange and the blue line, is significantly smaller compared to the noise level. This suggests that noise variation is not due to the gradient norm, but due to some implicit properties of the objective function. The limitations of the existing assumptions motivates us to introduce the following assumption on the noise:

Assumption 1 (non-stationary noise oracle).

The stochasticity of the problem is governed by a sequence of second moments or variances , such that, at the iteration, the gradient oracle returns an unbiased gradient such that , and either

  1. [label=(), leftmargin=0.3in]

  2. with second moment ; or

  3. with variance .

By introducing time dependent second moments and variance, we aim to understand how the variation of noise influences the convergence rate of optimization algorithms. Under our assumption, the change of the noise level is decoupled from its location, meaning that, the parameters or only depend on the iteration number , but do not depend on the specific location where the gradient is evaluated. This assumption holds for example when the noise is additive to the gradient, namely . Even though this iterate independence of noise may seem restricted, it is already more relaxed than the standard assumption on SGD that requires the noise to be uniformly upper bound by a fixed constant. Thus, our relaxed assumption helps us take the first step toward our goal: characterize the convergence rate of adaptive algorithms under non-stationary noise.

To avoid redundancy, we present our results mainly based on the second moment parameters and defer the discussion on to Section 5. One reason that we prioritize the second moment than the variance is to draw a connection with the well-known adaptive method RMSProp Tieleman and Hinton (2012).

3 The benefit of adaptivity under nonstationary noise

In this section, we investigate the influence of nonstationary noise in an idealized setting where the noise parameters are known. To simplify the presentation, we will first focus on the convex setting. Similar results also hold for nonconvex problems and are noted later in Section 5.

Let be convex and differentiable. We consider the problem , where the gradient is given by the nonstationary noise oracle satisfying Assumption 1. We assume that the optimum is attained at and we denote the minimum of the objective. We are interested in studying the convergence rate of a stochastic algorithm with update rule

(2)

where are stepsizes that are oblivious of the iterates .

Theorem 1.

Under Assumption 1 (a), the weighted average of the iterates obtained by the update rule (2) satisfies the suboptimality bound

(3)

The theorem follows from standard analysis, yet it leads to valuable observations explained below.

Corollary 2.

Let . We have the following two convergence rate bounds for SGD:

  1. [leftmargin=.2in]

  2. SGD with constant stepsize: if , then

    (constant baseline)
  3. SGD with idealized stepsize: if , then

    (idealized baseline)

To facilitate comparison, in Corollary 2 we have normalized the convergence rates with respect to the conventional rate . The constant baseline has an additional factor depending on the average of the value, and the remaining factor in idealized baseline is inversely proportional to the average of . In particular, from Jensen’s inequality , we have

implying that the idealized baseline is always better than the constant baseline. This observation is rather expected, as the stepsizes are adapted to the noise in an idealized way. It is worth noticing that the constant baseline also benefits from explicit knowledge on , replacing the upper bound  by the average value .

If the actual value of is unavailable, but an upper bound such that is known, then replacing all the values by in both algorithms recovers the standard result (Nemirovski et al., 2009).

To further illustrate the difference in the convergences rate, we consider the following synthetic noise model, mimicking our observations in the training of Transformer (see Figure 1(c)).

Example 1.

Consider the following piece-wise linear noise model with .

For the above , the maximum noise level is , and the minimum level is , inducing a large ratio of order . Following the bounds developed in Corollary 2, the performance of the constant baseline maintains the standard convergence rate, while as the idealized baseline converges at . Hence a nontrivial acceleration of order is obtained by using the idealized stepsize, and this acceleration can be arbitrarily large as increases.

This example is encouraging, showing that the speedup due to adaptive stepsizes can be polynomial in the number of iterations, especially when the ratio between the maximum and the minimum noise level is large. However, explicit knowledge on is required to implement these idealized stepsizes, which is unrealistic. The goal in the rest of the paper is to show that approximating the moment bound in an online fashion can also achieve convergence rate comparable with the idealized setting.

4 Adaptive methods: Online estimation of noise

From now on, we assume that the moment bounds are not given. To address the non-stationarity, we estimate the noise intensity based on an exponential moving average, a technique commonly used in adaptive methods. More precisely, the moment estimator is constructed recursively as

(ExpMvAvg)

where is the -th stochastic gradient and is the decay paramter. Then we choose the stepsizes inversely proportional to , leading to Algorithm 1.

Indeed, Algorithm 1 could be viewed as a “norm" version of RMSProp Tieleman and Hinton (2012): while in RMSProp the exponential moving average is performed coordinate-wise, we use the full norm of  to update the moment estimator . Such a simplification via a full norm variant has also been used in the uniformly bounded noise setting (Levy, 2017; Ward et al., 2019; Li and Orabona, 2019; Levy et al., 2018)–we leave the more advanced coordinate-wise version as a topic of future research. Another important component in the stepsize is the correction constant , shown in the denominator of the stepsize. This constant provides a safety threshold when underestimates

, which is commonly used in the practical implementation of adaptive methods, and even beyond, in reinforcement learning as a so-called exploration bonus 

Strehl and Littman (2008); Azar et al. (2017); Jin et al. (2018a).

To show the convergence of the algorithm, we need to impose a regularity assumption on the sequence of noise intensities. Otherwise, previous estimate may not provide any information of the next one.

1:  Initialize .
2:  for  do
3:     Evaluate stochastic gradient at .
4:      with .
5:     .
6:  end for
7:  return  
Algorithm 1 Adaptive SGD ()
Assumption 2.

We assume that an upper bound on is given, i.e. such that

  1. [leftmargin=.2in,label=()]

  2. The fourth moment of is bounded by , namely, .

  3. The total variation on is bounded by

    (4)

The fourth moment assumption ensures concentration of , which is necessary to provide a finite-sample analysis. In particular, this is satisfied when follows a sub-Gaussian distribution. The other assumption is less straightforward, and deserves a broader discussion here. A key aspect of the bounded variation is to avoid infinite oscillation, such as the pathological setting where and , in which case the total variation scales with the number of iterations . We emphasize that the choice of the constant  is not special, and it can be easily replaced by any arbitrary constant, i.e., . When is increasing in the first half and decreasing in the second half, as in the Transformer and Example 1, the total variation bound in (13) is satisfied. More generally, one could allow piece-wise monotone fragment by changing the constant to .

With the above assumptions, we are now ready to present our convergence analysis.

Theorem 3.

Under Assumptions 1, 2, and large enough such that

, with probability at least

, the iterates generated by Algorithm 1 using parameters , , satisfy

Remark 4.

Our result directly implies a high probability style convergence rate, by restarting it times. An additional dependency will be introduced in the complexity, as in standard high probability results Nemirovski et al. (2009); Jin et al. (2018b); Li and Orabona (2019); Fang et al. (2018).

The key to prove the theorem is to effectively bound the estimation error relying on concentration, and on bounded variation in Assumption 2. In particular, the choice of the decay parameter  is critical, determining how fast the contribution of past gradients decays. Because of the non-stationarity in noise, the online estimator is biased. The proposed choice of carefully balances the bias error and the variance error, leading to sublinear regret, see Appendix B.

Due to the correction constant , the obtained convergence rate inversely depends on , instead of the idealized dependency . This additional term makes the comparison less straightforward and we now provide different scenarios to better understand it.

4.1 Discussion of the convergence rate

Constant Adaptive Idealized
Table 1: Comparison of the convergence rate under the noise example 1.

To illustrate the difference between convergence rates, we first consider the synthetic noise model introduced in Example 1. The detailed comparison is presented in Table 1, where we observe two regimes regarding the exponent :

In both cases, the adaptive method achieves a non-trivial improvement, polynomial in , compared to the (constant baseline). Even though the improvement might seem in-significant, it is the first result showing a plausible non-trivial advantage of adaptive methods over SGD under nonstationary-noise. Further, note that the adaptive convergence rate does not always match the (idealized baseline) when  is large. Such a discrepancy comes from the correction term , which makes the stepsize more conservative than it should be, especially when is small.

That the above comparison relies on the specific choice on the noise model given in Example 1. Now we formalize some simple conditions allowing comparison in more general settings.

Corollary 5.

If the ratio , then adaptive method converges in the same order as the (idealized baseline), up to logarithmic dependency.

This result is remarkable since the adaptive method does not require any knowledge of values, and yet it achieves the idealized rate. In other words, the exponential moving average estimator successfully adapts to the variation in noise, allowing faster convergence than constant stepsizes.

Corollary 6.

Let be the average second moment. If , then adaptive method is no slower than the (constant baseline), up to logarithmic dependency.

The condition in Corollary 6 is strictly weaker than the condition in Corollary 5, which means even though an adaptive method may not match the idealized baseline, it could still be non-trivially better than the constant baseline. This case happens e.g., when in Table 1, where the adaptive method is faster than the constant baseline. Indeed, is the maximum improvement one can expect according to our current analysis.

Corollary 7.

Recall that is an upper bound on , i.e., . Therefore

  1. [leftmargin=.2in]

  2. The convergence rate of the constant baseline is no slower than .

  3. The convergence rate of the adaptive method is no faster than .

The order of maximum improvement is determined by the specific choice of in Theorem 3, which is chosen to be . Indeed, the correction term is helpful when the estimator underestimates the true value , avoiding the singularity at zero. Hence, the choice of is related to the average deviation between and . Under a stronger concentration assumption, we can strengthen the maximum improvement to , as shown in Appendix D.

The noise model in Example 1 provides a favorable scenario where the maximum improvement is attained. However, in some scenarios, the convergence rate of an adaptive method can be slower than the constant baseline.

Adversarial scenario.

If for all except at it takes the value with , then the convergence rate of both constant and idealized baselines are , while the adaptive method only converges in . The subtle change at iteration amplifies the exponential moving average estimator and requires a non-negligible period to get back to the constant level. It is clear that the estimator becomes less meaningful under such a subtle change.

Overall, it is hard to provide a complete characterization of the variation in noise. In Corollary 5 and 6, we show that when the ratio between the maximum and the minimum/average second moment is not growing too fast, adaptive methods do improve upon SGD.

5 Extensions of Thm 3

In this section, we discuss several extensions to Thm 3. The results are nontrivial but the analysis is almost the same. Hence we defer the exact statements and proofs to appendices.

Addressing the variance oracle

So far, we have focused on the noise oracle based on the second moment  and made the connection with existing adaptive methods. However, there is some unnaturalness underlying the assumption on . Indeed, it is hard to argue that is iterate independent since . Even though the influence of might be minor when the variance is high (e.g. as in Figure 1), it is still changing . In contrast, the variance  is an intrinsic quantity coming from the noise model, which could be iterate independent. Hence the variance oracle is theoretically more sound. We now present the necessary modifications in order to adapt to the variance oracle.

First, in order to estimate the variance, we need to query two stochastic gradients and at the same iterate, then we construct the estimator following the recursion

Second, the smoothness condition on is required, i.e., -Lipschitzness on the gradient of . In this case, it is necessary to ensure that the step-size being not larger than . This translates to an additional constraint on the correcting constant . More precisely, the stepsize is given by

Remark that the -smoothness condition is not required in the second moment oracle. This is why the second moment assumption is usually imposed in the non-differentiable setting (see Section 6.1 of Bubeck (2014)). A thorough algorithm for the variance oracle is provided in Algorithm 2. The convergence results are essentially the same by substituting to , we refer to Appendix G for more details.

Extension to nonconvex setting

We also provide an extension of our analysis to non-convex smooth setting. In which case, we characterize the convergence with respect to the gradient norm , i.e. convergence to stationary point. The conclusion are very similar to the one in the convex setting and the results (Thm 12, Thm 14) are deferred to Appendix E.

Variants on stepsizes

To go beyond the second moment of noise, one could apply an estimator of the form when the -th moment of the gradient is bounded. This allows stepsize of the shape as in Adamax Kingma and Ba (2014).

Online estimator

Note that we have chosen the exponential moving average estimator since it is the most popular choice in practice. One could apply other estimator to approximate the noise level. The log factor in Lemma 8 is mainly due to the non-uniform accumulation of the error created by the exponential moving average. An alternative is to apply a uniform-averaging estimator of type , where represents a predefined window size. In this case, log factor in Lemma 8 could be removed. However, memory proportional to is required to implement this uniform estimator. Hence, we choose the exponential moving average for simple implementation.

6 Experiments

In this section, we describe two sets of experiments that verify the faster convergence of Algorithm 1

against the vanilla SGD. The first experiment solves a linear regression problem with injected noise. The second experiment trains an AWD-LSTM model 

Merity et al. (2018) on PTB dataset for language modeling.

Figure 2: Left: The injected noise level over iterations. Middle:Average loss trajectory over 10 runs for four different algorithms: standard baseline, idealized baseline, Alg 1 and Alg 2. The curve (idealized vs standard) confirms that adopting step sizes inverse to the noise level lead to faster convergence and less variations. Right:

Average and standard deviation of function suboptimality. The values are normalized by the average MSE of the standard baseline.

6.1 Synthetic experiments

In the synthetic experiment, we generate a random linear regression dataset using the sklearn444scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html library. We design the stochastic oracle as full gradient with injected Gaussian noise, whose coordinate-wise standard deviation is shown in the left figure of Fig 2. We then run the four algorithms discussed in this work: standard baseline, idealized baseline, Alg 1 and Alg 2. We finetune the step sizes for each algorithm by grid-searching among where is an integer. We repeat the experiment for 10 runs and show the average training trajectory as well as the function suboptimality in Fig 2. We observe that the performance is ranked as follows: idealized baseline, Alg 2, Alg 1 and standard baseline.

6.2 Neural Network training

We demonstrate how the proposed algorithm performs in real-world neural network training. We implement our algorithm into the AWD-LSTM codebase described in Merity et al. (2018). The original codebase trains the network using clipped gradient descent followed by an average SGD (ASGD) algorithm to prevent overfitting. As generalization error is beyond our discussion, we focus on the first phase (which takes about 200 epochs) by removing the ASGD training part. We see from Figure 3 that our proposed algorithm can achieve similar performance as the finetuned clipped SGD baseline provided by Merity et al. (2018). This confirms that Alg 1 is a practical algorithm. However, further explorations on more state of art architectures is required to conclude its effectiveness.

Figure 3: The training loss and validation loss for LSTM Language modelling from Merity et al. (2018). The baseline is provided by the authors. Algorithm 1 is described in Alg 1.

7 Conclusions

This paper discusses convergence rates of stochastic gradient methods in an empirically motivated setting where the noise level is changing over iterations. We show that under mild assumptions, one can achieve faster convergence than the fixed step SGD by a factor that is polynomial in number of iterations, by applying online noise estimation and adaptive step sizes. Our analysis, therefore provides one explanation for the recent success of adaptive methods in neural network training.

There is much more to be done along the line of non-stationary stochastic optimization. Under our current analysis, there is a gap between the adaptive method and the idealized method when the noise variation is large (see second row in Table 1). A natural question to ask is whether one could reduce this gap, or alternatively, is there any threshold preventing the adaptive method from getting arbitrarily close to the idealized baseline? Moreover, could one attain further acceleration by combining momentum or coordinate-wise update techniques? Answering these questions would provide more insight and lead to a better understanding of adaptive methods.

Perhaps a more fundamental question is regarding the iterate dependency: the setting where the moments  or the variance  are functions of the current update , not just of the iteration index . Significant effort needs to be spent to address this additional correlation under appropriate regularity conditions. We believe our work lays the foundation to address this challenging research problem.

8 Acknowledgement

This work was partially supported by the MIT-IBM Watson AI Lab. SS and JZ also acknowledge support from NSF CAREER grant Number 1846088.

References

  • A. Agarwal, M. J. Wainwright, P. L. Bartlett, and P. K. Ravikumar (2009) Information-theoretic lower bounds on the oracle complexity of convex optimization. In Proceedings of Advances in Neural Information Processing Systems (NIPS), Cited by: §1.
  • N. Agarwal, B. Bullins, X. Chen, E. Hazan, K. Singh, C. Zhang, and Y. Zhang (2018) The case for full-matrix adaptive regularization. arXiv preprint arXiv:1806.02958. Cited by: §1.
  • Y. Arjevani, Y. Carmon, J. C. Duchi, D. J. Foster, N. Srebro, and B. Woodworth (2019) Lower bounds for non-convex stochastic optimization. arXiv preprint arXiv:1912.02365. Cited by: §1.
  • M. G. Azar, I. Osband, and R. Munos (2017) Minimax regret bounds for reinforcement learning. In International Conferences on Machine Learning (ICML), Cited by: §4.
  • L. Bottou, F. E. Curtis, and J. Nocedal (2018) Optimization methods for large-scale machine learning. Siam Review 60 (2), pp. 223–311. Cited by: §2.
  • S. Bubeck (2014) Convex optimization: algorithms and complexity. arXiv preprint arXiv:1405.4980. Cited by: Appendix E, §5.
  • X. Chen, S. Liu, R. Sun, and M. Hong (2019) On the convergence of a class of ADAM-type algorithms for non-convex optimization. In International Conference on Learning Representations (ICLR), Cited by: §1.
  • Y. Drori and O. Shamir (2019) The complexity of finding stationary points with stochastic gradient descent. arXiv preprint arXiv:1910.01845. Cited by: §1.
  • J. Duchi, E. Hazan, and Y. Singer (2011) Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12 (Jul), pp. 2121–2159. Cited by: §1.
  • C. Fang, C. J. Li, Z. Lin, and T. Zhang (2018) SPIDER: near-optimal non-convex optimization via stochastic path integrated differential estimator. In Proceedings of Advances in Neural Information Processing Systems (NIPS), Cited by: Remark 4.
  • S. Gadat and F. Panloup (2017) Optimal non-asymptotic bound of the Ruppert-Polyak averaging without strong convexity. arXiv preprint arXiv:1709.03342. Cited by: §2.
  • S. Ghadimi and G. Lan (2012) Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: a generic algorithmic framework. SIAM Journal on Optimization 22 (4), pp. 1469–1492. Cited by: §1.
  • S. Ghadimi and G. Lan (2013) Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization 23 (4), pp. 2341–2368. Cited by: §1.
  • H. Huang, C. Wang, and B. Dong (2019) Nostalgic Adam: weighting more of the past gradients when designing the adaptive learning rate. In

    Proceedings of the 28th International Joint Conference on Artificial Intelligence

    ,
    Cited by: §1.
  • C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan (2018a) Is q-learning provably efficient?. In Proceedings of Advances in Neural Information Processing Systems (NIPS), Cited by: §4.
  • C. Jin, P. Netrapalli, and M. I. Jordan (2018b) Accelerated gradient descent escapes saddle points faster than gradient descent. In Conference On Learning Theory, pp. 1042–1085. Cited by: Remark 4.
  • A. Jofré and P. Thompson (2019) On variance reduction for stochastic smooth convex optimization with multiplicative noise. Mathematical Programming 174 (1-2), pp. 253–292. Cited by: §2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §1, §5.
  • K. Y. Levy, A. Yurtsever, and V. Cevher (2018) Online adaptive methods, universality and acceleration. In Proceedings of Advances in Neural Information Processing Systems (NIPS), Cited by: §1, §4.
  • K. Levy (2017) Online to offline conversions, universality and adaptive minibatch sizes. In Proceedings of Advances in Neural Information Processing Systems (NIPS), Cited by: §1, §4.
  • X. Li and F. Orabona (2019) On the convergence of stochastic gradient descent with adaptive stepsizes. In The 22nd International Conference on Artificial Intelligence and Statistics, Cited by: §1, §4, Remark 4.
  • L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han (2019) On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265. Cited by: §1.
  • M. Liu, Y. Mroueh, J. Ross, W. Zhang, X. Cui, P. Das, and T. Yang (2020) Towards better understanding of adaptive gradient algorithms in generative adversarial nets. In International Conference on Learning Representations (ICLR), Cited by: §1.
  • J. Ma and D. Yarats (2019) On the adequacy of untuned warmup for adaptive optimization. arXiv preprint arXiv:1910.04209. Cited by: §1.
  • S. Merity, N. S. Keskar, and R. Socher (2018) Regularizing and optimizing LSTM language models. In International Conference on Learning Representations (ICLR), Cited by: Figure 3, §6.2, §6.
  • E. Moulines and F. R. Bach (2011)

    Non-asymptotic analysis of stochastic approximation algorithms for machine learning

    .
    In Proceedings of Advances in Neural Information Processing Systems (NIPS), Cited by: §2.
  • A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro (2009) Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization 19 (4), pp. 1574–1609. Cited by: §3, Remark 4.
  • A. S. Nemirovsky and D. B. Yudin (1983) Problem complexity and method efficiency in optimization.. Wiley. Cited by: §1.
  • A. Rakhlin, O. Shamir, and K. Sridharan (2012) Making gradient descent optimal for strongly convex stochastic optimization. In International Conferences on Machine Learning (ICML), Cited by: §1.
  • L. Rosasco, S. Villa, and B. C. Vũ (2019) Convergence of stochastic proximal gradient algorithm.

    Applied Mathematics & Optimization

    , pp. 1–27.
    Cited by: §2.
  • M. Staib, S. J. Reddi, S. Kale, S. Kumar, and S. Sra (2019) Escaping saddle points with adaptive gradient methods. In International Conferences on Machine Learning (ICML), Cited by: §1.
  • A. L. Strehl and M. L. Littman (2008)

    An analysis of model-based interval estimation for markov decision processes

    .
    Journal of Computer and System Sciences 74 (8), pp. 1309–1331. Cited by: §4.
  • T. Tieleman and G. Hinton (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4 (2), pp. 26–31. Cited by: Stochastic Optimization with Non-stationary Noise, §1, §1, §2, §4.
  • R. Ward, X. Wu, and L. Bottou (2019) AdaGrad stepsizes: sharp convergence over nonconvex landscapes. In International Conferences on Machine Learning (ICML), Cited by: §1, §4.
  • J. Zhang, T. He, S. Sra, and A. Jadbabaie (2020)

    Why gradient clipping accelerates training: a theoretical justification for adaptivity

    .
    In International Conference on Learning Representations, Cited by: §1.
  • J. Zhang, S. P. Karimireddy, A. Veit, S. Kim, S. J. Reddi, S. Kumar, and S. Sra (2019)

    Why adam beats sgd for attention models

    .
    arXiv preprint arXiv:1912.03194. Cited by: §1.
  • D. Zhou, Y. Tang, Z. Yang, Y. Cao, and Q. Gu (2018) On the convergence of adaptive gradient methods for nonconvex optimization. arXiv preprint arXiv:1808.05671. Cited by: §1.
  • Z. Zhou, Q. Zhang, G. Lu, H. Wang, W. Zhang, and Y. Yu (2019) AdaShift: decorrelation and convergence of adaptive learning rate methods. In International Conference on Learning Representations (ICLR), Cited by: §1.
  • F. Zou, L. Shen, Z. Jie, W. Zhang, and W. Liu (2019) A sufficient condition for convergences of adam and rmsprop. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 11127–11135. Cited by: §1.
  • F. Zou and L. Shen (2018) On the convergence of weighted adagrad with momentum for training deep neural networks. arXiv preprint arXiv:1808.03408. Cited by: §1.

Appendix A Proof of Theorem 1

Proof.

The iterate suboptimality have the following relationship:

Rearrange and take expectation with respect to we have

Sum over and take expectation we get

Then from convexity, we have

where . Corollary 2 follows from specifying the particular choices of the stepsizes.

Appendix B Key Lemma

Lemma 8.

Under Assumptions 2, taking , the total estimation error of the based on (ExpMvAvg) is bounded by:

Proof.

On a high level, we decouple the error in a bias term and a variance term. We use the total variation assumption to bound the bias term, and use the exponential moving average to reduce variance. Then we pick to balance the two terms.

From triangle inequality, we have

(5)

We first bound the bias term. By definition of , we have

Hence by recursion,

Therefore, the bias term could be bounded by

The first inequality follows by traingle inequality. The third inequality uses the geometric sum over . To bound the variance term, we remark that

Hence from independence of the gradients, we have

where is an upperbound on the variance. The first inequality follows by Jensen’s inequality. The second equality uses independence of given . The last inequality follows by assumption 2.

We distinguish two cases, when is small, we simply bound the coefficient by 1, i.e.

When is large such that , with , we have , thus

The second inequality follows by , with . Therefore, when ,

Therefore, substitute in the above equation into the

Summing up the variance term and the bias term yields,

(6)

Taking yields,

(7)

Appendix C Proof of Theorem 3

On a high level, the difference between the adaptive stepsize and the idealized stepsize mainly depends on the estimation error , which has a sublinear regret according to Lemma B. Then we carefully integrate this regret bound to control the derivation from the idealized algorithm, reaching the conclusion.

Proof.

By the update rule of , we have,

Noting that the stepsize is independent of , taking expectation with respect to conditional on the past iterates lead to

Recall that , taking expectation and sum over iterations , we get

Hence by Markov’s inequality, with probability at least ,

(8)

Now we can upper bound the right hand side, indeed

(9)

The last inequality follows by the choice on that

Hence, from Eq. (8), we have with probability at least ,

(10)

Next, by denoting , we lower bound the left hand side,