1 Introduction
First order optimization methods play a key role in solving large scale machine learning problems due to their low iteration complexity and scalability with large data sets. In several cases, these methods operate with noisy first order information either because the gradient is estimated from draws or subset of components of the underlying objective function
Bach and Moulines (2011); Cohen et al. (2018); Flammarion and Bach (2015); Ghadimi and Lan (2012, 2013); Jain et al. (2018); Vaswani et al. (2018); d’Aspremont (2008); Devolder et al. (2014) or noise is injected intentionally due to privacy or algorithmic considerations Bassily et al. (2014); Neelakantan et al. (2015); Raginsky et al. (2017); Gao et al. (2018a, b). A fundamental question is this setting is to design fast algorithms with optimal convergence rate, matching the lower bounds in terms of target accuracy and other important parameters both for the deterministic and stochastic case (i.e., with or without gradient errors).In this paper, we design an optimal first order method to solve the problem
(1) 
where, for scalars , is the set of continuously differentiable functions that are strongly convex with modulus and have Lipschitzcontinuous gradients with constant , which imply that for every the function satisfies (see e.g. Nesterov (2004))
(2a)  
(2b) 
We denote the solution of problem (1) by which is achieved at the unique optimal point . Also, the ratio is called the condition number of .
We assume that the gradient information is available through a stochastic oracle, which at each iteration , given the current iterate , provides the noisy gradient where
is a sequence of independent random variables such that for all
(3a)  
(3b) 
This oracle model is commonly considered in the literature (see e.g. Ghadimi and Lan (2012, 2013); Bubeck et al. (2015)), and is more general than the additive noise model where the gradient is corrupted by additive stochastic noise, see e.g. Cohen et al. (2018); Bassily et al. (2014).
In this setting, the performance of many algorithms is characterized by the expected error of the iterates (in terms of the suboptimality in function values) which admits a bound as a sum of two terms: a bias term that shows the decay of initialization error and is independent of the noise parameter , and a variance term that depends on and is independent of the initial point . A lower bound on the bias term follows from the seminal work of Nemirovsky and Yudin (1983), which showed that without noise () and after iterations, the expected function suboptimality cannot be smaller than^{1}^{1}1This lower bound is shown with the additional assumption
(4) 
With noise, Raginsky and Rakhlin (2011) provided the following (much larger) lower bound^{2}^{2}2The authors show this result for . Nonetheless, it can be generalized to any by scaling the problem parameters properly. on function suboptimality which also provides a lower bound on the variance term:
(5) 
Several algorithms have been proposed in the recent literature attempting to achieve these lower bounds^{3}^{3}3Here we review their error bounds after iterations highlighting dependence on , , and initial point , suppressing and dependence.. Xiao (2010) obtain performance guarantees in expected suboptimality for an accelerated version of the dual averaging method. Dieuleveut et al. (2017) consider quadratic objective function and develop an algorithm with averaging to achieve the error bound . Hu et al. (2009) consider general strongly convex and smooth functions and achieve an error bound with similar dependence under the assumption of bounded noise. Ghadimi and Lan (2012) and Chen et al. (2012) extend this result to the noise model in (3) by introducing the accelerated stochastic approximation algorithm (ACSA) and optimal regularized dual averaging algorithm (ORDA), respectively. Both ACSA and ORDA have multistage versions presented in Ghadimi and Lan (2013) and Chen et al. (2012) where authors improve the bias term to the optimal by exploiting knowledge of and the optimality gap , i.e., an upper bound for , in the operation of the algorithm. Another closely related paper is Cohen et al. (2018) which proposed AGD+ and showed under additive noise model that it admits the error bound for any where the constants grow with , and in particular, they achieve the bound with .
In this paper, we introduce the class of Multistage Accelerated Stochastic Gradient (MASG) methods that are universally optimal, achieving the lower bound both in the noiseless deterministic case and the noisy stochastic case up to some constants independent of and . MASG proceeds in stages that use a stochastic version of Nesterov’s accelerated method Nesterov (2004) with a specific restart and parameterization. Given an arbitrary length and constant stepsize for the first stage and geometrically growing lengths and shrinking stepsizes for the following stages, we first provide a general convergence rate result for MASG (see Theorem 3.4). Given the computational budget , a specific choice for the length of the first stage is shown to achieve the optimal error bound without requiring knowledge of the noise bound and the initial optimality gap (See Corollary 3.8). To the best of our knowledge, this is the first algorithm that achieves such a lower bound under such informational assumptions. To the best of our knowledge, this is the first algorithm that achieves such a lower bound under such informational assumptions.
In Table 1, we provide a comparison of our algorithm with other algorithms in terms of required assumptions and optimality of their results in both bias and variance terms. In particular, we consider ACSA Ghadimi and Lan (2012), Multistage ACSA Ghadimi and Lan (2013), ORDA and Multistage ORDA Chen et al. (2012), and the algorithm proposed in Cohen et al. (2018).
Algorithm  Requires  Opt.  Opt.  

or  Bias  Var.  
ACSA  ✗  ✗  ✗  ✗  ✓ 
Multi. ACSA  ✓  ✓  ✗  ✓  ✓ 
ORDA  ✗  ✗  ✗  ✗  ✓ 
Multi. ORDA  ✓  ✓  ✗  ✓  ✓ 
Cohen et al.  ✗  ✗  ✗  ✗  ✓ 
MASG  ✗  ✗  ✗  ✗  ✓ 
(With parameters in  
Corollary 3.7)  
MASG  ✗  ✗  ✓[]  ✓  ✓ 
(With parameters in  
Corollary 3.8)  
MASG  ✗  ✓  ✓[]  ✓  ✓ 
(With parameters in  
Corollary 3.9) 
Our paper builds on an analysis of Nesterov’s accelerated stochastic method with a specific momentum parameter presented in Section 2 which may be of independent interest. This analysis follows from a dynamical system representation and study of first order methods which has gained attention in the literature recently Lessard et al. (2016); Hu and Lessard (2017); Aybat et al. (2018). In Section 3, we present the MASG algorithm, and characterize its behavior under different assumptions as summarized in Table 1. In particular, we show that it achieves the optimal convergence rate with the given budget of iterations . In Section 4, we show how additional information such as and can be leveraged in our framework to improve practical performance. Finally, in Section 5, we provide numerical results on the comparison of our algorithm with some of the other most recent methods in the literature.
1.1 Preliminaries and Notation
Let and represent the identity and zero matrices. For matrix , and denote the trace and determinant of , respectively. Also, for scalars and , we use to show the submatrix formed by rows to and columns to . We use the superscript
to denote the transpose of a vector or a matrix depending on the context. Throughout this paper, all vectors are represented as column vectors. Let
denote the set of all symmetric and positive semidefinite matrices. For two matrices and , their Kronecker product is denoted by . For scalars , is the set of continuously differentiable functions that are strongly convex with modulus and have Lipschitzcontinuous gradients with constant .2 Modeling Accelerated Gradient Method as a Dynamical System
In this section we study Nesterov’s Accelerated Stochastic Gradient method (ASG) Nesterov (2004) with the stochastic firstorder oracle in (3):
(6a)  
(6b) 
where is the stepsize and is the momentum parameter. It is worth noting that Nesterov’s analysis in Nesterov (2004) for (6) is given for and , i.e., for ; more importantly, it does not guarantee convergence when . This choice of momentum parameter has been also studied in other papers in the literature, e.g., Nitanda (2014); Wai et al. (2018); Shi et al. (2018). Here, in the following lemma, we provide a new motivation for it by showing that, for quadratic functions and in the noiseless setting, this momentum parameter achieves the fastest asymptotic convergence rate for a fixed stepsize . The proof of this lemma is provided in Appendix A.
Lemma 2.1.
Let be a strongly convex quadratic function such that where is a by
symmetric positive definite matrix with all its eigenvalues in the interval
. Consider the deterministic ASG iterations, i.e., , as shown in (6), with constant stepsize . Then, the fastest asymptotic convergence rate, i.e. the smallest that satisfies the inequalityfor some nonnegative sequence that goes to zero is and it is achieved by . As a consequence, for this choice of , we have
where .
Our analysis builds on the reformulation of a firstorder optimization algorithm as a linear dynamical system. Following Lessard et al. (2016); Hu and Lessard (2017), we write ASG iterations as
(7a)  
(7b) 
where is the state vector and and are system matrices with appropriate dimensions defined as the Kronecker products , and with
(8) 
We can also relate the state to the iterate in a linear fashion through the identity
We study the evolution of the ASG method through the following Lyapunov function which also arises in the study of deterministic accelerated gradient methods:
(9) 
where is a symmetric positive semidefinite matrix. In particular, we first state the following lemma which can be derived by a simple adaptation of the proof of Proposition 4.6 in Aybat et al. (2018) to our setting where the noise assumption is less restrictive. Its proof can be found in Appendix B.
Lemma 2.2.
Let . Consider the ASG iterations given by (6). Assume there exist and , possibly depending on , such that
(10) 
where
Let . Then, for every ,
(11) 
We use this lemma and derive the following theorem which characterize the behavior of ASG method for when and (see the proof in Appendix C).
Theorem 2.3.
3 A Class of Multistage ASG Algorithms
In this section, we introduce a class of multistage ASG algorithms, represented in Algorithm 1 which we denote by MASG. The main idea is to run ASG with properly chosen parameters at each stage for stages. In addition, each new stage is dependent on the previous stage as the first two initial iterates of the new stage are set to the last iterate of the previous stage.
In order to analyze Algorithm 1, in the following theorem, we first characterize the evolution of iterates in one specific stage through the Lyapunov function in (9). The details of the proof is provided in Appendix D.
Theorem 3.1.
We use this result to choose a stepsize, given iterations, such that we achieve an approximately optimal decay in the variance term which yields the following corollary for MASG algorithm with stage, and its proof can be found in Appendix E.
Corollary 3.2.
Let . Consider running MASG, i.e., Algorithm 1, for only one stage with iterations and stepsize for some scalar . Then,
(15) 
provided that .
For subsequent analysis, we define the state vector as
for and where is the number of stages. We analyze the performance of each stage with respect to a stagedependent Lyapunov function . The following lemma relates the performance bounds with respect to consecutive choice of Lyapunov functions, building on our specific restarting mechanism.
Lemma 3.3.
Let . Consider MASG, i.e., Algorithm 1. Then, for every ,
(16) 
Proof.
The proof can be found in Appendix F. ∎
Now, we are ready to state and prove the main result of the paper (see proof in Appendix G):
Theorem 3.4.
Let . Consider running MASG , i.e., Algorithm 1, with the following parameters:
for any , and . The last iterate of each stage, i.e., , satisfies the following bound for all :
(17) 
We next define as the number of iterations needed to run MASG for stages, i.e.,
(18) 
Note for and with parameters given in Theorem 3.4,
(19) 
Also, we define sequence such that is the iterate generated by MASG algorithm at the end of gradient steps for , i.e., , for , and for we set where and .
Remark 3.5.
In the absence of noise, i.e., , the result of Theorem 3.4 recovers the linear convergence rate of deterministic gradient methods as its special case. Indeed, running MASG only for one stage with iterations, i.e., and guarantees that for all .
The next theorem remarks the behavior of MASG after running for iterations with the parameters in the preceding theorem, and its proof is provided in Appendix H.
Theorem 3.6.
Using Theorem 3.6, as stated in the following corollary, we can obtain a convergence rate result similar to ACSA Ghadimi and Lan (2012) and ORDA Chen et al. (2012) without assuming any knowledge of and (see Appendix I for the proof).
Corollary 3.7.
Under the premise of Theorem 3.6, choosing , the suboptimality error of MASG after admits the following upper bound
We continue this section by pointing out a few important special cases of our result. We first show in the next corollary how our algorithm is universally optimal and capable of achieving the lower bounds (4) and (5) simultaneously.
Corollary 3.8.
Proof.
The proof is straightforward by using (20) and noting that . ∎
Note the bounds in Theorems 3.4 and 3.6 and in Corollaries 3.7 and 3.8 can be seen as sum of two separate bias and variance terms.
The lower bound can also be stated as the minimum number of iterations needed to find an solution, i.e, to find such that , for any given . In the following corollary, and with the additional assumption of knowing the bound on the initial optimality gap , we state this version of lower bound. The proof is provided in Appendix J.
Corollary 3.9.
Recall that we presented a comparison of different versions of MASG with other stateoftheart algorithms in Table 1. In particular, this table shows that Multistage ACSA Ghadimi and Lan (2013) and Multistage ORDA Chen et al. (2012) also achieve the lower bounds provided that noise parameters are known – note we do not make this extra assumption for MASG. In the next remark, we compare MASG with these two algorithms from another perspective.
Remark 3.10.
In addition to the desirable property that MASG parameters are independent of , MASG iteration complexity bound in (21) has a better constant in front of the variance term, which is the dominant term of the bound, when compared to the bounds provided for Multistage ACSA and Multistage ORDA. In fact, while our constant is less than 50, constants in Multistage ACSA and Multistage ORDA are 384 and 1024, respectively.
Finally, we conclude this section by taking a closer look on how MASG is related to ACSA and Multistage ACSA algorithms proposed in Ghadimi and Lan (2012, 2013). In Appendix K, for the sake of completeness, we state the ACSA algorithm, and next show that it can be cast as an ASG method in (6) with a specific varying stepsize rule. In fact, ACSA iterations can be written for as follows:
(22a)  
(22b) 
As a consequence, Multistage ACSA is a variant of MASG Algorithm that has a different length for each stage and employs a specific varying stepsize rule together with a different selection for the momentum parameter at each stage. That said, the pattern of increase in in Multistage ACSA is very similar to MASG. In particular, for Multistage ACSA, is given by
which increases almost by a factor of two in every stage for sufficiently large , similar to stage length of MASG in Theorem 3.4. Moreover, it can be verified that for specific parameter sequences that authors suggest, maximum of at each stage decreases by almost a factor of four (for large enough ), which is again a similar behavior to stepsize parameter of MASG.
4 MAsg: An improved biasvariance tradeoff
In section 3, we described a universal algorithm that do not require the knowledge of neither initial suboptimality gap nor the noise magnitude to operate. However, as we will argue in this section, our framework is flexible in the sense that additional information about the magnitude of or can be leveraged to improve practical performance.
We first note that several algorithms in the literature assume that an upper bound on is known or can be estimated, as summarized in Table 1. This assumption is reasonable in a variety of applications when there is a natural lower bound on
. For example, in supervised learning scenarios such as support vector machines, regression or logistic regression problems, the loss function
has nonnegative values Vapnik (2013). Similarly, the noise level may be known or estimated; for instance in private risk minimization Bassily et al. (2014), the noise is added by the user to ensure privacy; therefore, it is a known quantity.There is a natural wellknown tradeoff between constant and decaying stepsizes (decaying with the number of iterations ) in stochastic gradient algorithms. Since the noise is multiplied with the stepsize, a stepsize that is decaying with the number of iterations leads to a decay in the variance term; however, this will slow down the decay of the bias term, which is controlled essentially by the behavior of the underlying deterministic accelerated gradient algorithm (AG) that will give the best performance with the constant stepsize (note that when , the bias term gives the known performance bounds for the AG algorithm). The main idea behind the MASG algorithm (which allows it to achieve the lower bounds) is to exploit this tradeoff and switch to decaying stepsize in the right time, i.e., when the bias term is sufficiently small so that the variance term dominates and should be handled with the decaying stepsize. This insight is visible from the results of Theorem 3.4 which gives insights on the choice of the stepsize at every stage to achieve the lower bounds. Theorem 3.4 shows that if MASG is run with a constant stepsize in the first stage, then the variance term admits the bound which does not decay with the number of iterations in the first stage. However, in later stages, when , the stepsize is decreased as the number of iterations grows and this results in a decay of the variance term. Overall, the choice of the length of the first stage , has a major impact in practice which we will highlight in our numerical experiments.
If an estimate of or is known, it is desirable to choose as small as possible such that it ensures the bias term becomes smaller than the variance term at the end of the first stage. More specifically, applying Theorem 3.1 for , one can choose to balance the variance and the bias terms. The term , as shown in the proof of Lemma 3.3, can be bounded by
Therefore, by having an estimate of an upper bound for , can be set to be the smallest number such that , i.e.,
(23) 
This lemma allows one to finetune the switching point to start using the decaying stepsizes within our framework as a function of and . In scenarios, when the noise level is small or the initial gap is large, is chosen large enough to guarantee a fast decay in the bias term. We would like to emphasize that this modified MASG algorithm only requires the knowledge of and for selecting and the rest of the parameters can be chosen as in Theorem 3.4 which are independent of both and . Finally, the following theorem provides theoretical guarantees in our framework for this choice of . The proof is omitted as it is similar to the proofs of Theorems 3.4 and 3.6.
Theorem 4.1.
In the next section, we present numerical experiments that illustrates the performance of our proposed algorithms and compare them to existing methods from the literature.
5 Numerical Experiments
In this section, we demonstrate the numerical performance of Algorithm 1 with parameters specified by Corollary 3.7 (MASG) and Theorem 4.1 (MASG) and compare with other methods from the literature.
In our first experiment, we consider the strongly convex quadratic objective where
is the Laplacian of a cycle graph, is a random vector and is a regularization parameter. We assume the gradients
are corrupted by additive noise with a Gaussian distribution
where . We note that this example has been previously considered in the literature as a problem instance where Standard ASG (ASG iterations with standard choice of parameters and ) perform badly compared to Standard GD (Gradient Descent with standard choice of the stepsize ) Hardt (2014).In Figures 1 and 2, we compare MASG and MASG with Standard GD, Standard AG, AGD+ Cohen et al. (2018), Multistage ACSA Ghadimi and Lan (2013), and FlammarionBach algorithm proposed in Flammarion and Bach (2015). We consider dimension and initialize all the methods from . We run the algorithms Multistage ACSA, FlammarionBach algorithm and MASG, having access to the same estimate of . Figures 1 2 show the average performance of all the algorithms over 10 sample runs while the total number of iterations and respectively as the noise level is varied. The simulation results reveal that both MASG and MASG have typically a faster decay of the error in the beginning and outperforms the other algorithms in general when the number of iterations is small to moderate. In this case, the speedup obtained by MASG and MASG are more prominent from the figures if the noise level is smaller. However, as the number of iterations grows, the performance of the algorithms become similar as the variance term dominates. In addition, we would like to highlight that when the noise is small, using as suggested in (23), MASG runs stage one longer than MASG; hence, enjoys the linear rate of decay for more iterations before the variance term becomes the dominant term.
For the second set of experiments, we consider a regularized logistic regression problem for binary classification. In particular, we generate a random matrix
and a random vector and compute which is the vector that contains the sign of the inner product with the rows of and the vector . Our goal is to recover by optimizing a regularized logistic objective when the gradient of the loss function is corrupted with additive Gaussian noise. We compare MASG and MASG with Standard GD, Standard AG, AGD+ Cohen et al. (2018), and Multistage ACSA Ghadimi and Lan (2013). We note that the condition number of the problem for this problem.Figures 3– 4 illustrate the behavior of the algorithms for and iterations for the noise level as before. It can be seen that both MASG and MASG usually start faster, and do not perform worse than other algorithms in different scenarios; moreover, they outperform other algorithms when the iteration budget is limited or the noise level is small. Furthermore, note that in the setting where the noise is large, MASG behaves better than MASG, as it terminates the first stage earlier, which is helpful as the noise is large; hence, the variance becomes term dominant in the first stage just after a few iterations.
6 Conclusion
In this work, we consider strongly convex smooth optimization problems where we have access to noisy estimates of the gradients. We proposed a multistage method that adapts the choice of the parameters of the Nesterov’s accelerated gradient at each stage to achieve the optimal rate. Our method is universal in the sense that it does not require the knowledge of the noise characteristics to operate and can achieve the optimal rate both in the deterministic and stochastic settings. We provided numerical experiments to show our algorithm can be faster than the existing approaches in practice.
References
 Aybat et al. [2018] Serhat Aybat, Alireza Fallah, Mert Gurbuzbalaban, and Asuman Ozdaglar. Robust accelerated gradient methods for smooth strongly convex functions. arXiv preprint arXiv:1805.10579, 2018.

Bach and Moulines [2011]
Francis Bach and Eric Moulines.
NonAsymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning.
In Neural Information Processing Systems (NIPS), Spain, 2011.  Bassily et al. [2014] R. Bassily, A. Smith, and A. Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on, pages 464–473. IEEE, 2014.
 Bubeck et al. [2015] Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(34):231–357, 2015.
 Chen et al. [2012] Xi Chen, Qihang Lin, and Javier Pena. Optimal regularized dual averaging methods for stochastic optimization. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 395–403. Curran Associates, Inc., 2012.
 Cohen et al. [2018] Michael Cohen, Jelena Diakonikolas, and Lorenzo Orecchia. On acceleration with noisecorrupted gradients. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1019–1028, Stockholmsmässan, Stockholm Sweden, 2018. PMLR.
 d’Aspremont [2008] A. d’Aspremont. Smooth optimization with approximate gradient. SIAM Journal on Optimization, 19(3):1171–1183, 2008. doi: 10.1137/060676386.
 de Klerk [2002] E de Klerk. Aspects of Semidefinite Programming: Interior Point Algorithms and Selected Applications, volume 65. Springer Science & Business Media, 2002.
 Devolder et al. [2014] O. Devolder, F. Glineur, and Y. Nesterov. Firstorder methods of smooth convex optimization with inexact oracle. Mathematical Programming, 146(12):37–75, 2014.
 Dieuleveut et al. [2017] Aymeric Dieuleveut, Nicolas Flammarion, and Francis Bach. Harder, better, faster, stronger convergence rates for leastsquares regression. The Journal of Machine Learning Research, 18(1):3520–3570, 2017.
 Flammarion and Bach [2015] N. Flammarion and F. Bach. From averaging to acceleration, there is only a stepsize. In Conference on Learning Theory, pages 658–695, 2015.
 Gao et al. [2018a] X. Gao, M. Gürbüzbalaban, and L. Zhu. Global Convergence of Stochastic Gradient Hamiltonian Monte Carlo for NonConvex Stochastic Optimization: NonAsymptotic Performance Bounds and MomentumBased Acceleration. ArXiv eprints, September 2018a.
 Gao et al. [2018b] Xuefeng Gao, Mert Gurbuzbalaban, and Lingjiong Zhu. Breaking Reversibility Accelerates Langevin Dynamics for Global NonConvex Optimization. arXiv eprints, art. arXiv:1812.07725, December 2018b.
 Ghadimi and Lan [2012] Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469–1492, 2012.
 Ghadimi and Lan [2013] Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, ii: shrinking procedures and optimal algorithms. SIAM Journal on Optimization, 23(4):2061–2089, 2013.
 Hardt [2014] M. Hardt. Robustness versus acceleration. August 18th, 2014. http://blog.mrtz.org/2014/08/18/robustnessversusacceleration.html, August 2014.
 Hu and Lessard [2017] Bin Hu and Laurent Lessard. Dissipativity theory for Nesterov’s accelerated method. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1549–1557, International Convention Centre, Sydney, Australia, 2017. PMLR.
 Hu et al. [2009] Chonghai Hu, Weike Pan, and James T. Kwok. Accelerated gradient methods for stochastic optimization and online learning. In Advances in Neural Information Processing Systems 22, pages 781–789. Curran Associates, Inc., 2009.

Jain et al. [2018]
Prateek Jain, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron
Sidford.
Accelerating stochastic gradient descent for least squares regression.
In Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 545–604. PMLR, 2018.  Lessard et al. [2016] Laurent Lessard, Benjamin Recht, and Andrew Packard. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization, 26(1):57–95, 2016.
 Neelakantan et al. [2015] Arvind Neelakantan, Luke Vilnis, Quoc V Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, and James Martens. Adding gradient noise improves learning for very deep networks. arXiv preprint arXiv:1511.06807, 2015.
 Nemirovsky and Yudin [1983] Arkadii Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity and method efficiency in optimization. Wiley, 1983.
 Nesterov [2004] Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87. Springer, 2004.
 Nitanda [2014] Atsushi Nitanda. Stochastic proximal gradient descent with acceleration techniques. In Advances in Neural Information Processing Systems, pages 1574–1582, 2014.
 O’Donoghue and Candès [2015] B. O’Donoghue and E. Candès. Adaptive restart for accelerated gradient schemes. Foundations of Computational Mathematics, 15(3):715–732, Jun 2015. ISSN 16153383.
 Raginsky et al. [2017] M. Raginsky, A. Rakhlin, and M. Telgarsky. Nonconvex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. arXiv preprint arXiv:1702.03849, 2017.
 Raginsky and Rakhlin [2011] Maxim Raginsky and Alexander Rakhlin. Informationbased complexity, feedback and dynamics in convex programming. IEEE Transactions on Information Theory, 57(10):7036–7056, 2011.
 Shi et al. [2018] Bin Shi, Simon S Du, Michael I Jordan, and Weijie J Su. Understanding the acceleration phenomenon via highresolution differential equations. arXiv preprint arXiv:1810.08907, 2018.

Vapnik [2013]
Vladimir Vapnik.
The nature of statistical learning theory
. Springer science & business media, 2013.  Vaswani et al. [2018] Sharan Vaswani, Francis Bach, and Mark Schmidt. Fast and faster convergence of sgd for overparameterized models and an accelerated perceptron. arXiv preprint arXiv:1810.07288, 2018.
 Wai et al. [2018] HoiTo Wai, Wei Shi, Cesar A Uribe, Angelia Nedich, and Anna Scaglione. On curvatureaided incremental aggregated gradient methods. arXiv preprint arXiv:1806.00125, 2018.
 Xiao [2010] Lin Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11(Oct):2543–2596, 2010.
Appendix A Proof of Lemma 2.1
Let us denote the asymptotic convergence rate of the ASG method as a function of and by . It is wellknown that has the following characterization (see e.g. Lessard et al. [2016], O’Donoghue and Candès [2015]):
(24) 
where and is defined as:
(25) 
with . Note that, since , we have for ; therefore, if and only if , which is equivalent to .
Using the fact that and is decreasing in , we obtain ; hence, for , we have both and . As a consequence, (24) implies that for , we have
(26) 
Moreover, for , the two branches in (25) take the same value for and ; therefore, when is set to this critical value, we also get for . Note (26) is an increasing function of for any ; thus, given , the smallest rate possible is equal to , which is the rate given in the statement of Lemma and it is achieved by .
Now, we consider the case . From (24), if , then we also have . Thus, showing suffices us to claim that for any , the best possible rate is and this can be achieved by setting . Indeed, as we discussed above, for the case , we have ; thus,
Therefore, to show , we just need to prove
(27) 
Taking the square of both sides of (27), it follows that (27) is equivalent to
and this holds when . Therefore, for any , we have for . which completes the proof.
Appendix B Proof of Lemma 2.2
We first state the following lemma which is an extension of Lemma 4.1 in Aybat et al. [2018] for ASG.
Lemma B.1.
Let where and consider the function . Then we have
Comments
There are no comments yet.