1 Introduction
In this paper, we consider stochastic composite optimization problems of the form
(1) 
where is a convex smooth function (meaning differentiable with Lipschitz continuous gradient) and is a possibly nonsmooth convex lowersemicontinuous function. For instance, may be the norm, which is known to induce sparsity, or an indicator function that may take the value outside of a convex set and inside [22]
. The random variable
corresponds to data samples. When the amount of training data is finite, the expectationcan be replaced by a finite sum, a setting that has attracted a lot of attention in machine learning recently, see,
e.g., [14, 15, 20, 26, 36, 43, 53] for incremental algorithms and [1, 27, 31, 34, 47, 55, 56] for accelerated variants.Yet, as noted in [8], one is typically not interested in the minimization of the empirical risk—that is, a finite sum of functions—with high precision, but instead, one should focus on the expected risk involving the true (unknown) data distribution. When one can draw an infinite number of samples from this distribution, the true risk (1) may be minimized by using appropriate stochastic optimization techniques. Unfortunately, fast methods designed for deterministic objectives would not apply to this setting; methods based on stochastic approximations admit indeed optimal “slow” rates that are typically for convex functions and for strongly convex ones, depending on the exact assumptions made on the problem, where is the number of noisy gradient evaluations [39].
Better understanding the gap between deterministic and stochastic optimization is one goal of this paper. Specifically, we are interested in Nesterov’s acceleration of gradientbased approaches [40, 41]. In a nutshell, gradient descent or its proximal variant applied to a strongly convex smooth function achieves an exponential convergence rate in the worst case in function values, and a sublinear rate if the function is simply convex (). By interleaving the algorithm with clever extrapolation steps, Nesterov showed that faster convergence could be achieved, and the previous convergence rates become and , respectively. Whereas no clear geometrical intuition seems to appear in the literature to explain why acceleration occurs, proof techniques to show accelerated convergence [5, 41, 51] and extensions to a large class of other gradientbased algorithms are now well established [1, 11, 34, 42, 47].
Yet, the effect of Nesterov’s acceleration to stochastic objectives remains poorly understood since existing unaccelerated algorithms such as stochastic mirror descent [39] and their variants already achieve the optimal asymptotic rate. Besides, negative results also exist, showing that Nesterov’s method may be unstable when the gradients are computed approximately [13, 17]. Nevertheless, several approaches such as [4, 12, 16, 18, 19, 24, 29, 30, 52] have managed to show that acceleration may be useful to forget faster the algorithm’s initialization and reach a region dominated by the noise of stochastic gradients; then, “good” methods are expected to asymptotically converge with a rate exhibiting an optimal dependency in the noise variance [39], but with no dependency on the initialization. A major challenge is then to achieve the optimal rate for these two regimes.
In this paper, we consider an optimization method with the following property: given an auxiliary strongly convex objective function , we assume that is able to produce iterates with expected linear convergence to a noisedominated region—that is, such that
(2) 
where , is the minimum function value, and is an upper bound on the variance of stochastic gradients accessed by , which we assume to be uniformly bounded. Whereas such an assumption has limitations, it remains the most standard one for stochastic optimization (see [10, 44] for more realistic settings in the smooth case). The class of methods satisfying (2) is relatively large. For instance, when is
smooth, the stochastic gradient descent method (SGD) with constant step size
and iterate averaging satisfies (2) with , , and , see [29].Main contribution.
In this paper, we extend the Catalyst approach [34] to stochastic problems. Under mild conditions, our approach is able to turn into a converging algorithm with a worstcase expected complexity that decomposes into two parts: the first one exhibits an accelerated convergence rate in the sense of Nesterov and shows how fast one forgets the initial point; the second one corresponds to the stochastic regime and typically depends (optimally in many cases) on . Note that even though we only make assumptions about the behavior of on strongly convex subproblems (2), we also treat the case where the objective (1) is convex, but not strongly convex.
To illustrate the versatility of our approach, we consider the stochastic finitesum problem [7, 23, 32, 54], where the objective (1) decomposes into components and is a stochastic perturbation, coming, e.g., from data augmentation or noise injected during training to improve generalization or privacy (see [29, 36]). The underlying finitesum structure may also result from clustering assumptions on the data [23], or from distributed computing [32], a setting beyond the scope of our paper. Whereas it was shown in [29] that classical variancereduced stochastic optimization methods such as SVRG [53], SDCA [47], SAGA [14], or MISO [36], can be made robust to noise, the analysis of [29] is only able to accelerate the SVRG approach. With our acceleration technique, all of the aforementioned method can be modified such that they find a point satisfying with global iteration complexity, for the strongly convex case,
(3) 
The term on the left is the optimal complexity for finitesum optimization [1, 2], up to logarithmic terms in hidden in the notation, and the term on the right is the optimal complexity for strongly convex stochastic objectives [18] where is due to the perturbations . As Catalyst [34], the price to pay compared to nongeneric direct acceleration techniques [1, 29] is a logarithmic factor.
Other contributions.
In this paper, we generalize the analysis of Catalyst [34, 45] to handle various new cases. Beyond the ability to deal with stochastic optimization problems, our approach (i) improves Catalyst by allowing subproblems of the form (2) to be solved approximately in expectation, which is more realistic than the deterministic requirement made in [34] and which is also critical for stochastic optimization, (ii) leads to a new accelerated stochastic gradient descent algorithms for composite optimization with similar guarantees as [18, 19, 29], (iii) handles the analysis of accelerated proximal gradient descent methods with inexact computation of proximal operators, improving the results of [46] while also treating the stochastic setting.
2 Relation with Inexact and Stochastic Proximal Point Methods
Catalyst is based on the inexact accelerated proximal point algorithm [21], which consists in solving approximately a sequence of subproblems and updating two sequences and by
(4) 
where in is obtained from Nesterov’s acceleration principles [41], and is a well chosen regularization parameter. The method is used to obtain an approximate minimizer of by using an appropriate computational budget; when converges linearly, it may be shown that the resulting algorithm (4) enjoys a better worstcase complexity than if was used directly on , see [34].
Since asymptotic linear convergence is out of reach when is a stochastic objective, a classical strategy consists in replacing in (4) by a finitesum approximation obtained by random sampling, leading to deterministic subproblems. Typically without Nesterov’s acceleration (with ), this strategy is often called the stochastic proximal point method [3, 6, 28, 50, 49]. The point of view we adopt in this paper is different and is based on the minimization of surrogate functions related to (4), but which are more general and may take other forms than .
3 Preliminaries: Basic MultiStage Schemes
In this section, we present two simple multistage mechanisms to improve the worstcase complexities of stochastic optimization methods, before introducing acceleration principles.
Basic restart with minibatching or decaying step sizes.
Consider an optimization method with convergence rate (2) and assume that there exists a hyperparameter to control a tradeoff between the bias and the computational complexity. Specifically, we assume that the bias can be reduced by an arbitrary factor , while paying a factor in terms of complexity per iteration (or may be reduced by a factor , thus slowing down convergence). This may occur in two cases:

by using a minibatch of size to sample gradients, which replaces by ;

or the method uses a step size proportional to that can be chosen arbitrarily small.
For instance, stochastic gradient descent with constant step size and iterate averaging is compatible with both scenarios [29]. Then, consider a target accuracy and define the sequences and for . We may now solve successively the problem up to accuracy —e.g., with a constant number steps of when using minibatches of size to reduce the bias—and by using the solution of iteration as a warm restart. As shown in Appendix B, the scheme converges and the worstcase complexity to achieve the accuracy in expectation is
(5) 
For instance, one may run SGD with constant step size at stage with iterate averaging as in [29], which yields , , and . Then, the left term is the classical complexity of the (unaccelerated) gradient descent algorithm for deterministic objectives, whereas the right term is the optimal complexity for stochastic optimization in . Similar restart principles appear for instance in [4] in the design of a multistage accelerated SGD algorithm.
Restart: from sublinear to linear rate with strong convexity.
A natural question is whether asking for a linear rate in (2) for strongly convex problems is a strong requirement. Here, we show that a sublinear rate is in fact sufficient for our needs by generalizing a restart technique introduced in [19] for stochastic optimization, which was previously used for deterministic objectives in [25].
Specifically, consider an optimization method such that the convergence rate (2) is replaced by
(6) 
where and is a minimizer of . Assume now that is strongly convex with and consider restarting times the method , each time running for iterations. Then, it may be shown (see Appendix B) that the relation (2) holds with , , and . If a minibatch or step size mechanism is available, we may then proceed as before and obtain a converging scheme with complexity (5), e.g., by using minibatches of exponentially increasing sizes once the method reaches a noisedominated region, and by using a restart frequency of order .
4 Generic MultiStage Approaches with Acceleration
We are now in shape to introduce a generic acceleration framework that generalizes (4). Specifically, given some point at iteration , we consider a surrogate function related to a parameter , an approximation error , and an optimization method that satisfy the following properties:

is strongly convex, where is the strong convexity parameter of ;

for , which is deteministic given the past information up to iteration and is given in Alg. 1;

can provide the exact minimizer of and a point (possibly equal to ) such that where .
The generic acceleration framework is presented in Algorithm 1. Note that the conditions on
bear similarities with estimate sequences introduced by Nesterov
[41]. However, the choices ofand the proof technique are significantly different, as we will see with various examples below. We also assume at the moment that the exact minimizer
of is available, which differs from the Catalyst framework [34]; the case with approximate minimization will be presented in Section 4.1.(7) 
The proof of the proposition is given in Appendix C and is based on an extension of the analysis of Catalyst [34]. Next, we present various application cases leading to algorithms with acceleration.
Accelerated proximal gradient method.
When is deterministic and the proximal operator of (see Appendix A for the definition) can be computed in closed form, choose and define
(9) 
Consider that minimizes in closed form: . Then, () is obvious; () holds from the convexity of , and () with follows from classical inequalities for smooth functions [41]. Finally, we recover accelerated convergence rates [5, 41].
Accelerated proximal point algorithm.
Accelerated stochastic gradient descent with prox.
A more interesting choice of surrogate is
(10) 
where and
is an unbiased estimate of
—that is, —with variance bounded by , following classical assumptions from the stochastic optimization literature [18, 19, 24]. Then, () and () are satisfied given that is convex. To characterize (), consider that minimizes in closed form: , and define , which is deterministic given . Then, from (10),When taking expectations, the last term on the right disappears since :
(11) 
where we used the nonexpansiveness of the proximal operator [38]. Therefore, () holds with . The resulting algorithm is similar to [29] and offers the same guarantees. The novelty of our approach is then a unified convergence proof for the deterministic and stochastic cases.
Corollary 2 (Complexity of proximal stochastic gradient algorithm, ).
which is of the form (2) with and . Interestingly, the optimal complexity can be obtained by using the first restart strategy presented in Section 3, see Eq. (5), either by using increasing minibatches or decreasing step sizes.
When the objective is convex, but not strongly convex, Proposition 1 gives a bias term that increases linearly with . Yet, the following corollary exhibits an optimal rate with finite horizon, when both and an upperbound on are available. Even though nonpractical, the result shows that our analysis recovers the optimal dependency in the noise level, as [19, 29] and others.
Corollary 3 (Complexity of proximal stochastic gradient algorithm, ).
While all the previous examples use the choice , we will see in Section 4.2 cases where we may choose . Before that, we introduce a variant when is not available.
4.1 Variant with Inexact Minimization
In this variant, presented in Algorithm 2, is not available and we assume that also satisfies:

given , can provide a point such that .
The next proposition, proven in Appendix C, gives us some insight on how to achieve acceleration.
To maintain the accelerated rate, the sequence needs to converge at a similar speed as in Proposition 1, but the dependency in is slightly worse. Specifically, when is strongly convex, we may have both and decreasing at a rate with , but we pay a factor compared to (8). When , the accelerated rate is preserved whenever and , but we pay a factor compared to (8).
Catalyst [34].
When using defined in (4), we recover the convergence rates of [34]. In such a case since . In order to analyze the complexity of minimizing each with and derive the global complexity of the multistage algorithm, the next proposition, proven in Appendix C, characterizes the quality of the initialization .
where . Following [34], we may now analyze the global complexity. For instance, when is strongly convex, we may choose with . Then, it is possible to show that Proposition (4) yields and from the inequality and (12), we have . Consider now a method that behaves as (2). When , can be obtained in iterations of after initializing with . This allows us to obtain the global complexity . For example, when is the proximal gradient descent method, and yield the global complexity of an accelerated method.
Our results improve upon Catalyst [34] in two aspects that are crucial for stochastic optimization: (i) we allow the subproblems to be solved in expectation, whereas Catalyst requires the stronger condition ; (ii) Proposition 5 removes the requirement of [34] to perform a full gradient step for initializing the method in the composite case (see Prop. 12 in [34]).
Proximal gradient descent with inexact prox [46].
Stochastic Catalyst.
With Proposition 5, we are in shape to consider stochastic problems when using a method that converges linearly as (2) with for minimizing . As in Section 3, we also assume that there exists a minibatch/stepsize parameter that can reduce the bias by a factor while paying a factor in terms of innerloop complexity. As above, we discuss the stronglyconvex case and choose the same sequence . In order to minimize up to accuracy , we set such that . Then, the complexity to minimize with when using the initialization becomes , leading to the global complexity
(13) 
Details about the derivation are given in Appendix B. The left term corresponds to the Catalyst accelerated rate, but it may be shown that the term on the right is suboptimal. Indeed, consider to be ISTA with . Then, , , and the right term becomes , which is suboptimal by a factor . Whereas this result is a negative one, suggesting that Catalyst is not robust to noise, we show in Section 4.2 how to circumvent this for a large class of algorithms.
Accelerated stochastic proximal gradient descent with inexact prox.
Finally, consider defined in (10) but the proximal operator is computed approximately, which, to our knowledge, has never been analyzed in the stochastic context. Then, it may be shown (see Appendix B for details) that Proposition 4 holds with . Then, an interesting question is how small should be to guarantee the optimal dependency with respect to as in Corollary 2. In the stronglyconvex case, Proposition 4 simply gives such that .
4.2 Exploiting methods providing strongly convex surrogates
Among various application cases, we have seen an extension of Catalyst to stochastic problems. To achieve convergence, the strategy requires a mechanism to reduce the bias in (2), e.g., by using minibatches or decreasing step sizes. Yet, the approach suffers from two issues: (i) some of the parameters are based on unknown quantities such as ; (ii) the worstcase complexity exhibits a suboptimal dependency in , typically of order when . Whereas practical workarounds for the first point are discussed in Section 5, we now show how to solve the second one in many cases, by using Algorithm 1 with a particular surrogate provided by the optimization method. Consider indeed a method satisfying (2) and which is able, after steps, to produce a point such that
(14) 
where is a function satisfying (), (), and that can be minimized in closed form and ; () is also satisfied with since . In other words, is used to perform approximate minimization of , but we consider cases where also provides another surrogate with closedform minimizer that satisfies the conditions required to use Algorithm 1, which has better convergence guarantees than Algorithm 2 (same convergence rate up to a better factor).
As shown in Appendix D, even though (14) looks technical, a large class of optimization techniques are able to provide the condition (14), including many variants of proximal stochastic gradient descent methods with variance reduction such as SAGA [14], MISO [36], SDCA [47], or SVRG [53].
Whereas (14) seems to be a minor modification of (2), an important consequence is that it will allow us to gain a factor in complexity when , corresponding precisely to the suboptimality factor. Indeed, we may notice that Therefore, even though the surrogate needs only be minimized approximately, the condition (14) allows us to use Algorithm 1 instead of Algorithm 2. The dependency with respect to being better than (by ), we have then the following result:
Proposition 6 (Stochastic Catalyst with Optimality Gaps, ).
Consider Algorithm 1 with a method and surrogate satisfying (14) when is used to minimize by using as a warm restart. Assume that is strongly convex and that there exists a parameter that can reduce the bias by a factor while paying a factor in terms of innerloop complexity.
Choose and . Then, the complexity to solve (14) and compute is , and the global complexity to obtain is
The term on the left is the accelerated rate of Catalyst for deterministic problems, whereas the term on the right is potentially optimal for strongly convex problems, as illustrated in the next table. We provide indeed practical choices for the parameters , leading to various values of , for the proximal stochastic gradient descent method with iterate averaging as well as variants of SAGA,MISO,SVRG that can cope with stochastic perturbations, which are discussed in Appendix D. All the values below are given up to universal constants to simplify the presentation.
Method  Complexity after Catalyst  

proxSGD  
SAGA/MISO/SVRG with 
In this table, and the methods SAGA/MISO/SVRG are applied to the stochastic finitesum problem discussed in Section 1 with smooth functions. As in the deterministic case, we note that when , there is no acceleration for SAGA/MISO/SVRG since the complexity of the unaccelerated method is , which is independent of the condition number and already optimal [29]. In comparison, the logarithmic terms in that are hidden in the notation do not appear for a variant of the SVRG method with direct acceleration introduced in [29]. Here, our approach is more generic. Note also that for proxSGD and SAGA/MISO/SVRG cannot be compared to each other since the source of randomness is larger for proxSGD, see [7, 29].
5 Experiments
In this section, we perform numerical evaluations by following [29], which was notably able to make SVRG and SAGA robust to stochastic noise, and accelerate SVRG. More details and experiments are given in Appendix E.
Datasets, formulations and methods.
We consider
logistic regression and support vector machine with the squared hinge loss, as in
[29], see Appendix E. Studying the squared hinge loss is interesting since its gradients are unbounded on the optimization domain, which may break the bounded noise assumption. The regularization parameter acts as the strong convexity constant and is chosen among the smallest values one would try when performing parameter search. Specifically, we consider and , where is the number of training points. Following [7, 29, 54], we consider DropOut perturbations [48] with rate (no noise), and , and consider three datasets used in [29], alpha, gen, ckncifar, see Appendix E.Practical questions and implementation.
In all setups, we choose the parameter according to theory, which are described in the previous section, following Catalyst [33]. For composite problems, Proposition 5 suggests to use as a warm start for innerloop problems. For smooth ones, [34] shows that in fact, other choices such as are appropriate and lead to similar complexity results. In our experiments with smooth losses, we use , which has shown to perform consistently better.
The strategy for discussed in Proposition 6 suggests to use constant stepsizes for a while in the innerloop, typically of order for the methods we consider, before using an exponentially decreasing schedule. Unfortunately, even though theory suggests a rate of decay in , it does not provide useful insight on when decaying should start since the theoretical time requires knowing . A similar issue arise in stochastic optimization techniques involving iterate averaging [9]
. We adopt a similar heuristic as in this literature and start decaying after
epochs, with . Finally, we discuss the number of iterations of to perform in the innerloop. When , the theoretical value is of order , and we choose exactly iterations (one epoch), as in Catalyst [34]. After starting decaying the stepsizes (), we use , according to theory.Experiments and conclusions.
We run each experiment five time with a different random seed and average the results. All curves also display one standard deviation. Appendix
E contains numerous experiments, where we vary the amount of noise, the type of approach (SVRG vs. SAGA), the amount of regularization, and choice of loss function. In Figure
1, we show a subset of these curves. Most of them show that acceleration may be useful even in the stochastic optimization regime, consistently with [29], but that all acceleration methods may no perform well for very illconditioned problems with , which are unrealistic in the context of empirical risk minimization.Acknowledgments
This work was supported by the ERC grant SOLARIS (number 714381).
References

AllenZhu [2017]
Z. AllenZhu.
Katyusha: The first direct acceleration of stochastic gradient
methods.
In
Proceedings of Symposium on Theory of Computing (STOC)
, 2017.  Arjevani and Shamir [2016] Y. Arjevani and O. Shamir. Dimensionfree iteration complexity of finite sum optimization problems. In Advances in Neural Information Processing Systems (NIPS), 2016.
 Asi and Duchi [2018] H. Asi and J. C. Duchi. Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity. preprint arXiv:1810.05633, 2018.
 Aybat et al. [2019] N. S. Aybat, A. Fallah, M. Gurbuzbalaban, and A. Ozdaglar. A universally optimal multistage accelerated stochastic gradient method. preprint arXiv:1901.08022, 2019.
 Beck and Teboulle [2009] A. Beck and M. Teboulle. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.
 Bertsekas [2011] D. P. Bertsekas. Incremental proximal methods for large scale convex optimization. Mathematical Programming, 129(2):163, 2011.
 Bietti and Mairal [2017] A. Bietti and J. Mairal. Stochastic optimization with variance reduction for infinite datasets with finitesum structure. In Advances in Neural Information Processing Systems (NIPS), 2017.
 Bottou and Bousquet [2008] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems (NIPS), 2008.
 Bottou et al. [2016] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for largescale machine learning, 2016. URL https://arxiv.org/abs/1606.04838. quantization overview.
 Bottou et al. [2018] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for largescale machine learning. SIAM Review, 60(2):223–311, 2018.
 Chambolle and Pock [2015] A. Chambolle and T. Pock. A remark on accelerated block coordinate descent for computing the proximity operators of a sum of convex functions. SMAI Journal of Computational Mathematics, 1:29–54, 2015.
 Cohen et al. [2018] M. B. Cohen, J. Diakonikolas, and L. Orecchia. On acceleration with noisecorrupted gradients. In Proceedings of the International Conferences on Machine Learning (ICML), 2018.
 d’Aspremont [2008] A. d’Aspremont. Smooth optimization with approximate gradient. SIAM Journal on Optimization, 19(3):1171–1183, 2008.
 Defazio et al. [2014a] A. Defazio, F. Bach, and S. LacosteJulien. Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in Neural Information Processing Systems (NIPS), 2014a.
 Defazio et al. [2014b] A. Defazio, T. Caetano, and J. Domke. Finito: A faster, permutable incremental gradient method for big data problems. In Proceedings of the International Conferences on Machine Learning (ICML), 2014b.
 Devolder [2011] O. Devolder. Stochastic first order methods in smooth convex optimization. Technical report, Université catholique de Louvain, 2011.
 Devolder et al. [2014] O. Devolder, F. Glineur, and Y. Nesterov. Firstorder methods of smooth convex optimization with inexact oracle. Mathematical Programming, 146(12):37–75, 2014.
 Ghadimi and Lan [2012] S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469–1492, 2012.
 Ghadimi and Lan [2013] S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization II: Shrinking procedures and optimal algorithms. SIAM Journal on Optimization, 23(4):2061–2089, 2013.
 Gower et al. [2018] R. M. Gower, P. Richtárik, and F. Bach. Stochastic quasigradient methods: Variance reduction via Jacobian sketching. preprint arXiv:1805.02632, 2018.
 Güler [1992] O. Güler. New proximal point algorithms for convex minimization. SIAM Journal on Optimization, 2(4):649–664, 1992.
 HiriartUrruty and Lemaréchal [1996] J.B. HiriartUrruty and C. Lemaréchal. Convex analysis and minimization algorithms. II. Springer, 1996.
 Hofmann et al. [2015] T. Hofmann, A. Lucchi, S. LacosteJulien, and B. McWilliams. Variance reduced stochastic gradient descent with neighbors. In Advances in Neural Information Processing Systems (NIPS), 2015.
 Hu et al. [2009] C. Hu, W. Pan, and J. T. Kwok. Accelerated gradient methods for stochastic optimization and online learning. In Advances in Neural Information Processing Systems (NIPS). 2009.
 Iouditski and Nesterov [2014] A. Iouditski and Y. Nesterov. Primaldual subgradient methods for minimizing uniformly convex functions. preprint arXiv:1401.1792, 2014.
 Konečnỳ and Richtárik [2017] J. Konečnỳ and P. Richtárik. Semistochastic gradient descent methods. Frontiers in Applied Mathematics and Statistics, 3:9, 2017.
 Kovalev et al. [2019] D. Kovalev, S. Horvath, and P. Richtarik. Don’t jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop. preprint arXiv:1901.08689, 2019.
 Kulis and Bartlett [2010] B. Kulis and P. L. Bartlett. Implicit online learning. In Proceedings of the International Conferences on Machine Learning (ICML), 2010.
 Kulunchakov and Mairal [2019] A. Kulunchakov and J. Mairal. Estimate sequences for stochastic composite optimization: Variance reduction, acceleration, and robustness to noise. preprint arXiv:1901.08788, 2019.
 Lan [2012] G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1):365–397, 2012.
 Lan and Zhou [2018a] G. Lan and Y. Zhou. An optimal randomized incremental gradient method. Mathematical Programming, 171(1–2):167–215, 2018a.
 Lan and Zhou [2018b] G. Lan and Y. Zhou. Random gradient extrapolation for distributed and stochastic optimization. SIAM Journal on Optimization, 28(4):2753–2782, 2018b.
 Lin et al. [2015] H. Lin, J. Mairal, and Z. Harchaoui. A Universal Catalyst for FirstOrder Optimization. In 28th International Conference on Neural Information Processing Systems, NIPS’15, pages 3384–3392, Montreal, Canada, Dec. 2015. MIT Press. URL https://hal.inria.fr/hal01160728. main paper (9 pages) + appendix (21 pages).
 Lin et al. [2018] H. Lin, J. Mairal, and Z. Harchaoui. Catalyst acceleration for firstorder convex optimization: from theory to practice. Journal of Machine Learning Research (JMLR), 18(212):1–54, 2018.
 Lin et al. [2019] H. Lin, J. Mairal, and Z. Harchaoui. An inexact variable metric proximal point algorithm for generic quasiNewton acceleration. preprint arXiv:1610.00960, 2019.
 Mairal [2015] J. Mairal. Incremental majorizationminimization optimization with application to largescale machine learning. SIAM Journal on Optimization, 25(2):829–855, 2015.
 Mairal [2016] J. Mairal. Endtoend kernel learning with supervised convolutional kernel networks. In Advances in Neural Information Processing Systems (NIPS), 2016.
 Moreau [1965] J.J. Moreau. Proximité et dualité dans un espace hilbertien. Bulletins de la Socitété Mathématique de France, 93(2):273–299, 1965.
 Nemirovski et al. [2009] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
 Nesterov [1983] Y. Nesterov. A method of solving a convex programming problem with convergence rate (1/). Soviet Mathematics Doklady, 27(2):372–376, 1983.
 Nesterov [2004] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer, 2004.
 Nesterov [2012] Y. Nesterov. Efficiency of coordinate descent methods on hugescale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
 Nguyen et al. [2017] L. M. Nguyen, J. Liu, K. Scheinberg, and M. Takáč. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the International Conferences on Machine Learning (ICML), 2017.
 Nguyen et al. [2018] L. M. Nguyen, P. H. Nguyen, M. van Dijk, P. Richtárik, K. Scheinberg, and M. Takáč. SGD and Hogwild! convergence without the bounded gradients assumption. In Proceedings of the International Conferences on Machine Learning (ICML), 2018.
 Paquette et al. [2018] C. Paquette, H. Lin, D. Drusvyatskiy, J. Mairal, and Z. Harchaoui. Catalyst acceleration for gradientbased nonconvex optimization. preprint arXiv:1703.10993, 2018.
 Schmidt et al. [2011] M. Schmidt, N. Le Roux, and F. Bach. Convergence rates of inexact proximalgradient methods for convex optimization. In Advances in Neural Information Processing Systems (NIPS), 2011.
 ShalevShwartz and Zhang [2016] S. ShalevShwartz and T. Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, 155(1):105–145, 2016.

Srivastava et al. [2014]
N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov.
Dropout: a simple way to prevent neural networks from overfitting.
Journal of Machine Learning Research (JMLR), 15(1):1929–1958, 2014. 
Toulis et al. [2016]
P. Toulis, D. Tran, and E. Airoldi.
Towards stability and optimality in stochastic gradient descent.
In
Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS)
, 2016.  Toulis et al. [2018] P. Toulis, T. Horel, and E. M. Airoldi. Stable RobbinsMonro approximations through stochastic proximal updates. preprint arXiv:1510.00967, 2018.
 Tseng [2008] P. Tseng. On accelerated proximal gradient methods for convexconcave optimization. 2008. unpublished.
 Xiao [2010] L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research (JMLR), 11(Oct):2543–2596, 2010.
 Xiao and Zhang [2014] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.
 Zheng and Kwok [2018] S. Zheng and J. T. Kwok. Lightweight stochastic optimization for minimizing finite sums with infinite data. In Proceedings of the International Conferences on Machine Learning (ICML), 2018.
 Zhou [2019] K. Zhou. Direct acceleration of SAGA using sampled negative momentum. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.
 Zhou et al. [2018] K. Zhou, F. Shang, and J. Cheng. A simple stochastic variance reduced algorithm with fast convergence rates. In Proceedings of the International Conferences on Machine Learning (ICML), 2018.
Appendix A Useful Results and Definitions
In this section, we present auxiliary results and definitions.
Definition 7 (Proximal operator).
Given a convex lowersemicontinuous function defined on , the proximal operator of is defined as the unique solution of the stronglyconvex problem
Lemma 8 (Convergence rate of the sequences and ).
Consider the sequence in defined by the recursion
and define . Then,

if and , then, for all ,

if , then for all ,

if , then for all ,
Proof.
We prove the three points, one by one.
First point.
Let us prove the first point when and . The relation is obvious for all and the relation holds for . By induction, let us assume that we have the relation and let us show that it propagates for . Assume, by contradiction, that , meaning that . Then,
and we obtain a contradiction. Therefore, and the induction hypothesis allows us to conclude for all . Then, note [45] that we also have for all ,
Second point.
The second point is obvious by induction.
Third point.
For the third point, we simply assume such that . Then, the relation and therefore are easy to show by induction. Then, consider the sequence defined recursively by with . From the first point, we have that . We will show that for all , which will be sufficient to conclude since then we would have . First, we note that ; then, assume that and also assume by contradiction that
Comments
There are no comments yet.