 # A Generic Acceleration Framework for Stochastic Composite Optimization

In this paper, we introduce various mechanisms to obtain accelerated first-order stochastic optimization algorithms when the objective function is convex or strongly convex. Specifically, we extend the Catalyst approach originally designed for deterministic objectives to the stochastic setting. Given an optimization method with mild convergence guarantees for strongly convex problems, the challenge is to accelerate convergence to a noise-dominated region, and then achieve convergence with an optimal worst-case complexity depending on the noise variance of the gradients. A side contribution of our work is also a generic analysis that can handle inexact proximal operators, providing new insights about the robustness of stochastic algorithms when the proximal operator cannot be exactly computed.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this paper, we consider stochastic composite optimization problems of the form

 minx∈Rp{F(x):=f(x)+ψ(x)}    % with    f(x)=Eξ[~f(x,ξ)], (1)

where is a convex -smooth function (meaning differentiable with -Lipschitz continuous gradient) and is a possibly non-smooth convex lower-semicontinuous function. For instance, may be the -norm, which is known to induce sparsity, or an indicator function that may take the value outside of a convex set and inside 

. The random variable

corresponds to data samples. When the amount of training data is finite, the expectation

can be replaced by a finite sum, a setting that has attracted a lot of attention in machine learning recently, see,

e.g., [14, 15, 20, 26, 36, 43, 53] for incremental algorithms and [1, 27, 31, 34, 47, 55, 56] for accelerated variants.

Yet, as noted in , one is typically not interested in the minimization of the empirical risk—that is, a finite sum of functions—with high precision, but instead, one should focus on the expected risk involving the true (unknown) data distribution. When one can draw an infinite number of samples from this distribution, the true risk (1) may be minimized by using appropriate stochastic optimization techniques. Unfortunately, fast methods designed for deterministic objectives would not apply to this setting; methods based on stochastic approximations admit indeed optimal “slow” rates that are typically for convex functions and for strongly convex ones, depending on the exact assumptions made on the problem, where is the number of noisy gradient evaluations .

Better understanding the gap between deterministic and stochastic optimization is one goal of this paper. Specifically, we are interested in Nesterov’s acceleration of gradient-based approaches [40, 41]. In a nutshell, gradient descent or its proximal variant applied to a -strongly convex -smooth function achieves an exponential convergence rate in the worst case in function values, and a sublinear rate if the function is simply convex (). By interleaving the algorithm with clever extrapolation steps, Nesterov showed that faster convergence could be achieved, and the previous convergence rates become and , respectively. Whereas no clear geometrical intuition seems to appear in the literature to explain why acceleration occurs, proof techniques to show accelerated convergence [5, 41, 51] and extensions to a large class of other gradient-based algorithms are now well established [1, 11, 34, 42, 47].

Yet, the effect of Nesterov’s acceleration to stochastic objectives remains poorly understood since existing unaccelerated algorithms such as stochastic mirror descent  and their variants already achieve the optimal asymptotic rate. Besides, negative results also exist, showing that Nesterov’s method may be unstable when the gradients are computed approximately [13, 17]. Nevertheless, several approaches such as [4, 12, 16, 18, 19, 24, 29, 30, 52] have managed to show that acceleration may be useful to forget faster the algorithm’s initialization and reach a region dominated by the noise of stochastic gradients; then, “good” methods are expected to asymptotically converge with a rate exhibiting an optimal dependency in the noise variance , but with no dependency on the initialization. A major challenge is then to achieve the optimal rate for these two regimes.

In this paper, we consider an optimization method with the following property: given an auxiliary strongly convex objective function , we assume that is able to produce iterates with expected linear convergence to a noise-dominated region—that is, such that

 E[h(zt)−h⋆]≤C(1−τ)t(h(z0)−h⋆)+Bσ2, (2)

where , is the minimum function value, and is an upper bound on the variance of stochastic gradients accessed by , which we assume to be uniformly bounded. Whereas such an assumption has limitations, it remains the most standard one for stochastic optimization (see [10, 44] for more realistic settings in the smooth case). The class of methods satisfying (2) is relatively large. For instance, when is

-smooth, the stochastic gradient descent method (SGD) with constant step size

and iterate averaging satisfies (2) with , , and , see .

#### Main contribution.

In this paper, we extend the Catalyst approach  to stochastic problems. Under mild conditions, our approach is able to turn  into a converging algorithm with a worst-case expected complexity that decomposes into two parts: the first one exhibits an accelerated convergence rate in the sense of Nesterov and shows how fast one forgets the initial point; the second one corresponds to the stochastic regime and typically depends (optimally in many cases) on . Note that even though we only make assumptions about the behavior of  on strongly convex sub-problems (2), we also treat the case where the objective (1) is convex, but not strongly convex.

To illustrate the versatility of our approach, we consider the stochastic finite-sum problem [7, 23, 32, 54], where the objective (1) decomposes into components and is a stochastic perturbation, coming, e.g., from data augmentation or noise injected during training to improve generalization or privacy (see [29, 36]). The underlying finite-sum structure may also result from clustering assumptions on the data , or from distributed computing , a setting beyond the scope of our paper. Whereas it was shown in  that classical variance-reduced stochastic optimization methods such as SVRG , SDCA , SAGA , or MISO , can be made robust to noise, the analysis of  is only able to accelerate the SVRG approach. With our acceleration technique, all of the aforementioned method can be modified such that they find a point  satisfying with global iteration complexity, for the -strongly convex case,

 ~O((n+√nLμ)log(F(x0)−F⋆ε)+σ2με). (3)

The term on the left is the optimal complexity for finite-sum optimization [1, 2], up to logarithmic terms in hidden in the notation, and the term on the right is the optimal complexity for -strongly convex stochastic objectives  where is due to the perturbations . As Catalyst , the price to pay compared to non-generic direct acceleration techniques [1, 29] is a logarithmic factor.

#### Other contributions.

In this paper, we generalize the analysis of Catalyst [34, 45] to handle various new cases. Beyond the ability to deal with stochastic optimization problems, our approach (i) improves Catalyst by allowing sub-problems of the form (2) to be solved approximately in expectation, which is more realistic than the deterministic requirement made in  and which is also critical for stochastic optimization, (ii) leads to a new accelerated stochastic gradient descent algorithms for composite optimization with similar guarantees as [18, 19, 29], (iii) handles the analysis of accelerated proximal gradient descent methods with inexact computation of proximal operators, improving the results of  while also treating the stochastic setting.

## 2 Relation with Inexact and Stochastic Proximal Point Methods

Catalyst is based on the inexact accelerated proximal point algorithm , which consists in solving approximately a sequence of sub-problems and updating two sequences and by

 xk≈\operatornamewithlimitsargminx∈Rp{hk(x):=F(x)+κ2∥x−yk--1∥2}    % and    yk=xk+βk(xk−xk--1), (4)

where in is obtained from Nesterov’s acceleration principles , and is a well chosen regularization parameter. The method  is used to obtain an approximate minimizer of by using an appropriate computational budget; when converges linearly, it may be shown that the resulting algorithm (4) enjoys a better worst-case complexity than if was used directly on , see .

Since asymptotic linear convergence is out of reach when is a stochastic objective, a classical strategy consists in replacing in (4) by a finite-sum approximation obtained by random sampling, leading to deterministic sub-problems. Typically without Nesterov’s acceleration (with ), this strategy is often called the stochastic proximal point method [3, 6, 28, 50, 49]. The point of view we adopt in this paper is different and is based on the minimization of surrogate functions related to (4), but which are more general and may take other forms than .

## 3 Preliminaries: Basic Multi-Stage Schemes

In this section, we present two simple multi-stage mechanisms to improve the worst-case complexities of stochastic optimization methods, before introducing acceleration principles.

Consider an optimization method with convergence rate (2) and assume that there exists a hyper-parameter to control a trade-off between the bias and the computational complexity. Specifically, we assume that the bias can be reduced by an arbitrary factor , while paying a factor in terms of complexity per iteration (or may be reduced by a factor , thus slowing down convergence). This may occur in two cases:

• by using a mini-batch of size to sample gradients, which replaces by ;

• or the method uses a step size proportional to that can be chosen arbitrarily small.

For instance, stochastic gradient descent with constant step size and iterate averaging is compatible with both scenarios . Then, consider a target accuracy and define the sequences and for . We may now solve successively the problem up to accuracy e.g., with a constant number steps of when using mini-batches of size to reduce the bias—and by using the solution of iteration as a warm restart. As shown in Appendix B, the scheme converges and the worst-case complexity to achieve the accuracy in expectation is

 (5)

For instance, one may run SGD with constant step size at stage with iterate averaging as in , which yields , , and . Then, the left term is the classical complexity of the (unaccelerated) gradient descent algorithm for deterministic objectives, whereas the right term is the optimal complexity for stochastic optimization in . Similar restart principles appear for instance in  in the design of a multistage accelerated SGD algorithm.

#### Restart: from sub-linear to linear rate with strong convexity.

A natural question is whether asking for a linear rate in (2) for strongly convex problems is a strong requirement. Here, we show that a sublinear rate is in fact sufficient for our needs by generalizing a restart technique introduced in  for stochastic optimization, which was previously used for deterministic objectives in .

Specifically, consider an optimization method  such that the convergence rate (2) is replaced by

 E[h(zt)−h⋆]≤D∥z0−z⋆∥22td+Bσ22, (6)

where and is a minimizer of . Assume now that is -strongly convex with and consider restarting times the method , each time running for iterations. Then, it may be shown (see Appendix B) that the relation (2) holds with , , and . If a mini-batch or step size mechanism is available, we may then proceed as before and obtain a converging scheme with complexity (5), e.g., by using mini-batches of exponentially increasing sizes once the method reaches a noise-dominated region, and by using a restart frequency of order .

## 4 Generic Multi-Stage Approaches with Acceleration

We are now in shape to introduce a generic acceleration framework that generalizes (4). Specifically, given some point  at iteration , we consider a surrogate function related to a parameter , an approximation error , and an optimization method  that satisfy the following properties:

• is -strongly convex, where is the strong convexity parameter of ;

• for , which is deteministic given the past information up to iteration and is given in Alg. 1;

• can provide the exact minimizer  of and a point (possibly equal to ) such that where .

The generic acceleration framework is presented in Algorithm 1. Note that the conditions on

bear similarities with estimate sequences introduced by Nesterov

. However, the choices of

and the proof technique are significantly different, as we will see with various examples below. We also assume at the moment that the exact minimizer

of is available, which differs from the Catalyst framework ; the case with approximate minimization will be presented in Section 4.1.

###### Proposition 1 (Convergence analysis for Algorithm 1).

Consider Algorithm 1. Then,

 E[F(xk)−F⋆]≤⎧⎪⎨⎪⎩(1−√q)k(2(F(x0)−F⋆)+∑kj=1(1−√q)−jδj)if μ≠02(k+1)2(κ∥x0−x⋆∥2+∑kj=1δj(j+1)2)otherwise. (8)

The proof of the proposition is given in Appendix C and is based on an extension of the analysis of Catalyst . Next, we present various application cases leading to algorithms with acceleration.

When is deterministic and the proximal operator of (see Appendix A for the definition) can be computed in closed form, choose and define

 hk(x):=f(yk--1)+∇f(yk--1)⊤(x−yk--1)+L2∥x−yk--1∥2+ψ(x). (9)

Consider that minimizes in closed form: . Then, () is obvious; () holds from the convexity of , and () with follows from classical inequalities for -smooth functions . Finally, we recover accelerated convergence rates [5, 41].

#### Accelerated proximal point algorithm.

We consider given in (4) with exact minimization (thus an unrealistic setting, but conceptually interesting) with . Then, the assumptions (), (), and () are satisfied with and we recover the accelerated rates of of .

#### Accelerated stochastic gradient descent with prox.

A more interesting choice of surrogate is

 hk(x):=f(yk--1)+g⊤k(x−yk--1)+κ+μ2∥x−yk--1∥2+ψ(x), (10)

where and

is an unbiased estimate of

—that is, —with variance bounded by , following classical assumptions from the stochastic optimization literature [18, 19, 24]. Then, () and () are satisfied given that  is convex. To characterize (), consider that minimizes in closed form: , and define , which is deterministic given . Then, from (10),

 f(xk)≤hk(xk)+(∇f(yk--1)−gk)⊤(xk−yk--1)(from L-smoothness of f)=h⋆k+(∇f(yk--1)−gk)⊤(xk−uk--1)+(∇f(yk--1)−gk)⊤(uk--1−yk--1).

When taking expectations, the last term on the right disappears since :

 E[f(xk)]≤E[h⋆k]+E[∥gk−∇f(yk--1)∥∥xk−uk--% 1∥]≤E[h⋆k]+1κ+μE[∥gk−∇f(yk--1)∥2]≤E[h⋆k]+σ2κ+μ, (11)

where we used the non-expansiveness of the proximal operator . Therefore, () holds with . The resulting algorithm is similar to  and offers the same guarantees. The novelty of our approach is then a unified convergence proof for the deterministic and stochastic cases.

###### Corollary 2 (Complexity of proximal stochastic gradient algorithm, μ>0).

Consider Algorithm 1 with defined in (10). When is -strongly convex, choose . Then,

 E[F(xk)−F⋆]≤(1−√μL)k(F(x0)−F⋆)+σ2√μL,

which is of the form (2) with and . Interestingly, the optimal complexity can be obtained by using the first restart strategy presented in Section 3, see Eq. (5), either by using increasing mini-batches or decreasing step sizes.

When the objective is convex, but not strongly convex, Proposition 1 gives a bias term that increases linearly with . Yet, the following corollary exhibits an optimal rate with finite horizon, when both and an upper-bound on are available. Even though non-practical, the result shows that our analysis recovers the optimal dependency in the noise level, as [19, 29] and others.

###### Corollary 3 (Complexity of proximal stochastic gradient algorithm, μ=0).

Consider a fixed budget of iterations of Algorithm 1 with defined in (10). When ,

 E[F(xK)−F⋆]≤2L∥x0−x⋆∥2(K+1)2+3σ∥x0−x⋆∥√K+1.

While all the previous examples use the choice , we will see in Section 4.2 cases where we may choose . Before that, we introduce a variant when is not available.

### 4.1 Variant with Inexact Minimization

In this variant, presented in Algorithm 2, is not available and we assume that also satisfies:

• given , can provide a point such that .

The next proposition, proven in Appendix C, gives us some insight on how to achieve acceleration.

###### Proposition 4 (Convergence analysis for Algorithm 2).

Consider Alg. 2. Then, for any ,

 E[F(xk)−F⋆]≤⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩(1−√q2)k(2(F(x0)−F⋆)+4∑kj=1(1−√q2)−j(δj+εj√q))if μ≠02e1+γ(k+1)2(κ∥x0−x⋆∥2+∑kj=1(j+1)2δj+(j+1)3+γεjγ)if μ=0.

To maintain the accelerated rate, the sequence needs to converge at a similar speed as in Proposition 1, but the dependency in  is slightly worse. Specifically, when is -strongly convex, we may have both and decreasing at a rate with , but we pay a factor compared to (8). When , the accelerated rate is preserved whenever and , but we pay a factor compared to (8).

#### Catalyst .

When using defined in (4), we recover the convergence rates of . In such a case since . In order to analyze the complexity of minimizing each  with and derive the global complexity of the multi-stage algorithm, the next proposition, proven in Appendix C, characterizes the quality of the initialization .

###### Proposition 5 (Warm restart for Catalyst).

Consider Alg. 2 with defined in (4). Then, for ,

 E[hk(xk--1)−h⋆k]≤3εk%−−12+54κmax(∥xk--1−x⋆∥2,∥xk--2−x⋆∥2,∥xk--3−x⋆∥2), (12)

where . Following , we may now analyze the global complexity. For instance, when  is -strongly convex, we may choose with . Then, it is possible to show that Proposition (4) yields and from the inequality and (12), we have . Consider now a method  that behaves as (2). When , can be obtained in iterations of  after initializing with . This allows us to obtain the global complexity . For example, when is the proximal gradient descent method, and yield the global complexity of an accelerated method.

Our results improve upon Catalyst  in two aspects that are crucial for stochastic optimization: (i) we allow the sub-problems to be solved in expectation, whereas Catalyst requires the stronger condition ; (ii) Proposition 5 removes the requirement of  to perform a full gradient step for initializing the method  in the composite case (see Prop. 12 in ).

#### Proximal gradient descent with inexact prox .

The surrogate (10) with inexact minimization can be treated in the same way as Catalyst, which provides a unified proof for both problems. Then, we recover the results of , while allowing inexact minimization to be performed in expectation.

#### Stochastic Catalyst.

With Proposition 5, we are in shape to consider stochastic problems when using a method  that converges linearly as (2) with for minimizing . As in Section 3, we also assume that there exists a mini-batch/step-size parameter that can reduce the bias by a factor  while paying a factor in terms of inner-loop complexity. As above, we discuss the strongly-convex case and choose the same sequence . In order to minimize up to accuracy , we set such that . Then, the complexity to minimize with when using the initialization becomes , leading to the global complexity

 ~O(1τ√qlog(F(x0)−F⋆ε)+Bσ2q3/2τε). (13)

Details about the derivation are given in Appendix B. The left term corresponds to the Catalyst accelerated rate, but it may be shown that the term on the right is sub-optimal. Indeed, consider to be ISTA with . Then, , , and the right term becomes , which is sub-optimal by a factor . Whereas this result is a negative one, suggesting that Catalyst is not robust to noise, we show in Section 4.2 how to circumvent this for a large class of algorithms.

#### Accelerated stochastic proximal gradient descent with inexact prox.

Finally, consider  defined in (10) but the proximal operator is computed approximately, which, to our knowledge, has never been analyzed in the stochastic context. Then, it may be shown (see Appendix B for details) that Proposition 4 holds with . Then, an interesting question is how small should be to guarantee the optimal dependency with respect to as in Corollary 2. In the strongly-convex case, Proposition 4 simply gives such that .

### 4.2 Exploiting methods M providing strongly convex surrogates

Among various application cases, we have seen an extension of Catalyst to stochastic problems. To achieve convergence, the strategy requires a mechanism to reduce the bias in (2), e.g., by using mini-batches or decreasing step sizes. Yet, the approach suffers from two issues: (i) some of the parameters are based on unknown quantities such as ; (ii) the worst-case complexity exhibits a sub-optimal dependency in , typically of order  when . Whereas practical workarounds for the first point are discussed in Section 5, we now show how to solve the second one in many cases, by using Algorithm 1 with a particular surrogate provided by the optimization method. Consider indeed a method  satisfying (2) and which is able, after steps, to produce a point such that

 E[Hk(xk)−h⋆k]≤C(1−τ)T(Hk(xk--1)−H⋆k+ξk--1)+Bσ2   with   Hk(x)=F(x)+κ2∥x−yk--1∥2, (14)

where is a function satisfying (), (), and that can be minimized in closed form and ; () is also satisfied with since . In other words, is used to perform approximate minimization of , but we consider cases where also provides another surrogate with closed-form minimizer that satisfies the conditions required to use Algorithm 1, which has better convergence guarantees than Algorithm 2 (same convergence rate up to a better factor).

As shown in Appendix D, even though (14) looks technical, a large class of optimization techniques are able to provide the condition (14), including many variants of proximal stochastic gradient descent methods with variance reduction such as SAGA , MISO , SDCA , or SVRG .

Whereas (14) seems to be a minor modification of (2), an important consequence is that it will allow us to gain a factor in complexity when , corresponding precisely to the sub-optimality factor. Indeed, we may notice that Therefore, even though the surrogate  needs only be minimized approximately, the condition (14) allows us to use Algorithm 1 instead of Algorithm 2. The dependency with respect to being better than (by ), we have then the following result:

###### Proposition 6 (Stochastic Catalyst with Optimality Gaps, μ>0).

Consider Algorithm 1 with a method and surrogate satisfying (14) when is used to minimize by using as a warm restart. Assume that is -strongly convex and that there exists a parameter that can reduce the bias by a factor  while paying a factor in terms of inner-loop complexity.

Choose and . Then, the complexity to solve (14) and compute  is , and the global complexity to obtain is

 ~O(1τ√qlog(F(x0)−F⋆ε)+Bσ2qτε).

The term on the left is the accelerated rate of Catalyst for deterministic problems, whereas the term on the right is potentially optimal for strongly convex problems, as illustrated in the next table. We provide indeed practical choices for the parameters , leading to various values of , for the proximal stochastic gradient descent method with iterate averaging as well as variants of SAGA,MISO,SVRG that can cope with stochastic perturbations, which are discussed in Appendix D. All the values below are given up to universal constants to simplify the presentation.

Method  Complexity after Catalyst
prox-SGD
SAGA/MISO/SVRG with

In this table, and the methods SAGA/MISO/SVRG are applied to the stochastic finite-sum problem discussed in Section 1 with -smooth functions. As in the deterministic case, we note that when , there is no acceleration for SAGA/MISO/SVRG since the complexity of the unaccelerated method  is , which is independent of the condition number and already optimal . In comparison, the logarithmic terms in that are hidden in the notation do not appear for a variant of the SVRG method with direct acceleration introduced in . Here, our approach is more generic. Note also that for prox-SGD and SAGA/MISO/SVRG cannot be compared to each other since the source of randomness is larger for prox-SGD, see [7, 29].

## 5 Experiments

In this section, we perform numerical evaluations by following , which was notably able to make SVRG and SAGA robust to stochastic noise, and accelerate SVRG. More details and experiments are given in Appendix E.

#### Datasets, formulations and methods.

We consider

-logistic regression and support vector machine with the squared hinge loss, as in

, see Appendix E. Studying the squared hinge loss is interesting since its gradients are unbounded on the optimization domain, which may break the bounded noise assumption. The regularization parameter acts as the strong convexity constant and is chosen among the smallest values one would try when performing parameter search. Specifically, we consider and , where is the number of training points. Following [7, 29, 54], we consider DropOut perturbations  with rate (no noise), and , and consider three datasets used in , alpha, gen, ckn-cifar, see Appendix E.

We consider the variants of SVRG and SAGA of , which use decreasing step sizes when (otherwise, they do not converge). We use the suffix “-d” each time decreasing step sizes are used. We also consider Katyuasha  when , and the accelerated SVRG method of .

#### Practical questions and implementation.

In all setups, we choose the parameter  according to theory, which are described in the previous section, following Catalyst . For composite problems, Proposition 5 suggests to use as a warm start for inner-loop problems. For smooth ones,  shows that in fact, other choices such as are appropriate and lead to similar complexity results. In our experiments with smooth losses, we use , which has shown to perform consistently better.

The strategy for discussed in Proposition 6 suggests to use constant step-sizes for a while in the inner-loop, typically of order for the methods we consider, before using an exponentially decreasing schedule. Unfortunately, even though theory suggests a rate of decay in , it does not provide useful insight on when decaying should start since the theoretical time requires knowing . A similar issue arise in stochastic optimization techniques involving iterate averaging 

. We adopt a similar heuristic as in this literature and start decaying after

epochs, with . Finally, we discuss the number of iterations of to perform in the inner-loop. When , the theoretical value is of order , and we choose exactly iterations (one epoch), as in Catalyst . After starting decaying the step-sizes (), we use , according to theory.

#### Experiments and conclusions.

We run each experiment five time with a different random seed and average the results. All curves also display one standard deviation. Appendix

E contains numerous experiments, where we vary the amount of noise, the type of approach (SVRG vs. SAGA), the amount of regularization

, and choice of loss function. In Figure

1, we show a subset of these curves. Most of them show that acceleration may be useful even in the stochastic optimization regime, consistently with , but that all acceleration methods may no perform well for very ill-conditioned problems with , which are unrealistic in the context of empirical risk minimization. Figure 1: Accelerating SVRG-like (top) and SAGA (bottom) methods for ℓ2-logistic regression with μ=1/(100n) (bottom) for δ=0.1. All plots are on a logarithmic scale for the objective function value, and the x-axis denotes the number of epochs. The colored tubes around each curve denote a standard deviations across 5 runs. They do not look symmetric because of the logarithmic scale.

### Acknowledgments

This work was supported by the ERC grant SOLARIS (number 714381).

## References

• Allen-Zhu  Z. Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. In

Proceedings of Symposium on Theory of Computing (STOC)

, 2017.
• Arjevani and Shamir  Y. Arjevani and O. Shamir. Dimension-free iteration complexity of finite sum optimization problems. In Advances in Neural Information Processing Systems (NIPS), 2016.
• Asi and Duchi  H. Asi and J. C. Duchi. Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity. preprint arXiv:1810.05633, 2018.
• Aybat et al.  N. S. Aybat, A. Fallah, M. Gurbuzbalaban, and A. Ozdaglar. A universally optimal multistage accelerated stochastic gradient method. preprint arXiv:1901.08022, 2019.
• Beck and Teboulle  A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.
• Bertsekas  D. P. Bertsekas. Incremental proximal methods for large scale convex optimization. Mathematical Programming, 129(2):163, 2011.
• Bietti and Mairal  A. Bietti and J. Mairal. Stochastic optimization with variance reduction for infinite datasets with finite-sum structure. In Advances in Neural Information Processing Systems (NIPS), 2017.
• Bottou and Bousquet  L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems (NIPS), 2008.
• Bottou et al.  L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning, 2016. URL https://arxiv.org/abs/1606.04838. quantization overview.
• Bottou et al.  L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311, 2018.
• Chambolle and Pock  A. Chambolle and T. Pock. A remark on accelerated block coordinate descent for computing the proximity operators of a sum of convex functions. SMAI Journal of Computational Mathematics, 1:29–54, 2015.
• Cohen et al.  M. B. Cohen, J. Diakonikolas, and L. Orecchia. On acceleration with noise-corrupted gradients. In Proceedings of the International Conferences on Machine Learning (ICML), 2018.
• d’Aspremont  A. d’Aspremont. Smooth optimization with approximate gradient. SIAM Journal on Optimization, 19(3):1171–1183, 2008.
• Defazio et al. [2014a] A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems (NIPS), 2014a.
• Defazio et al. [2014b] A. Defazio, T. Caetano, and J. Domke. Finito: A faster, permutable incremental gradient method for big data problems. In Proceedings of the International Conferences on Machine Learning (ICML), 2014b.
• Devolder  O. Devolder. Stochastic first order methods in smooth convex optimization. Technical report, Université catholique de Louvain, 2011.
• Devolder et al.  O. Devolder, F. Glineur, and Y. Nesterov. First-order methods of smooth convex optimization with inexact oracle. Mathematical Programming, 146(1-2):37–75, 2014.
• Ghadimi and Lan  S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469–1492, 2012.
• Ghadimi and Lan  S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization II: Shrinking procedures and optimal algorithms. SIAM Journal on Optimization, 23(4):2061–2089, 2013.
• Gower et al.  R. M. Gower, P. Richtárik, and F. Bach. Stochastic quasi-gradient methods: Variance reduction via Jacobian sketching. preprint arXiv:1805.02632, 2018.
• Güler  O. Güler. New proximal point algorithms for convex minimization. SIAM Journal on Optimization, 2(4):649–664, 1992.
• Hiriart-Urruty and Lemaréchal  J.-B. Hiriart-Urruty and C. Lemaréchal. Convex analysis and minimization algorithms. II. Springer, 1996.
• Hofmann et al.  T. Hofmann, A. Lucchi, S. Lacoste-Julien, and B. McWilliams. Variance reduced stochastic gradient descent with neighbors. In Advances in Neural Information Processing Systems (NIPS), 2015.
• Hu et al.  C. Hu, W. Pan, and J. T. Kwok. Accelerated gradient methods for stochastic optimization and online learning. In Advances in Neural Information Processing Systems (NIPS). 2009.
• Iouditski and Nesterov  A. Iouditski and Y. Nesterov. Primal-dual subgradient methods for minimizing uniformly convex functions. preprint arXiv:1401.1792, 2014.
• Konečnỳ and Richtárik  J. Konečnỳ and P. Richtárik. Semi-stochastic gradient descent methods. Frontiers in Applied Mathematics and Statistics, 3:9, 2017.
• Kovalev et al.  D. Kovalev, S. Horvath, and P. Richtarik. Don’t jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop. preprint arXiv:1901.08689, 2019.
• Kulis and Bartlett  B. Kulis and P. L. Bartlett. Implicit online learning. In Proceedings of the International Conferences on Machine Learning (ICML), 2010.
• Kulunchakov and Mairal  A. Kulunchakov and J. Mairal. Estimate sequences for stochastic composite optimization: Variance reduction, acceleration, and robustness to noise. preprint arXiv:1901.08788, 2019.
• Lan  G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1):365–397, 2012.
• Lan and Zhou [2018a] G. Lan and Y. Zhou. An optimal randomized incremental gradient method. Mathematical Programming, 171(1–2):167–215, 2018a.
• Lan and Zhou [2018b] G. Lan and Y. Zhou. Random gradient extrapolation for distributed and stochastic optimization. SIAM Journal on Optimization, 28(4):2753–2782, 2018b.
• Lin et al.  H. Lin, J. Mairal, and Z. Harchaoui. A Universal Catalyst for First-Order Optimization. In 28th International Conference on Neural Information Processing Systems, NIPS’15, pages 3384–3392, Montreal, Canada, Dec. 2015. MIT Press. main paper (9 pages) + appendix (21 pages).
• Lin et al.  H. Lin, J. Mairal, and Z. Harchaoui. Catalyst acceleration for first-order convex optimization: from theory to practice. Journal of Machine Learning Research (JMLR), 18(212):1–54, 2018.
• Lin et al.  H. Lin, J. Mairal, and Z. Harchaoui. An inexact variable metric proximal point algorithm for generic quasi-Newton acceleration. preprint arXiv:1610.00960, 2019.
• Mairal  J. Mairal. Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization, 25(2):829–855, 2015.
• Mairal  J. Mairal. End-to-end kernel learning with supervised convolutional kernel networks. In Advances in Neural Information Processing Systems (NIPS), 2016.
• Moreau  J.-J. Moreau. Proximité et dualité dans un espace hilbertien. Bulletins de la Socitété Mathématique de France, 93(2):273–299, 1965.
• Nemirovski et al.  A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
• Nesterov  Y. Nesterov. A method of solving a convex programming problem with convergence rate (1/). Soviet Mathematics Doklady, 27(2):372–376, 1983.
• Nesterov  Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer, 2004.
• Nesterov  Y. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
• Nguyen et al.  L. M. Nguyen, J. Liu, K. Scheinberg, and M. Takáč. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the International Conferences on Machine Learning (ICML), 2017.
• Nguyen et al.  L. M. Nguyen, P. H. Nguyen, M. van Dijk, P. Richtárik, K. Scheinberg, and M. Takáč. SGD and Hogwild! convergence without the bounded gradients assumption. In Proceedings of the International Conferences on Machine Learning (ICML), 2018.
• Paquette et al.  C. Paquette, H. Lin, D. Drusvyatskiy, J. Mairal, and Z. Harchaoui. Catalyst acceleration for gradient-based non-convex optimization. preprint arXiv:1703.10993, 2018.
• Schmidt et al.  M. Schmidt, N. Le Roux, and F. Bach. Convergence rates of inexact proximal-gradient methods for convex optimization. In Advances in Neural Information Processing Systems (NIPS), 2011.
• Shalev-Shwartz and Zhang  S. Shalev-Shwartz and T. Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, 155(1):105–145, 2016.
• Srivastava et al.  N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov.

Dropout: a simple way to prevent neural networks from overfitting.

Journal of Machine Learning Research (JMLR), 15(1):1929–1958, 2014.
• Toulis et al.  P. Toulis, D. Tran, and E. Airoldi. Towards stability and optimality in stochastic gradient descent. In

Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS)

, 2016.
• Toulis et al.  P. Toulis, T. Horel, and E. M. Airoldi. Stable Robbins-Monro approximations through stochastic proximal updates. preprint arXiv:1510.00967, 2018.
• Tseng  P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. 2008. unpublished.
• Xiao  L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research (JMLR), 11(Oct):2543–2596, 2010.
• Xiao and Zhang  L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.
• Zheng and Kwok  S. Zheng and J. T. Kwok. Lightweight stochastic optimization for minimizing finite sums with infinite data. In Proceedings of the International Conferences on Machine Learning (ICML), 2018.
• Zhou  K. Zhou. Direct acceleration of SAGA using sampled negative momentum. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.
• Zhou et al.  K. Zhou, F. Shang, and J. Cheng. A simple stochastic variance reduced algorithm with fast convergence rates. In Proceedings of the International Conferences on Machine Learning (ICML), 2018.

## Appendix A Useful Results and Definitions

In this section, we present auxiliary results and definitions.

###### Definition 7 (Proximal operator).

Given a convex lower-semicontinuous function defined on , the proximal operator of is defined as the unique solution of the strongly-convex problem

 Proxψ[y]=\operatornamewithlimitsargminx∈Rp{12∥y−x∥2+ψ(x)}.
###### Lemma 8 (Convergence rate of the sequences (αk)k≥0 and (Ak)k≥0).

Consider the sequence in defined by the recursion

 α2k=(1−αk)α2k--1+qαk   with    0≤q<1,

and define . Then,

• if and , then, for all ,

 2(k+2)2≤Ak=α2k≤4(k+2)2.
• if , then for all ,

 Ak=(1−√q)k    and    αk=√q.
• if , then for all ,

 Ak≤min((1−√q)k,4(k+2)2)    and    αk≥max(√q,√2k+2).
###### Proof.

We prove the three points, one by one.

#### First point.

Let us prove the first point when and . The relation is obvious for all and the relation holds for . By induction, let us assume that we have the relation and let us show that it propagates for . Assume, by contradiction, that , meaning that . Then,

 α2k=(1−αk)α2k--1≤(1−αk)4(k+1)2<4k(k+2)(k+1)2=4(k+2)(k+2+1k)<4(k+2)2,

and we obtain a contradiction. Therefore, and the induction hypothesis allows us to conclude for all . Then, note  that we also have for all ,

 Ak=k∏t=1(1−αt)≥k∏t=1(1−2t+2)=2(k+1)(k+2)≥2(k+2)2.

#### Second point.

The second point is obvious by induction.

#### Third point.

For the third point, we simply assume such that . Then, the relation and therefore are easy to show by induction. Then, consider the sequence defined recursively by with . From the first point, we have that . We will show that for all , which will be sufficient to conclude since then we would have . First, we note that ; then, assume that