 # Estimate Sequences for Variance-Reduced Stochastic Composite Optimization

In this paper, we propose a unified view of gradient-based algorithms for stochastic convex composite optimization by extending the concept of estimate sequence introduced by Nesterov. This point of view covers the stochastic gradient descent method, variants of the approaches SAGA, SVRG, and has several advantages: (i) we provide a generic proof of convergence for the aforementioned methods; (ii) we show that this SVRG variant is adaptive to strong convexity; (iii) we naturally obtain new algorithms with the same guarantees; (iv) we derive generic strategies to make these algorithms robust to stochastic noise, which is useful when data is corrupted by small random perturbations. Finally, we show that this viewpoint is useful to obtain new accelerated algorithms in the sense of Nesterov.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider convex optimization problems of the form

 minx∈Rp{F(x):=f(x)+ψ(x)}, (1)

where is -strongly convex and -smooth (differentiable with -Lipschitz continuous gradient), and  is convex lower-semicontinuous. For instance, may be the -norm but may also be the indicator function of a convex set  for constrained problems (Hiriart-Urruty & Lemaréchal, 1996).

More specifically, we focus on stochastic objectives, where  is an expectation or a finite sum of convex functions

 f(x)=Eξ[~f(x,ξ)]   or   f(x)=1nn∑i=1fi(x). (2)

On the left,

is a random variable representing a data point drawn according to some distribution and

measures the fit of some model parameter  to the data point .

While the finite-sum setting is a particular case of expectation, the deterministic nature of the resulting cost function drastically changes the performance guarantees an optimization method may achieve to solve (1). In particular, when an algorithm is only allowed to access unbiased measurements of the objective and gradient, it may be shown that the worst-case convergence rate in expected function value cannot be better than in general, where is the number of iterations (Nemirovski et al., 2009; Agarwal et al., 2012).

Even though this pessimistic result applies to the general stochastic case, linear convergence rates can be obtained for deterministic finite sums. For instance, linear rates are achieved by SAG (Schmidt et al., 2017), SAGA (Defazio et al., 2014), SVRG (Johnson & Zhang, 2013; Xiao & Zhang, 2014), SDCA (Shalev-Shwartz & Zhang, 2016), MISO (Mairal, 2015), Katyusha (Allen-Zhu, 2017), MiG (Zhou et al., 2018), SARAH (Nguyen et al., 2017), accelerated SAGA (Zhou, 2019) or Lan & Zhou (2018a). In the non-convex case, a recent focus has been on improving convergence rates for finding first-order stationary points (Lei et al., 2017; Paquette et al., 2018; Fang et al., 2018), which is however beyond the scope of our paper. A common interpretation is to see these algorithms as performing SGD steps with an estimate of the gradient that has lower variance.

In this paper, we are interested in providing a unified view of such stochastic optimization algorithms, but we also want to investigate their robustness to random pertubations. Specifically, we consider objective functions with an explicit finite-sum structure such as (2) when only noisy estimates of the gradients  are available. Such a setting may occur for various reasons. Perturbations may be injected during training to achieve better generalization (Srivastava et al., 2014)

, perform stable feature selection

(Meinshausen & Bühlmann, 2010), for privacy-aware learning (Wainwright et al., 2012) or to improve model robustness (Zheng et al., 2016).

Each point indexed by  is then corrupted by a random perturbation and the function  may be written as

 f(x)=1nn∑i=1fi(x)   with   fi(x)=Eρi[~fi(x,ρi)]. (3)

Since the exact gradients cannot be computed, all the aforementioned variance-reduction methods do not apply and the standard approach is to use SGD. Typically, the variance of the gradient estimate then decomposes into two parts , where is due to data sampling and to the random perturbation. In such a context, variance reduction consists of designing algorithms whose convergence rate depends on , which is potentially much smaller than . The SAGA and SVRG methods have been adapted for such a purpose by Hofmann et al. (2015a), though the resulting algorithms have non-zero asymptotic error; the MISO method was adapted by Bietti & Mairal (2017) at the cost of a memory overhead of , whereas other variants of SAGA and SVRG have been proposed by Zheng & Kwok (2018)

for linear models in machine learning.

In this paper, we extend estimate sequences introduced by Nesterov (2004), which are typically used to design accelerated algorithms. Estimate sequences have been used before for stochastic optimization (Devolder, 2011; Lin et al., 2014; Lu & Xiao, 2015), but not for the following generic purpose:

First, we make a large class of variance-reduction methods robust to stochastic perturbations. More precisely, by using a sampling strategy  to select indices of the sum (3) at each iteration, when each function is convex and -smooth, the worst-case iteration complexity of our approaches in function values—that is, the number of iterations to guarantee —is upper bounded by

 O((n+LQμ)log(F(x0)−F∗ε))+O(ρQ~σ2με),

where and depends on

. For the uniform distribution, we have

and , whereas a non-uniform may yield . The term on the left corresponds to the complexity of variance-reduction methods without perturbation, and is the optimal sublinear rate of convergence for a stochastic optimization problem when the gradient estimates have variance . In contrast, a variant of SGD applied to (3) has worst-case complexity , with potentially .

Second, we design a new accelerated algorithm which, to our knowledge, is the first one to achieve the complexity

 O((n+√nLQμ)log(F(x0)−F∗ε))+O(ρQ~σ2με),

where the term on the left matches the optimal complexity for finite sums when  (Arjevani & Shamir, 2016), which has been achieved by Allen-Zhu (2017); Zhou et al. (2018); Zhou (2019); Kovalev et al. (2019). The general stochastic finite-sum problem with was also considered with an acceleration mechanism by Lan & Zhou (2018b) for distributed optimization being optimal in terms of communication rounds, but not in the global complexity.

Note that we treat here only the strongly convex case, but similar results can be obtained when , as shown in a long version of this paper (Kulunchakov & Mairal, 2019).

## 2 A Generic Framework

In this section, we introduce stochastic estimate sequences and show how they can handle variance reduction.

### 2.1 A Classical Iteration Revisited

Consider an algorithm that performs the following updates:

[leftmargin=0pt,innerleftmargin=6pt,innerrightmargin=6pt]

 xk←Proxηkψ[xk--1−ηkgk],0.1cm (A)

where and is the filtration representing all information up to iteration , is a step size, and is the proximal operator (Moreau, 1962) defined for any scalar as the unique solution of

 Proxηψ[u]:=argminx∈Rp{ηψ(x)+12∥x−u∥2}. (4)

Key to our analysis, we interpret (A) as the iterative minimization of quadratic surrogate functions.

#### Interpretation with stochastic estimate sequence.

Consider

 d0(x)=d∗0+γ02∥x−x0∥2, (5)

with and

is a scalar value that is left unspecified at the moment. Then, it is easy to show that

in (A) minimizes defined recursively for as

 dk(x)=(1−δk)dk--1(x)+δklk(x), (6)

where

 lk(x)=f(xk--1)+g⊤k(x−xk--1)+μ2∥x−xk--1∥2+ψ(xk)+ψ′(xk)⊤(x−xk),

satisfy the system of equations

 δk=ηkγk   and   γk=(1−δk)γk--1+μδk, (7)

(note that yields for all ), and

 ψ′(xk)=1ηk(xk--1−xk)−gk.

Here, is a subgradient in . By simply using the definition of the proximal operator (4) and considering first-order optimality conditions, we indeed have that and  coincides with the minimizer of . This allows us to write in the form

 dk(x)=d∗k+γk2∥x−xk∥2   for all k≥0.

The construction (6) is akin to that of estimate sequences introduced by Nesterov (2004), which are typically used for designing accelerated gradient-based optimization algorithms. In this section, we are however not interested in acceleration, but instead in stochastic optimization and variance reduction. One of the main properties of estimate sequences that we will nevertheless use is their ability to behave asymptotically as a lower bound. Indeed, we have

 E[dk(x∗)]≤Γkd0(x∗)+(1−Γk)F∗, (8)

where is a minimizer of and . The inequality comes from strong convexity since , leading to the relation . Then, by unrolling the recursion, we obtain (8). When  converges to zero, the contribution of the initial surrogate disappears and behaves as a lower bound of .

#### Relation with existing algorithms.

The iteration (A) encompasses many approaches such as ISTA (proximal gradient uses the exact gradient (Beck & Teboulle, 2009; Nesterov, 2013) or proximal variants of the stochastic gradient descent method to deal with a composite objective (Lan, 2012). Of interest to us, the variance-reduced stochastic optimization approaches SVRG (Xiao & Zhang, 2014) and SAGA (Defazio et al., 2014) also follow (A) but with an unbiased gradient estimator  whose variance reduces over time. Specifically, they use

 gk=∇fik(xk--1)−zikk--1+¯zk--1  with  ¯zk--1=1nn∑i=1zik--1, (9)

where is an index chosen uniformly in at random, and each is equal to a gradient , where  is one of the previous iterates. The motivation is that given two random variables and , it is possible to define a new variable which has the same expectation as  but potentially a lower variance if is positively correlated with . SVRG uses the same anchor point for all , where is updated every iterations. Typically, the memory cost of SVRG is that of storing the variable and the gradient , which is thus . On the other hand, SAGA updates only at iteration , such that if . Thus, SAGA requires storing gradients. While in general the overhead cost in memory is of order , it may be reduced to when dealing with linear models in machine learning (see Defazio et al., 2014). Note that variants with non-uniform sampling of the indices have been proposed by Xiao & Zhang (2014); Schmidt et al. (2015), which we discuss later.

In order to make our proofs consistent for SAGA and SVRG (and MISO in Kulunchakov & Mairal, 2019), we consider a variant of SVRG with a randomized gradient updating schedule (Hofmann et al., 2015a). Remarkably, this variant was shown to provide benefits over the fixed schedule in a concurrent work (Kovalev et al., 2019) when .

### 2.2 Gradient Estimators and New Algorithms

In this paper, we consider the gradient estimators below. For all of them, we define the variance to be

 σ2k=E[∥gk−∇f(xk--1)∥2].

#### Ista.

Simply consider and .

#### Sgd.

We assume that has variance bounded by . Typically, when , a data point is drawn at iteration and . Even though the bounded variance assumption has limitations, it remains the most standard one for stochastic optimization and more realistic settings (such as (Bottou et al., 2018; Nguyen et al., 2018) for the smooth case) are left for future work.

#### random-SVRG.

For finite sums, we consider a variant of SVRG with random update of the anchor point , proposed originally in (Hofmann et al., 2015b), combined with non-uniform sampling. Specifically, is defined as

 gk=1qikn(~∇fik(xk--1)−zikk--1)+¯zk--1, (10)

where and denotes a perturbed gradient operator. For instance, if for all , where  is a stochastic perturbation, instead of accessing , we draw a perturbation and observe

 ~∇fik(xk--1)=∇~fik(xk--1,ρk)=∇fik(xk--1)+ζk,

where the perturbation has zero mean given and its variance is bounded by . When considering the setting without perturbation, we simply have .

Similar to the previous case, the variables and also correspond to noisy estimates of the gradients. Specifically,

 zik=~∇fi(~xk)    and   ¯zk=1nn∑i=1zik,

where is an anchor point that is updated on average every  iterations. Whereas the classical SVRG approach updates

on a fixed schedule, we perform random updates: with probability

, we choose and recompute ; otherwise is kept unchanged. In comparison with the fixed schedule, the analysis with the random one is simplified and can be unified with that of SAGA. This approach is described in Algorithm 1.

In terms of memory, the random-SVRG gradient estimator requires to store an anchor point  and the average gradients . The ’s do not need to be stored; only the  random seeds to produce the perturbations are kept into memory, which allows us to compute at iteration , with the same perturbation for index  that was used to compute when the anchor point was last updated. The overall cost is thus .

#### Saga.

The estimator has a form similar to (10) but with a different choice of variables . Unlike SVRG that stores an anchor point , the SAGA estimator requires storing and incrementally updating the auxiliary variables , while maintaining the relation . We consider variants such that each gradient  is corrupted by a random perturbation; to deal with non-uniform sampling, we use a similar strategy as Schmidt et al. (2015). The corresponding algorithm is available in Appendix A.

## 3 Convergence Analysis and Robustness

In Section 3.1, we present a general convergence result and the analysis with variance-reduction is presented in Section 3.2. All proofs are in the appendix.

### 3.1 Generic Convergence Result

The following proposition gives a key relation between , the surrogate , and the variance .

###### Proposition 1 (Key relation).

For iteration (A), assuming , we have for all ,

 δk(E[F(xk)]−F∗)+E[dk(x∗)−d∗k]≤(1−δk)E[dk--1(x∗)−d∗k--% 1]+ηkδkσ2k, (11)

where is a minimizer of and .

Then, without making further assumption on , we have the following general convergence result, which is a direct consequence of the averaging Lemma C.8 in the appendix, inspired in part by Ghadimi & Lan (2012):

###### Theorem 1 (General convergence result).

Under the same assumptions as in Proposition 1, by using the averaging strategy of Lemma C.8, which produces an iterate ,

 E[F(^xk)−F∗+γk2∥xk−x∗∥2]≤Γk(F(x0)−F∗+γ02∥x0−x∗∥2+k∑t=1δtηtσ2tΓt),

where and is a minimizer of .

Theorem 1 allows us to recover convergence rates for various algorithms. In the corollary below, we consider the stochastic setting with constant step sizes; the algorithm converges with the same rate as the deterministic problem to a noise-dominated region of radius . The proof simply uses Lemma C.6, which provides the convergence rate of and uses the relation from Lemma C.5 in the appendix.

###### Corollary 1 (Prox-SGD with constant step-size).

Assume in Theorem 1 that , and choose and . Then,

 E[F(^xk)−F∗+μ2∥xk−x∗∥2]≤(1−μL)k(F(x0)−F∗+μ2∥x0−x∗∥2)+σ2L. (12)

We now show that it is also possible to obtain converging algorithms by using decreasing step sizes.

###### Corollary 2 (Prox-SGD with decreasing step-sizes).

Assume that we target an accuracy smaller than . First, use iteration (A) as in Theorem 1 with a constant step-size and , leading to the convergence rate (12), until . Then, restart the optimization procedure with decreasing step-sizes . The resulting number of gradient evaluations to achieve is upper bounded by

 O(Lμlog(F(x0)−F∗ε))+O(σ2με).

We note that the dependency in with the rate is optimal for strongly convex functions (Nemirovski et al., 2009). Unfortunately, estimating

is not easy and knowing in practice when to start decreasing the step sizes in SGD algorithms is an open problem. The corollary simply supports the heuristic consisting of adopting a constant step size long enough until the iterates oscillate without much progress, before decreasing the step sizes

(see Bottou et al., 2018).

### 3.2 Faster Convergence with Variance Reduction

Stochastic variance-reduced algorithms rely on gradient estimates whose variance decreases as fast as the objective. Our framework provides a unified proof of convergence for our variants of SVRG and SAGA and makes them robust to stochastic perturbations. Specifically, we consider the minimization of a finite sum of functions as in (3), but each observation of the gradient is corrupted by a random noise variable. The next proposition extends the proof of SVRG (Xiao & Zhang, 2014) and characterizes .

###### Proposition 2 (Generic Upper-Bound on Variance).

Consider the optimization problem (1) when is a finite sum of functions where each is convex and -smooth with . Then, the random-SVRG and SAGA gradient estimates defined in Section 2.2 satisfy

 σ2k≤4LQE[F(xk--1)−F∗]+2nE[n∑i=11nqi∥uik--1−∇fi(x∗)∥2]+3ρQ~σ2, (13)

where , , and for all and , is equal to without noise—that is

 uik=∇fi(~xk)   for random-SVRGujkk=∇fjk(xk)   and   ujk=ujk--1   if   j≠jk   for SAGA.

Next, we apply this result to Proposition 1.

###### Proposition 3 (Lyapunov function).

Consider the same setting as Proposition 2 and the same gradient estimators. When using the construction of from Sections 2.1, and assuming and is non-increasing with , we have for all , with ,

 δk6E[F(xk)−F∗]+Tk≤(1−τk)Tk--1+3ρQηkδk~σ2,where  Tk=5LQηkδkE[F(xk)−F∗]+E[dk(x∗)−d∗k]+5ηkδk2E[1nn∑i=11qin∥uik−ui∗∥2]. (14)

From the Lyapunov function, we obtain a general convergence result for the variance-reduced stochastic algorithms.

###### Theorem 2 (Convergence with variance-reduction).

Consider the same setting as Proposition 3. Then, by using the averaging strategy described in Algorithm 1,

 E[F(^xk)−F∗+6τkδkTk]≤Θk(F(x0)−F∗+6τkδkT0+18ρQτk~σ2δkk∑t=1ηtδtΘt),

where .

The theorem is a direct application of the averaging Lemma C.8 to Proposition 3. From this generic convergence theorem, we now study particular cases.

###### Corollary 3 (Variance-reduction with constant η).

Consider the same setting as in Theorem 2, with , , and . Then,

 E[F(^xk)−F∗+36LQτ∥xk−x∗∥2]≤8Θk(F(x0)−F∗)+3ρQ~σ22LQ.

This first corollary shows that the algorithm achieves a linear convergence rate to a noise-dominated region. Interestingly, the algorithm without averaging does not require computing and produces iterates without using the strong convexity constant . This shows that all estimators we consider can become adaptive to .

Moreover, we note that the non-uniform strategy slightly degrades the dependency in : indeed, and if  is uniform, but with non-uniform , we have instead (which is better) and (which is worse).

###### Corollary 4 (Variance-reduction with decreasing ηk).

Consider the same setting as in Theorem 2 and target an accuracy , with . Then, use a constant step-size strategy with until we find a point such that . Then, restart the optimization with decreasing step-sizes . The number of gradient evaluations to achieve is upper bounded by

 O((n+LQμ)log(F(x0)−F∗ε))+O(ρQ~σ2με).

The corollary shows that variance-reduction algorithms may exhibit an optimal dependency on the noise level .

## 4 Accelerated Stochastic Algorithms

We now consider the following iteration, involving an extrapolation sequence , which is a classical mechanism from accelerated first-order algorithms (Beck & Teboulle, 2009; Nesterov, 2013). Given a sequence of step-sizes with for all , and , we consider the sequences and that satisfy

 δk=√ηkγk   for all k≥0γk=(1−δk)γk--1+δkμ   for all k≥1.

Then, for , we consider the iteration [leftmargin=0pt,innerleftmargin=6pt,innerrightmargin=6pt]

 (B)

where . Iteration (B) resembles the accelerated SGD approaches of Hu et al. (2009); Ghadimi & Lan (2012); Lin et al. (2014) but is slightly simpler since it involves two sequences of variables instead of three.

### 4.1 Convergence Analysis without Variance Reduction

Consider then the stochastic estimate sequence introduced in (6) with defined as in (5) and

 lk(x)=f(yk--1)+g⊤k(x−yk--1)+μ2∥x−yk--1∥2+ψ(xk)+ψ′(xk)⊤(x−xk), (15)

and is in by definition of the proximal operator. As in Section 2, asymptotically becomes a lower bound on since (8) remains satisfied. This time, the iterate does not minimize , and we denote by instead its minimizer, allowing us to write in the canonical form

 dk(x)=d∗k+γk2∥x−vk∥2.

The first lemma highlights classical relations between the iterates , and the minimizers , which also appears in (Nesterov, 2004, p. 78) for constant . Note that the construction of stochastic estimate sequence resembles that of (Devolder, 2011; Lin et al., 2014). The main difference lies in the choice of function in (15), which yields a different algorithm and slightly stronger guarantees.

###### Lemma 1 (Relations between yk and xk).

The sequences and produced by iteration (B) satisfy for all , with ,

 yk=θkxk+(1−θk)vk    with    θk=γk+1γk+δk+1μ.

Then, the next lemma will be used to prove that , where is a noise term, such that , according to (8).

###### Lemma 2 (Key lemma for acceleration).

Consider the same sequences as in Lemma 1. Then, for all ,

 E[F(xk)]≤E[lk(yk--1)]+(Lη2k2−ηk)E[∥~gk∥2]+ηkσ2k,

with and .

Finally, we obtain the following convergence result.

###### Theorem 3 (Convergence of accelerated SGD).

Under the assumptions of Lemma 1, we have for all ,

 E[F(xk)−F∗+γk2∥vk−x∗∥2]≤Γk(F(x0)−F∗+γ02∥x0−x∗∥2+k∑t=1ηtσ2tΓt),

where, as before, .

We now specialize the theorem to various practical cases.

###### Corollary 5 (Prox accelerated SGD with constant η).

Assume that  has constant variance , and choose and with Algorithm (B). Then,

 E[F(xk)−F∗]≤(1−√μL)k(F(x0)−F∗+μ2∥x0−x∗∥2)+σ2√μL.

We now show that with decreasing step sizes, we obtain an algorithm with optimal complexity similar to (Ghadimi & Lan, 2013; Cohen et al., 2018; Aybat et al., 2019), though we use a two-stages algorithm only.

###### Corollary 6 (Prox accelerated SGD with decreasing ηk).

Target an accuracy smaller than . First, use a constant step-size with within Algorithm (B) until . Then, we restart the optimization procedure with decreasing step-sizes . The number of gradient evaluations to achieve is upper bounded by

 O(√Lμlog(F(x0)−F∗ε))+O(σ2με).

### 4.2 Accelerated Algorithm with Variance Reduction

Next, we show how to build accelerated algorithms with the random-SVRG gradient estimator. First, we control the variance of the estimator in a similar manner to Katyusha (Allen-Zhu, 2017), as stated in the next proposition. Note that the estimator here does not require storing the seed of the random perturbations, unlike in the previous section, and does not rely on an averaging procedure (hence preserving the potential sparsity of the solution when is sparsity-inducing).