## 1 Introduction

We consider convex optimization problems of the form

(1) |

where is -strongly convex and -smooth (differentiable with -Lipschitz continuous gradient), and is convex lower-semicontinuous. For instance, may be the -norm but may also be the indicator function of a convex set for constrained problems (Hiriart-Urruty & Lemaréchal, 1996).

More specifically, we focus on stochastic objectives, where is an expectation or a finite sum of convex functions

(2) |

On the left,

is a random variable representing a data point drawn according to some distribution and

measures the fit of some model parameter to the data point .While the finite-sum setting is a particular case of expectation, the deterministic nature of the resulting cost function drastically changes the performance guarantees an optimization method may achieve to solve (1). In particular, when an algorithm is only allowed to access unbiased measurements of the objective and gradient, it may be shown that the worst-case convergence rate in expected function value cannot be better than in general, where is the number of iterations (Nemirovski et al., 2009; Agarwal et al., 2012).

Even though this pessimistic result applies to the general stochastic case, linear convergence rates can be obtained for deterministic finite sums. For instance, linear rates are achieved by SAG (Schmidt et al., 2017), SAGA (Defazio et al., 2014), SVRG (Johnson & Zhang, 2013; Xiao & Zhang, 2014), SDCA (Shalev-Shwartz & Zhang, 2016), MISO (Mairal, 2015), Katyusha (Allen-Zhu, 2017), MiG (Zhou et al., 2018), SARAH (Nguyen et al., 2017), accelerated SAGA (Zhou, 2019) or Lan & Zhou (2018a). In the non-convex case, a recent focus has been on improving convergence rates for finding first-order stationary points (Lei et al., 2017; Paquette et al., 2018; Fang et al., 2018), which is however beyond the scope of our paper. A common interpretation is to see these algorithms as performing SGD steps with an estimate of the gradient that has lower variance.

In this paper, we are interested in providing a unified view of such stochastic optimization
algorithms, but we also want to investigate their *robustness* to random
pertubations. Specifically, we consider objective functions with an explicit
finite-sum structure such as (2) when only noisy
estimates of the gradients are available.
Such a setting may occur for various reasons. Perturbations may be injected during training to achieve better
generalization (Srivastava et al., 2014)

, perform stable feature selection

(Meinshausen & Bühlmann, 2010), for privacy-aware learning (Wainwright et al., 2012) or to improve model robustness (Zheng et al., 2016).Each point indexed by is then corrupted by a random perturbation and the function may be written as

(3) |

Since the exact gradients cannot be computed, all the aforementioned variance-reduction methods do not apply and the standard approach is to use SGD. Typically, the variance of the gradient estimate then decomposes into two parts , where is due to data sampling and to the random perturbation. In such a context, variance reduction consists of designing algorithms whose convergence rate depends on , which is potentially much smaller than . The SAGA and SVRG methods have been adapted for such a purpose by Hofmann et al. (2015a), though the resulting algorithms have non-zero asymptotic error; the MISO method was adapted by Bietti & Mairal (2017) at the cost of a memory overhead of , whereas other variants of SAGA and SVRG have been proposed by Zheng & Kwok (2018)

for linear models in machine learning.

In this paper, we extend estimate sequences introduced by Nesterov (2004), which are typically used to design accelerated algorithms. Estimate sequences have been used before for stochastic optimization (Devolder, 2011; Lin et al., 2014; Lu & Xiao, 2015), but not for the following generic purpose:

First, we make a large class of variance-reduction methods robust to stochastic perturbations. More precisely, by using a sampling strategy to select indices of the sum (3) at each iteration, when each function is convex and -smooth, the worst-case iteration complexity of our approaches in function values—that is, the number of iterations to guarantee —is upper bounded by

where and depends on

. For the uniform distribution, we have

and , whereas a non-uniform may yield . The term on the left corresponds to the complexity of variance-reduction methods without perturbation, and is the optimal sublinear rate of convergence for a stochastic optimization problem when the gradient estimates have variance . In contrast, a variant of SGD applied to (3) has worst-case complexity , with potentially .Second, we design a new accelerated algorithm which, to our knowledge, is the first one to achieve the complexity

where the term on the left matches the optimal complexity for finite sums when (Arjevani & Shamir, 2016), which has been achieved by Allen-Zhu (2017); Zhou et al. (2018); Zhou (2019); Kovalev et al. (2019). The general stochastic finite-sum problem with was also considered with an acceleration mechanism by Lan & Zhou (2018b) for distributed optimization being optimal in terms of communication rounds, but not in the global complexity.

Note that we treat here only the strongly convex case, but similar results can be obtained when , as shown in a long version of this paper (Kulunchakov & Mairal, 2019).

## 2 A Generic Framework

In this section, we introduce stochastic estimate sequences and show how they can handle variance reduction.

### 2.1 A Classical Iteration Revisited

Consider an algorithm that performs the following updates:

[leftmargin=0pt,innerleftmargin=6pt,innerrightmargin=6pt]

(A) |

where and is the filtration representing all information up to iteration , is a step size, and is the proximal operator (Moreau, 1962) defined for any scalar as the unique solution of

(4) |

Key to our analysis, we interpret (A) as the iterative minimization of quadratic surrogate functions.

#### Interpretation with stochastic estimate sequence.

Consider

(5) |

with and

is a scalar value that is left unspecified at the moment. Then, it is easy to show that

in (A) minimizes defined recursively for as(6) |

where

satisfy the system of equations

(7) |

(note that yields for all ), and

Here, is a subgradient in . By simply using the definition of the proximal operator (4) and considering first-order optimality conditions, we indeed have that and coincides with the minimizer of . This allows us to write in the form

The construction (6) is akin to that of estimate sequences introduced by Nesterov (2004), which are typically used for designing accelerated gradient-based optimization algorithms. In this section, we are however not interested in acceleration, but instead in stochastic optimization and variance reduction. One of the main properties of estimate sequences that we will nevertheless use is their ability to behave asymptotically as a lower bound. Indeed, we have

(8) |

where is a minimizer of and . The inequality comes from strong convexity since , leading to the relation . Then, by unrolling the recursion, we obtain (8). When converges to zero, the contribution of the initial surrogate disappears and behaves as a lower bound of .

#### Relation with existing algorithms.

The iteration (A) encompasses many approaches such as ISTA (proximal gradient uses the exact gradient (Beck & Teboulle, 2009; Nesterov, 2013) or proximal variants of the stochastic gradient descent method to deal with a composite objective (Lan, 2012). Of interest to us, the variance-reduced stochastic optimization approaches SVRG (Xiao & Zhang, 2014) and SAGA (Defazio et al., 2014) also follow (A) but with an unbiased gradient estimator whose variance reduces over time. Specifically, they use

(9) |

where is an index chosen uniformly in at random, and each is equal to a gradient , where is one of the previous iterates. The motivation is that given two random variables and , it is possible to define a new variable which has the same expectation as but potentially a lower variance if is positively correlated with . SVRG uses the same anchor point for all , where is updated every iterations. Typically, the memory cost of SVRG is that of storing the variable and the gradient , which is thus . On the other hand, SAGA updates only at iteration , such that if . Thus, SAGA requires storing gradients. While in general the overhead cost in memory is of order , it may be reduced to when dealing with linear models in machine learning (see Defazio et al., 2014). Note that variants with non-uniform sampling of the indices have been proposed by Xiao & Zhang (2014); Schmidt et al. (2015), which we discuss later.

In order to make our proofs consistent for SAGA and SVRG (and MISO in Kulunchakov & Mairal, 2019), we consider a variant of SVRG with a randomized gradient updating schedule (Hofmann et al., 2015a). Remarkably, this variant was shown to provide benefits over the fixed schedule in a concurrent work (Kovalev et al., 2019) when .

### 2.2 Gradient Estimators and New Algorithms

In this paper, we consider the gradient estimators below. For all of them, we define the variance to be

#### Ista.

Simply consider and .

#### Sgd.

We assume that has variance bounded by . Typically, when , a data point is drawn at iteration and . Even though the bounded variance assumption has limitations, it remains the most standard one for stochastic optimization and more realistic settings (such as (Bottou et al., 2018; Nguyen et al., 2018) for the smooth case) are left for future work.

#### random-SVRG.

For finite sums, we consider a variant of SVRG with random update of the anchor point , proposed originally in (Hofmann et al., 2015b), combined with non-uniform sampling. Specifically, is defined as

(10) |

where and denotes a perturbed gradient operator. For instance, if for all , where is a stochastic perturbation, instead of accessing , we draw a perturbation and observe

where the perturbation has zero mean given and its variance is bounded by . When considering the setting without perturbation, we simply have .

Similar to the previous case, the variables and also correspond to noisy estimates of the gradients. Specifically,

where is an anchor point that is updated on average every iterations. Whereas the classical SVRG approach updates

on a fixed schedule, we perform random updates: with probability

, we choose and recompute ; otherwise is kept unchanged. In comparison with the fixed schedule, the analysis with the random one is simplified and can be unified with that of SAGA. This approach is described in Algorithm 1.In terms of memory, the random-SVRG gradient estimator requires to store an anchor point and the average gradients . The ’s do not need to be stored; only the random seeds to produce the perturbations are kept into memory, which allows us to compute at iteration , with the same perturbation for index that was used to compute when the anchor point was last updated. The overall cost is thus .

#### Saga.

The estimator has a form similar to (10) but with a different choice of variables . Unlike SVRG that stores an anchor point , the SAGA estimator requires storing and incrementally updating the auxiliary variables , while maintaining the relation . We consider variants such that each gradient is corrupted by a random perturbation; to deal with non-uniform sampling, we use a similar strategy as Schmidt et al. (2015). The corresponding algorithm is available in Appendix A.

## 3 Convergence Analysis and Robustness

In Section 3.1, we present a general convergence result and the analysis with variance-reduction is presented in Section 3.2. All proofs are in the appendix.

### 3.1 Generic Convergence Result

The following proposition gives a key relation between , the surrogate , and the variance .

###### Proposition 1 (Key relation).

Then, without making further assumption on , we have the following general convergence result, which is a direct consequence of the averaging Lemma C.8 in the appendix, inspired in part by Ghadimi & Lan (2012):

###### Theorem 1 (General convergence result).

Theorem 1 allows us to recover convergence rates for various algorithms. In the corollary below, we consider the stochastic setting with constant step sizes; the algorithm converges with the same rate as the deterministic problem to a noise-dominated region of radius . The proof simply uses Lemma C.6, which provides the convergence rate of and uses the relation from Lemma C.5 in the appendix.

###### Corollary 1 (Prox-SGD with constant step-size).

We now show that it is also possible to obtain converging algorithms by using decreasing step sizes.

###### Corollary 2 (Prox-SGD with decreasing step-sizes).

Assume that we target an accuracy smaller than . First, use iteration (A) as in Theorem 1 with a constant step-size and , leading to the convergence rate (12), until . Then, restart the optimization procedure with decreasing step-sizes . The resulting number of gradient evaluations to achieve is upper bounded by

We note that the dependency in with the rate is optimal for strongly convex functions (Nemirovski et al., 2009). Unfortunately, estimating

is not easy and knowing in practice when to start decreasing the step sizes in SGD algorithms is an open problem. The corollary simply supports the heuristic consisting of adopting a constant step size long enough until the iterates oscillate without much progress, before decreasing the step sizes

(see Bottou et al., 2018).### 3.2 Faster Convergence with Variance Reduction

Stochastic variance-reduced algorithms rely on gradient estimates whose variance decreases as fast as the objective. Our framework provides a unified proof of convergence for our variants of SVRG and SAGA and makes them robust to stochastic perturbations. Specifically, we consider the minimization of a finite sum of functions as in (3), but each observation of the gradient is corrupted by a random noise variable. The next proposition extends the proof of SVRG (Xiao & Zhang, 2014) and characterizes .

###### Proposition 2 (Generic Upper-Bound on Variance).

Next, we apply this result to Proposition 1.

###### Proposition 3 (Lyapunov function).

From the Lyapunov function, we obtain a general convergence result for the variance-reduced stochastic algorithms.

###### Theorem 2 (Convergence with variance-reduction).

The theorem is a direct application of the averaging Lemma C.8 to Proposition 3. From this generic convergence theorem, we now study particular cases.

###### Corollary 3 (Variance-reduction with constant ).

This first corollary shows that the algorithm achieves a linear convergence rate to a noise-dominated region. Interestingly, the algorithm *without averaging* does not require computing and produces iterates without using the strong convexity constant . *This shows that all estimators we consider can become adaptive to *.

Moreover, we note that the non-uniform strategy slightly degrades the dependency in : indeed, and if is uniform, but with non-uniform , we have instead (which is better) and (which is worse).

###### Corollary 4 (Variance-reduction with decreasing ).

Consider the same setting as in Theorem 2 and target an accuracy , with . Then, use a constant step-size strategy with until we find a point such that . Then, restart the optimization with decreasing step-sizes . The number of gradient evaluations to achieve is upper bounded by

The corollary shows that variance-reduction algorithms may exhibit an optimal dependency on the noise level .

## 4 Accelerated Stochastic Algorithms

We now consider the following iteration, involving an extrapolation sequence , which is a classical mechanism from accelerated first-order algorithms (Beck & Teboulle, 2009; Nesterov, 2013). Given a sequence of step-sizes with for all , and , we consider the sequences and that satisfy

Then, for , we consider the iteration [leftmargin=0pt,innerleftmargin=6pt,innerrightmargin=6pt]

(B) |

where . Iteration (B) resembles the accelerated SGD approaches of Hu et al. (2009); Ghadimi & Lan (2012); Lin et al. (2014) but is slightly simpler since it involves two sequences of variables instead of three.

### 4.1 Convergence Analysis without Variance Reduction

Consider then the stochastic estimate sequence introduced in (6) with defined as in (5) and

(15) |

and is in by definition of the proximal operator. As in Section 2, asymptotically becomes a lower bound on since (8) remains satisfied. This time, the iterate does not minimize , and we denote by instead its minimizer, allowing us to write in the canonical form

The first lemma highlights classical relations between the iterates , and the minimizers , which also appears in (Nesterov, 2004, p. 78) for constant . Note that the construction of stochastic estimate sequence resembles that of (Devolder, 2011; Lin et al., 2014). The main difference lies in the choice of function in (15), which yields a different algorithm and slightly stronger guarantees.

###### Lemma 1 (Relations between and ).

The sequences and produced by iteration (B) satisfy for all , with ,

Then, the next lemma will be used to prove that , where is a noise term, such that , according to (8).

###### Lemma 2 (Key lemma for acceleration).

Finally, we obtain the following convergence result.

###### Theorem 3 (Convergence of accelerated SGD).

We now specialize the theorem to various practical cases.

###### Corollary 5 (Prox accelerated SGD with constant ).

We now show that with decreasing step sizes, we obtain an algorithm with optimal complexity similar to (Ghadimi & Lan, 2013; Cohen et al., 2018; Aybat et al., 2019), though we use a two-stages algorithm only.

###### Corollary 6 (Prox accelerated SGD with decreasing ).

Target an accuracy smaller than . First, use a constant step-size with within Algorithm (B) until . Then, we restart the optimization procedure with decreasing step-sizes . The number of gradient evaluations to achieve is upper bounded by

### 4.2 Accelerated Algorithm with Variance Reduction

Next, we show how to build accelerated algorithms with the random-SVRG gradient estimator. First, we control the variance of the estimator in a similar manner to Katyusha (Allen-Zhu, 2017), as stated in the next proposition. Note that the estimator here does not require storing the seed of the random perturbations, unlike in the previous section, and does not rely on an averaging procedure (hence preserving the potential sparsity of the solution when is sparsity-inducing).

Comments

There are no comments yet.