We consider convex optimization problems of the form
where is -strongly convex and -smooth (differentiable with -Lipschitz continuous gradient), and is convex lower-semicontinuous. For instance, may be the -norm but may also be the indicator function of a convex set for constrained problems (Hiriart-Urruty & Lemaréchal, 1996).
More specifically, we focus on stochastic objectives, where is an expectation or a finite sum of convex functions
On the left,
is a random variable representing a data point drawn according to some distribution andmeasures the fit of some model parameter to the data point .
While the finite-sum setting is a particular case of expectation, the deterministic nature of the resulting cost function drastically changes the performance guarantees an optimization method may achieve to solve (1). In particular, when an algorithm is only allowed to access unbiased measurements of the objective and gradient, it may be shown that the worst-case convergence rate in expected function value cannot be better than in general, where is the number of iterations (Nemirovski et al., 2009; Agarwal et al., 2012).
Even though this pessimistic result applies to the general stochastic case, linear convergence rates can be obtained for deterministic finite sums. For instance, linear rates are achieved by SAG (Schmidt et al., 2017), SAGA (Defazio et al., 2014), SVRG (Johnson & Zhang, 2013; Xiao & Zhang, 2014), SDCA (Shalev-Shwartz & Zhang, 2016), MISO (Mairal, 2015), Katyusha (Allen-Zhu, 2017), MiG (Zhou et al., 2018), SARAH (Nguyen et al., 2017), accelerated SAGA (Zhou, 2019) or Lan & Zhou (2018a). In the non-convex case, a recent focus has been on improving convergence rates for finding first-order stationary points (Lei et al., 2017; Paquette et al., 2018; Fang et al., 2018), which is however beyond the scope of our paper. A common interpretation is to see these algorithms as performing SGD steps with an estimate of the gradient that has lower variance.
In this paper, we are interested in providing a unified view of such stochastic optimization algorithms, but we also want to investigate their robustness to random pertubations. Specifically, we consider objective functions with an explicit finite-sum structure such as (2) when only noisy estimates of the gradients are available. Such a setting may occur for various reasons. Perturbations may be injected during training to achieve better generalization (Srivastava et al., 2014)
, perform stable feature selection(Meinshausen & Bühlmann, 2010), for privacy-aware learning (Wainwright et al., 2012) or to improve model robustness (Zheng et al., 2016).
Each point indexed by is then corrupted by a random perturbation and the function may be written as
Since the exact gradients cannot be computed, all the aforementioned variance-reduction methods do not apply and the standard approach is to use SGD. Typically, the variance of the gradient estimate then decomposes into two parts , where is due to data sampling and to the random perturbation. In such a context, variance reduction consists of designing algorithms whose convergence rate depends on , which is potentially much smaller than . The SAGA and SVRG methods have been adapted for such a purpose by Hofmann et al. (2015a), though the resulting algorithms have non-zero asymptotic error; the MISO method was adapted by Bietti & Mairal (2017) at the cost of a memory overhead of , whereas other variants of SAGA and SVRG have been proposed by Zheng & Kwok (2018)
for linear models in machine learning.
In this paper, we extend estimate sequences introduced by Nesterov (2004), which are typically used to design accelerated algorithms. Estimate sequences have been used before for stochastic optimization (Devolder, 2011; Lin et al., 2014; Lu & Xiao, 2015), but not for the following generic purpose:
First, we make a large class of variance-reduction methods robust to stochastic perturbations. More precisely, by using a sampling strategy to select indices of the sum (3) at each iteration, when each function is convex and -smooth, the worst-case iteration complexity of our approaches in function values—that is, the number of iterations to guarantee —is upper bounded by
where and depends on
. For the uniform distribution, we haveand , whereas a non-uniform may yield . The term on the left corresponds to the complexity of variance-reduction methods without perturbation, and is the optimal sublinear rate of convergence for a stochastic optimization problem when the gradient estimates have variance . In contrast, a variant of SGD applied to (3) has worst-case complexity , with potentially .
Second, we design a new accelerated algorithm which, to our knowledge, is the first one to achieve the complexity
where the term on the left matches the optimal complexity for finite sums when (Arjevani & Shamir, 2016), which has been achieved by Allen-Zhu (2017); Zhou et al. (2018); Zhou (2019); Kovalev et al. (2019). The general stochastic finite-sum problem with was also considered with an acceleration mechanism by Lan & Zhou (2018b) for distributed optimization being optimal in terms of communication rounds, but not in the global complexity.
Note that we treat here only the strongly convex case, but similar results can be obtained when , as shown in a long version of this paper (Kulunchakov & Mairal, 2019).
2 A Generic Framework
In this section, we introduce stochastic estimate sequences and show how they can handle variance reduction.
2.1 A Classical Iteration Revisited
Consider an algorithm that performs the following updates:
where and is the filtration representing all information up to iteration , is a step size, and is the proximal operator (Moreau, 1962) defined for any scalar as the unique solution of
Key to our analysis, we interpret (A) as the iterative minimization of quadratic surrogate functions.
Interpretation with stochastic estimate sequence.
is a scalar value that is left unspecified at the moment. Then, it is easy to show thatin (A) minimizes defined recursively for as
satisfy the system of equations
(note that yields for all ), and
Here, is a subgradient in . By simply using the definition of the proximal operator (4) and considering first-order optimality conditions, we indeed have that and coincides with the minimizer of . This allows us to write in the form
The construction (6) is akin to that of estimate sequences introduced by Nesterov (2004), which are typically used for designing accelerated gradient-based optimization algorithms. In this section, we are however not interested in acceleration, but instead in stochastic optimization and variance reduction. One of the main properties of estimate sequences that we will nevertheless use is their ability to behave asymptotically as a lower bound. Indeed, we have
where is a minimizer of and . The inequality comes from strong convexity since , leading to the relation . Then, by unrolling the recursion, we obtain (8). When converges to zero, the contribution of the initial surrogate disappears and behaves as a lower bound of .
Relation with existing algorithms.
The iteration (A) encompasses many approaches such as ISTA (proximal gradient uses the exact gradient (Beck & Teboulle, 2009; Nesterov, 2013) or proximal variants of the stochastic gradient descent method to deal with a composite objective (Lan, 2012). Of interest to us, the variance-reduced stochastic optimization approaches SVRG (Xiao & Zhang, 2014) and SAGA (Defazio et al., 2014) also follow (A) but with an unbiased gradient estimator whose variance reduces over time. Specifically, they use
where is an index chosen uniformly in at random, and each is equal to a gradient , where is one of the previous iterates. The motivation is that given two random variables and , it is possible to define a new variable which has the same expectation as but potentially a lower variance if is positively correlated with . SVRG uses the same anchor point for all , where is updated every iterations. Typically, the memory cost of SVRG is that of storing the variable and the gradient , which is thus . On the other hand, SAGA updates only at iteration , such that if . Thus, SAGA requires storing gradients. While in general the overhead cost in memory is of order , it may be reduced to when dealing with linear models in machine learning (see Defazio et al., 2014). Note that variants with non-uniform sampling of the indices have been proposed by Xiao & Zhang (2014); Schmidt et al. (2015), which we discuss later.
In order to make our proofs consistent for SAGA and SVRG (and MISO in Kulunchakov & Mairal, 2019), we consider a variant of SVRG with a randomized gradient updating schedule (Hofmann et al., 2015a). Remarkably, this variant was shown to provide benefits over the fixed schedule in a concurrent work (Kovalev et al., 2019) when .
2.2 Gradient Estimators and New Algorithms
In this paper, we consider the gradient estimators below. For all of them, we define the variance to be
Simply consider and .
We assume that has variance bounded by . Typically, when , a data point is drawn at iteration and . Even though the bounded variance assumption has limitations, it remains the most standard one for stochastic optimization and more realistic settings (such as (Bottou et al., 2018; Nguyen et al., 2018) for the smooth case) are left for future work.
For finite sums, we consider a variant of SVRG with random update of the anchor point , proposed originally in (Hofmann et al., 2015b), combined with non-uniform sampling. Specifically, is defined as
where and denotes a perturbed gradient operator. For instance, if for all , where is a stochastic perturbation, instead of accessing , we draw a perturbation and observe
where the perturbation has zero mean given and its variance is bounded by . When considering the setting without perturbation, we simply have .
Similar to the previous case, the variables and also correspond to noisy estimates of the gradients. Specifically,
where is an anchor point that is updated on average every iterations. Whereas the classical SVRG approach updates
on a fixed schedule, we perform random updates: with probability, we choose and recompute ; otherwise is kept unchanged. In comparison with the fixed schedule, the analysis with the random one is simplified and can be unified with that of SAGA. This approach is described in Algorithm 1.
In terms of memory, the random-SVRG gradient estimator requires to store an anchor point and the average gradients . The ’s do not need to be stored; only the random seeds to produce the perturbations are kept into memory, which allows us to compute at iteration , with the same perturbation for index that was used to compute when the anchor point was last updated. The overall cost is thus .
The estimator has a form similar to (10) but with a different choice of variables . Unlike SVRG that stores an anchor point , the SAGA estimator requires storing and incrementally updating the auxiliary variables , while maintaining the relation . We consider variants such that each gradient is corrupted by a random perturbation; to deal with non-uniform sampling, we use a similar strategy as Schmidt et al. (2015). The corresponding algorithm is available in Appendix A.
3 Convergence Analysis and Robustness
3.1 Generic Convergence Result
The following proposition gives a key relation between , the surrogate , and the variance .
Proposition 1 (Key relation).
For iteration (A), assuming , we have for all ,
where is a minimizer of and .
Then, without making further assumption on , we have the following general convergence result, which is a direct consequence of the averaging Lemma C.8 in the appendix, inspired in part by Ghadimi & Lan (2012):
Theorem 1 (General convergence result).
Theorem 1 allows us to recover convergence rates for various algorithms. In the corollary below, we consider the stochastic setting with constant step sizes; the algorithm converges with the same rate as the deterministic problem to a noise-dominated region of radius . The proof simply uses Lemma C.6, which provides the convergence rate of and uses the relation from Lemma C.5 in the appendix.
Corollary 1 (Prox-SGD with constant step-size).
Assume in Theorem 1 that , and choose and . Then,
We now show that it is also possible to obtain converging algorithms by using decreasing step sizes.
Corollary 2 (Prox-SGD with decreasing step-sizes).
Assume that we target an accuracy smaller than . First, use iteration (A) as in Theorem 1 with a constant step-size and , leading to the convergence rate (12), until . Then, restart the optimization procedure with decreasing step-sizes . The resulting number of gradient evaluations to achieve is upper bounded by
We note that the dependency in with the rate is optimal for strongly convex functions (Nemirovski et al., 2009). Unfortunately, estimating
is not easy and knowing in practice when to start decreasing the step sizes in SGD algorithms is an open problem. The corollary simply supports the heuristic consisting of adopting a constant step size long enough until the iterates oscillate without much progress, before decreasing the step sizes(see Bottou et al., 2018).
3.2 Faster Convergence with Variance Reduction
Stochastic variance-reduced algorithms rely on gradient estimates whose variance decreases as fast as the objective. Our framework provides a unified proof of convergence for our variants of SVRG and SAGA and makes them robust to stochastic perturbations. Specifically, we consider the minimization of a finite sum of functions as in (3), but each observation of the gradient is corrupted by a random noise variable. The next proposition extends the proof of SVRG (Xiao & Zhang, 2014) and characterizes .
Proposition 2 (Generic Upper-Bound on Variance).
Next, we apply this result to Proposition 1.
Proposition 3 (Lyapunov function).
From the Lyapunov function, we obtain a general convergence result for the variance-reduced stochastic algorithms.
Theorem 2 (Convergence with variance-reduction).
Corollary 3 (Variance-reduction with constant ).
Consider the same setting as in Theorem 2, with , , and . Then,
This first corollary shows that the algorithm achieves a linear convergence rate to a noise-dominated region. Interestingly, the algorithm without averaging does not require computing and produces iterates without using the strong convexity constant . This shows that all estimators we consider can become adaptive to .
Moreover, we note that the non-uniform strategy slightly degrades the dependency in : indeed, and if is uniform, but with non-uniform , we have instead (which is better) and (which is worse).
Corollary 4 (Variance-reduction with decreasing ).
Consider the same setting as in Theorem 2 and target an accuracy , with . Then, use a constant step-size strategy with until we find a point such that . Then, restart the optimization with decreasing step-sizes . The number of gradient evaluations to achieve is upper bounded by
The corollary shows that variance-reduction algorithms may exhibit an optimal dependency on the noise level .
4 Accelerated Stochastic Algorithms
We now consider the following iteration, involving an extrapolation sequence , which is a classical mechanism from accelerated first-order algorithms (Beck & Teboulle, 2009; Nesterov, 2013). Given a sequence of step-sizes with for all , and , we consider the sequences and that satisfy
Then, for , we consider the iteration [leftmargin=0pt,innerleftmargin=6pt,innerrightmargin=6pt]
where . Iteration (B) resembles the accelerated SGD approaches of Hu et al. (2009); Ghadimi & Lan (2012); Lin et al. (2014) but is slightly simpler since it involves two sequences of variables instead of three.
4.1 Convergence Analysis without Variance Reduction
and is in by definition of the proximal operator. As in Section 2, asymptotically becomes a lower bound on since (8) remains satisfied. This time, the iterate does not minimize , and we denote by instead its minimizer, allowing us to write in the canonical form
The first lemma highlights classical relations between the iterates , and the minimizers , which also appears in (Nesterov, 2004, p. 78) for constant . Note that the construction of stochastic estimate sequence resembles that of (Devolder, 2011; Lin et al., 2014). The main difference lies in the choice of function in (15), which yields a different algorithm and slightly stronger guarantees.
Lemma 1 (Relations between and ).
The sequences and produced by iteration (B) satisfy for all , with ,
Then, the next lemma will be used to prove that , where is a noise term, such that , according to (8).
Lemma 2 (Key lemma for acceleration).
Consider the same sequences as in Lemma 1. Then, for all ,
with and .
Finally, we obtain the following convergence result.
Theorem 3 (Convergence of accelerated SGD).
Under the assumptions of Lemma 1, we have for all ,
where, as before, .
We now specialize the theorem to various practical cases.
Corollary 5 (Prox accelerated SGD with constant ).
Assume that has constant variance , and choose and with Algorithm (B). Then,
We now show that with decreasing step sizes, we obtain an algorithm with optimal complexity similar to (Ghadimi & Lan, 2013; Cohen et al., 2018; Aybat et al., 2019), though we use a two-stages algorithm only.
Corollary 6 (Prox accelerated SGD with decreasing ).
Target an accuracy smaller than . First, use a constant step-size with within Algorithm (B) until . Then, we restart the optimization procedure with decreasing step-sizes . The number of gradient evaluations to achieve is upper bounded by
4.2 Accelerated Algorithm with Variance Reduction
Next, we show how to build accelerated algorithms with the random-SVRG gradient estimator. First, we control the variance of the estimator in a similar manner to Katyusha (Allen-Zhu, 2017), as stated in the next proposition. Note that the estimator here does not require storing the seed of the random perturbations, unlike in the previous section, and does not rely on an averaging procedure (hence preserving the potential sparsity of the solution when is sparsity-inducing).