We consider convex optimization problems of the form
where is convex and -smooth111A function is -smooth when it is differentiable and its derivative is Lipschitz continuous with constant ., and we call its strong convexity modulus with respect to the Euclidean norm.222Then, means that the function is convex but not strongly convex. The function is convex lower semi-continuous and is not assumed to be necessarily differentiable. For instance, may be the
-norm, which is very popular in signal processing and machine learning for its sparsity-inducing properties(see Mairal et al., 2014, and references therein); may also be the extended-valued indicator function of a convex set that takes the value outside of and inside such that the previous setting encompasses constrained problems (see Hiriart-Urruty and Lemaréchal, 1996).
More specifically, we focus on stochastic objective functions, which are of utmost importance in machine learning, where is an expectation or a finite sum of smooth functions
On the left,
is a random variable representing a data point drawn according to some distribution andmeasures the fit of some model parameter to the data point . Whereas in machine learning formulations the explicit form of the data distribution is unknown, it is assumed that it is possible to draw from it random samples Either an infinite number of such i.i.d. samples are available and the problem of interest is to minimize (1) with , or one has access to a finite training set only, leading to the finite-sum setting on the right of (2), called empirical risk (Vapnik, 2000).
While the finite-sum setting is obviously a particular case of expectation with a discrete probability distribution, thedeterministic nature of the resulting cost function drastically changes the performance guarantees an optimization method may achieve to solve (1). In particular, when an algorithm is only allowed to access unbiased measurements of the objective and gradient—which we assume is the case when is an expectation—it may be shown that the worst-case convergence rate in expected function value cannot be better than in general, where is the number of iterations (Nemirovski et al., 2009; Agarwal et al., 2012). Such a sublinear rate of convergence is notably achieved by stochastic gradient descent (SGD) algorithms or their variants (see Bottou et al., 2018, for a review).
Even though this pessimistic result applies to the general stochastic case, linear convergence rates can be obtained for the finite-sum setting (Schmidt et al., 2017). Specifically, a large body of work in machine learning has led to many randomized incremental approaches such as SAG (Schmidt et al., 2017), SAGA (Defazio et al., 2014a), SVRG (Johnson and Zhang, 2013; Xiao and Zhang, 2014), SDCA (Shalev-Shwartz and Zhang, 2016), MISO (Mairal, 2015), Katyusha (Allen-Zhu, 2017), or the method of Lan and Zhou (2018). These algorithms have about the same cost per-iteration as the stochastic gradient descent method, since they access only a single or two gradients at each iteration, and they may achieve lower computational complexity than accelerated gradient descent methods (Nesterov, 1983, 2004, 2013; Beck and Teboulle, 2009) in expectation. A common interpretation is to see these algorithms as performing SGD steps with an estimate of the full gradient that has lower variance (Xiao and Zhang, 2014).
More precisely, accelerated gradient methods applied to the deterministic finite-sum problem when is -strongly convex are able to provide an iterate such that after evaluations of gradients in the worst case; in contrast, variance-reduced stochastic optimization methods without acceleration achieve the same guarantee in expectation with complexity , where is the maximum Lipschitz constant of the gradients , or the average Lipschitz constant if a non-uniform sampling strategy is used. From a worst-case complexity point of view, the usefulness of these variance-reduced stochastic methods depend on and on the discrepancy between and . Indeed, when is large enough, the complexity of these incremental approaches is simply , which is independent of the condition number and always better than non-incremental first-order methods. Moreover, even though there is no guarantee that , large speed-ups over accelerated first-order methods have been reported on many classical machine learning datasets for incremental approaches (Defazio et al., 2014a; Schmidt et al., 2017), suggesting that is of the same order of magnitude as in many practical cases. Note also that accelerated algorithms for finite sums have also been proposed by Shalev-Shwartz and Zhang (2016); Allen-Zhu (2017); Lan and Zhou (2018); Lin et al. (2018), which we will discuss later.
In this paper, we are interested in providing a unified view of stochastic optimization algorithms, with and without variance reduction, but we also want to investigate their robustness to random pertubations. Specifically, we may consider objective functions with an explicit finite-sum structure such as (2) when only noisy estimates of the gradients are available. Such a setting may occur for various reasons. For instance, perturbations may be injected during training in order to achieve better generalization on new test data (Srivastava et al., 2014)
, perform stable feature selection(Meinshausen and Bühlmann, 2010), improve the model robustness (Zheng et al., 2016), or for privacy-aware learning (Wainwright et al., 2012). Each training point indexed by is corrupted by a random perturbation and the resulting function may be written as
) is a finite sum of functions, we now assume that one has now only access to unbiased estimates of the gradientsdue to the stochastic nature of . Then, all the aforementioned variance-reduction methods do not apply anymore and the standard approach to address this problem is to ignore the finite-sum structure and use SGD or one of its variants. At each iteration, an estimate of the full gradient is obtained by randomly drawing an index in along with a perturbation. Typically, the variance of the gradient estimate then decomposes into two parts , where is due to the random sampling of the index and is due to the random data perturbation. In such a context, variance reduction consists of building gradient estimates with variance , which is potentially much smaller than . The SAGA and SVRG methods were adapted for such a purpose by Hofmann et al. (2015), though the resulting algorithms have non-zero asymptotic error; the MISO method was adapted by Bietti and Mairal (2017) at the cost of a memory overhead of , whereas other variants of SAGA and SVRG were proposed by Zheng and Kwok (2018) for linear models in machine learning.
The framework we adopt is that of estimate sequences introduced by Nesterov (2004), which consists of building iteratively a quadratic model of the objective. Typically, estimate sequences may be used to analyze the convergence of existing algorithms, but also to design new ones, in particular with acceleration. Our construction is however slightly different than the original one since it is based on stochastic estimates of the gradients, and some classical properties of estimate sequences are satisfied only approximately. We note that estimate sequences have been used before for stochastic optimization (Hu et al., 2009; Nitanda, 2014), but not in a generic fashion as we do and not for the same purpose. In summary, our construction leads to the following contributions:
We revisit many stochastic optimization algorithms dealing with composite problems; in particular, we consider methods with variance reduction such as SVRG, SAGA, SDCA, or MISO. Interestingly, we show that all these algorithms admit variants that are adaptive to the strong convexity constant , when only a lower bound is available, a property that was not known for SDCA or MISO.
We provide improvements to the previous algorithms by making them robust to stochastic perturbations. We analyze these approaches under a non-uniform sampling strategy where is the probability of drawing example at each iteration. Typically, when the gradients have different Lipschitz constants
, the uniform distributionyields complexities that depend on , whereas a non-uniform may yield . For strongly convex problems, the resulting worst-case iteration complexity for minimizing (3)—that is, the number of iterations to guarantee —is upper bounded by
where depends on the sampling strategy and for uniform distributions. The term on the left corresponds to the complexity of the variance-reduction methods for a deterministic objective without perturbation, and is the optimal sublinear rate of convergence for a stochastic optimization problem when the gradient estimates have variance . In contrast, a variant of stochastic gradient descent for composite optimization applied to (3) has worst-case complexity , with potentially . Note that the non-uniform sampling strategy potentially reduces and improves the left part, whereas it increases and degrades the dependency on the noise . Whereas non-uniform sampling strategies for incremental methods are now classical (Xiao and Zhang, 2014; Schmidt et al., 2015), the robustness to stochastic perturbations has not been studied for all these methods and existing approaches such as (Hofmann et al., 2015; Bietti and Mairal, 2017; Zheng and Kwok, 2018) have various limitations as discussed earlier.
We show that our construction of estimate sequence naturally leads to an accelerated stochastic gradient method for composite optimization, similar in spirit to (Ghadimi and Lan, 2012, 2013; Hu et al., 2009), but slightly simpler. The resulting complexity in terms of gradient evaluations for -strongly convex objective is
which, to the best of our knowledge, has only been achieved by Ghadimi and Lan (2013). When the objective is convex, but non-strongly convex, we also provide a sublinear convergence rate for finite horizon. Given a budget of iterations, the algorithm returns an iterate such that
which is also optimal for stochastic first-order optimization (Ghadimi and Lan, 2012).
We design a new accelerated algorithm for finite sums based on the SVRG gradient estimator, with complexity, for -strongly convex functions,
where the term on the left is the classical optimal complexity for finite sums (Arjevani and Shamir, 2016). Note that when (deterministic setting) we recover a similar complexity as Katyusha (Allen-Zhu, 2017). When the problem is convex but not strongly convex, given a budget of iterations that is large enough, the algorithm returns a solution such that
where the term on the right is potentially better than (4) for large when (see discussion above on full variance vs. variance due to small stochastic perturbations). When the objective is deterministic (), the term (6) yields a complexity in , which is potentially better than the complexity of accelerated gradient descent, unless is significantly smaller than .
This paper is organized as follows. Section 2 introduces the proposed framework based on estimate sequences; Section 3 is devoted to the convergence analysis; Section 4 presents a variant of SVRG with acceleration; Section 5 presents various experiments to compare the effectiveness of the proposed approaches, and Section 6 concludes the paper.
2 Proposed Framework Based on Stochastic Estimate Sequences
In this section, we present two generic stochastic optimization algorithms to address the composite problem (1). Then, we show their relation to variance-reduction methods.
2.1 A Classical Iteration Revisited
Consider an algorithm that performs the following updates: [leftmargin=0pt,innerleftmargin=6pt,innerrightmargin=6pt]
where is the filtration representing all information up to iteration , is an unbiased estimate of the gradient , is a step size, and is the proximal operator (Moreau, 1962) defined for any scalar as the unique solution of
The iteration (A) is generic and encompasses many existing algorithms, which we review later. Key to our analysis, we are interested in a simple interpretation corresponding to the iterative minimization of strongly convex surrogate functions.
Interpretation with stochastic estimate sequence.
Consider now the function
is a scalar value that is left unspecified at the moment. Then, it is easy to show thatin (A) minimizes the following quadratic function defined for as
where satisfy the system of equations
We note that is a subgradient in . By simply using the definition of the proximal operator (7) and considering first-order optimality conditions, we indeed have that and coincides with the minimizer of . This allows us to write in the generic form
The construction (9) is akin to that of estimate sequences introduced by Nesterov (2004), which are typically used for designing accelerated gradient-based optimization algorithms. In this section, we are however not interested in acceleration, but instead in stochastic optimization and variance reduction. One of the main property of estimate sequences that we will nevertheless use is their ability do behave asymptotically as a lower bound of the objective function near the optimum. Indeed, we have
where and . The first inequality comes from a strong convexity inequality since , and the second inequality is obtained by unrolling the relation obtained between and . When converges to zero, the contribution of the initial surrogate disappears and behaves as a lower bound of .
Relation with existing algorithms.
The iteration (A) encompasses many approaches such as ISTA (proximal gradient descent), which uses the exact gradient leading to deterministic iterates (Beck and Teboulle, 2009; Nesterov, 2013) or proximal variants of the stochastic gradient descent method to deal with a composite objective (see Lan, 2012, for instance). Of interest for us, the variance-reduced stochastic optimization approaches SVRG (Xiao and Zhang, 2014) and SAGA (Defazio et al., 2014a) also follow the iteration (A) but with an unbiased gradient estimator whose variance reduces over time. Specifically, the basic form of these estimators is
where is an index chosen uniformly in at random, and each auxiliary variable is equal to the gradient , where is one of the previous iterates. The motivation is that given two random variables and , it is possible to define a new variable which has the same expectation as but potentially a lower variance if is positively correlated with . SVRG and SAGA are two different approaches to build such positively correlated variables. SVRG uses the same anchor point for all , where is updated every iterations. Typically, the memory cost of SVRG is that of storing the variable and the gradient , which is thus . On the other hand, SAGA updates only at iteration , such that if . Thus, SAGA requires storing gradients. While in general the overhead cost in memory is of order , it may be reduced to when dealing with linear models in machine learning (see Defazio et al., 2014a). Note that variants with non-uniform sampling of the indices have been proposed by Xiao and Zhang (2014); Schmidt et al. (2015).
2.2 Another Algorithm with a Different Estimate Sequence
In the previous section, we have interpreted the classical iteration (A) as the iterative minimization of the stochastic surrogate (9). Here, we show that a slightly different construction leads to a new algorithm. To obtain a lower bound, we have indeed used basic properties of the proximal operator to obtain a subgradient and we have exploited the following convexity inequality
Another natural choice to build a lower bound consists then of using directly instead of , leading to the construction
where is assumed to be the minimizer of the composite function , is defined as in Section 2.1, and is a minimizer of . To initialize the recursion, we define then as
with is the minimizer of and is the minimum value of ; is left unspecified since it does not affect the algorithm. Typically, one may choose to be a minimizer of such that . Unlike in the previous section, the surrogates are not quadratic, but they remain -strongly convex. It is also easy to check that the relation (11) still holds.
The corresponding algorithm.
It is also relatively easy to show that the iterative minimization of the stochastic lower bounds (13) leads to the following iterations [leftmargin=0pt,innerleftmargin=6pt,innerrightmargin=6pt]
As we will see, the convergence analysis for algorithm (A) also holds for algorithm (B) such that both variants enjoy similar theoretical properties. In one case, the function appears explicitly, whereas a lower bound is used in the other case. The introduction of the variable allows us to write the surrogates in the canonical form
where is constant and the inequality on the right is due to the strong convexity of .
Relation to existing approaches.
The approach (B) is related to several optimization methods. When the objective is a deterministic finite sum, the MISO algorithm (Mairal, 2015), one variant of SDCA (Shalev-Shwartz and Zhang, 2016), and Finito (Defazio et al., 2014b) adopt similar strategies and perform the update (B), even though they were derived from a significantly different point of view. For instance, SDCA is a dual coordinate ascent approach, whereas MISO is explicitly derived from the iterative surrogate minimization we adopt in this paper. While the links between (B) and the previous approaches are not necessarily obvious when looking at the original description of these methods, it may be shown that they indeed perform such an update with a gradient estimator of the form (12) where , where is the strong convexity constant of the objective and if . Whereas such estimator requires storing gradients in general, the cost may be also reduced to when dealing with a linear model in machine learning with a quadratic regularization function . Variants with non-uniform sampling for the index appear also in the literature (Csiba et al., 2015; Bietti and Mairal, 2017).
2.3 Gradient Estimators and New Algorithms
exact gradient, with , when the problem is deterministic ();
stochastic gradient, when we assume that has bounded variance. Typically, when , a data point is drawn at iteration and .
random-SVRG: for finite sums, we consider a variant of the SVRG gradient estimator with non-uniform sampling and a random update of the anchor point . Specifically, is also an unbiased estimator of , defined as
where is sampled from a distribution and denotes that the gradient is perturbed by a zero-mean noise variable with variance . More precisely, if for all , where is a stochastic perturbation, instead of accessing , we draw a perturbation and observe
where the perturbation has zero mean given and its variance is bounded by . When there is no perturbation, we simply have and .
Similar to the previous case, the variables and also correspond to possibly noisy estimates of the gradients. Specifically,
where is an anchor point that is updated on average every iterations. Whereas the classical SVRG approach (Xiao and Zhang, 2014) updates on a fixed schedule, we prefer to perform random updates: with probability , we choose and recompute ; otherwise is kept unchanged. In comparison with the fixed schedule, the analysis with the random one is simplified and can be unified with that of SAGA/SDCA or MISO. The use of this estimator with iteration (A) is illustrated in Algorithm 1. It is then easy to modify it to use variant (B) instead.
In terms of memory, the random-SVRG gradient estimator requires to store an anchor point and the average gradients . The ’s do not need to be stored; only the random seeds to produce the perturbations are kept into memory, which allows us to compute at iteration , with the same perturbation for index that was used to compute when the anchor point was last updated. The overall cost is thus .1:Input: in (initial point); (number of iterations); (step sizes); (if averaging);2:Initialization: ; ;3:for do4: Sample according to the distribution ;5: Compute the gradient estimator, possibly corrupted by random perturbations:6: Obtain the new iterate7: With probability ,8: Otherwise, with probability , keep and ;9: Optional: Use the online averaging strategy using obtained from (10):10:end for11:Output: or if averaging.
Algorithm 1 Variant (A) with random-SVRG estimator
SAGA: The estimator has a form similar to (14) but with a different choice of variables . Unlike SVRG that stores an anchor point , the SAGA estimator requires storing and incrementally updating the auxiliary variables for , while maintaining the relation . We consider variants such that each time a gradient is computed, it is corrupted by a zero-mean random perturbation with variance . The procedure is described in Algorithm 2 for variant (A), with a more general estimator that encompasses SAGA/SDCA/MISO, as detailed next.
Note that to deal with non-uniform sampling, we draw uniformly in an index for updating a variable . When is already drawn from a uniform distribution, we may choose instead , which saves computations and does not affect the convergence results. The reason for using an additional index in the non-uniform sampling case removes a difficulty in the convergence proof, a strategy also adopted by Schmidt et al. (2015) for a variant of SAGA with non-uniform sampling.
SDCA/MISO: To put SAGA, MISO and SDCA under the same umbrella, we introduce a scalar in , which we will explain in the sequel, and a correcting term involving that appears only when the sampling distribution is not uniform:
The resulting algorithm corresponds to a variant of SAGA when ; when instead the gradient estimator is used in the context of variant (B), the choice then leads to MISO/SDCA-like procedures. The motivation for introducing the parameter comes from regularized empirical risk minimization problems, where the functions may have the form , where in is a data point; then, is a lower bound on the strong convexity modulus, and is proportional to , which is assumed to be already in memory. When there is no noise (meaning ), storing the variables then requires only additional scalars.
After having introduced our algorithms and before presenting the convergence analysis, we summarize here their new features.
robustness to noise: As mentioned already in the introduction, we introduce mechanisms to deal with stochastic perturbations.
adaptivity to the strong convexity when : Algorithms 1 and 2 without averaging do not require knowing the strong convexity constant (MISO will simply need a lower-bound , which is often trivial to obtain). As shown in the next section, no averaging simply leads to a slightly worse expected convergence rate.
3 Convergence Analysis and Robustness
We now present the convergence analysis of the algorithms described previously. In Section 3.1, we present a general convergence result. Then, we present specific results for the variance-reduction approaches in Section 3.2, including strategies to make them robust to stochastic noise. Acceleration will be discussed in the next section.
3.1 Generic Convergence Result Without Variance Reduction
Key to our complexity results, the following proposition gives a first relation between the quantity , the surrogate , and the variance of the gradient estimates.
Proposition 1 (Key relation).
where the first inequality comes from Lemma 7—it is in fact an equality when considering Algorithm (A)—and the second inequality simply uses the assumption , which yields . Finally, the last inequality uses a classical upper-bound for -smooth functions presented in Lemma 5. Then, after taking expectations,
where we have defined the following quantity
In the previous relations, we have used twice the fact that , for all that is deterministic given such as or . We may now use the non-expansiveness property of the proximal operator (Moreau, 1965) to control the quantity , which gives us
This relation can now be combined with (11) when , and we obtain (16). It is also easy to see that the proof also works with variant (B). The convergence analysis is identical, except that we take to be
and the same result follows. ∎
Then, without making further assumption on , we have the following general convergence result, which is a direct consequence of the averaging Lemma 13, inspired by Ghadimi and Lan (2012), and presented in the appendix:
Theorem 1 (General convergence result).
Theorem 1 allows us to recover convergence rates for various algorithms. Note that the effect of the averaging strategy is to remove the factor in front of on the left part of (17), thus improving the convergence rate by a factor . The price to pay is an additional constant term . For variant (A), the quantity is equal to , whereas it may be larger for (B). Indeed, we may simply say that for variant (B), where is a subgradient in .
As an illustration, we now provide various corollaries of the convergence result for variant (A) only. Note that the following lemma may further refine the upper-bound:
Lemma 1 (Auxiliary lemma for stochastic gradient descent iteration).
The purpose of the lemma is to replace the dependency in by in the convergence rate (when ), at the price of one extra iteration. Whereas the latter naturally upper-bounds the former in the smooth case, we do not have in the composite case, which makes result (19) slightly stronger than (18).
In the corollary below, we consider the stochastic setting showing that with constant step sizes, the algorithm converges with the same rate as the deterministic problem to a noise-dominated region of radius . The proof simply uses Lemma 11, which provides the convergence rate of