1 Introduction
We consider convex optimization problems of the form
(1) 
where is convex and smooth^{1}^{1}1A function is smooth when it is differentiable and its derivative is Lipschitz continuous with constant ., and we call its strong convexity modulus with respect to the Euclidean norm.^{2}^{2}2Then, means that the function is convex but not strongly convex. The function is convex lower semicontinuous and is not assumed to be necessarily differentiable. For instance, may be the
norm, which is very popular in signal processing and machine learning for its sparsityinducing properties
(see Mairal et al., 2014, and references therein); may also be the extendedvalued indicator function of a convex set that takes the value outside of and inside such that the previous setting encompasses constrained problems (see HiriartUrruty and Lemaréchal, 1996).More specifically, we focus on stochastic objective functions, which are of utmost importance in machine learning, where is an expectation or a finite sum of smooth functions
(2) 
On the left,
is a random variable representing a data point drawn according to some distribution and
measures the fit of some model parameter to the data point . Whereas in machine learning formulations the explicit form of the data distribution is unknown, it is assumed that it is possible to draw from it random samples Either an infinite number of such i.i.d. samples are available and the problem of interest is to minimize (1) with , or one has access to a finite training set only, leading to the finitesum setting on the right of (2), called empirical risk (Vapnik, 2000).While the finitesum setting is obviously a particular case of expectation with a discrete probability distribution, the
deterministic nature of the resulting cost function drastically changes the performance guarantees an optimization method may achieve to solve (1). In particular, when an algorithm is only allowed to access unbiased measurements of the objective and gradient—which we assume is the case when is an expectation—it may be shown that the worstcase convergence rate in expected function value cannot be better than in general, where is the number of iterations (Nemirovski et al., 2009; Agarwal et al., 2012). Such a sublinear rate of convergence is notably achieved by stochastic gradient descent (SGD) algorithms or their variants (see Bottou et al., 2018, for a review).Even though this pessimistic result applies to the general stochastic case, linear convergence rates can be obtained for the finitesum setting (Schmidt et al., 2017). Specifically, a large body of work in machine learning has led to many randomized incremental approaches such as SAG (Schmidt et al., 2017), SAGA (Defazio et al., 2014a), SVRG (Johnson and Zhang, 2013; Xiao and Zhang, 2014), SDCA (ShalevShwartz and Zhang, 2016), MISO (Mairal, 2015), Katyusha (AllenZhu, 2017), or the method of Lan and Zhou (2018). These algorithms have about the same cost periteration as the stochastic gradient descent method, since they access only a single or two gradients at each iteration, and they may achieve lower computational complexity than accelerated gradient descent methods (Nesterov, 1983, 2004, 2013; Beck and Teboulle, 2009) in expectation. A common interpretation is to see these algorithms as performing SGD steps with an estimate of the full gradient that has lower variance (Xiao and Zhang, 2014).
More precisely, accelerated gradient methods applied to the deterministic finitesum problem when is strongly convex are able to provide an iterate such that after evaluations of gradients in the worst case; in contrast, variancereduced stochastic optimization methods without acceleration achieve the same guarantee in expectation with complexity , where is the maximum Lipschitz constant of the gradients , or the average Lipschitz constant if a nonuniform sampling strategy is used. From a worstcase complexity point of view, the usefulness of these variancereduced stochastic methods depend on and on the discrepancy between and . Indeed, when is large enough, the complexity of these incremental approaches is simply , which is independent of the condition number and always better than nonincremental firstorder methods. Moreover, even though there is no guarantee that , large speedups over accelerated firstorder methods have been reported on many classical machine learning datasets for incremental approaches (Defazio et al., 2014a; Schmidt et al., 2017), suggesting that is of the same order of magnitude as in many practical cases. Note also that accelerated algorithms for finite sums have also been proposed by ShalevShwartz and Zhang (2016); AllenZhu (2017); Lan and Zhou (2018); Lin et al. (2018), which we will discuss later.
In this paper, we are interested in providing a unified view of stochastic optimization algorithms, with and without variance reduction, but we also want to investigate their robustness to random pertubations. Specifically, we may consider objective functions with an explicit finitesum structure such as (2) when only noisy estimates of the gradients are available. Such a setting may occur for various reasons. For instance, perturbations may be injected during training in order to achieve better generalization on new test data (Srivastava et al., 2014)
, perform stable feature selection
(Meinshausen and Bühlmann, 2010), improve the model robustness (Zheng et al., 2016), or for privacyaware learning (Wainwright et al., 2012). Each training point indexed by is corrupted by a random perturbation and the resulting function may be written as(3) 
Whereas (3
) is a finite sum of functions, we now assume that one has now only access to unbiased estimates of the gradients
due to the stochastic nature of . Then, all the aforementioned variancereduction methods do not apply anymore and the standard approach to address this problem is to ignore the finitesum structure and use SGD or one of its variants. At each iteration, an estimate of the full gradient is obtained by randomly drawing an index in along with a perturbation. Typically, the variance of the gradient estimate then decomposes into two parts , where is due to the random sampling of the index and is due to the random data perturbation. In such a context, variance reduction consists of building gradient estimates with variance , which is potentially much smaller than . The SAGA and SVRG methods were adapted for such a purpose by Hofmann et al. (2015), though the resulting algorithms have nonzero asymptotic error; the MISO method was adapted by Bietti and Mairal (2017) at the cost of a memory overhead of , whereas other variants of SAGA and SVRG were proposed by Zheng and Kwok (2018) for linear models in machine learning.The framework we adopt is that of estimate sequences introduced by Nesterov (2004), which consists of building iteratively a quadratic model of the objective. Typically, estimate sequences may be used to analyze the convergence of existing algorithms, but also to design new ones, in particular with acceleration. Our construction is however slightly different than the original one since it is based on stochastic estimates of the gradients, and some classical properties of estimate sequences are satisfied only approximately. We note that estimate sequences have been used before for stochastic optimization (Hu et al., 2009; Nitanda, 2014), but not in a generic fashion as we do and not for the same purpose. In summary, our construction leads to the following contributions:

We revisit many stochastic optimization algorithms dealing with composite problems; in particular, we consider methods with variance reduction such as SVRG, SAGA, SDCA, or MISO. Interestingly, we show that all these algorithms admit variants that are adaptive to the strong convexity constant , when only a lower bound is available, a property that was not known for SDCA or MISO.

We provide improvements to the previous algorithms by making them robust to stochastic perturbations. We analyze these approaches under a nonuniform sampling strategy where is the probability of drawing example at each iteration. Typically, when the gradients have different Lipschitz constants
, the uniform distribution
yields complexities that depend on , whereas a nonuniform may yield . For strongly convex problems, the resulting worstcase iteration complexity for minimizing (3)—that is, the number of iterations to guarantee —is upper bounded bywhere depends on the sampling strategy and for uniform distributions. The term on the left corresponds to the complexity of the variancereduction methods for a deterministic objective without perturbation, and is the optimal sublinear rate of convergence for a stochastic optimization problem when the gradient estimates have variance . In contrast, a variant of stochastic gradient descent for composite optimization applied to (3) has worstcase complexity , with potentially . Note that the nonuniform sampling strategy potentially reduces and improves the left part, whereas it increases and degrades the dependency on the noise . Whereas nonuniform sampling strategies for incremental methods are now classical (Xiao and Zhang, 2014; Schmidt et al., 2015), the robustness to stochastic perturbations has not been studied for all these methods and existing approaches such as (Hofmann et al., 2015; Bietti and Mairal, 2017; Zheng and Kwok, 2018) have various limitations as discussed earlier.

We show that our construction of estimate sequence naturally leads to an accelerated stochastic gradient method for composite optimization, similar in spirit to (Ghadimi and Lan, 2012, 2013; Hu et al., 2009), but slightly simpler. The resulting complexity in terms of gradient evaluations for strongly convex objective is
which, to the best of our knowledge, has only been achieved by Ghadimi and Lan (2013). When the objective is convex, but nonstrongly convex, we also provide a sublinear convergence rate for finite horizon. Given a budget of iterations, the algorithm returns an iterate such that
(4) which is also optimal for stochastic firstorder optimization (Ghadimi and Lan, 2012).

We design a new accelerated algorithm for finite sums based on the SVRG gradient estimator, with complexity, for strongly convex functions,
(5) where the term on the left is the classical optimal complexity for finite sums (Arjevani and Shamir, 2016). Note that when (deterministic setting) we recover a similar complexity as Katyusha (AllenZhu, 2017). When the problem is convex but not strongly convex, given a budget of iterations that is large enough, the algorithm returns a solution such that
(6) where the term on the right is potentially better than (4) for large when (see discussion above on full variance vs. variance due to small stochastic perturbations). When the objective is deterministic (), the term (6) yields a complexity in , which is potentially better than the complexity of accelerated gradient descent, unless is significantly smaller than .
This paper is organized as follows. Section 2 introduces the proposed framework based on estimate sequences; Section 3 is devoted to the convergence analysis; Section 4 presents a variant of SVRG with acceleration; Section 5 presents various experiments to compare the effectiveness of the proposed approaches, and Section 6 concludes the paper.
2 Proposed Framework Based on Stochastic Estimate Sequences
In this section, we present two generic stochastic optimization algorithms to address the composite problem (1). Then, we show their relation to variancereduction methods.
2.1 A Classical Iteration Revisited
Consider an algorithm that performs the following updates: [leftmargin=0pt,innerleftmargin=6pt,innerrightmargin=6pt]
(A) 
where is the filtration representing all information up to iteration , is an unbiased estimate of the gradient , is a step size, and is the proximal operator (Moreau, 1962) defined for any scalar as the unique solution of
(7) 
The iteration (A) is generic and encompasses many existing algorithms, which we review later. Key to our analysis, we are interested in a simple interpretation corresponding to the iterative minimization of strongly convex surrogate functions.
Interpretation with stochastic estimate sequence.
Consider now the function
(8) 
with and
is a scalar value that is left unspecified at the moment. Then, it is easy to show that
in (A) minimizes the following quadratic function defined for as(9) 
where satisfy the system of equations
(10) 
and
We note that is a subgradient in . By simply using the definition of the proximal operator (7) and considering firstorder optimality conditions, we indeed have that and coincides with the minimizer of . This allows us to write in the generic form
The construction (9) is akin to that of estimate sequences introduced by Nesterov (2004), which are typically used for designing accelerated gradientbased optimization algorithms. In this section, we are however not interested in acceleration, but instead in stochastic optimization and variance reduction. One of the main property of estimate sequences that we will nevertheless use is their ability do behave asymptotically as a lower bound of the objective function near the optimum. Indeed, we have
(11) 
where and . The first inequality comes from a strong convexity inequality since , and the second inequality is obtained by unrolling the relation obtained between and . When converges to zero, the contribution of the initial surrogate disappears and behaves as a lower bound of .
Relation with existing algorithms.
The iteration (A) encompasses many approaches such as ISTA (proximal gradient descent), which uses the exact gradient leading to deterministic iterates (Beck and Teboulle, 2009; Nesterov, 2013) or proximal variants of the stochastic gradient descent method to deal with a composite objective (see Lan, 2012, for instance). Of interest for us, the variancereduced stochastic optimization approaches SVRG (Xiao and Zhang, 2014) and SAGA (Defazio et al., 2014a) also follow the iteration (A) but with an unbiased gradient estimator whose variance reduces over time. Specifically, the basic form of these estimators is
(12) 
where is an index chosen uniformly in at random, and each auxiliary variable is equal to the gradient , where is one of the previous iterates. The motivation is that given two random variables and , it is possible to define a new variable which has the same expectation as but potentially a lower variance if is positively correlated with . SVRG and SAGA are two different approaches to build such positively correlated variables. SVRG uses the same anchor point for all , where is updated every iterations. Typically, the memory cost of SVRG is that of storing the variable and the gradient , which is thus . On the other hand, SAGA updates only at iteration , such that if . Thus, SAGA requires storing gradients. While in general the overhead cost in memory is of order , it may be reduced to when dealing with linear models in machine learning (see Defazio et al., 2014a). Note that variants with nonuniform sampling of the indices have been proposed by Xiao and Zhang (2014); Schmidt et al. (2015).
2.2 Another Algorithm with a Different Estimate Sequence
In the previous section, we have interpreted the classical iteration (A) as the iterative minimization of the stochastic surrogate (9). Here, we show that a slightly different construction leads to a new algorithm. To obtain a lower bound, we have indeed used basic properties of the proximal operator to obtain a subgradient and we have exploited the following convexity inequality
Another natural choice to build a lower bound consists then of using directly instead of , leading to the construction
(13) 
where is assumed to be the minimizer of the composite function , is defined as in Section 2.1, and is a minimizer of . To initialize the recursion, we define then as
with is the minimizer of and is the minimum value of ; is left unspecified since it does not affect the algorithm. Typically, one may choose to be a minimizer of such that . Unlike in the previous section, the surrogates are not quadratic, but they remain strongly convex. It is also easy to check that the relation (11) still holds.
The corresponding algorithm.
It is also relatively easy to show that the iterative minimization of the stochastic lower bounds (13) leads to the following iterations [leftmargin=0pt,innerleftmargin=6pt,innerrightmargin=6pt]
(B) 
As we will see, the convergence analysis for algorithm (A) also holds for algorithm (B) such that both variants enjoy similar theoretical properties. In one case, the function appears explicitly, whereas a lower bound is used in the other case. The introduction of the variable allows us to write the surrogates in the canonical form
where is constant and the inequality on the right is due to the strong convexity of .
Relation to existing approaches.
The approach (B) is related to several optimization methods. When the objective is a deterministic finite sum, the MISO algorithm (Mairal, 2015), one variant of SDCA (ShalevShwartz and Zhang, 2016), and Finito (Defazio et al., 2014b) adopt similar strategies and perform the update (B), even though they were derived from a significantly different point of view. For instance, SDCA is a dual coordinate ascent approach, whereas MISO is explicitly derived from the iterative surrogate minimization we adopt in this paper. While the links between (B) and the previous approaches are not necessarily obvious when looking at the original description of these methods, it may be shown that they indeed perform such an update with a gradient estimator of the form (12) where , where is the strong convexity constant of the objective and if . Whereas such estimator requires storing gradients in general, the cost may be also reduced to when dealing with a linear model in machine learning with a quadratic regularization function . Variants with nonuniform sampling for the index appear also in the literature (Csiba et al., 2015; Bietti and Mairal, 2017).
2.3 Gradient Estimators and New Algorithms
In this paper, we consider the iterations (A) and (B) with the following gradient estimators that are variants of the ones above. For all of them, we define the variance to be

exact gradient, with , when the problem is deterministic ();

stochastic gradient, when we assume that has bounded variance. Typically, when , a data point is drawn at iteration and .

randomSVRG: for finite sums, we consider a variant of the SVRG gradient estimator with nonuniform sampling and a random update of the anchor point . Specifically, is also an unbiased estimator of , defined as
(14) where is sampled from a distribution and denotes that the gradient is perturbed by a zeromean noise variable with variance . More precisely, if for all , where is a stochastic perturbation, instead of accessing , we draw a perturbation and observe
where the perturbation has zero mean given and its variance is bounded by . When there is no perturbation, we simply have and .
Similar to the previous case, the variables and also correspond to possibly noisy estimates of the gradients. Specifically,
where is an anchor point that is updated on average every iterations. Whereas the classical SVRG approach (Xiao and Zhang, 2014) updates on a fixed schedule, we prefer to perform random updates: with probability , we choose and recompute ; otherwise is kept unchanged. In comparison with the fixed schedule, the analysis with the random one is simplified and can be unified with that of SAGA/SDCA or MISO. The use of this estimator with iteration (A) is illustrated in Algorithm 1. It is then easy to modify it to use variant (B) instead.
In terms of memory, the randomSVRG gradient estimator requires to store an anchor point and the average gradients . The ’s do not need to be stored; only the random seeds to produce the perturbations are kept into memory, which allows us to compute at iteration , with the same perturbation for index that was used to compute when the anchor point was last updated. The overall cost is thus .
1:Input: in (initial point); (number of iterations); (step sizes); (if averaging);2:Initialization: ; ;3:for do4: Sample according to the distribution ;5: Compute the gradient estimator, possibly corrupted by random perturbations:6: Obtain the new iterate7: With probability ,8: Otherwise, with probability , keep and ;10:end for11:Output: or if averaging.Algorithm 1 Variant (A) with randomSVRG estimator 
SAGA: The estimator has a form similar to (14) but with a different choice of variables . Unlike SVRG that stores an anchor point , the SAGA estimator requires storing and incrementally updating the auxiliary variables for , while maintaining the relation . We consider variants such that each time a gradient is computed, it is corrupted by a zeromean random perturbation with variance . The procedure is described in Algorithm 2 for variant (A), with a more general estimator that encompasses SAGA/SDCA/MISO, as detailed next.
Note that to deal with nonuniform sampling, we draw uniformly in an index for updating a variable . When is already drawn from a uniform distribution, we may choose instead , which saves computations and does not affect the convergence results. The reason for using an additional index in the nonuniform sampling case removes a difficulty in the convergence proof, a strategy also adopted by Schmidt et al. (2015) for a variant of SAGA with nonuniform sampling.

SDCA/MISO: To put SAGA, MISO and SDCA under the same umbrella, we introduce a scalar in , which we will explain in the sequel, and a correcting term involving that appears only when the sampling distribution is not uniform:
(15) The resulting algorithm corresponds to a variant of SAGA when ; when instead the gradient estimator is used in the context of variant (B), the choice then leads to MISO/SDCAlike procedures. The motivation for introducing the parameter comes from regularized empirical risk minimization problems, where the functions may have the form , where in is a data point; then, is a lower bound on the strong convexity modulus, and is proportional to , which is assumed to be already in memory. When there is no noise (meaning ), storing the variables then requires only additional scalars.
New Features.
After having introduced our algorithms and before presenting the convergence analysis, we summarize here their new features.

robustness to noise: As mentioned already in the introduction, we introduce mechanisms to deal with stochastic perturbations.

adaptivity to the strong convexity when : Algorithms 1 and 2 without averaging do not require knowing the strong convexity constant (MISO will simply need a lowerbound , which is often trivial to obtain). As shown in the next section, no averaging simply leads to a slightly worse expected convergence rate.
3 Convergence Analysis and Robustness
We now present the convergence analysis of the algorithms described previously. In Section 3.1, we present a general convergence result. Then, we present specific results for the variancereduction approaches in Section 3.2, including strategies to make them robust to stochastic noise. Acceleration will be discussed in the next section.
3.1 Generic Convergence Result Without Variance Reduction
Key to our complexity results, the following proposition gives a first relation between the quantity , the surrogate , and the variance of the gradient estimates.
Proposition 1 (Key relation).
Proof.
We first consider the variant (A) and later show how to modify the convergence proofs to accommodate the variant (B).
where the first inequality comes from Lemma 7—it is in fact an equality when considering Algorithm (A)—and the second inequality simply uses the assumption , which yields . Finally, the last inequality uses a classical upperbound for smooth functions presented in Lemma 5. Then, after taking expectations,
where we have defined the following quantity
In the previous relations, we have used twice the fact that , for all that is deterministic given such as or . We may now use the nonexpansiveness property of the proximal operator (Moreau, 1965) to control the quantity , which gives us
This relation can now be combined with (11) when , and we obtain (16). It is also easy to see that the proof also works with variant (B). The convergence analysis is identical, except that we take to be
and the same result follows. ∎
Then, without making further assumption on , we have the following general convergence result, which is a direct consequence of the averaging Lemma 13, inspired by Ghadimi and Lan (2012), and presented in the appendix:
Theorem 1 (General convergence result).
Theorem 1 allows us to recover convergence rates for various algorithms. Note that the effect of the averaging strategy is to remove the factor in front of on the left part of (17), thus improving the convergence rate by a factor . The price to pay is an additional constant term . For variant (A), the quantity is equal to , whereas it may be larger for (B). Indeed, we may simply say that for variant (B), where is a subgradient in .
As an illustration, we now provide various corollaries of the convergence result for variant (A) only. Note that the following lemma may further refine the upperbound:
Lemma 1 (Auxiliary lemma for stochastic gradient descent iteration).
The purpose of the lemma is to replace the dependency in by in the convergence rate (when ), at the price of one extra iteration. Whereas the latter naturally upperbounds the former in the smooth case, we do not have in the composite case, which makes result (19) slightly stronger than (18).
In the corollary below, we consider the stochastic setting showing that with constant step sizes, the algorithm converges with the same rate as the deterministic problem to a noisedominated region of radius . The proof simply uses Lemma 11, which provides the convergence rate of and uses also the relation
Comments
There are no comments yet.