# On the Complexity of Minimizing Convex Finite Sums Without Using the Indices of the Individual Functions

Recent advances in randomized incremental methods for minimizing L-smooth μ-strongly convex finite sums have culminated in tight complexity of Õ((n+√(n L/μ))log(1/ϵ)) and O(n+√(nL/ϵ)), where μ>0 and μ=0, respectively, and n denotes the number of individual functions. Unlike incremental methods, stochastic methods for finite sums do not rely on an explicit knowledge of which individual function is being addressed at each iteration, and as such, must perform at least Ω(n^2) iterations to obtain O(1/n^2)-optimal solutions. In this work, we exploit the finite noise structure of finite sums to derive a matching O(n^2)-upper bound under the global oracle model, showing that this lower bound is indeed tight. Following a similar approach, we propose a novel adaptation of SVRG which is both compatible with stochastic oracles, and achieves complexity bounds of Õ((n^2+n√(L/μ))log(1/ϵ)) and O(n√(L/ϵ)), for μ>0 and μ=0, respectively. Our bounds hold w.h.p. and match in part existing lower bounds of Ω̃(n^2+√(nL/μ)log(1/ϵ)) and Ω̃(n^2+√(nL/ϵ)), for μ>0 and μ=0, respectively.

There are no comments yet.

## Authors

• 14 publications
• 20 publications
• 52 publications
• 8 publications
• ### Tight Lower Complexity Bounds for Strongly Convex Finite-Sum Optimization

Finite-sum optimization plays an important role in the area of machine l...
10/17/2020 ∙ by Min Zhang, et al. ∙ 0

• ### A General Analysis Framework of Lower Complexity Bounds for Finite-Sum Optimization

This paper studies the lower bound complexity for the optimization probl...
08/22/2019 ∙ by Guangzeng Xie, et al. ∙ 0

• ### Tight Complexity Bounds for Optimizing Composite Objectives

We provide tight upper and lower bounds on the complexity of minimizing ...
05/25/2016 ∙ by Blake Woodworth, et al. ∙ 0

• ### The Convergence Rate of SGD's Final Iterate: Analysis on Dimension Dependence

Stochastic Gradient Descent (SGD) is among the simplest and most popular...
06/28/2021 ∙ by Daogao Liu, et al. ∙ 0

• ### Lower Complexity Bounds of Finite-Sum Optimization Problems: The Results and Construction

The contribution of this paper includes two aspects. First, we study the...
03/15/2021 ∙ by Yuze Han, et al. ∙ 0

• ### Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization

We study the conditions under which one is able to efficiently apply var...
06/06/2017 ∙ by Yossi Arjevani, et al. ∙ 0

• ### On the Randomized Complexity of Minimizing a Convex Quadratic Function

Minimizing a convex, quadratic objective is a fundamental problem in mac...
07/24/2018 ∙ by Max Simchowitz, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Many tasks in machine learning and statistics reduce to finite-sum minimization problems of the form

 minw∈RdF(w)\coloneqq1nn∑i=1fi(w), (1.1)

where the individual functions are -smooth and -strongly convex or convex. The large datasets often encountered in modern applications have led to high interest in optimization methods which can efficiently cope with a large numbers of individual functions.

In this work, we measure efficiency of optimization algorithms through the framework of oracle complexity. Concretely, we assume the existence of an external procedure, typically referred to as an oracle, which upon receiving a query, reveals some information about the function at hand (e.g., function values, gradients or Hessians). The oracle complexity of a given optimization algorithm is then the number of oracle calls required to obtain an -optimal solution, i.e., a point that satisfies w.h.p. or , where the expectation is over randomness in the algorithm and oracle.

Throughout, we focus on two types of oracles for finite sums: incremental oracles, which allow one to control which individual function is being referred to in each iteration, and stochastic oracles, which provide information on a randomly chosen —without revealing its index (see sec:settings for a formal exposition). Although the difference between these two types of oracle may seem insignificant at first glance, the optimal attainable performances of randomized incremental methods (compatible with incremental oracles), such as SAG [SLRB13], SVRG [JZ13] and SDCA [SSZ13], are significantly better than those attainable by stochastic methods (compatible with stochastic oracles).111We follow here a nomenclatural convention that distinguishes cases where deterministic problems are addressed via random methods, as in (1.1), from cases that are inherently stochastic, as in (3.21) below. This distinction is made formal in sec:settings.

First-order oracles. A notable example where incremental methods outperform stochastic methods is the case of first-order oracles, which provide function values and gradients. When the are smooth and strongly convex, SAG, SDCA and SVRG enjoy exponential rates of (disregarding other problem parameters for now), while the stochastic vanilla SGD obtains significantly slower rates of , e.g., [SSSSC11]

. The fact that randomized incremental methods can converge exponentially fast is not altogether trivial. In particular, it implies that the variance of the iterates must also decay exponentially fast, which explains why such methods are often referred to as

variance-reduced methods.

The main algorithmic idea behind variance reduction is to wisely incorporate past information acquired from the oracle (see [JZ13]). This naturally raises the following question. Is it possible to use the same idea to design stochastic, rather than incremental, variance-reduced methods that achieve exponential convergence rates for finite-sum problems?

In this work, we answer this question in the affirmative. Specifically, we exploit the unique finite noise structure of finite sums to form an estimator that recovers the full gradient of

w.h.p.. Based on this estimator, we propose a novel adaptation of SVRG which does not rely on the indices of the individual functions. Combined with the Catalyst acceleration scheme [LMH15], this yields upper complexity bounds of

 ~O(n2+n√L/ϵ) and% ~O((n2+n√L/μ)log(1/ϵ)), (1.2)

for and , respectively, which hold w.h.p.. In particular, this shows that the additional finite-sum structure can indeed be used to achieve exponential rates for stochastic methods in the strongly convex case (and rate in the smooth convex case), which is impossible for general-purpose stochastic methods [NY83, AWBR09, RR11]. Perhaps surprisingly, although this rate cannot be achieved through a naive empirical average gradient estimator, as we prove in Section 3, a simple rounding correction is all it needs to obtain an exponential convergence rate w.h.p. (in the strongly convex case).

Although the quadratic dependence on may seem rather pessimistic, any stochastic method that obtains an -optimal solution must issue at least oracle queries [Arj17]. In fact, in the class of stochastic first-order methods, one cannot hope to perform better than

 ~O(n2+√nL/ϵ) and% ~Ω(n2+√nL/μlog(1/ϵ)), (1.3)

for and , respectively (see (2.16) below for details). The complexity bounds stated in (1.2) provide, therefore, a fair idea of the complexity of obtaining high-accuracy solutions in applications where the indices of the individual functions are not known, or not used.

Global oracles. The -lower bound in [Arj17] applies in fact to the much broader class of global oracles, under which a complete specification of a randomly chosen individual function is provided. Clearly, any local information, such as gradients, Hessians, and other high-order derivatives, as well as global information, such as the minimizers of along a given direction (i.e., steepest descent steps), can be extracted through global oracles. An important setting in which one is granted such ‘privileged access’ is empirical risk minimization (ERM),

 F(w)=n∑i=1ℓ(w;(xi,yi)), (1.4)

where , the loss function, is known a priori, and denote the training samples. Since each individual function is fully parameterized by its associated sample, this implies global access to the randomly chosen individual functions. In this work, by using a similar idea to the one we use for first-order oracles, we show that the global -lower complexity bound is tight.

Our contributions can be summarized as follows:

• [leftmargin=1em]

• We show that (unlike existing variance-reduced methods which directly rely on the indices of the individual functions) it is possible to minimize smooth strongly-convex finite-sum problems—without using the indices—and still enjoy exponential convergence rates.

• To this end, we combine the SVRG method with a novel biased quantized gradient estimator which recovers the average function w.h.p.. We further apply the Catalyst framework to improve the dependence on the condition number and to extend the proposed algorithm to convex functions (which are not necessarily strongly-convex).

• We prove that variance reduction cannot be obtained by using SGD or SVRG with a naive gradient estimator, as the rate at which the variance decays in this case is too slow, which in turn leads to polynomial convergence rates. Thus, perhaps surprisingly, the seemingly negligible quantization correction we propose is essential for obtaining exponential (rather than polynomial) rates.

• Using an estimator similar to the quantized gradient estimator, we show that the -lower bound established in [Arj17] for stochastic global methods for finite sums is essentially tight.

The paper is structured as follows. In sec:settings, we introduce the main ingredients of the oracle complexity framework and discuss relevant lower complexity bounds. sec:related surveys existing approaches for minimizing finite sums and their limitations in the context of stochastic oracles. In sec:estimator, we present our quantized estimator for categorical random variables. With this, we establish in sec:app a tight upper bound for stochastic global oracles and present a variant of SVRG, which does not use indices explicitly, and extend its applicability through the Catalyst framework.

## 2 Setup

We study the problem of finding an -optimal solution in the framework of oracle complexity. First, we briefly review its main components. For further details on information-based complexity, see [TW80].

Function classes  We denote by the class of generic finite-sum problems of the form (1.1). Similarly, let denote the class of finite sums with -strongly convex individual functions, where for any and ,

 fi(u)≥fi(w)+⟨∇fi(w),u−w⟩+μ2∥u−w∥2. (2.5)

If , then is merely convex. Likewise, we use to indicate that the individual functions are further assumed to be -smooth, i.e., for any and ,

 fi(u)≤fi(w)+⟨∇fi(w),u−w⟩+L2∥u−w∥2. (2.6)

Lastly, we always assume that the initial suboptimality of at the initialization point is bounded from above by

 F(w0)−F∗≤Δ, (2.7)

where is some positive real scalar assumed to be known.

Oracle Classes  Different ways of accessing a given optimization problem are modeled through different oracles, and these, in turn, accommodate different classes of optimization methods. In this work, we consider the following four distinct types of oracles:

1. [leftmargin=0.3cm]

2. Incremental first-order oracle defined with a parameter  as

 I∇:(Rd)B×[n]→(R×Rd)B: (2.8) (w1,..wB,i)↦ (fi(w1),∇fi(w1),..,fi(wB),∇fi(wB)).

The oracle allows the user to obtain the gradient of an individual function they desire, at different points (note that we avoid explicitly stating in to allow a cleaner notation). Having is necessary for implementing the SVRG method in sec:app.

3. Incremental global oracle defined by

 If:[n] →(Rd↦R): i↦fi. (2.9)

In this oracle model, the entire function is accessible to the user. Hence one can simply issue queries to get , and then return an exact minimizer of . Under the -model, one completely disregards the cost of computing local information, such as gradients. It can be shown that of queries are necessary to obtain solutions of sufficiently high accuracy (e.g., Lemma 2 in [AS16]).

4. Stochastic first-order oracle defined with a parameter as

 S∇:(Rd)B→(R×Rd)B: (2.10) (w1,…,wB)↦ (fi(w1),∇fi(w1),..,fi(wB),∇fi(wB)), where i∼Unif([n]).

The class of stochastic oracles, the main focus of this work, is used to model methods which do not or cannot explicitly rely on the individual function index. This happens for example when ‘non-enumerable’ data augmentation is used (see, e.g., [LCB07]). Stochastic methods are used more broadly to address stochastic optimization problems of the form , where one is given access to a first-order estimate of the gradient of at given point through a randomly drawn . Note that existing variance reduced methods are not directly implementable with stochastic oracles, as we discuss in sec:related.

5. Stochastic global oracle defined by

 Sf returns fi, where i∼Unif([n]). (2.11)

This oracle is used to study the fundamental statistical limitations of minimizing finite-sum problems, as all other computational aspects are disregarded under this oracle model. As mentioned earlier, an important instance of this setting is ERM. In this case, the global stochastic oracle complexity is typically referred to as sample complexity.

Minimax Oracle Complexity  Next, we next general oracle (minimax) complexity. Given a class of functions and a suitable oracle , we denote by the class of all optimization algorithms that access instances in by issuing at most -queries.222Strictly speaking, when studying complexity of optimization algorithms, one must carefully define what is meant by ‘algorithm’ and ‘oracle’. We shall not need this level of formality in this work. Let be the final iterate returned by algorithm when applied to an . With this, we define two related notions of minimax complexity:

 MF,O(ϵ) ˙= inf{k∈N | ∃A∈A(O,k), s.t. supf∈FE[f(wA(f))−f∗]≤ϵ}, (2.12) MF,O(ϵ,δ) ˙= inf{k∈N | ∃A∈A(O,k), s.t. supf∈FP(f(wA(f))−f∗>ϵ)<δ}, (2.13)

where, here and throughout, denotes the infimum of over its domain.

A straightforward application of Markov’s inequality shows that the first complexity notion eq:comp_exp, which holds in expectation, implies the second complexity notion eq:comp_whp, which holds w.h.p.. We note in passing that, in spite of being a well-accepted framework in the field of continuous optimization, oracle complexity does not take into account the computational resources required to implement and process oracle calls, and should therefore be regarded as a lower bound on the real computational complexity.

Lower Oracle Complexity Bounds  Some oracles are more expressive than others. For example, it is straightforward to implement the stochastic first-order oracle through the global oracle . Therefore, by definitions (2) and (2) above and by [[Arj17], Theorem 1], we have

 MΣLμ,S∇(ϵ)≥MΣLμ,Sf(ϵ)≥~Ω(n2), (2.14)

provided that . Here and below, we omit the dependence on the initial suboptimality for simplicity. Likewise, can be used to implement be calling with , hence

 MΣLμ,S∇(ϵ)≥MΣLμ,I∇(ϵ)≥~Ω(√nL/μlog(1/ϵ)). (2.15)

Combining both bounds gives

 MΣLμ,S∇(ϵ)≥~Ω(n2+√nL/μlog(1/ϵ)). (2.16)

As mentioned earlier, we also have , but this additional bound is absorbed in the term in (2.16). Using similar arguments, one obtains a bound for the -smooth case, . Fact 1 summarizes these lower complexity bounds, for reference in later sections. We leave a treatment of the h.p. counterparts to future work.

###### Fact 1.

The following bounds hold for minimizing finite-sum functions.

• [leftmargin=1em]

• For an incremental first order oracle [WS16, AS16]:

 MΣLμ,I∇(ϵ) ≥~Ω(n+√nL/μlog(1/ϵ)), (2.17) MΣL0,I∇(ϵ) ≥~Ω(n+√nL/ϵ). (2.18)
• For a stochastic finite sum oracle:

 MΣLμ,S∇(ϵ) ≥~Ω(n2+√nL/μlog(1/ϵ)), (2.19) MΣL0,S∇(ϵ) ≥~Ω(n2+√nL/ϵ). (2.20)

## 3 Approaches for Minimizing Finite Sums

Next, we review existing approaches for minimizing finite-sum problems via randomized incremental and stochastic methods. This brief survey also serves as a motivational exposition for the main question we ask in this work, namely, is it possible to apply variance-reduced techniques using first-order stochastic oracles (defined in eq:stoch_orc)?

For concreteness of the following discussion, let us consider the class with . A natural approach for addressing finite-sum problems without relying on the indices is to re-express (1.1) as a stochastic optimization problem,

 minw∈RdEi∼Unif([n])[fi(w)], (3.21)

and apply generic stochastic methods, such as SGD [RM51]. The oracle complexity of vanilla SGD is (e.g., [NJLS09]), which, although inferior to incremental methods in terms of , does not depend on . This makes SGD particularly suited for settings where is very large and one desires solutions of moderate accuracy.

Other, more recent variants of SGD use, e.g., mini-batches, importance sampling, or fixed step sizes, to achieve exponential convergence rates—but only up to a certain noise level (see [MB11, NWS14, NW16, QRG19]). It is possible to gradually reduce the convergence noise level, but that would effectively imply polynomial complexity bounds (hence, giving the same rates attainable by vanilla SGD). This should come as no surprise—any general-purpose stochastic first-order methods designed for 3.21) is bound to polynomial rates [NY83, AWBR09, RR11].

In contrast to this, if an incremental first-order oracles is given, one can compute the full gradient of at each iteration by simply iterating over the individual functions, and then use vanilla Gradient Descent (GD) and Accelerated Gradient Descent (AGD, [Nes04]). This yields oracle complexity bounds of

 ~O(nL/μlog(1/ϵ))   and   ~O(n√L/μlog(1/ϵ)), (3.22)

for GD and AGD, respectively, where hides some logarithmic factors in the problem parameters.

Recently, the new class of variance-reduced methods, such as SAG, SDCA, SVRG, SAGA [DBLJ14], SDCA without duality [Sha16], or MISOFinito [Mai15, DD14], was shown to enjoy significantly better rates of , or

 ~O((n+√nL/μ)log(1/ϵ)) (3.23)

when acceleration schemes are used [LMH15, SSZ16, AZ17]. This rate is tight and cannot be improved in the class of incremental first-order methods [WS16, AS16].

Despite their favorable rates, variance-reduce methods cannot be directly implemented through stochastic first-order oracles. Indeed, algorithms like SAG/SAGA/MISO/SDCA keep a list of vectors in memory, one for each individual function. The th vector is then updated using gradients of , which are acquired throughout the optimization process. Clearly, one must know which individual function is being addressed at each iteration in order to performs such updates. Another notable example of a variance-reduced method is SVRG, which generates new iterates using the full gradient at some reference point , i.e.,

 wt =wt−1−η(∇fit(wt−1)−∇fit(~w)+∇F(~w)).

When the stochastic oracle allows multiple simultaneous queries (i.e., in eq:stoch_orc), it is possible to evaluate the term . However, one cannot evaluate the full gradient  by simply iterating over the indices of the individual functions.

A natural idea to address this issue is to replace the full gradient by an empirical estimator that simply averages the sampled gradients. Unfortunately, this naive estimator does not lead to the desired exponential convergence.

###### Theorem 1.

Assume that in each iteration we make a fixed number of stochastic oracle calls to form the empirical estimator

 ˆ∇F(w)=1mm∑i=1gi(w),

where each is uniformly sampled among the individual gradients , and apply SGD and SVRG with and constant stepsize . Then we have the following lower bounds:

• [leftmargin=1em]

• Lower bound in expectation: There exists such that for sufficiently small , for any choice of and stepsize , the number of stochastic oracle calls required to ensure is at least .

• Lower bound in high probability:

There exists such that for sufficiently small and , for any choice of and stepsize , the number of stochastic oracle calls required to ensure with probability is at least .

We prove thm:naive estimator by carefully tracing the random iterates produced by SGD (with different step sizes) when applied on the finite-sum function (for some even )

 F(w)\coloneqq1n((n/4)(w−1)2+(n/4)(w+1)2), w∈R.

It is straightforward to show that one full iteration of SVRG is equivalent to one SGD step. Hence both methods can be effectively addressed in the same way.

We note that the high probability bound is not directly implied by the bound in expectation. The former requires one to guarantee that the distribution of the iterates does not concentrate around the mean of the iterates, as this sequence can converge exponentially fast. The main tool we use to ensure anti-concentration of the iterates is the Berry-Esseen theorem. One may notice that the dependency on in the high probability lower bound is not as good as the one in the expectation lower bound, which may be improvable with a different proof technique. Our main message here is to show that the naive estimator does not provide exponential convergence rate.

## 4 A Biased Estimator with Quantization

It turns out that the high variance exhibited by the naive gradient estimator that leads to the lower bounds above, can be addressed by a simple rounding procedure. We first describe this procedure in a general setting: estimating the parameters of categorical random variables.

We start by introducing relevant definitions. Given a real number , we let denote the closest integer to , where by convention for . A random variable is said to be -categorical if is discrete with finite support , where is some vector space, and

 P(X=si)=nin, i∈[q],

for some nonnegative integers which sum up to , that is, . Clearly, stochastic oracles over finite-sum functions induce a categorical distribution on their potential set of answers.

The next simple lemma is a key insight for our analysis.

###### Lemma 1.

Let be a -categorical distribution. Let be i.i.d. samples of and let be the empirical counter of category defined by

 Zi=m∑j=11Xj=si.

If , then with probability at least we have that for every ,

 rnd(nZim)=ni.

Proof  Define . By Hoeffding’s inequality, we have that for every ,

 P(∣∣∣Zim−pi∣∣∣≥12n)≤2exp(−m2n2)≤δn,

where the last inequality is due to the assumption that . It follows now by the union bound that

 P(∃i such that ∣∣∣Zim−pi∣∣∣≥12n) ≤ n∑i=1P(∣∣∣Zim−pi∣∣∣≥12n)≤δ.

Thus, with probability at least , for all , we have , implying that

 ∣∣∣nZim−ni∣∣∣<12⟹rnd(nZim)=ni.

Therefore, with appropriate quantization, we can recover the distribution parameters with high probability. More importantly, the estimator

 ^Xqn=1nq∑i=1rnd(nZim)si, (4.24)

satisfies with probability at least as long as . It is worth noting that the quantized estimator is not unbiased. A simple example follows by a straightforward computation for (see sec:biased_est_proof in the appendix for full details).

We emphasize that one does not need to know the support set in advance to implement this estimator; the counter can be implemented on the fly. During the sampling procedure, we keep in memory the set of sample ‘types’ seen up to some point, and update it accordingly. lem:categorical_estimation implies that with probability at least all different categories are seen and the corresponding probabilities are recovered.

## 5 Application to Stochastic Oracles

Having presented our quantized estimator for categorical random variables, we now use it to design optimization algorithms for finite sums which do not rely on the indices of the individual functions.

Stochastic global oracle  We consider first the oracle complexity of stochastic global oracle. In this case, a straightforward application of the quantized estimator eq:quane with support set yields the following upper complexity bound.

###### Theorem 2.

The minimax complexity of a stochastic global oracle for the finite-sum problem (1.1) is bounded by

 MΣ,Sf(ϵ,δ) ≤2n2log(2nδ). (5.25)

By the lower complexity bound established in [Arj17], this bound stated in thm:global_comp is tight up to logarithmic factors. Also, note that the bound stated in thm:global_comp applies to any type of individual functions (including non-convex and non-smooth functions).

Stochastic First-order Oracle  Our quantized estimator can also be used to recover full gradients of . This enables us to implement a ‘quantized’ variant of SVRG (Q-SVRG), over -smooth and -strongly convex individual functions—which is compatible with the stochastic oracle . A better dependence on the condition number is then achieved by applying the Catalyst acceleration framework. Moreover, the Catalyst framework also allows us to extend the scope of Q-SVRG to cases where the individual function are only assumed to be convex, rather than strongly convex.

Each iteration of SVRG starts by obtaining a full gradient of at some reference point . Next, SVRG generates iterates by setting

 wt =wt−1−η(∇fit(wt−1)−∇fit(~w)+~μ), (5.26)

where , ; and and are assumed to be fixed throughout the optimization process. Lastly, one uses the average of the points as the reference point for the next iteration.

The difficulty in implementing SVRG via stochastic first-order oracles is that one cannot simply form the full gradient at the reference point by sequentially iterating over the individual functions. To remedy this, we use the quantized gradient estimator which allows us to recover the full gradient w.h.p.. This is formally stated as follows (see full proof in 7.2).

###### Lemma 2.

One can compute the full gradient of at a given point with success probability via -calls.

With a union bound, lem:full_gradient can be further used to obtain exact gradients (see 7.3 for details).

###### Corollary 3.

One can compute the full gradients of at different points in with success probability of via -calls.

corr:k_full_gradients implies that w.p. at least over the oracle randomness, we can implement full iterations of SVRG using overall number of -oracle calls, from which we conclude the following result.

###### Lemma 4.

With notations as above,

 MΣLμ,S∇(ϵ,δ)=~O((n2+L/μ)log(1/δϵ)).

Proof  The proof of the theorem follows by a direct application of Theorem 1 in [JZ13]. In particular, the same analysis holds conditioned on the event that the full gradients are exactly recovered at each reference point. Formally, given , and , we denote by the event that all the first full gradients are exactly estimated. We have

 E[F(^wk)−F∗|Ek]≤αkΔ,

where the convergence factor is given by,

 α =1μη(1−2Lη)m+2Lη1−2Lη.

Setting and yields . Therefore, by Markov’s inequality,

 P(F(^w)−F∗>ϵ|EK)≤αkΔϵ

Then, setting , we have

 P(F(^w)−F∗>ϵ|EK)≤δ2. (5.27)

On the other hand, based on Lemma 2, by using -oracle calls, we recover all the full gradients with probability at least . Together with (5.27), this yields

 P(F(^w) −F∗>ϵ) ≤ P(EcK)+P(F(^w)−F∗>ϵ|EK)P(EK)≤δ.

Therefore the overall number of -oracle calls required to obtain an -optimal solution is bounded from above by

 2n2Klog(4nKδ)+km=~O((n2+Lμ)log(Δδϵ)).

In other words, replacing the naive estimator by the quantized estimator is all we need to obtain exponential convergence result. We emphasize that this result does not contradict the lower bound in Theorem 1 because Theorem 1 relies on the explicit form of the empirical estimator. Before moving on, it is worth noting that our algorithm requires the stochastic oracle to allow multiple simultaneous queries, i.e., in (2.10).

Next, to improve the dependence on condition number, we apply the Catalyst acceleration framework [LMH15] to Q-SVRG. Catalyst acts as a wrapper that takes as input an algorithm that converges exponentially fast and outputs an accelerated algorithm.

In detail, at iteration , we replace the original objective function by an auxiliary objective defined by

 Gk(w)=F(w)+β2∥w−uk∥2,

where is a well-chosen regularization parameter and is obtained by extrapolating solutions of previous subproblems. We optimize up to accuracy and use the solution to warm-start the next subproblem.

For the complexity analysis of the Catalyst acceleration framework, assume that a given optimization algorithm converges exponentially fast for the smooth and strongly convex problems , i.e.,

 Gk(xt)−G∗k≤(1−τ)t(Gk(x0)−G∗k). (5.28)

Then, applying Catalyst on yields [LMH15] a global complexity bound of

 ~O(1τA√μ+βμlog(1ϵ)),

for finding an -optimal solution. We remark that is a free parameter and hence can be chosen to minimize the overall complexity bound.

In our case, the convergence rate parameter is

 1τA=(n2+L+βμ+β)log(1/δ).

Therefore, the total number of -calls is given by

 ~O((n2+L+βμ+β)√μ+βμlog(1/δ)log(1/ϵ)). (5.29)

Minimizing the total number of -calls with respect to , yields . Plugging in the value of gives the desired accelerated complexity bound

 ~O((n2+n√Lμ)log(1/δ)log(1/ϵ)).

For randomized incremental methods, the acceleration occurs in the ill-conditioned regime where . Here, due to the augmented cost of evaluating the full gradient, acceleration only occurs in the extremely ill-conditioned regime in which .

The Catalyst framework is also useful as a means of extending the applicability of Q-SVRG to smooth convex finite sums. This follows by the fact that subproblems are always strongly convex when .

In this case, by [LMH15], if is an algorithm which satisfies (5.28), then by applying the Catalyst yields a global complexity bound of

 ~O(1τA√βϵlog(1ϵ))

for finding an -solution. Again, minimizing the global complexity with respect to the parameter yields , by which we obtain the following complexity bound

 ~O(√Ln2log(1/δ)/ϵlog(1/δϵ)).

Both complexity bounds are summarized as follows.

###### Theorem 3.

With notations as above,

 MΣLμ,S∇(ϵ,δ) =~O((n2+n√L/μ)log(1/δϵ)), MΣL,S∇(ϵ,δ) =~O((n2+n√L/ϵ)log(1/δϵ)).

The bounds stated in thm:acc_rates are partly complemented by the lower bounds given in eq:lb_n2_fs_sc and eq:lb_n2_fs_nsc, namely, and . Specifically, in both cases, the proposed SVRG variant is tight w.r.t. the global term . However, the first-order term of Q-SVRG corresponds to that of deterministic methods, namely, (in the strongly convex case), whereas the first-order term of randomized incremental methods is , and thus misses a factor of .

## 6 Discussion and Future Work

In this paper, we showed that although current variance-reduced finite-sum methods directly rely on the indices of the individual functions, it is possible to achieve variance reduction for obtaining exponential convergence rates even without this knowledge. Although this variance reduction cannot be achieved by simply averaging over the gradients, a simple rounding procedure suffices to obtain an exponential convergence rate w.h.p..

The cost of not having access to or disregarding the indices of the individual functions (as is often done in practice) is an expensive -term in the upper complexity bound—which is inevitable for stochastic methods compatible with . This leads to a factor of in the first-order term (rather than the -factor exhibited by incremental methods) which we suspect is tight for stochastic methods. We leave addressing this gap to future work.

One limitation of our approach is the requirement of issuing two or more queries simultaneously (i.e., in eq:stoch_orc). This assumption is necessary to compute the expression

 ∇fit(w)−∇fit(~w)

for the SVRG update rule. Replacing it by

 ∇fit(w)−∇fjt(~w)

introduces additional variance and breaks the current analysis. Since our quantized estimator is still applicable when , one can still implement GD or AGD and obtain an exponential convergence rate of , which is nontrivial to achieve in the stochastic setting.

That said, to the best of our knowledge, no variance reduction technique is applicable with . This leads to an interesting open question: is it possible to exploit the finite sum structure and achieve exponential convergence rate better than in the stochastic first-order setting with ? Addressing this question will provide further understanding of the variance reduction technique.

### Acknowledgments

This research was supported by The Defense Advanced Research Projects Agency (grant number YFA17N66001-17-1-4039). The views, opinions, and/or findings contained in this article are those of the authors and should not be interpreted as representing the official views or policies, either expressed or implied, of the Defense Advanced Research Projects Agency or the Department of Defense.

## References

• [Arj17] Yossi Arjevani. Limitations on variance-reduction and acceleration schemes for finite sums optimization. In Advances in Neural Information Processing Systems, pages 3540–3549, 2017.
• [AS16] Yossi Arjevani and Ohad Shamir. Dimension-free iteration complexity of finite sum optimization problems. In Advances in Neural Information Processing Systems, pages 3540–3548, 2016.
• [AWBR09] Alekh Agarwal, Martin J Wainwright, Peter L Bartlett, and Pradeep K Ravikumar. Information-theoretic lower bounds on the oracle complexity of convex optimization. In Advances in Neural Information Processing Systems, pages 1–9, 2009.
• [AZ17] Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. The Journal of Machine Learning Research, 18(1):8194–8244, 2017.
• [Ber41] Andrew C Berry. The accuracy of the gaussian approximation to the sum of independent variates. Transactions of the american mathematical society, 49(1):122–136, 1941.
• [DBLJ14] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654, 2014.
• [DD14] Aaron Defazio, Justin Domke, et al. Finito: A faster, permutable incremental gradient method for big data problems. In International Conference on Machine Learning, pages 1125–1133, 2014.
• [Ess42] C.-G. Esseen. On the liapunoff limit of error in the theory of probability. Arkiv för Matematik, Astronomi och Fysik, A28, 1942.
• [JZ13] Rie Johnson and Tong Zhang.

Accelerating stochastic gradient descent using predictive variance reduction.

In Advances in Neural Information Processing Systems, pages 315–323, 2013.
• [KS10] V Yu Korolev and Irina G Shevtsova. On the upper bound for the absolute constant in the berry–esseen inequality. Theory of Probability & Its Applications, 54(4):638–658, 2010.
• [LCB07] Gaëlle Loosli, Stéphane Canu, and Léon Bottou.

Training invariant support vector machines using selective sampling.

Large scale kernel machines, pages 301–320, 2007.
• [LMH15] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems, pages 3366–3374, 2015.
• [Mai15] Julien Mairal. Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization, 25(2):829–855, 2015.
• [MB11] Eric Moulines and Francis R Bach.

Non-asymptotic analysis of stochastic approximation algorithms for machine learning.

In Advances in Neural Information Processing Systems, pages 451–459, 2011.
• [Nes04] Yurii Nesterov. Introductory lectures on convex optimization, volume 87. Springer Science & Business Media, 2004.
• [NJLS09] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609, 2009.
• [NW16] Deanna Needell and Rachel Ward. Batched stochastic gradient descent with weighted sampling. In International Conference Approximation Theory, pages 279–306. Springer, 2016.
• [NWS14] Deanna Needell, Rachel Ward, and Nati Srebro. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In Advances in neural information processing systems, pages 1017–1025, 2014.
• [NY83] AS Nemirovsky and DB Yudin. Problem complexity and method efficiency in optimization. 1983. Willey-Interscience, New York, 1983.
• [QRG19] Xun Qian, Peter Richtárik, Robert M. Gower, Alibek Sailanbayev, Nicolas Loizou, and Egor Shulgin. SGD with arbitrary sampling: General analysis and improved rates. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pages 5200–5209, 2019.
• [RM51] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
• [RR11] Maxim Raginsky and Alexander Rakhlin. Information-based complexity, feedback and dynamics in convex programming. Information Theory, IEEE Transactions on, 57(10):7036–7056, 2011.
• [Sha16] Shai Shalev-Shwartz. SDCA without duality, regularization, and individual convexity. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 747–754, 2016.
• [SLRB13] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, pages 1–30, 2013.
• [SSSSC11] Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated sub-gradient solver for svm. Mathematical programming, 127(1):3–30, 2011.
• [SSZ13] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss. The Journal of Machine Learning Research, 14(1):567–599, 2013.
• [SSZ16] Shai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, 155(1-2):105–145, 2016.
• [TW80] Joseph Frederick Traub and Henryk Wozniakowski. A general theory of optimal algorithms. Academic Press New York, 1980.
• [WS16] Blake E Woodworth and Nati Srebro. Tight complexity bounds for optimizing composite objectives. In Advances in Neural Information Processing Systems, pages 3639–3647, 2016.

## 7 Supplementary Material

### 7.1 Quantized Estimator is Biased

Consider , i.e. there are category, each having probability . Now assume , then all the possible couples (up to permutation) of are , , , , . The corresponding are , , , , . Note that all the couples sum up to expect, . Thus the estimator is biased. (If it is unbiased, the sum of the expectation should be , but here it is .)

### 7.2 Proof of lem:full_gradient

Proof  Let be the gradients of the individual functions corresponding to some point in , and let denote the set of distinct gradients (note that , with strict inequality if two functions share the same gradient). Denote . Note that the full gradient can be equivalently expressed as:

 g=1nn∑i=1gi=1nq∑i=1nig′i.

Let be the answers of the first-order oracle, and let denote the corresponding set of distinct gradients (here with strict inequality if one of the gradients was not sampled). We let , and estimate the gradient through

 ^g=1n^q∑i=1rnd(nZim)^g′i.

By lem:categorical_estimation we have that with probability at least and up to permutation of the indices, for every , , in which case

 ^g=1nq∑i=1nig′i=g.

### 7.3 Proof of corr:k_full_gradients

Proof  Bt lem:full_gradient, -calls suffice to compute the full gradient at a given point with failure probability of at most . Hence, by the union bound, full gradients can be obtained with failure probability of at most by using -calls.

### 7.4 Proof of thm:naive estimator

Proof  Assume is even, and we define . In this case, and the minimum . We now consider applying gradient descent (GD) with the naive unbiased gradient estimator on .

At iteration , we sample stochastic oracles, which is equivalent to pick points at random independently, then the update is given by

 xk+1 =xk−αg(xk) =xk−αmm∑j=1(xk+zk,j) =xk−α(xk+1mm∑j=1zk,jzk)=(1−α)xk−αzk,

Note that , where

is the binomial distribution. Therefore,

 E[∥xk+1−x∗∥2] =(1−α)2x2k+α2E[z2k] =(1−α)2x2k+4α2m2Var(B(m,1/2)) =(1−α)2∥xk−x∗∥2+α2m

A simple telescopic summing yields,

 E[∥xk−x∗∥2]=(1−α)2kE[∥x0−x∗∥2]+α2mk−1∑i=0(1−α)2i=(1−α)2kE[∥x0−x∗∥2]+αm(1−(1−α)2k)2−α

Note that . Therefore, recall that , we have

 E[F(xk)−F∗]=(1−α)2kΔ+α2m(1−(1−α)2k)2−α

In order to guarantee , it is necessary to have both

 (1−α)2kΔ≤ϵ and α2m(1−(1−α)2k)2−α≤ϵ

This implies

 k≥log(Δϵ)−2log(1−α) and m≥α(1−(1−α)2k)2(2−α)ϵ≥α(1−ϵΔ)2ϵ

Let denotes the total number of oracles. If , then

 T≥m≥(1−ϵΔ)4ϵ

If , then

 T≥km≥(1−ϵΔ)log(Δϵ)8ϵ

Therefore in both cases the complexity of obtaining an solution of is lower bounded by .

High probability result:
We consider the same function . With out loss of generality, let’s assume . Note that , bounding is equivalent to bound