Stochastic Variance Reduction via Accelerated Dual Averaging for Finite-Sum Optimization

In this paper, we introduce a simplified and unified method for finite-sum convex optimization, named Stochastic Variance Reduction via Accelerated Dual Averaging (SVR-ADA). In the nonstrongly convex and smooth setting, SVR-ADA can attain an O(1/n)-accurate solution in nloglog n number of stochastic gradient evaluations, where n is the number of samples; meanwhile, SVR-ADA matches the lower bound of this setting up to a loglog n factor. In the strongly convex and smooth setting, SVR-ADA matches the lower bound in the regime n< O(κ) while it improves the rate in the regime n≫κ to O(nloglog n +nlog(1/(nϵ))/log(n/κ)), where κ is the condition number. SVR-ADA improves complexity of the best known methods without use of any additional strategy such as optimal black-box reduction, and it leads to a unified convergence analysis and simplified algorithm for both the nonstrongly convex and strongly convex settings. Through experiments on real datasets, we also show the superior performance of SVR-ADA over existing methods for large-scale machine learning problems.

Authors

• 8 publications
• 38 publications
• 39 publications
• Accelerated Stochastic Gradient Descent for Minimizing Finite Sums

We propose an optimization method for minimizing the finite sums of smoo...
06/09/2015 ∙ by Atsushi Nitanda, et al. ∙ 0

• Variance Reduction via Primal-Dual Accelerated Dual Averaging for Nonsmooth Convex Finite-Sums

We study structured nonsmooth convex finite-sum optimization that appear...
02/26/2021 ∙ by Chaobing Song, et al. ∙ 0

• A unified variance-reduced accelerated gradient method for convex optimization

We propose a novel randomized incremental gradient algorithm, namely, VA...
05/29/2019 ∙ by Guanghui Lan, et al. ∙ 0

• Stochastic dual averaging methods using variance reduction techniques for regularized empirical risk minimization problems

We consider a composite convex minimization problem associated with regu...
03/08/2016 ∙ by Tomoya Murata, et al. ∙ 0

• An optimal gradient method for smooth (possibly strongly) convex minimization

We present an optimal gradient method for smooth (possibly strongly) con...
01/24/2021 ∙ by Adrien Taylor, et al. ∙ 0

• Dissipativity Theory for Accelerating Stochastic Variance Reduction: A Unified Analysis of SVRG and Katyusha Using Semidefinite Programs

Techniques for reducing the variance of gradient estimates used in stoch...
06/10/2018 ∙ by Bin Hu, et al. ∙ 0

• Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization

We study the conditions under which one is able to efficiently apply var...
06/06/2017 ∙ by Yossi Arjevani, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper, we study the following composite convex optimization problem:

 minx∈Rdf(x):=g(x)+l(x):=1nn∑i=1gi(x)+l(x), (1)

where is a convex function that is a finite sum of convex, smooth sample functions , and

is convex, probably nonsmooth but admitting an efficient proximal operator. In this paper, we mainly assume that each

is -smooth and is -strongly convex (). If , then the problem is nonstrongly convex. If , then the problem is strongly convex and we define the corresponding condition number . Instances of problem (1) appear widely in statistical learning, operational research, and signal processing. For instance, in machine learning, if , where and

is the data vector, then the problem (

1) is also called regularized empirical risk minimization

(ERM). Important instances of ERM include ridge regression, Lasso, logistic regression, and support vector machine.

1.1 Context and Motivation

In the large-scale setting where is large, first-order methods become the natural choice for solving (1) due to its better scalability. However, when is very large, even accessing the full gradient becomes prohibitively expensive. To alleviate this difficulty, a common approach is to use a stochastic gradient with to replace the full gradient in each iteration, aka, stochastic gradient descent (SGD). In the stochastic setting, the goal to solve (1) becomes to find an expected -accurate solution satisfying where is an exact minimizer of (1). Typically, the complexity or convergence rate of such an algorithm is evaluated by the number of evaluating stochastic gradients needed to achieve the -accurate solution.

Due to the need of only accessing stochastic gradients, SGD has a low per-iteration cost. However, SGD has a very slow convergence rate due to the constant variance . To improve the rate of SGD while still maintaining its low per-iteration cost, a remarkable progress in the past decade is to exploit the finite-sum structure of in (1) to reduce the variance of stochastic gradients. In such variance reduction methods, instead of directly using , we compute a full gradient of an anchor point beforehand. Then we use the following variance reduced gradient

 ~∇gi(x):=∇gi(x)−∇gi(~x)+∇g(~x) (2)

as a proxy for the full gradient during each iteration. As a result, the amortized per-iteration cost is still the same as SGD. However, the variance reduced gradient (2) is unbiased and can reduce the variance from to . The variance can vanish asymptotically, thus the convergence rate of SGD can be substantially improved.

To this end, SAG [RSB12]

is one of the first variance reduction methods while it uses a biased estimation of the full gradient. SVRG

[JZ13] directly solves (1

) and explicitly uses the unbiased estimation (

2) to reduce variance. Then SAGA [DBLJ14] provides an alternative of (2) to avoid precomputing the gradient of an anchor point but with the price of an increased memory cost. Based on [SSZ14], a Catalyst approach [LMH15] is proposed to combine Nesterov’s acceleration into variance reduction methods in a black box manner. [AZ17] has proposed the first direct approach, named Katyusha (aka accelerated SVRG), to combine variance reduction and a kind of Nesterov’s acceleration scheme in a principled manner. [WS16] has given a tight lower complexity bound for finite-sum stochastic optimization and shown the tightness of Katyusha up to a logarithmic factor. MiG [ZSC18] follows and simplifies Katyusha by only using negative momentum to produce acceleration.

In Table 1, we give the complexity results of representative variance reduction methods for both the nonstrongly convex and strongly convex settings (as well as results of this paper). The literature on variance reduction is too rich to list them all here.111For instance, when the objective (1) is strongly convex, we can also use randomized coordinate descent/ascent methods on the dual or primal-dual formulation of (1) to indirectly solve (1), such as the non-accelerated SDCA [SSZ13] and the accelerated versions Acc-ADCA [SSZ14], APCG [LLX14] and SPDC [ZX15]. Variance reduced methods have also been widely applied into distributed computing [RHS15, LPLJ17] and nonconvex optimization [RHS16, RSPS16]. As Katyusha [AZ17] remains state of the art for nearly all (if not all) convex settings, we mainly use Katyusha as the baseline for comparison.

To understand where we stand with these complexity results, in the nonstrongly convex setting, we are particularly interested in attaining a solution with a proper accuracy such as .222This is because, in the context of large-scale statistical learning, due to statistical limits [BB08, SSS08], even under some strong regularity conditions [BB08], obtaining an accuracy will be sufficient. To attain this accuracy, and Katyusha (as well as SVRG) needs about number of iterations whereas the lower bound [WS16] implies that we may only need iterations. Before this work, it is not known whether the logarithmic factor gap can be further improved or not.

In the strongly convex setting, the rate of Katyusha is optimal for the case according to the lower bound given by [WS16]. However, when (which is common in the statistical learning context such as [BE02]), the rate by the accelerated Katyusha is not better than the non-accelerated methods such as SVRG and SAGA. Before this work, it is not known in the regime whether the rate can be further improved or not and whether accelerated variance reduced methods can indeed be better than non-accelerated ones.

1.2 Our Approach and Contributions

We believe that the above issues with Katyusha are the result of the acceleration strategy that it adopts, which follows the idea of combining gradient descent with mirror descent [AZ17]. The mirror descent step can only exploit the strong convexity in the current iteration and cannot explore the strong convexity along the whole optimization trajectory. As a result, despite such a combination can lead to optimal dependence on , it leads to suboptimal dependence on the sample size particularly when in the strongly convex setting or in the nonstrongly convex setting. Although no improved results are obtained after Katyusha, MiG [ZSC18] simplifies Katyusha by showing that it is enough to lead to acceleration by only using negative momentum and performing a mirror descent step in each iteration.

In this paper, we follow the simple algorithmic framework of MiG [ZSC18] but consider a different acceleration strategy based on Nesterov’s idea of combining gradient descent and dual averaging [Nes15], hence the name stochastic variance reduction via accelerated dual averaging (SVR-ADA). The superiority of dual averaging in exploiting the structure of regularizer has been elaborated in the work [Xia10] (a NeurIPS test of time award paper). By using dual averaging, [Xia10] shows it can accumulate the weight of a sparse regularizer such as -norm to produce more sparse solutions than the composite mirror descent in [DSSST10] in the online optimization context. Here, we use dual averaging to accumulates the weight (a.k.a. the strong convexity constant) of -norm 333 The strong convexity implies that can be written as a sum of a convex part and . along the optimization trajectory. As we will show, this can lead to better complexity dependence on the sample size in the finite-sum optimization context. To full exploit the potential of dual averaging, we introduce a generalized estimation sequence for the finite-sum setting and a simple but novel initialization strategy that helps improve substantially the relationship between and . As a result, SVR-ADA simplifies the convergence analysis significantly and improves the state of the art convergence rates in both nonstrongly and strongly convex settings.

Efficiency.

In the nonstrongly convex setting, as shown in Table 1, our SVR-ADA method achieves the rate

 O(nloglogn+√nL√ϵ), (3)

which matches the lower bound up to a factor. So to attain a solution with we only need number of iterations which is by the superlinear rate in the initial stage, whereas the best known result before is . Practically speaking, the factor can be treated as a small constant: for instance when , we have Thus, for nonstrongly convex problems, SVR-ADA can attain an -accurate solution with essentially iterations, practically matches the lower bound!

In the strongly convex setting444We use the indicator function with value when the event is true and when the event is false., as shown in Table 1, SVR-ADA first attains an initial accuracy with number of iterations. Then when and , the bound of SVR-ADA becomes

 O(nloglogn+√nκlogLnϵ), (4)

which matches the lower bound with the initialized error . In contrast, theoretically the best existing methods such as Katyusha do not enjoy such good initial behaviors. Meanwhile, for this setting, if we do not consider the superlinear behavior in the initial stage, then the rate of SVR-ADA can also be characterized by .

Meanwhile, when and , the rate of SVR-ADA becomes

 O(nloglogn+nlog(L/(nϵ))log(n/κ)). (5)

Again, if we do not consider the superlinear behavior, then for and , the bound can also be characterized as . In this regime, as shown in Table 1, both non-accelerated and accelerated methods have the rate . Thus SVR-ADA improves the best known result by a factor. The factor will be significant when is very large and is small such as or For the first time, our results show that for the regime accelerated methods can be better than non-accelerated methods for finite-sum optimization.

Simplicity.

In the nonstrongly convex setting, besides the improved complexity result, SVR-ADA is also much simpler than the best known method of combining Katyusha with an optimal black-box reduction strategy [AZH16], as SVR-ADA directly matches the lower bound without any need of additional strategy. In addition, the optimal black-box reduction strategy of Katyusha (similarly the geometrically increasing strategy for SVRG [AZY16]) needs to estimate some unknown factors such as the initial objective difference and the initial solution distance , which makes the implementation complicated. In the strongly convex setting, SVR-ADA uses a natural uniform average in algorithm implementation, while Katyusha uses a weighted average that also makes implementation complicated.

Unification.

Different from Katyusha and Katyusha that use different parameter settings and independent convergence analysis, SVR-ADA uses the same parameter setting for both the nonstrongly convex and strongly convex settings555As we will discuss, the Varag [LLZ19] algorithm is also unified for both the nonstrongly convex and strongly convex settings.. The only difference is that in SVR-ADA, we set the parameter in the nonstrongly convex setting, while we set in the strongly convex setting. Meanwhile, based on a “generalized estimation sequence,” we conduct a unified convergence analysis for both settings. The only difference is that the values of two predefined sequences of positive numbers are different.

1.3 Other Related Works

Regarding the lower bound under sampling with replacement.

When the problem (1) is -strongly convex and -smooth with , [LZ18a] has provided a stronger lower bound than [WS16] such that to find an -solution such that any randomized incremental gradient methods need at least

 Ω((n+√nκ)log1ϵ) (6)

number of iterations when the dimension is sufficiently large. Our upper bound (5) is measured by . By the strong convexity, when and , if we convert to the Euclidean distance , then our rate will be At first sight, when our upper bound is actually better than the lower bound (6) by a factor, which seems rather surprising. Nevertheless, by examining carefully the assumption of [LZ18a] in deriving the lower bound (6), one sees that the randomized incremental gradient methods of [LZ18a] are referred to the ones by sampling with replacement completely. As it turns out, the SVR-ADA algorithm proposed in this paper does not satisfy the assumption of [LZ18a]. SVR-ADA is based on the two-loop structure of SVRG: in the outer loop, we compute the full gradient of an anchor point; in the inner loop, we compute stochastic gradients by sampling with replacement. In the outer loop of SVRG, the step of computing full gradient can be viewed as stochastic gradient steps with step size by (implicitly) sampling without replacement. 666The outer loop of SVRG or our algorithm cannot be interpreted as stochastic gradient steps by sampling with replacement (say, with step size), as we cannot pick all the samples with probability by sampling with replacement for times. Thus, SVR-ADA does not fall into the kind of randomized incremental algorithms that [LZ18a] has examined, and the lower bound (6) does not apply to SVR-ADA.

Meanwhile, despite all Katyusha, MiG (and also Varag [LLZ19], ASVRG [SJZ18]) are accelerated versions of SVRG, they cannot obtain the improved upper bound (5). We believe that it is from the simple fact we have discussed: all of them lead to acceleration by following the style of mirror descent [AZO17], which cannot effectively explore the strong convexity along the optimization path. Thus our improved rate in (5) can be attributed to the combined merits of the (implicitly) sampling without replacement of SVRG and the acceleration effect by following the style of Nesterov’s dual averaging.

Remark 1 (Sampling without Replacement)

Very recently, the superiority of sampling without replacement has also been verified theoretically [HS18, GOP19, RGP20, AS20]. Particularly, [AS20] has shown that for strongly convex and smooth finite-sum problems, SGD without replacement (also known as random reshuffling) needs number of stochastic gradient evaluations, which is tight and significantly better than the rate of SGD with replacement [HK14]

. Meanwhile, in practice, sampling without replacement is also more widely used in training deep neural network for its better efficiency

[Bot09, RR13].

Relation to the algorithm Varag.

In this work, we mainly compare the proposed algorithm SVR-ADA with Katyusha, because it provides the first accelerated convergence rates of the direct solvers for the finite-sum problems and remains state of the art. Besides the MiG [ZSC18] method, another work very related to SVR-ADA is Varag [LZ18a], which is a unified version of Katyusha for both the strongly convex and the nonstrongly convex settings, and does not need black-box reduction to achieve the same rate as Katyusha in the nonstrongly convex setting. However, SVR-ADA is not only unified and simple, it provides the improved convergence results in (3) and (5) for both the nonstrongly convex and strongly convex settings respectively. Note that despite the rich literature on convex finite-sums, only a few works [RSB12, DBLJ14, SSZ14, AZY16, AZ17] have provided improved complexity results for the two basic settings. Meanwhile, our analysis is based on a generalized estimation sequence and is significantly different from the acceleration methods [AZ17, ZSC18, LLZ19] that follow the style of mirror descent. Furthermore, similar to MiG , we only need to keep track of only one variable vector in the inner loop, which gives it a better edge in sparse and asynchronous settings [ZSC18].

Besides accelerated versions of SVRG, there are also two accelerated versions Point-SAGA [Def16] and SSNM [Def16] of SAGA and also a randomized primal-dual method RPDG [LZ18a] and a randomized gradient extrapolation method RGEM [LZ18b]. All these methods use sampling with replacement completely and attains the corresponding lower bound (6).

2 Algorithm: Variance Reduction via Accelerated Dual Averaging

Let . For simplicity, we only consider We first introduce a couple of standard assumptions about the smoothness and convexity of the problem (1).

Assumption 1

is convex, i.e., is -smooth (), i.e.,

By Assumption 1 and , we can verify that is -smooth, i.e., Furthermore, we assume satisfies:

Assumption 2

is -strongly convex (), i.e., and

To realize acceleration with dual averaging, we recursively define the following generalized estimation sequence for the finite-sum problem (1):

 ψs,k(z) := ψs,k−1(z)+as(g(ys,k)+⟨~∇s,k,z−ys,k⟩+l(z)), (7)

with the initialization , with , 777As we will see, denotes the number of inner iterations in our algorithm., and (for ), where is a sequence of positive numbers to be specified later. Here is a sequence of vectors that will be generated by our algorithm, is a sequence of variance reduced stochastic gradients evaluated at If and , then we can verify that (7) is equivalent to the classical definition of estimation sequence by Nesterov [Nes15]. In the finite-sum setting, we set

to amortize the computational cost per epoch

, where is the number of sample functions in (1). Then we say is the estimation sequence in the (inner) -th iteration of the -th epoch.

For convenience, we define with . Then Algorithm 1 summarizes the proposed Stochastic Varianced Reduced via Accelerated Dual Averaging (SVR-ADA) method. As one may see, besides the steps about updating the estimation sequence such as Steps 2-4, 6 and 12, Algorithm 1 mainly follows the framework of the simplified MiG [ZSC18] of Katyusha. The main differences are that we have novel but effective initialization steps in Steps 3-4 and replace the mirror descent step in MiG with the Step 12, a dual averaging step:

 zs,k:=argminzψs,k(z). (8)

To be self-contained, note that we compute the full gradient on the anchor point in Step 7 and compute the variance reduced stochastic gradient in Steps 10 and 11. The convex combination Step 9 and the dual averaging Step 12 are used to achieve acceleration. The settings for , and in Step 14 are derived from our analysis. Notice that we update as a natural uniform average with respect to for both the nonstrongly convex and strongly convex settings, while Katyusha (as well as MiG) uses a weighted average for the strongly convex setting and a uniform average for the nonstrongly convex setting.

In our algorithm, a key new setting is being initialized as the times of , which, as we will see in our proof, helps cancel the term in the complexity result of Katyusha completely. For from the definition of in (7) and , one can show888see proof for details. that the sequence is at least -strongly convex. As we will see in the proof, this accumulation of strong convexity is crucial for our algorithm to achieve better convergence than the non-accelerated ones such as SVRG and SAGA even in the regime

To be more precise, under these settings, we can prove (in Section 3 and the appendix) the following convergence guarantee for the proposed SVR-ADA algorithm:

Theorem 1

Let be generated by Algorithm 1. Under Assumptions 1 and 2, and taking expectation on the randomness of all the history, we have

 E[f(~xs)]−f(x∗)≤∥~x0−x∗∥22As, (9)

where

 As≥max{m2L(2m)2−(s−1),1L(1+√σm2L)s−1} (10)

and with besides the lower bounds in (10), we also have ,

 (11)

Theorem 1 gives a unified convergence result for both the nonstrongly convex and the strongly convex settings. As we see in (9), the estimated error in the function value is simply bounded by the term about .

To see implications of Theorem 1 in the complexity of algorithm, by the first term in (10), whether strongly convex or not, SVR-ADA can attain an -accurate solution in number of epochs and thus The superlinear phenomenon is by our novel initialization in Step 4 of Algorithm 1. The best known convergence rate in the initial stage is by Varag [LLZ19], which has shown a linear convergence rate in the initial stage for convex-finite sums whether strongly convex or not. In contrast, the corresponding rate of SVR-ADA is superlinear. To the best of our knowledge, we have not observed any theoretical justification of superlinear phenomenon for variance reduced first order methods.

By the second term in (10), in the strongly convex setting we can have an accelerated linear convergence rate from the start. Note that whatever or the contracting ratio of SVR-ADA will always be , which will tend to as However, for all the existing variance reduction methods such as Katyusha and Varag, when the contracting ratio will be at least a constant such as in Katyusha.

Then based on the prompt decrease in the superlinear initial stage, we also provide two new lower bounds for in (11). By the first term in (11), whether strongly convex or not, SVR-ADA can have at least an accelerated sublinear rate. By the second term in (11), in the strongly convex setting SVR-ADA will maintain an accelerated linear rate.

Thus by Theorem 1, by setting we obtain our improved convergence rates for both the nonstrongly convex and strongly convex settings in Table 1.

The generalized estimation sequence and the associated analysis is the key in proving our main result Theorem 1, while it is commonly known to be difficult to understand. So [AZO17] has proposed a linear coupling of gradient descent and mirror descent instead. Meanwhile, there are plenty of study on understanding acceleration from discretization of continuous time dynamics [SBC14]. The estimation sequence itself also has principled explanation from a continuous time perspective [DO19, SJM19]. In Section 3, we show that the estimation sequence analysis leads to a very concise, unified, and principled convergence analysis for both the strongly convex and nonstrongly convex settings.

3 Convergence Analysis

When using estimation sequence to prove convergence rate, the main task is to given the lower bound and upper bound of , where The lower bound is given in terms of the objective value at the current iterate and the estimation sequence in the previous iteration, while the upper bound is in terms of the objective value at the optimal solution. (For simplicity, we only need to give the upper bound of ) Then by telescoping and concatenating the lower bound and upper bound of , we prove the rate in terms of the objective difference (in expectation).

First, in the initial Step 3 of Algorithm 1, by the smoothness property of and the setting we have Lemma 1.

Lemma 1 (The initial step)

It follows that

 ψ1,1(z1,1)≥A1f(~x1). (12)

Proof. See Section A.

Lemma 1 will be used to cancel the error introduced by in the following iterations. After entering into the main loop, by using the smoothness and convexity property, and the careful setting of and , we obtain the lower bound of in Lemma 2.

Lemma 2 (Lower bound)

we have

 ψs,k(zs,k) (13) ≥ −As−1⟨~∇s,k,~xs−1−ys,k⟩−As−1l(~xs−1).

Proof. See Section B.

In Lemma 2, the term is the variance we need to bound, of which the bound is given in Lemma 3 based on the standard derivation in [AZ17].

Lemma 3 (Variance reduction)

taking expectation on the randomness over the choice of in the -th iteration of -th epoch, we have

 E[∥~∇s,k−∇g(ys,k)∥2]≤2L(g(~xs−1)−g(ys,k)−⟨∇g(ys,k),~xs−1−ys,k⟩). (14)

Proof. See Section C.

Then by combining Lemma 2 and Lemma 3, we will find that the inner product in (13) and (14) can be canceled with each other in expectation. Therefore after combining Lemma 2 and Lemma 3, by telescoping the resulted inequality from to and using the definition , we have Lemma 4.

Lemma 4 (Recursion)

taking expectation on the randomness over the epoch , it follows that

 E[ψs+1,0(zs+1,0)−ψs,0(zs,0)]≥E[mAsf(~xs)−mAs−1f(~xs−1)], (15)

Proof. See Section D.

Besides the lower bound (15), by the convexity of and optimality of , we can also provide the upper bound in Lemma 5.

Lemma 5 (Upper bound)

taking expectation on all the history, we have

 E[ψs,m(zs,m)]≤mAsf(x∗)+m2∥~x0−x∗∥2. (16)

Proof. See Section E.

Finally, by combining Lemmas 1, 4 and 5, we prove Theorem 1 as follows.

Proof of Theorem 1.

Proof. Taking expectation on the randomness of all the history and telescoping (15) from to , we have

 E[ψs+1,0(zs+1,0)−ψ2,0(z2,0)] ≥ E[mAsf(~xs)−mA1f(~x1)]. (17)

Meanwhile, by Lemma 1 and the setting , we have

 ψ2,0(z2,0)=mψ1,1(z2,0)=mψ1,1(z1,1)≥mA1f(~x1). (18)

So combining (17) and (18), and by the setting , we have

 E[ψs,m(zs,m)]=E[ψs+1,0(zs+1,0)]≥E[mAsf(~xs)]. (19)

Then combining Lemma 5 and (19), we have

 E[mAsf(~xs)]≤E[ψs,m(zs,m)]≤mAsf(x∗)+m2∥~x0−x∗∥2. (20)

So after simple rearrangement of (20), we obtain (9).

Then we give the lower bound of by the condition in Step 6 of Algorithm 1 and . To show the lower bound by the first term in (10), we know that

 As=As−1+√mAs−1(1+σAs−1)2L≥√mAs−1(1+σAs−1)2L≥√mAs−12L, (21)

so we have

 2LAsm≥(2LAs−1m)12≥(2LA1m)2−(s−1). (22)

Then by the setting , we have

 As≥m2L(2m)2−(s−1). (23)

Meanwhile, for we also have

 As ≥ As−1+√mAs−1(1+σAs−1)2L≥As−1+√mσ2LAs−1=(1+√mσ2L)As−1 (24) ≥ (1+√mσ2L)s−1A1 ≥ 1L(1+√mσ2L)s−1.

Thus the lower bounds in (10) are proved. Then with we have

 As0 ≥ m2L(2m)2−(s0−1)≥m2L(2m)2−⌈log2log2(m/2)⌉≥m2L(2m)2−log2log2(m/2) = m4L.

Meanwhile for we have

 As≥As−1+√mAs−1(1+σAs−1)2L≥As−1+√mAs−12L. (25)

Thus we can use the mathematical induction method to prove the first lower bound in (11):

Firstly, for we have

Then assume that for a , then

 As ≥ As−1+√mAs−12L≥m32L(s−s0+2√2)2+m16L(s−s0)+m32L(2b+1) (26) ≥ m32L(s−s0+2√2)2.

Thus the first lower bound in (11) is proved.

Meanwhile, for we also have

 As ≥ As−1+√mAs−1(1+σAs−1)2L≥As−1+√mσ2LAs−1=(1+√mσ2L)As−1 (27) ≥ (1+√mσ2L)s−s0As0 ≥ m4L(1+√mσ2L)s−s0.

Thus the second lower bound in (11) is proved.

4 Experiments

In this section, to verify the theoretical results and show the empirical performance of the proposed SVR-ADA method, we conduct numerical experiments on large-scale datasets in machine learning. The datasets we use are a9a, covtype, mnist and cifar10, downloaded from the LIBSVM website999The dataset url is https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.. The a9a and covtype datasets are for binary classification, while mnist and cifar10 are for multi-class classification. To make comparison easier, we normalize the Euclidean norm of each data vector in the datasets to be . The problem we solve is the -norm regularized (multinomial) logistic regression problem:

 minw∈Rd×(c−1)f(w):=1nn∑j=1(−c−1∑i=1y(i)jw(i)Txj+log(1+c−1∑i=1exp(w(i)Txj)))+λ2c−1∑i=1∥w(i)∥22, (28)

where is the number of samples, denotes the number of class (for a9a and covtype, ; for mnist and cifar10, ) denotes the regularization parameter, is a one-hot vector or zero vector101010Zero vector denotes the class of the -th sample is ., and denotes the variable to optimize. For the two-class datasets “a9a” and “covtype”, we present our results by choosing the regularization parameter . For the ten-class datasets “mnist” and “cifar10”, we choose . For the corresponding problem is unregularized and thus nonstrongly convex. For this setting, we compare SVR-ADA with the state-of-the-art variance reduction methods SVRG [JZ13], Katyusha [AZ17], and MiG [ZSC18]. The settings for a9a and covtype (and for mnist and cifar10) correspond to the strongly convex setting with a large condition number and that with a small condition number , respectively. For both settings, we compare SVR-ADA with SVRG, Katyusha and MiG.

All four algorithms we compare have a similar outer-inner structure, where we set all the number of iterations as . For these algorithms, the common parameter to tune is the parameter w.r.t. Lipschitz constant111111For logistic regression with normalized data, the Lipschitz constant is globally upper bounded [ZX15] by , but in practice we can use a smaller one than , which is tuned in .121212In our experiments, due to the normalization of datasets, all the four algorithms will diverge when the parameter is less than . Otherwise, they always converge if the parameter is less than Following the tradition of ERM experiments, we use the number of “passes” of the entire dataset as the x-axis. All four algorithms are implemented in C++ under the same framework, while the figures are produced using Python.

For the datasets a9a and covtype, as shown in Figure 1, when , SVR-ADA decreases the error promptly in the initial stage, which validates our theoretical result in attaining an -accurate solution with passes of the entire dataset. An interesting phenomenon is that the other variance reduction methods share the same behavior with SVR-ADA in empirical evaluations (in fact, MiG is slightly faster for both a9a and covtype datasets). This poses an open problem whether or not this superlinear phenomenon in the initial stage can be theoretically justified for SVRG, Katyusha, and MiG.

As shown in Figure 1, when , i.e., the large condition number setting, SVR-ADA has significantly better performance than Katyusha and MiG. This is partly due to the fact that the accumulation of strong convexity by dual averaging helps us better cancel the error from the randomness and allows SVR-ADA to choose a more aggressive parameter w.r.t. Lipschitz constant. When , i.e., the small condition number setting, as shown in Figure 1, SVR-ADA is significantly better than Katyusha and MiG, which validates our superior theoretical results in (5). Meanwhile, although with less theoretical guarantee, when , SVRG can be competitive with SVR-ADA, which inspires us to conduct in the future a tighter convergence analysis for non-accelerated methods in the small condition number setting.

As we see, despite there are some minor differences among different tasks/datasets shown in Figure 1 and Figure 2, the general behaviors are still very consistent. From both figures, our method SVR-ADA achieves state of the art and is much faster than the non-accelerated SVRG algorithm in the nonstrongly convex setting and the strongly convex setting with a large conditional number. Meanwhile, in the strongly convex setting with a small condition number, SVR-ADA is still competitive with the non-accelerated SVRG algorithm and much faster than the other two accelerated algorithms of Katyusha and MiG.

5 Conclusion and Discussion

In this work, for the finite-sum convex optimization problem, we propose to combine accelerated dual averaging with stochastic variance reduction for designing more efficient algorithms. The complexity of our algorithm practically matches the theoretical lower bounds for both settings, and leads to significant improvements in performance guarantees over existing methods. It is somewhat surprising that with such a simple and unified approach, the convergence rates can be further improved for the well-studied convex finite-sum optimization problem. As a result, we believe that the general approach of this paper may have potential extensions to nonconvex optimization, distributed settings and minimax problems.

Acknowledgements

Chaobing would like to thank Yaodong Yu for pointing out the closely related recent work of [LLZ19].

References

• [AS20] Kwangjun Ahn and Suvrit Sra. On tight convergence rates of without-replacement sgd. arXiv preprint arXiv:2004.08657, 2020.
• [AZ17] Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. In

Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing

, pages 1200–1205. ACM, 2017.
• [AZH16] Zeyuan Allen-Zhu and Elad Hazan. Optimal black-box reductions between optimization objectives. In Advances in Neural Information Processing Systems, pages 1614–1622, 2016.
• [AZO17] Zeyuan Allen-Zhu and Lorenzo Orecchia. Linear coupling: An ultimate unification of gradient and mirror descent. In 8th Innovations in Theoretical Computer Science Conference (ITCS 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017.
• [AZY16] Zeyuan Allen-Zhu and Yang Yuan. Improved svrg for non-strongly-convex or sum-of-non-convex objectives. In International conference on machine learning, pages 1080–1089, 2016.
• [BB08] Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in neural information processing systems, pages 161–168, 2008.
• [BE02] Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of machine learning research, 2(Mar):499–526, 2002.
• [Bot09] Léon Bottou. Curiously fast convergence of some stochastic gradient descent algorithms. In

Proceedings of the symposium on learning and data science, Paris

, 2009.
• [DBLJ14] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in neural information processing systems, pages 1646–1654, 2014.
• [Def16] Aaron Defazio. A simple practical accelerated method for finite sums. In Advances in neural information processing systems, pages 676–684, 2016.
• [DO19] Jelena Diakonikolas and Lorenzo Orecchia. The approximate duality gap technique: A unified theory of first-order methods. SIAM Journal on Optimization, 29(1):660–689, 2019.
• [DSSST10] John C Duchi, Shai Shalev-Shwartz, Yoram Singer, and Ambuj Tewari. Composite objective mirror descent. In COLT, pages 14–26, 2010.
• [GOP19] Mert Gürbüzbalaban, Asu Ozdaglar, and PA Parrilo. Why random reshuffling beats stochastic gradient descent. Mathematical Programming, pages 1–36, 2019.
• [HK14] Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15(1):2489–2512, 2014.
• [HS18] Jeffery Z HaoChen and Suvrit Sra. Random shuffling beats sgd after finite epochs. arXiv preprint arXiv:1806.10077, 2018.
• [JZ13] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323, 2013.
• [LLX14] Qihang Lin, Zhaosong Lu, and Lin Xiao. An accelerated proximal coordinate gradient method. In Advances in Neural Information Processing Systems, pages 3059–3067, 2014.
• [LLZ19] Guanghui Lan, Zhize Li, and Yi Zhou. A unified variance-reduced accelerated gradient method for convex optimization. In Advances in Neural Information Processing Systems, pages 10462–10472, 2019.
• [LMH15] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems, pages 3384–3392, 2015.
• [LPLJ17] Rémi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien. Asaga: Asynchronous parallel saga. In

20th International Conference on Artificial Intelligence and Statistics (AISTATS) 2017

, 2017.
• [LZ18a] Guanghui Lan and Yi Zhou. An optimal randomized incremental gradient method. Mathematical programming, 171(1-2):167–215, 2018.
• [LZ18b] Guanghui Lan and Yi Zhou. Random gradient extrapolation for distributed and stochastic optimization. SIAM Journal on Optimization, 28(4):2753–2782, 2018.
• [Nes98] Yurii Nesterov. Introductory lectures on convex programming volume i: Basic course. Lecture notes, 1998.
• [Nes15] Yurii Nesterov. Universal gradient methods for convex optimization problems. Mathematical Programming, 152(1-2):381–404, 2015.
• [RGP20] Shashank Rajput, Anant Gupta, and Dimitris Papailiopoulos. Closing the convergence gap of sgd without replacement. arXiv preprint arXiv:2002.10400, 2020.
• [RHS