1 Introduction
In this paper, we study the following composite convex optimization problem:
(1) 
where is a convex function that is a finite sum of convex, smooth sample functions , and
is convex, probably nonsmooth but admitting an efficient proximal operator. In this paper, we mainly assume that each
is smooth and is strongly convex (). If , then the problem is nonstrongly convex. If , then the problem is strongly convex and we define the corresponding condition number . Instances of problem (1) appear widely in statistical learning, operational research, and signal processing. For instance, in machine learning, if , where andis the data vector, then the problem (
1) is also called regularized empirical risk minimization(ERM). Important instances of ERM include ridge regression, Lasso, logistic regression, and support vector machine.
1.1 Context and Motivation
In the largescale setting where is large, firstorder methods become the natural choice for solving (1) due to its better scalability. However, when is very large, even accessing the full gradient becomes prohibitively expensive. To alleviate this difficulty, a common approach is to use a stochastic gradient with to replace the full gradient in each iteration, aka, stochastic gradient descent (SGD). In the stochastic setting, the goal to solve (1) becomes to find an expected accurate solution satisfying where is an exact minimizer of (1). Typically, the complexity or convergence rate of such an algorithm is evaluated by the number of evaluating stochastic gradients needed to achieve the accurate solution.
Due to the need of only accessing stochastic gradients, SGD has a low periteration cost. However, SGD has a very slow convergence rate due to the constant variance . To improve the rate of SGD while still maintaining its low periteration cost, a remarkable progress in the past decade is to exploit the finitesum structure of in (1) to reduce the variance of stochastic gradients. In such variance reduction methods, instead of directly using , we compute a full gradient of an anchor point beforehand. Then we use the following variance reduced gradient
(2) 
as a proxy for the full gradient during each iteration. As a result, the amortized periteration cost is still the same as SGD. However, the variance reduced gradient (2) is unbiased and can reduce the variance from to . The variance can vanish asymptotically, thus the convergence rate of SGD can be substantially improved.
To this end, SAG [RSB12]
is one of the first variance reduction methods while it uses a biased estimation of the full gradient. SVRG
[JZ13] directly solves (1) and explicitly uses the unbiased estimation (
2) to reduce variance. Then SAGA [DBLJ14] provides an alternative of (2) to avoid precomputing the gradient of an anchor point but with the price of an increased memory cost. Based on [SSZ14], a Catalyst approach [LMH15] is proposed to combine Nesterov’s acceleration into variance reduction methods in a black box manner. [AZ17] has proposed the first direct approach, named Katyusha (aka accelerated SVRG), to combine variance reduction and a kind of Nesterov’s acceleration scheme in a principled manner. [WS16] has given a tight lower complexity bound for finitesum stochastic optimization and shown the tightness of Katyusha up to a logarithmic factor. MiG [ZSC18] follows and simplifies Katyusha by only using negative momentum to produce acceleration.In Table 1, we give the complexity results of representative variance reduction methods for both the nonstrongly convex and strongly convex settings (as well as results of this paper). The literature on variance reduction is too rich to list them all here.^{1}^{1}1For instance, when the objective (1) is strongly convex, we can also use randomized coordinate descent/ascent methods on the dual or primaldual formulation of (1) to indirectly solve (1), such as the nonaccelerated SDCA [SSZ13] and the accelerated versions AccADCA [SSZ14], APCG [LLX14] and SPDC [ZX15]. Variance reduced methods have also been widely applied into distributed computing [RHS15, LPLJ17] and nonconvex optimization [RHS16, RSPS16]. As Katyusha [AZ17] remains state of the art for nearly all (if not all) convex settings, we mainly use Katyusha as the baseline for comparison.
[b]
Algorithm  Nonstrongly Convex  Strongly Convex 
SVRG [JZ13, XZ14]  —  
SAGA [DBLJ14]  
SVRG [AZY16]  )  — 
Katyusha [AZ17]  —  
Katyusha [AZ17]  —  
Katyusha [AZ17]  
by Reduction of [AZH16]  —  
SVRADA  
(This Paper)  ^{1}  
Lower bound [WS16] 

For this setting, the bound is also valid.
To understand where we stand with these complexity results, in the nonstrongly convex setting, we are particularly interested in attaining a solution with a proper accuracy such as .^{2}^{2}2This is because, in the context of largescale statistical learning, due to statistical limits [BB08, SSS08], even under some strong regularity conditions [BB08], obtaining an accuracy will be sufficient. To attain this accuracy, and Katyusha (as well as SVRG) needs about number of iterations whereas the lower bound [WS16] implies that we may only need iterations. Before this work, it is not known whether the logarithmic factor gap can be further improved or not.
In the strongly convex setting, the rate of Katyusha is optimal for the case according to the lower bound given by [WS16]. However, when (which is common in the statistical learning context such as [BE02]), the rate by the accelerated Katyusha is not better than the nonaccelerated methods such as SVRG and SAGA. Before this work, it is not known in the regime whether the rate can be further improved or not and whether accelerated variance reduced methods can indeed be better than nonaccelerated ones.
1.2 Our Approach and Contributions
We believe that the above issues with Katyusha are the result of the acceleration strategy that it adopts, which follows the idea of combining gradient descent with mirror descent [AZ17]. The mirror descent step can only exploit the strong convexity in the current iteration and cannot explore the strong convexity along the whole optimization trajectory. As a result, despite such a combination can lead to optimal dependence on , it leads to suboptimal dependence on the sample size particularly when in the strongly convex setting or in the nonstrongly convex setting. Although no improved results are obtained after Katyusha, MiG [ZSC18] simplifies Katyusha by showing that it is enough to lead to acceleration by only using negative momentum and performing a mirror descent step in each iteration.
In this paper, we follow the simple algorithmic framework of MiG [ZSC18] but consider a different acceleration strategy based on Nesterov’s idea of combining gradient descent and dual averaging [Nes15], hence the name stochastic variance reduction via accelerated dual averaging (SVRADA). The superiority of dual averaging in exploiting the structure of regularizer has been elaborated in the work [Xia10] (a NeurIPS test of time award paper). By using dual averaging, [Xia10] shows it can accumulate the weight of a sparse regularizer such as norm to produce more sparse solutions than the composite mirror descent in [DSSST10] in the online optimization context. Here, we use dual averaging to accumulates the weight (a.k.a. the strong convexity constant) of norm ^{3}^{3}3 The strong convexity implies that can be written as a sum of a convex part and . along the optimization trajectory. As we will show, this can lead to better complexity dependence on the sample size in the finitesum optimization context. To full exploit the potential of dual averaging, we introduce a generalized estimation sequence for the finitesum setting and a simple but novel initialization strategy that helps improve substantially the relationship between and . As a result, SVRADA simplifies the convergence analysis significantly and improves the state of the art convergence rates in both nonstrongly and strongly convex settings.
Efficiency.
In the nonstrongly convex setting, as shown in Table 1, our SVRADA method achieves the rate
(3) 
which matches the lower bound up to a factor. So to attain a solution with we only need number of iterations which is by the superlinear rate in the initial stage, whereas the best known result before is . Practically speaking, the factor can be treated as a small constant: for instance when , we have Thus, for nonstrongly convex problems, SVRADA can attain an accurate solution with essentially iterations, practically matches the lower bound!
In the strongly convex setting^{4}^{4}4We use the indicator function with value when the event is true and when the event is false., as shown in Table 1, SVRADA first attains an initial accuracy with number of iterations. Then when and , the bound of SVRADA becomes
(4) 
which matches the lower bound with the initialized error . In contrast, theoretically the best existing methods such as Katyusha do not enjoy such good initial behaviors. Meanwhile, for this setting, if we do not consider the superlinear behavior in the initial stage, then the rate of SVRADA can also be characterized by .
Meanwhile, when and , the rate of SVRADA becomes
(5) 
Again, if we do not consider the superlinear behavior, then for and , the bound can also be characterized as . In this regime, as shown in Table 1, both nonaccelerated and accelerated methods have the rate . Thus SVRADA improves the best known result by a factor. The factor will be significant when is very large and is small such as or For the first time, our results show that for the regime accelerated methods can be better than nonaccelerated methods for finitesum optimization.
Simplicity.
In the nonstrongly convex setting, besides the improved complexity result, SVRADA is also much simpler than the best known method of combining Katyusha with an optimal blackbox reduction strategy [AZH16], as SVRADA directly matches the lower bound without any need of additional strategy. In addition, the optimal blackbox reduction strategy of Katyusha (similarly the geometrically increasing strategy for SVRG [AZY16]) needs to estimate some unknown factors such as the initial objective difference and the initial solution distance , which makes the implementation complicated. In the strongly convex setting, SVRADA uses a natural uniform average in algorithm implementation, while Katyusha uses a weighted average that also makes implementation complicated.
Unification.
Different from Katyusha and Katyusha that use different parameter settings and independent convergence analysis, SVRADA uses the same parameter setting for both the nonstrongly convex and strongly convex settings^{5}^{5}5As we will discuss, the Varag [LLZ19] algorithm is also unified for both the nonstrongly convex and strongly convex settings.. The only difference is that in SVRADA, we set the parameter in the nonstrongly convex setting, while we set in the strongly convex setting. Meanwhile, based on a “generalized estimation sequence,” we conduct a unified convergence analysis for both settings. The only difference is that the values of two predefined sequences of positive numbers are different.
1.3 Other Related Works
Regarding the lower bound under sampling with replacement.
When the problem (1) is strongly convex and smooth with , [LZ18a] has provided a stronger lower bound than [WS16] such that to find an solution such that any randomized incremental gradient methods need at least
(6) 
number of iterations when the dimension is sufficiently large. Our upper bound (5) is measured by . By the strong convexity, when and , if we convert to the Euclidean distance , then our rate will be At first sight, when our upper bound is actually better than the lower bound (6) by a factor, which seems rather surprising. Nevertheless, by examining carefully the assumption of [LZ18a] in deriving the lower bound (6), one sees that the randomized incremental gradient methods of [LZ18a] are referred to the ones by sampling with replacement completely. As it turns out, the SVRADA algorithm proposed in this paper does not satisfy the assumption of [LZ18a]. SVRADA is based on the twoloop structure of SVRG: in the outer loop, we compute the full gradient of an anchor point; in the inner loop, we compute stochastic gradients by sampling with replacement. In the outer loop of SVRG, the step of computing full gradient can be viewed as stochastic gradient steps with step size by (implicitly) sampling without replacement. ^{6}^{6}6The outer loop of SVRG or our algorithm cannot be interpreted as stochastic gradient steps by sampling with replacement (say, with step size), as we cannot pick all the samples with probability by sampling with replacement for times. Thus, SVRADA does not fall into the kind of randomized incremental algorithms that [LZ18a] has examined, and the lower bound (6) does not apply to SVRADA.
Meanwhile, despite all Katyusha, MiG (and also Varag [LLZ19], ASVRG [SJZ18]) are accelerated versions of SVRG, they cannot obtain the improved upper bound (5). We believe that it is from the simple fact we have discussed: all of them lead to acceleration by following the style of mirror descent [AZO17], which cannot effectively explore the strong convexity along the optimization path. Thus our improved rate in (5) can be attributed to the combined merits of the (implicitly) sampling without replacement of SVRG and the acceleration effect by following the style of Nesterov’s dual averaging.
Remark 1 (Sampling without Replacement)
Very recently, the superiority of sampling without replacement has also been verified theoretically [HS18, GOP19, RGP20, AS20]. Particularly, [AS20] has shown that for strongly convex and smooth finitesum problems, SGD without replacement (also known as random reshuffling) needs number of stochastic gradient evaluations, which is tight and significantly better than the rate of SGD with replacement [HK14]
. Meanwhile, in practice, sampling without replacement is also more widely used in training deep neural network for its better efficiency
[Bot09, RR13].Relation to the algorithm Varag.
In this work, we mainly compare the proposed algorithm SVRADA with Katyusha, because it provides the first accelerated convergence rates of the direct solvers for the finitesum problems and remains state of the art. Besides the MiG [ZSC18] method, another work very related to SVRADA is Varag [LZ18a], which is a unified version of Katyusha for both the strongly convex and the nonstrongly convex settings, and does not need blackbox reduction to achieve the same rate as Katyusha in the nonstrongly convex setting. However, SVRADA is not only unified and simple, it provides the improved convergence results in (3) and (5) for both the nonstrongly convex and strongly convex settings respectively. Note that despite the rich literature on convex finitesums, only a few works [RSB12, DBLJ14, SSZ14, AZY16, AZ17] have provided improved complexity results for the two basic settings. Meanwhile, our analysis is based on a generalized estimation sequence and is significantly different from the acceleration methods [AZ17, ZSC18, LLZ19] that follow the style of mirror descent. Furthermore, similar to MiG , we only need to keep track of only one variable vector in the inner loop, which gives it a better edge in sparse and asynchronous settings [ZSC18].
Besides accelerated versions of SVRG, there are also two accelerated versions PointSAGA [Def16] and SSNM [Def16] of SAGA and also a randomized primaldual method RPDG [LZ18a] and a randomized gradient extrapolation method RGEM [LZ18b]. All these methods use sampling with replacement completely and attains the corresponding lower bound (6).
2 Algorithm: Variance Reduction via Accelerated Dual Averaging
Let . For simplicity, we only consider We first introduce a couple of standard assumptions about the smoothness and convexity of the problem (1).
Assumption 1
is convex, i.e., is smooth (), i.e.,
By Assumption 1 and , we can verify that is smooth, i.e., Furthermore, we assume satisfies:
Assumption 2
is strongly convex (), i.e., and
To realize acceleration with dual averaging, we recursively define the following generalized estimation sequence for the finitesum problem (1):
(7) 
with the initialization , with , ^{7}^{7}7As we will see, denotes the number of inner iterations in our algorithm., and (for ), where is a sequence of positive numbers to be specified later. Here is a sequence of vectors that will be generated by our algorithm, is a sequence of variance reduced stochastic gradients evaluated at If and , then we can verify that (7) is equivalent to the classical definition of estimation sequence by Nesterov [Nes15]. In the finitesum setting, we set
to amortize the computational cost per epoch
, where is the number of sample functions in (1). Then we say is the estimation sequence in the (inner) th iteration of the th epoch.For convenience, we define with . Then Algorithm 1 summarizes the proposed Stochastic Varianced Reduced via Accelerated Dual Averaging (SVRADA) method. As one may see, besides the steps about updating the estimation sequence such as Steps 24, 6 and 12, Algorithm 1 mainly follows the framework of the simplified MiG [ZSC18] of Katyusha. The main differences are that we have novel but effective initialization steps in Steps 34 and replace the mirror descent step in MiG with the Step 12, a dual averaging step:
(8) 
To be selfcontained, note that we compute the full gradient on the anchor point in Step 7 and compute the variance reduced stochastic gradient in Steps 10 and 11. The convex combination Step 9 and the dual averaging Step 12 are used to achieve acceleration. The settings for , and in Step 14 are derived from our analysis. Notice that we update as a natural uniform average with respect to for both the nonstrongly convex and strongly convex settings, while Katyusha (as well as MiG) uses a weighted average for the strongly convex setting and a uniform average for the nonstrongly convex setting.
In our algorithm, a key new setting is being initialized as the times of , which, as we will see in our proof, helps cancel the term in the complexity result of Katyusha completely. For from the definition of in (7) and , one can show^{8}^{8}8see proof for details. that the sequence is at least strongly convex. As we will see in the proof, this accumulation of strong convexity is crucial for our algorithm to achieve better convergence than the nonaccelerated ones such as SVRG and SAGA even in the regime
To be more precise, under these settings, we can prove (in Section 3 and the appendix) the following convergence guarantee for the proposed SVRADA algorithm:
Theorem 1
Theorem 1 gives a unified convergence result for both the nonstrongly convex and the strongly convex settings. As we see in (9), the estimated error in the function value is simply bounded by the term about .
To see implications of Theorem 1 in the complexity of algorithm, by the first term in (10), whether strongly convex or not, SVRADA can attain an accurate solution in number of epochs and thus The superlinear phenomenon is by our novel initialization in Step 4 of Algorithm 1. The best known convergence rate in the initial stage is by Varag [LLZ19], which has shown a linear convergence rate in the initial stage for convexfinite sums whether strongly convex or not. In contrast, the corresponding rate of SVRADA is superlinear. To the best of our knowledge, we have not observed any theoretical justification of superlinear phenomenon for variance reduced first order methods.
By the second term in (10), in the strongly convex setting we can have an accelerated linear convergence rate from the start. Note that whatever or the contracting ratio of SVRADA will always be , which will tend to as However, for all the existing variance reduction methods such as Katyusha and Varag, when the contracting ratio will be at least a constant such as in Katyusha.
Then based on the prompt decrease in the superlinear initial stage, we also provide two new lower bounds for in (11). By the first term in (11), whether strongly convex or not, SVRADA can have at least an accelerated sublinear rate. By the second term in (11), in the strongly convex setting SVRADA will maintain an accelerated linear rate.
Thus by Theorem 1, by setting we obtain our improved convergence rates for both the nonstrongly convex and strongly convex settings in Table 1.
The generalized estimation sequence and the associated analysis is the key in proving our main result Theorem 1, while it is commonly known to be difficult to understand. So [AZO17] has proposed a linear coupling of gradient descent and mirror descent instead. Meanwhile, there are plenty of study on understanding acceleration from discretization of continuous time dynamics [SBC14]. The estimation sequence itself also has principled explanation from a continuous time perspective [DO19, SJM19]. In Section 3, we show that the estimation sequence analysis leads to a very concise, unified, and principled convergence analysis for both the strongly convex and nonstrongly convex settings.
3 Convergence Analysis
When using estimation sequence to prove convergence rate, the main task is to given the lower bound and upper bound of , where The lower bound is given in terms of the objective value at the current iterate and the estimation sequence in the previous iteration, while the upper bound is in terms of the objective value at the optimal solution. (For simplicity, we only need to give the upper bound of ) Then by telescoping and concatenating the lower bound and upper bound of , we prove the rate in terms of the objective difference (in expectation).
First, in the initial Step 3 of Algorithm 1, by the smoothness property of and the setting we have Lemma 1.
Lemma 1 (The initial step)
It follows that
(12) 
Proof. See Section A.
Lemma 1 will be used to cancel the error introduced by in the following iterations. After entering into the main loop, by using the smoothness and convexity property, and the careful setting of and , we obtain the lower bound of in Lemma 2.
Lemma 2 (Lower bound)
we have
(13)  
Proof. See Section B.
In Lemma 2, the term is the variance we need to bound, of which the bound is given in Lemma 3 based on the standard derivation in [AZ17].
Lemma 3 (Variance reduction)
taking expectation on the randomness over the choice of in the th iteration of th epoch, we have
(14) 
Proof. See Section C.
Then by combining Lemma 2 and Lemma 3, we will find that the inner product in (13) and (14) can be canceled with each other in expectation. Therefore after combining Lemma 2 and Lemma 3, by telescoping the resulted inequality from to and using the definition , we have Lemma 4.
Lemma 4 (Recursion)
taking expectation on the randomness over the epoch , it follows that
(15) 
Proof. See Section D.
Besides the lower bound (15), by the convexity of and optimality of , we can also provide the upper bound in Lemma 5.
Lemma 5 (Upper bound)
taking expectation on all the history, we have
(16) 
Proof. See Section E.
Proof of Theorem 1.
Proof. Taking expectation on the randomness of all the history and telescoping (15) from to , we have
(17) 
Meanwhile, by Lemma 1 and the setting , we have
(18) 
So combining (17) and (18), and by the setting , we have
(19) 
Then combining Lemma 5 and (19), we have
(20) 
Then we give the lower bound of by the condition in Step 6 of Algorithm 1 and . To show the lower bound by the first term in (10), we know that
(21) 
so we have
(22) 
Then by the setting , we have
(23) 
Meanwhile for we have
(25) 
Thus we can use the mathematical induction method to prove the first lower bound in (11):
Firstly, for we have
Meanwhile, for we also have
(27)  
Thus the second lower bound in (11) is proved.
4 Experiments
In this section, to verify the theoretical results and show the empirical performance of the proposed SVRADA method, we conduct numerical experiments on largescale datasets in machine learning. The datasets we use are a9a, covtype, mnist and cifar10, downloaded from the LIBSVM website^{9}^{9}9The dataset url is https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.. The a9a and covtype datasets are for binary classification, while mnist and cifar10 are for multiclass classification. To make comparison easier, we normalize the Euclidean norm of each data vector in the datasets to be . The problem we solve is the norm regularized (multinomial) logistic regression problem:
(28) 
where is the number of samples, denotes the number of class (for a9a and covtype, ; for mnist and cifar10, ) denotes the regularization parameter, is a onehot vector or zero vector^{10}^{10}10Zero vector denotes the class of the th sample is ., and denotes the variable to optimize. For the twoclass datasets “a9a” and “covtype”, we present our results by choosing the regularization parameter . For the tenclass datasets “mnist” and “cifar10”, we choose . For the corresponding problem is unregularized and thus nonstrongly convex. For this setting, we compare SVRADA with the stateoftheart variance reduction methods SVRG [JZ13], Katyusha [AZ17], and MiG [ZSC18]. The settings for a9a and covtype (and for mnist and cifar10) correspond to the strongly convex setting with a large condition number and that with a small condition number , respectively. For both settings, we compare SVRADA with SVRG, Katyusha and MiG.
All four algorithms we compare have a similar outerinner structure, where we set all the number of iterations as . For these algorithms, the common parameter to tune is the parameter w.r.t. Lipschitz constant^{11}^{11}11For logistic regression with normalized data, the Lipschitz constant is globally upper bounded [ZX15] by , but in practice we can use a smaller one than , which is tuned in .^{12}^{12}12In our experiments, due to the normalization of datasets, all the four algorithms will diverge when the parameter is less than . Otherwise, they always converge if the parameter is less than Following the tradition of ERM experiments, we use the number of “passes” of the entire dataset as the xaxis. All four algorithms are implemented in C++ under the same framework, while the figures are produced using Python.
For the datasets a9a and covtype, as shown in Figure 1, when , SVRADA decreases the error promptly in the initial stage, which validates our theoretical result in attaining an accurate solution with passes of the entire dataset. An interesting phenomenon is that the other variance reduction methods share the same behavior with SVRADA in empirical evaluations (in fact, MiG is slightly faster for both a9a and covtype datasets). This poses an open problem whether or not this superlinear phenomenon in the initial stage can be theoretically justified for SVRG, Katyusha, and MiG.
As shown in Figure 1, when , i.e., the large condition number setting, SVRADA has significantly better performance than Katyusha and MiG. This is partly due to the fact that the accumulation of strong convexity by dual averaging helps us better cancel the error from the randomness and allows SVRADA to choose a more aggressive parameter w.r.t. Lipschitz constant. When , i.e., the small condition number setting, as shown in Figure 1, SVRADA is significantly better than Katyusha and MiG, which validates our superior theoretical results in (5). Meanwhile, although with less theoretical guarantee, when , SVRG can be competitive with SVRADA, which inspires us to conduct in the future a tighter convergence analysis for nonaccelerated methods in the small condition number setting.
As we see, despite there are some minor differences among different tasks/datasets shown in Figure 1 and Figure 2, the general behaviors are still very consistent. From both figures, our method SVRADA achieves state of the art and is much faster than the nonaccelerated SVRG algorithm in the nonstrongly convex setting and the strongly convex setting with a large conditional number. Meanwhile, in the strongly convex setting with a small condition number, SVRADA is still competitive with the nonaccelerated SVRG algorithm and much faster than the other two accelerated algorithms of Katyusha and MiG.
5 Conclusion and Discussion
In this work, for the finitesum convex optimization problem, we propose to combine accelerated dual averaging with stochastic variance reduction for designing more efficient algorithms. The complexity of our algorithm practically matches the theoretical lower bounds for both settings, and leads to significant improvements in performance guarantees over existing methods. It is somewhat surprising that with such a simple and unified approach, the convergence rates can be further improved for the wellstudied convex finitesum optimization problem. As a result, we believe that the general approach of this paper may have potential extensions to nonconvex optimization, distributed settings and minimax problems.
Acknowledgements
Chaobing would like to thank Yaodong Yu for pointing out the closely related recent work of [LLZ19].
References
 [AS20] Kwangjun Ahn and Suvrit Sra. On tight convergence rates of withoutreplacement sgd. arXiv preprint arXiv:2004.08657, 2020.

[AZ17]
Zeyuan AllenZhu.
Katyusha: The first direct acceleration of stochastic gradient
methods.
In
Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing
, pages 1200–1205. ACM, 2017.  [AZH16] Zeyuan AllenZhu and Elad Hazan. Optimal blackbox reductions between optimization objectives. In Advances in Neural Information Processing Systems, pages 1614–1622, 2016.
 [AZO17] Zeyuan AllenZhu and Lorenzo Orecchia. Linear coupling: An ultimate unification of gradient and mirror descent. In 8th Innovations in Theoretical Computer Science Conference (ITCS 2017). Schloss DagstuhlLeibnizZentrum fuer Informatik, 2017.
 [AZY16] Zeyuan AllenZhu and Yang Yuan. Improved svrg for nonstronglyconvex or sumofnonconvex objectives. In International conference on machine learning, pages 1080–1089, 2016.
 [BB08] Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in neural information processing systems, pages 161–168, 2008.
 [BE02] Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of machine learning research, 2(Mar):499–526, 2002.

[Bot09]
Léon Bottou.
Curiously fast convergence of some stochastic gradient descent
algorithms.
In
Proceedings of the symposium on learning and data science, Paris
, 2009.  [DBLJ14] Aaron Defazio, Francis Bach, and Simon LacosteJulien. Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in neural information processing systems, pages 1646–1654, 2014.
 [Def16] Aaron Defazio. A simple practical accelerated method for finite sums. In Advances in neural information processing systems, pages 676–684, 2016.
 [DO19] Jelena Diakonikolas and Lorenzo Orecchia. The approximate duality gap technique: A unified theory of firstorder methods. SIAM Journal on Optimization, 29(1):660–689, 2019.
 [DSSST10] John C Duchi, Shai ShalevShwartz, Yoram Singer, and Ambuj Tewari. Composite objective mirror descent. In COLT, pages 14–26, 2010.
 [GOP19] Mert Gürbüzbalaban, Asu Ozdaglar, and PA Parrilo. Why random reshuffling beats stochastic gradient descent. Mathematical Programming, pages 1–36, 2019.
 [HK14] Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic stronglyconvex optimization. The Journal of Machine Learning Research, 15(1):2489–2512, 2014.
 [HS18] Jeffery Z HaoChen and Suvrit Sra. Random shuffling beats sgd after finite epochs. arXiv preprint arXiv:1806.10077, 2018.
 [JZ13] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323, 2013.
 [LLX14] Qihang Lin, Zhaosong Lu, and Lin Xiao. An accelerated proximal coordinate gradient method. In Advances in Neural Information Processing Systems, pages 3059–3067, 2014.
 [LLZ19] Guanghui Lan, Zhize Li, and Yi Zhou. A unified variancereduced accelerated gradient method for convex optimization. In Advances in Neural Information Processing Systems, pages 10462–10472, 2019.
 [LMH15] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for firstorder optimization. In Advances in Neural Information Processing Systems, pages 3384–3392, 2015.

[LPLJ17]
Rémi Leblond, Fabian Pedregosa, and Simon LacosteJulien.
Asaga: Asynchronous parallel saga.
In
20th International Conference on Artificial Intelligence and Statistics (AISTATS) 2017
, 2017.  [LZ18a] Guanghui Lan and Yi Zhou. An optimal randomized incremental gradient method. Mathematical programming, 171(12):167–215, 2018.
 [LZ18b] Guanghui Lan and Yi Zhou. Random gradient extrapolation for distributed and stochastic optimization. SIAM Journal on Optimization, 28(4):2753–2782, 2018.
 [Nes98] Yurii Nesterov. Introductory lectures on convex programming volume i: Basic course. Lecture notes, 1998.
 [Nes15] Yurii Nesterov. Universal gradient methods for convex optimization problems. Mathematical Programming, 152(12):381–404, 2015.
 [RGP20] Shashank Rajput, Anant Gupta, and Dimitris Papailiopoulos. Closing the convergence gap of sgd without replacement. arXiv preprint arXiv:2002.10400, 2020.
 [RHS
Comments
There are no comments yet.