Many tasks in machine learning and statistics reduce to finite-sum minimization problems of the form
where the individual functions are -smooth and -strongly convex or convex. The large datasets often encountered in modern applications have led to high interest in optimization methods which can efficiently cope with a large numbers of individual functions.
In this work, we measure efficiency of optimization algorithms through the framework of oracle complexity. Concretely, we assume the existence of an external procedure, typically referred to as an oracle, which upon receiving a query, reveals some information about the function at hand (e.g., function values, gradients or Hessians). The oracle complexity of a given optimization algorithm is then the number of oracle calls required to obtain an -optimal solution, i.e., a point that satisfies w.h.p. or , where the expectation is over randomness in the algorithm and oracle.
Throughout, we focus on two types of oracles for finite sums: incremental oracles, which allow one to control which individual function is being referred to in each iteration, and stochastic oracles, which provide information on a randomly chosen —without revealing its index (see sec:settings for a formal exposition). Although the difference between these two types of oracle may seem insignificant at first glance, the optimal attainable performances of randomized incremental methods (compatible with incremental oracles), such as SAG [SLRB13], SVRG [JZ13] and SDCA [SSZ13], are significantly better than those attainable by stochastic methods (compatible with stochastic oracles).111We follow here a nomenclatural convention that distinguishes cases where deterministic problems are addressed via random methods, as in (1.1), from cases that are inherently stochastic, as in (3.21) below. This distinction is made formal in sec:settings.
First-order oracles. A notable example where incremental methods outperform stochastic methods is the case of first-order oracles, which provide function values and gradients. When the are smooth and strongly convex, SAG, SDCA and SVRG enjoy exponential rates of (disregarding other problem parameters for now), while the stochastic vanilla SGD obtains significantly slower rates of , e.g., [SSSSC11]
. The fact that randomized incremental methods can converge exponentially fast is not altogether trivial. In particular, it implies that the variance of the iterates must also decay exponentially fast, which explains why such methods are often referred to asvariance-reduced methods.
The main algorithmic idea behind variance reduction is to wisely incorporate past information acquired from the oracle (see [JZ13]). This naturally raises the following question. Is it possible to use the same idea to design stochastic, rather than incremental, variance-reduced methods that achieve exponential convergence rates for finite-sum problems?
In this work, we answer this question in the affirmative. Specifically, we exploit the unique finite noise structure of finite sums to form an estimator that recovers the full gradient ofw.h.p.. Based on this estimator, we propose a novel adaptation of SVRG which does not rely on the indices of the individual functions. Combined with the Catalyst acceleration scheme [LMH15], this yields upper complexity bounds of
for and , respectively, which hold w.h.p.. In particular, this shows that the additional finite-sum structure can indeed be used to achieve exponential rates for stochastic methods in the strongly convex case (and rate in the smooth convex case), which is impossible for general-purpose stochastic methods [NY83, AWBR09, RR11]. Perhaps surprisingly, although this rate cannot be achieved through a naive empirical average gradient estimator, as we prove in Section 3, a simple rounding correction is all it needs to obtain an exponential convergence rate w.h.p. (in the strongly convex case).
Although the quadratic dependence on may seem rather pessimistic, any stochastic method that obtains an -optimal solution must issue at least oracle queries [Arj17]. In fact, in the class of stochastic first-order methods, one cannot hope to perform better than
for and , respectively (see (2.16) below for details). The complexity bounds stated in (1.2) provide, therefore, a fair idea of the complexity of obtaining high-accuracy solutions in applications where the indices of the individual functions are not known, or not used.
Global oracles. The -lower bound in [Arj17] applies in fact to the much broader class of global oracles, under which a complete specification of a randomly chosen individual function is provided. Clearly, any local information, such as gradients, Hessians, and other high-order derivatives, as well as global information, such as the minimizers of along a given direction (i.e., steepest descent steps), can be extracted through global oracles. An important setting in which one is granted such ‘privileged access’ is empirical risk minimization (ERM),
where , the loss function, is known a priori, and denote the training samples. Since each individual function is fully parameterized by its associated sample, this implies global access to the randomly chosen individual functions. In this work, by using a similar idea to the one we use for first-order oracles, we show that the global -lower complexity bound is tight.
Our contributions can be summarized as follows:
We show that (unlike existing variance-reduced methods which directly rely on the indices of the individual functions) it is possible to minimize smooth strongly-convex finite-sum problems—without using the indices—and still enjoy exponential convergence rates.
To this end, we combine the SVRG method with a novel biased quantized gradient estimator which recovers the average function w.h.p.. We further apply the Catalyst framework to improve the dependence on the condition number and to extend the proposed algorithm to convex functions (which are not necessarily strongly-convex).
We prove that variance reduction cannot be obtained by using SGD or SVRG with a naive gradient estimator, as the rate at which the variance decays in this case is too slow, which in turn leads to polynomial convergence rates. Thus, perhaps surprisingly, the seemingly negligible quantization correction we propose is essential for obtaining exponential (rather than polynomial) rates.
Using an estimator similar to the quantized gradient estimator, we show that the -lower bound established in [Arj17] for stochastic global methods for finite sums is essentially tight.
The paper is structured as follows. In sec:settings, we introduce the main ingredients of the oracle complexity framework and discuss relevant lower complexity bounds. sec:related surveys existing approaches for minimizing finite sums and their limitations in the context of stochastic oracles. In sec:estimator, we present our quantized estimator for categorical random variables. With this, we establish in sec:app a tight upper bound for stochastic global oracles and present a variant of SVRG, which does not use indices explicitly, and extend its applicability through the Catalyst framework.
We study the problem of finding an -optimal solution in the framework of oracle complexity. First, we briefly review its main components. For further details on information-based complexity, see [TW80].
Function classes We denote by the class of generic finite-sum problems of the form (1.1). Similarly, let denote the class of finite sums with -strongly convex individual functions, where for any and ,
If , then is merely convex. Likewise, we use to indicate that the individual functions are further assumed to be -smooth, i.e., for any and ,
Lastly, we always assume that the initial suboptimality of at the initialization point is bounded from above by
where is some positive real scalar assumed to be known.
Oracle Classes Different ways of accessing a given optimization problem are modeled through different oracles, and these, in turn, accommodate different classes of optimization methods. In this work, we consider the following four distinct types of oracles:
Incremental first-order oracle defined with a parameter as
The oracle allows the user to obtain the gradient of an individual function they desire, at different points (note that we avoid explicitly stating in to allow a cleaner notation). Having is necessary for implementing the SVRG method in sec:app.
Incremental global oracle defined by
In this oracle model, the entire function is accessible to the user. Hence one can simply issue queries to get , and then return an exact minimizer of . Under the -model, one completely disregards the cost of computing local information, such as gradients. It can be shown that of queries are necessary to obtain solutions of sufficiently high accuracy (e.g., Lemma 2 in [AS16]).
Stochastic first-order oracle defined with a parameter as
The class of stochastic oracles, the main focus of this work, is used to model methods which do not or cannot explicitly rely on the individual function index. This happens for example when ‘non-enumerable’ data augmentation is used (see, e.g., [LCB07]). Stochastic methods are used more broadly to address stochastic optimization problems of the form , where one is given access to a first-order estimate of the gradient of at given point through a randomly drawn . Note that existing variance reduced methods are not directly implementable with stochastic oracles, as we discuss in sec:related.
Stochastic global oracle defined by
This oracle is used to study the fundamental statistical limitations of minimizing finite-sum problems, as all other computational aspects are disregarded under this oracle model. As mentioned earlier, an important instance of this setting is ERM. In this case, the global stochastic oracle complexity is typically referred to as sample complexity.
Minimax Oracle Complexity Next, we next general oracle (minimax) complexity. Given a class of functions and a suitable oracle , we denote by the class of all optimization algorithms that access instances in by issuing at most -queries.222Strictly speaking, when studying complexity of optimization algorithms, one must carefully define what is meant by ‘algorithm’ and ‘oracle’. We shall not need this level of formality in this work. Let be the final iterate returned by algorithm when applied to an . With this, we define two related notions of minimax complexity:
where, here and throughout, denotes the infimum of over its domain.
A straightforward application of Markov’s inequality shows that the first complexity notion eq:comp_exp, which holds in expectation, implies the second complexity notion eq:comp_whp, which holds w.h.p.. We note in passing that, in spite of being a well-accepted framework in the field of continuous optimization, oracle complexity does not take into account the computational resources required to implement and process oracle calls, and should therefore be regarded as a lower bound on the real computational complexity.
Lower Oracle Complexity Bounds Some oracles are more expressive than others. For example, it is straightforward to implement the stochastic first-order oracle through the global oracle . Therefore, by definitions (2) and (2) above and by [[Arj17], Theorem 1], we have
provided that . Here and below, we omit the dependence on the initial suboptimality for simplicity. Likewise, can be used to implement be calling with , hence
Combining both bounds gives
As mentioned earlier, we also have , but this additional bound is absorbed in the term in (2.16). Using similar arguments, one obtains a bound for the -smooth case, . Fact 1 summarizes these lower complexity bounds, for reference in later sections. We leave a treatment of the h.p. counterparts to future work.
3 Approaches for Minimizing Finite Sums
Next, we review existing approaches for minimizing finite-sum problems via randomized incremental and stochastic methods. This brief survey also serves as a motivational exposition for the main question we ask in this work, namely, is it possible to apply variance-reduced techniques using first-order stochastic oracles (defined in eq:stoch_orc)?
For concreteness of the following discussion, let us consider the class with . A natural approach for addressing finite-sum problems without relying on the indices is to re-express (1.1) as a stochastic optimization problem,
and apply generic stochastic methods, such as SGD [RM51]. The oracle complexity of vanilla SGD is (e.g., [NJLS09]), which, although inferior to incremental methods in terms of , does not depend on . This makes SGD particularly suited for settings where is very large and one desires solutions of moderate accuracy.
Other, more recent variants of SGD use, e.g., mini-batches, importance sampling, or fixed step sizes, to achieve exponential convergence rates—but only up to a certain noise level (see [MB11, NWS14, NW16, QRG19]). It is possible to gradually reduce the convergence noise level, but that would effectively imply polynomial complexity bounds (hence, giving the same rates attainable by vanilla SGD). This should come as no surprise—any general-purpose stochastic first-order methods designed for 3.21) is bound to polynomial rates [NY83, AWBR09, RR11].
In contrast to this, if an incremental first-order oracles is given, one can compute the full gradient of at each iteration by simply iterating over the individual functions, and then use vanilla Gradient Descent (GD) and Accelerated Gradient Descent (AGD, [Nes04]). This yields oracle complexity bounds of
for GD and AGD, respectively, where hides some logarithmic factors in the problem parameters.
Recently, the new class of variance-reduced methods, such as SAG, SDCA, SVRG, SAGA [DBLJ14], SDCA without duality [Sha16], or MISOFinito [Mai15, DD14], was shown to enjoy significantly better rates of , or
Despite their favorable rates, variance-reduce methods cannot be directly implemented through stochastic first-order oracles. Indeed, algorithms like SAG/SAGA/MISO/SDCA keep a list of vectors in memory, one for each individual function. The th vector is then updated using gradients of , which are acquired throughout the optimization process. Clearly, one must know which individual function is being addressed at each iteration in order to performs such updates. Another notable example of a variance-reduced method is SVRG, which generates new iterates using the full gradient at some reference point , i.e.,
When the stochastic oracle allows multiple simultaneous queries (i.e., in eq:stoch_orc), it is possible to evaluate the term . However, one cannot evaluate the full gradient by simply iterating over the indices of the individual functions.
A natural idea to address this issue is to replace the full gradient by an empirical estimator that simply averages the sampled gradients. Unfortunately, this naive estimator does not lead to the desired exponential convergence.
Assume that in each iteration we make a fixed number of stochastic oracle calls to form the empirical estimator
where each is uniformly sampled among the individual gradients , and apply SGD and SVRG with and constant stepsize . Then we have the following lower bounds:
Lower bound in expectation: There exists such that for sufficiently small , for any choice of and stepsize , the number of stochastic oracle calls required to ensure is at least .
Lower bound in high probability:There exists such that for sufficiently small and , for any choice of and stepsize , the number of stochastic oracle calls required to ensure with probability is at least .
We prove thm:naive estimator by carefully tracing the random iterates produced by SGD (with different step sizes) when applied on the finite-sum function (for some even )
It is straightforward to show that one full iteration of SVRG is equivalent to one SGD step. Hence both methods can be effectively addressed in the same way.
We note that the high probability bound is not directly implied by the bound in expectation. The former requires one to guarantee that the distribution of the iterates does not concentrate around the mean of the iterates, as this sequence can converge exponentially fast. The main tool we use to ensure anti-concentration of the iterates is the Berry-Esseen theorem. One may notice that the dependency on in the high probability lower bound is not as good as the one in the expectation lower bound, which may be improvable with a different proof technique. Our main message here is to show that the naive estimator does not provide exponential convergence rate.
4 A Biased Estimator with Quantization
It turns out that the high variance exhibited by the naive gradient estimator that leads to the lower bounds above, can be addressed by a simple rounding procedure. We first describe this procedure in a general setting: estimating the parameters of categorical random variables.
We start by introducing relevant definitions. Given a real number , we let denote the closest integer to , where by convention for . A random variable is said to be -categorical if is discrete with finite support , where is some vector space, and
for some nonnegative integers which sum up to , that is, . Clearly, stochastic oracles over finite-sum functions induce a categorical distribution on their potential set of answers.
The next simple lemma is a key insight for our analysis.
Let be a -categorical distribution. Let be i.i.d. samples of and let be the empirical counter of category defined by
If , then with probability at least we have that for every ,
Proof Define . By Hoeffding’s inequality, we have that for every ,
where the last inequality is due to the assumption that . It follows now by the union bound that
Thus, with probability at least , for all , we have , implying that
Therefore, with appropriate quantization, we can recover the distribution parameters with high probability. More importantly, the estimator
satisfies with probability at least as long as . It is worth noting that the quantized estimator is not unbiased. A simple example follows by a straightforward computation for (see sec:biased_est_proof in the appendix for full details).
We emphasize that one does not need to know the support set in advance to implement this estimator; the counter can be implemented on the fly. During the sampling procedure, we keep in memory the set of sample ‘types’ seen up to some point, and update it accordingly. lem:categorical_estimation implies that with probability at least all different categories are seen and the corresponding probabilities are recovered.
5 Application to Stochastic Oracles
Having presented our quantized estimator for categorical random variables, we now use it to design optimization algorithms for finite sums which do not rely on the indices of the individual functions.
Stochastic global oracle We consider first the oracle complexity of stochastic global oracle. In this case, a straightforward application of the quantized estimator eq:quane with support set yields the following upper complexity bound.
The minimax complexity of a stochastic global oracle for the finite-sum problem (1.1) is bounded by
By the lower complexity bound established in [Arj17], this bound stated in thm:global_comp is tight up to logarithmic factors. Also, note that the bound stated in thm:global_comp applies to any type of individual functions (including non-convex and non-smooth functions).
Stochastic First-order Oracle Our quantized estimator can also be used to recover full gradients of . This enables us to implement a ‘quantized’ variant of SVRG (Q-SVRG), over -smooth and -strongly convex individual functions—which is compatible with the stochastic oracle . A better dependence on the condition number is then achieved by applying the Catalyst acceleration framework. Moreover, the Catalyst framework also allows us to extend the scope of Q-SVRG to cases where the individual function are only assumed to be convex, rather than strongly convex.
Each iteration of SVRG starts by obtaining a full gradient of at some reference point . Next, SVRG generates iterates by setting
where , ; and and are assumed to be fixed throughout the optimization process. Lastly, one uses the average of the points as the reference point for the next iteration.
The difficulty in implementing SVRG via stochastic first-order oracles is that one cannot simply form the full gradient at the reference point by sequentially iterating over the individual functions. To remedy this, we use the quantized gradient estimator which allows us to recover the full gradient w.h.p.. This is formally stated as follows (see full proof in 7.2).
One can compute the full gradient of at a given point with success probability via -calls.
With a union bound, lem:full_gradient can be further used to obtain exact gradients (see 7.3 for details).
One can compute the full gradients of at different points in with success probability of via -calls.
corr:k_full_gradients implies that w.p. at least over the oracle randomness, we can implement full iterations of SVRG using overall number of -oracle calls, from which we conclude the following result.
With notations as above,
Proof The proof of the theorem follows by a direct application of Theorem 1 in [JZ13]. In particular, the same analysis holds conditioned on the event that the full gradients are exactly recovered at each reference point. Formally, given , and , we denote by the event that all the first full gradients are exactly estimated. We have
where the convergence factor is given by,
Setting and yields . Therefore, by Markov’s inequality,
Then, setting , we have
Therefore the overall number of -oracle calls required to obtain an -optimal solution is bounded from above by
In other words, replacing the naive estimator by the quantized estimator is all we need to obtain exponential convergence result. We emphasize that this result does not contradict the lower bound in Theorem 1 because Theorem 1 relies on the explicit form of the empirical estimator. Before moving on, it is worth noting that our algorithm requires the stochastic oracle to allow multiple simultaneous queries, i.e., in (2.10).
Next, to improve the dependence on condition number, we apply the Catalyst acceleration framework [LMH15] to Q-SVRG. Catalyst acts as a wrapper that takes as input an algorithm that converges exponentially fast and outputs an accelerated algorithm.
In detail, at iteration , we replace the original objective function by an auxiliary objective defined by
where is a well-chosen regularization parameter and is obtained by extrapolating solutions of previous subproblems. We optimize up to accuracy and use the solution to warm-start the next subproblem.
For the complexity analysis of the Catalyst acceleration framework, assume that a given optimization algorithm converges exponentially fast for the smooth and strongly convex problems , i.e.,
Then, applying Catalyst on yields [LMH15] a global complexity bound of
for finding an -optimal solution. We remark that is a free parameter and hence can be chosen to minimize the overall complexity bound.
In our case, the convergence rate parameter is
Therefore, the total number of -calls is given by
Minimizing the total number of -calls with respect to , yields . Plugging in the value of gives the desired accelerated complexity bound
For randomized incremental methods, the acceleration occurs in the ill-conditioned regime where . Here, due to the augmented cost of evaluating the full gradient, acceleration only occurs in the extremely ill-conditioned regime in which .
The Catalyst framework is also useful as a means of extending the applicability of Q-SVRG to smooth convex finite sums. This follows by the fact that subproblems are always strongly convex when .
for finding an -solution. Again, minimizing the global complexity with respect to the parameter yields , by which we obtain the following complexity bound
Both complexity bounds are summarized as follows.
With notations as above,
The bounds stated in thm:acc_rates are partly complemented by the lower bounds given in eq:lb_n2_fs_sc and eq:lb_n2_fs_nsc, namely, and . Specifically, in both cases, the proposed SVRG variant is tight w.r.t. the global term . However, the first-order term of Q-SVRG corresponds to that of deterministic methods, namely, (in the strongly convex case), whereas the first-order term of randomized incremental methods is , and thus misses a factor of .
6 Discussion and Future Work
In this paper, we showed that although current variance-reduced finite-sum methods directly rely on the indices of the individual functions, it is possible to achieve variance reduction for obtaining exponential convergence rates even without this knowledge. Although this variance reduction cannot be achieved by simply averaging over the gradients, a simple rounding procedure suffices to obtain an exponential convergence rate w.h.p..
The cost of not having access to or disregarding the indices of the individual functions (as is often done in practice) is an expensive -term in the upper complexity bound—which is inevitable for stochastic methods compatible with . This leads to a factor of in the first-order term (rather than the -factor exhibited by incremental methods) which we suspect is tight for stochastic methods. We leave addressing this gap to future work.
One limitation of our approach is the requirement of issuing two or more queries simultaneously (i.e., in eq:stoch_orc). This assumption is necessary to compute the expression
for the SVRG update rule. Replacing it by
introduces additional variance and breaks the current analysis. Since our quantized estimator is still applicable when , one can still implement GD or AGD and obtain an exponential convergence rate of , which is nontrivial to achieve in the stochastic setting.
That said, to the best of our knowledge, no variance reduction technique is applicable with . This leads to an interesting open question: is it possible to exploit the finite sum structure and achieve exponential convergence rate better than in the stochastic first-order setting with ? Addressing this question will provide further understanding of the variance reduction technique.
This research was supported by The Defense Advanced Research Projects Agency (grant number YFA17N66001-17-1-4039). The views, opinions, and/or findings contained in this article are those of the authors and should not be interpreted as representing the official views or policies, either expressed or implied, of the Defense Advanced Research Projects Agency or the Department of Defense.
- [Arj17] Yossi Arjevani. Limitations on variance-reduction and acceleration schemes for finite sums optimization. In Advances in Neural Information Processing Systems, pages 3540–3549, 2017.
- [AS16] Yossi Arjevani and Ohad Shamir. Dimension-free iteration complexity of finite sum optimization problems. In Advances in Neural Information Processing Systems, pages 3540–3548, 2016.
- [AWBR09] Alekh Agarwal, Martin J Wainwright, Peter L Bartlett, and Pradeep K Ravikumar. Information-theoretic lower bounds on the oracle complexity of convex optimization. In Advances in Neural Information Processing Systems, pages 1–9, 2009.
- [AZ17] Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. The Journal of Machine Learning Research, 18(1):8194–8244, 2017.
- [Ber41] Andrew C Berry. The accuracy of the gaussian approximation to the sum of independent variates. Transactions of the american mathematical society, 49(1):122–136, 1941.
- [DBLJ14] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654, 2014.
- [DD14] Aaron Defazio, Justin Domke, et al. Finito: A faster, permutable incremental gradient method for big data problems. In International Conference on Machine Learning, pages 1125–1133, 2014.
- [Ess42] C.-G. Esseen. On the liapunoff limit of error in the theory of probability. Arkiv för Matematik, Astronomi och Fysik, A28, 1942.
Rie Johnson and Tong Zhang.
Accelerating stochastic gradient descent using predictive variance reduction.In Advances in Neural Information Processing Systems, pages 315–323, 2013.
- [KS10] V Yu Korolev and Irina G Shevtsova. On the upper bound for the absolute constant in the berry–esseen inequality. Theory of Probability & Its Applications, 54(4):638–658, 2010.
Gaëlle Loosli, Stéphane Canu, and Léon Bottou.
Training invariant support vector machines using selective sampling.Large scale kernel machines, pages 301–320, 2007.
- [LMH15] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems, pages 3366–3374, 2015.
- [Mai15] Julien Mairal. Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization, 25(2):829–855, 2015.
Eric Moulines and Francis R Bach.
Non-asymptotic analysis of stochastic approximation algorithms for machine learning.In Advances in Neural Information Processing Systems, pages 451–459, 2011.
- [Nes04] Yurii Nesterov. Introductory lectures on convex optimization, volume 87. Springer Science & Business Media, 2004.
- [NJLS09] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609, 2009.
- [NW16] Deanna Needell and Rachel Ward. Batched stochastic gradient descent with weighted sampling. In International Conference Approximation Theory, pages 279–306. Springer, 2016.
- [NWS14] Deanna Needell, Rachel Ward, and Nati Srebro. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In Advances in neural information processing systems, pages 1017–1025, 2014.
- [NY83] AS Nemirovsky and DB Yudin. Problem complexity and method efficiency in optimization. 1983. Willey-Interscience, New York, 1983.
- [QRG19] Xun Qian, Peter Richtárik, Robert M. Gower, Alibek Sailanbayev, Nicolas Loizou, and Egor Shulgin. SGD with arbitrary sampling: General analysis and improved rates. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pages 5200–5209, 2019.
- [RM51] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
- [RR11] Maxim Raginsky and Alexander Rakhlin. Information-based complexity, feedback and dynamics in convex programming. Information Theory, IEEE Transactions on, 57(10):7036–7056, 2011.
- [Sha16] Shai Shalev-Shwartz. SDCA without duality, regularization, and individual convexity. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 747–754, 2016.
- [SLRB13] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, pages 1–30, 2013.
- [SSSSC11] Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated sub-gradient solver for svm. Mathematical programming, 127(1):3–30, 2011.
- [SSZ13] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss. The Journal of Machine Learning Research, 14(1):567–599, 2013.
- [SSZ16] Shai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, 155(1-2):105–145, 2016.
- [TW80] Joseph Frederick Traub and Henryk Wozniakowski. A general theory of optimal algorithms. Academic Press New York, 1980.
- [WS16] Blake E Woodworth and Nati Srebro. Tight complexity bounds for optimizing composite objectives. In Advances in Neural Information Processing Systems, pages 3639–3647, 2016.
7 Supplementary Material
7.1 Quantized Estimator is Biased
Consider , i.e. there are category, each having probability . Now assume , then all the possible couples (up to permutation) of are , , , , . The corresponding are , , , , . Note that all the couples sum up to expect, . Thus the estimator is biased. (If it is unbiased, the sum of the expectation should be , but here it is .)
7.2 Proof of lem:full_gradient
Proof Let be the gradients of the individual functions corresponding to some point in , and let denote the set of distinct gradients (note that , with strict inequality if two functions share the same gradient). Denote . Note that the full gradient can be equivalently expressed as:
Let be the answers of the first-order oracle, and let denote the corresponding set of distinct gradients (here with strict inequality if one of the gradients was not sampled). We let , and estimate the gradient through
By lem:categorical_estimation we have that with probability at least and up to permutation of the indices, for every , , in which case
7.3 Proof of corr:k_full_gradients
-calls suffice to compute the full gradient at a given point
with failure probability of at most . Hence, by the
union bound, full gradients can be obtained with failure
probability of at most by using
7.4 Proof of thm:naive estimator
Proof Assume is even, and we define . In this case, and the minimum . We now consider applying gradient descent (GD) with the naive unbiased gradient estimator on .
At iteration , we sample stochastic oracles, which is equivalent to pick points at random independently, then the update is given by
Note that , where
is the binomial distribution. Therefore,
A simple telescopic summing yields,
Note that . Therefore, recall that , we have
In order to guarantee , it is necessary to have both
Let denotes the total number of oracles. If , then
If , then
Therefore in both cases the complexity of obtaining an solution of is lower bounded by .
High probability result:
We consider the same function . With out loss of generality, let’s assume . Note that , bounding is equivalent to bound