An optimization problem principal to machine learning and statistics is that of finite sums:
where the individual functions are assumed to possess some favorable analytical properties, such as Lipschitz-continuity, smoothness or strong convexity (see [Nes04] for details). We measure the iteration complexity of a given optimization algorithm by determining how many evaluations of individual functions (via some external oracle procedure, along with their gradient, Hessian, etc.) are needed in order to obtain an -solution, i.e., a point which satisfies (where the expectation is taken w.r.t. the algorithm and the oracle randomness).
Arguably, the simplest way of minimizing finite sum problems is by using optimization algorithms for general optimization problems. For concreteness of the following discussion, let us assume for the moment that the individual functions are-smooth and -strongly convex. In this case, by applying vanilla Gradient Descent (GD) or Accelerated Gradient Descent (AGD, [Nes04]), one obtains iteration complexity of
respectively, where denotes the condition number of the problem and hides logarithmic factors in the problem parameters. However, whereas such bounds enjoy logarithmic dependence on the accuracy level, the multiplicative dependence on renders this approach unsuitable for modern applications where is very large.
A different approach to tackle a finite sum problem is by reformulating it as a stochastic optimization problem, i.e., , and then applying a general stochastic method, such as SGD, which allows iteration complexity of or (depending on the problem parameters). These methods offer rates which do not depend on , and are therefore attractive for situations where one seeks for a solution of relatively low accuracy. An evident drawback of these methods is their broad applicability for stochastic optimization problems, which may conflict with the goal of efficiently exploiting the unique noise structure of finite sums (indeed, in the general stochastic setting, these rates cannot be improved, e.g., [AWBR09, RR11]).
In recent years, a major breakthrough was made when stochastic methods specialized in finite sums (first SAG [SLRB13] and SDCA [SSZ13], and then SAGA [DBLJ14], SVRG [JZ13], SDCA without duality [SS15], and others) were shown to obtain iteration complexity of
The ability of these algorithms to enjoy both logarithmic dependence on the accuracy parameter and an additive dependence on is widely attributed to the fact that the noise of finite sum problems distributes over a finite set of size . Perhaps surprisingly, in this paper we show that another key ingredient is crucial, namely, a mean of knowing which individual function is being referred to by the oracle at each iteration. In particular, this shows that variance-reduction mechanisms (see, e.g., [DBLJ14, Section 3]) cannot be applied without explicitly knowing the ‘identity’ of the individual functions. On the more practical side, this result shows that when data augmentation (e.g., [LCB07]) is done without an explicit enumeration of the added samples, it is impossible to obtain iteration complexity as stated in (3, see [BM16] for relevant upper bounds).
Although variance-reduction mechanisms are essential for obtaining an additive dependence on (as shown in (3)), they do not necessarily yield ‘accelerated’ rates which depend on the square root of the condition number (as shown in (2) for AGD). Recently, generic acceleration schemes were used by [LMH15] and accelerated SDCA [SSZ16] to obtain iteration complexity of
. The first category of lower bounds exploits the degree of freedom offered by a- (or an infinite-) dimensional space to show that any first-order and a certain class of second-order methods cannot obtain better rates than (4) in the regime where the number of iterations is less than . The second category of lower bounds is based on maintaining the complexity of the functional form of the iterates, thereby establishing bounds for first-order and coordinate-descent algorithms whose step sizes are oblivious to the problem parameters (e.g., SAG, SAGA, SVRG, SDCA, SDCA without duality) for any number of iterations, regardless of and .
In this work, we further extend the theory of oblivious finite sum algorithms, by showing that if a first-order and a coordinate-descent oracle are used, then acceleration is not possible without an explicit knowledge of the strong convexity parameter. This implies that in cases where only poor estimation of the strong convexity is available, faster rates may be obtained through ‘adaptive’ algorithms (see relevant discussions in[SLRB13, AS16b]).
Next, we show that in the smooth and convex case, oblivious finite sum algorithms which, on average, apply the same update rule at each iteration (e.g., SAG, SDCA, SVRG, SVRG++ [AZY16], and typically, other algorithms with a variance-reduction mechanism as described in [DBLJ14, Section 3]), are bound to iteration complexity of , where denotes the smoothness parameter (rather than ). To show this, we employ a restarting scheme (see [AS16b]) which explicitly introduces the strong convexity parameter into algorithms that are designed for smooth and convex functions. Finally, we use this scheme to establish a tight dimension-free lower bound for smooth and convex finite sums which holds for oblivious algorithms with a first-order and a coordinate-descent oracle.
To summarize, our contributions (in order of appearance) are the following:
In Section 2, we prove that in the setting of stochastic optimization, having finitely supported noise (as in finite sum problems) is not sufficient for obtaining linear convergence rates with a linear dependence on - one must also know exactly which individual function is being referred to by the oracle at each iteration. Deriving similar results for various settings, we show that SDCA, accelerated SDCA, SAG, SAGA, SVRG, SVRG++ and other finite sum algorithms must have a proper enumeration of the individual functions in order to obtain their stated convergence rate.
In Section 3.1, we lay the foundations of the framework of general CLI algorithms (see [AS16a]), which enables us to formally address oblivious algorithms (e.g., when step sizes are scheduled regardless of the function at hand). In section 3.2, we improve upon [AS16b], by showing that (in this generalized framework) the optimal iteration complexity of oblivious, deterministic or stochastic, finite sum algorithms with both first-order and coordinate-descent oracles cannot perform better than , unless the strong convexity parameter is provided explicitly. In particular, the richer expressiveness power of this framework allows addressing incremental gradient methods, such as Incremental Gradient Descent [Ber97] and Incremental Aggregated Gradient [BHG07, IAG].
In Section 3.3, we show that, in the -smooth and convex case, the optimal complexity bound (in terms of the accuracy parameter) of oblivious algorithms whose update rules are (on average) fixed for any iteration is (rather then , as obtained, e.g., by accelerated SDCA). To show this, we first invoke a restarting scheme (used by [AS16b]) to explicitly introduce strong convexity into algorithms for finite sums with smooth and convex individuals, and then apply the result derived in Section 3.2.
2 The Importance of Individual Identity
In the following, we address the stochastic setting of finite sum problems (1) where one is equipped with a stochastic oracle which, upon receiving a call, returns some individual function chosen uniformly at random and hides its index. We show that not knowing the identity of the function returned by the oracle (as opposed to an incremental oracle which addresses the specific individual functions chosen by the user), significantly harms the optimal attainable performance. To this end, we reduce the statistical problem of estimating the bias of a noisy coin into that of optimizing finite sums. This reduction (presented below) makes an extensive use of elementary definitions and tools from information theory, all of which can be found in [CT12].
First, given , we define the following finite sum problem
is w.l.o.g. assumed to be odd,and are some functions (to be defined later). We then define the following discrepancy measure between and for different values of (see also [AWBR09]),
where . It is easy to verify that no solution can be -optimal for both and , at the same time. Thus, by running a given optimization algorithm long enough to obtain -solution w.h.p., we can deduce the value of . Also, note that, one can simplify the computation of by choosing convex such that . Indeed, in this case, we have (in particular, ), and since is convex, it must attain its minimum at , which yields
Next, we let
be drawn uniformly at random, and then use the given optimization algorithm to estimate the bias of a random variablewhich, conditioned on , takes w.p. , and w.p. . To implement the stochastic oracle described above, conditioned on , we draw i.i.d. copies of , denoted by , and return , if , and , otherwise. Now, if is such that
for both and , then by Markov inequality, we have that
(note that is a non-negative random variable). We may now try to guess the value of using the following estimator
whose probability of error, as follows by Inequality (8), is
Lastly, we show that the existence of an estimator for with high probability of success implies that . To this end, note that the corresponding conditional dependence structure of this probabilistic setting can be modeled as follows: . Thus, we have
where and denote the Shannon entropy function and the binary entropy function, respectively, follows by the data processing inequality (in terms of entropy), follows by Fano’s inequality and follows from Equation (9). Applying standard entropy identities, we get
where follows from Bayes rule, follows by the fact that , conditioned on , are i.i.d. and
follows from the chain rule and the fact that conditioning reduces entropy. Combining this with Inequality (10) and rearranging, we have
The minimal number of stochastic oracle calls required to obtain -optimal solution for problem (5) is .
Instantiating this schemes for of various analytical properties yields the following.
When solving a finite sum problem (defined in 1) with a stochastic oracle, one needs at least oracle calls in order to obtain an accuracy level of:
for smooth and strongly convex individuals with condition .
for -smooth and convex individuals.
if , and , otherwise, for -Lipschitz continuous and -strongly convex individuals.
One can easily verify that are -smooth and convex functions, and that the minimizer of is . By Equation (7), we get .
over the unit ball. Clearly, are -Lipschitz continuous and -strongly convex functions. It can be verified that the minimizer of is . Therefore, by Equation (7), we see that in this case we have
A few conclusions can be readily made from Theorem 1. First, if a given optimization algorithm obtains an iteration complexity of an order of , up to logarithmic factors (including the norm of the minimizer which, in our construction, is of an order of and coupled with the accuracy parameter), for solving smooth and strongly convex finite sum problems with a stochastic oracle, then
Thus, the following holds for optimization of finite sums with smooth and strongly convex individuals.
In order to obtain linear convergence rate with linear dependence on , one must know the index of the individual function addressed by the oracle.
This implies that variance-reduction methods such as, SAG, SAGA, SDCA and SVRG (possibly combining with acceleration schemes), which exhibit linear dependence on , cannot be applied when data augmentation is used. In general, this conclusion also holds for cases when one applies general first-order optimization algorithms, such as AGD, on finite sums, as this typically results in a linear dependence on . Secondly, if a given optimization algorithm obtains an iteration complexity of an order of for solving smooth and convex finite sum problems with a stochastic oracle, then . Therefore, and , indicating that an iteration complexity of an order of , as obtained by, e.g., SVRG++, is not attainable with a stochastic oracle. Similar reasoning based on the Lipschitz and strongly convex case in Theorem 1 shows that the iteration complexity guaranteed by accelerated SDCA is also not attainable in this setting.
3 Oblivious Optimization Algorithms
In the previous section, we discussed different situations under which variance-reduction schemes are not applicable. Now, we turn to study under what conditions can one apply acceleration schemes. First, we define the framework of oblivious CLI algorithms. Next, we show that, for this family of algorithms, knowing the strong convexity parameter is crucial for obtaining accelerated rates. We then describe a restarting scheme through which we establish that stationary algorithms (whose update rule are, on average, the same for every iteration) for smooth and convex functions are sub-optimal. Finally, we use this reduction to derive a tight lower bound for smooth and convex finite sums on the iteration complexity of any oblivious algorithm (not just stationary).
In the sequel, following [AS16a], we present the analytic framework through which we derive iteration complexity bounds. This, perhaps pedantic, formulation will allows us to study somewhat subtle distinctions between optimization algorithms. First, we give a rigorous definition for a class of optimization problems which emphasizes the role of prior knowledge in optimization.
Definition 1 (Class of Optimization Problems).
A class of optimization problems is an ordered triple , where is a family of functions defined over some domain designated by , is the side-information given prior to the optimization process and is a suitable oracle procedure which upon receiving and in some parameter set , returns for a given (we shall omit the subscript in when is clear from the context).
In finite sum problems, comprises of functions as defined in (1); the side-information may contain the smoothness parameter , the strong convexity parameter and the number of individual functions ; and the oracle may allow one to query about a specific individual function (as in the case of incremental oracle, and as opposed to the stochastic oracle discussed in Section 2). We now turn to define CLI optimization algorithms (see [AS16a] for a more comprehensive discussion).
Definition 2 (Cli).
An optimization algorithm is called a Canonical Linear Iterative (CLI) optimization algorithm over a class of optimization problems , if given an instance and initialization points , where is some index set, it operates by iteratively generating points such that for any ,
holds, where are parameters chosen, stochastically or deterministically, by the algorithm, possibly based on the side-information. If the parameters do not depend on previously acquired oracle answers, we say that the given algorithm is oblivious. For notational convenience, we assume that the solution returned by the algorithm is stored in .
Throughout the rest of the paper, we shall be interested in oblivious CLI algorithms (for brevity, we usually omit the ‘CLI’ qualifier) equipped with the following two incremental oracles:
where , denotes the ’th -dimensional unit vector and . We restrict the oracle parameters such that only one individual function is allowed to be accessed at each iteration. We remark that the family of oblivious algorithms with a first-order and a coordinate-descent oracle is wide and subsumes SAG, SAGA, SDCA, SDCA without duality, SVRG, SVRG++ to name a few. Also, note that coordinate-descent steps w.r.t. partial gradients can be implemented using the generalized first-order oracle by setting to be some principal minor of the unit matrix (see, e.g., RDCM in [Nes12]). Further, similarly to [AS16a], we allow both first-order and coordinate-descent oracles to be used during the same optimization process.
3.2 No Strong Convexity Parameter, No Acceleration for Finite Sum Problems
Having described our analytic approach, we now turn to present some concrete applications. Below, we show that in the absence of a good estimation for the strong convexity parameter, the optimal iteration complexity of oblivious algorithms is . Our proof is based on the technique used in [AS16a, AS16b] (see [AS16a, Section 2.3] for a brief introduction of the technique).
Given , we define the following set of optimization problems (over with )
parametrized by (note that the individual functions are identical. We elaborate more on this below). It can be easily verified that the condition number of , which we denote by , is , and that the corresponding minimizers are with norm .
If we are allowed to use different optimization algorithm for different in this setting, then we know that the optimal iteration complexity is of an order of . However, if we allowed to use only one single algorithm, then we show that the optimal iteration complexity is of an order of . The proof goes as follows. First, note that in this setting, the oracles defined in (3.1) take the following form,
Now, since the oracle answers are linear in and the ’th iterate is a -fold composition of sums of the oracle answers, it follows that forms a -dimensional vector of univariate polynomials in of degree with (possibly random) coefficients (formally, see Lemma 3, Appendix A). Denoting the polynomial of the first coordinate of by , we see that for any ,
where the first inequality follows by Jensen inequality and the second inequality by focusing on the first coordinate of and . Lastly, since the coefficients of do not depend on , we have by Lemma 4 in Appendix A, that there exists , such that for any it holds that
by which we derive the following.
The iteration complexity of oblivious finite sum optimization algorithms equipped with a first-order and a coordinate-descent oracle whose side-information does not contain the strong convexity parameter is .
The part of the lower bound holds for any type of finite sum algorithm and is proved in [AS16a, Theorem 5]. The lower bound stated in Theorem 2 is tight up to logarithmic factors and is attained by, e.g., SAG [SLRB13]. Although relying on a finite sum with identical individual functions may seem somewhat disappointing, it suggests that some variance-reduction schemes can only give optimal dependence in terms of , and that obtaining optimal dependence in terms of the condition number need to be done through other (acceleration) mechanisms (e.g., [LMH15]). Lastly, note that, this bound holds for any number of iterations (regardless of the problem parameters).
3.3 Stationary Algorithms for Smooth and Convex Finite Sums are Sub-optimal
In the previous section, we showed that not knowing the strong convexity parameter reduces the optimal attainable iteration complexity. In this section, we use this result to show that whereas general optimization algorithms for smooth and convex finite sum problems obtain iteration complexity of , the optimal iteration complexity of stationary algorithms (whose expected update rules are fixed) is .
The proof (presented below) is based on a general restarting scheme (see Scheme 1) used in [AS16b]. The scheme allows one to apply algorithms which are designed for -smooth and convex problems on smooth and strongly convex finite sums by explicitly incorporating the strong convexity parameter. The key feature of this reduction is its ability to ‘preserve’ the exponent of the iteration complexity from an order of in the non-strongly convex case to an order of in the strongly convex case, where denotes some quantity which may depend on but not on , and is some positive constant.
|Scheme 1||Restarting Scheme|
|Given||An optimization algorithm|
|for smooth convex functions with|
|for any initialization point|
|Restart the step size schedule of|
|Run for iterations|
|Set to be the point returned by|
The proof goes as follows. Suppose is a stationary CLI optimization algorithm for -smooth and convex finite sum problems equipped with oracles (3.1). Also, assume that its convergence rate for is of an order of , for some . First, observe that in this case we must have . For otherwise, we get , implying that, simply by scaling , one can optimize to any level of accuracy using at most iterations, which contradicts [AS16a, Theorem 5]. Now, by [AS16b, Lemma 1], Scheme 1 produces a new algorithm whose iteration complexity for smooth and strongly convex finite sums with condition number is
Finally, stationary algorithms are invariant under this restarting scheme. Therefore, the new algorithm cannot depend on . Thus, by Theorem 2, it must hold that that and that , proving the following.
If the iteration complexity of a stationary optimization algorithm for smooth and convex finite sum problems equipped with a first-order and a coordinate-descent oracle is of the form of the l.h.s. of (16), then it must be at least .
We note that, this lower bound is tight and is attained by, e.g., SDCA.
3.4 A Tight Lower Bound for Smooth and Convex Finite Sums
We now turn to derive a lower bound for finite sum problems with smooth and convex individual functions using the restarting scheme shown in the previous section. Note that, here we allow any oblivious optimization algorithm, not just stationary. The technique shown in Section 3.2 of reducing an optimization problem into a polynomial approximation problem was used in [AS16a] to derive lower bounds for various settings. The smooth and convex case was proved only for , and a generalization for seems to reduce to a non-trivial approximation problem. Here, using Scheme 1, we are able to avoid this difficulty by reducing the non-strongly case to the strongly convex case, for which a lower bound for a general is known.
The proof follows the same lines of the proof of Theorem 3. Given an oblivious optimization algorithm for finite sums with smooth and convex individuals equipped with oracles (3.1), we apply again Scheme 1 to get an algorithm for the smooth and strongly convex case, whose iteration complexity is as in (16). Now, crucially, oblivious algorithm are invariant under Scheme 1 (that is, when applied on a given oblivious algorithm, Scheme 1 produces another oblivious algorithm). Therefore, using [AS16a, Theorem 2], we obtain the following.
If the iteration complexity of an oblivious optimization algorithm for smooth and convex finite sum problems equipped with a first-order and a coordinate-descent oracle is of the form of the l.h.s. of (16), then it must be at least
This bound is tight and is obtained by, e.g., accelerated SDCA [SSZ16]. Optimality in terms of and can be obtained simply by applying Accelerate Gradient Descent [Nes04], or alternatively, by using an accelerated version of SVRG as presented in [Nit16]. More generally, one can apply acceleration schemes, e.g., [LMH15], to get an optimal dependence on .
We thank Raanan Tvizer and Maayan Maliach for several helpful and insightful discussions.
- [AS16a] Yossi Arjevani and Ohad Shamir. Dimension-free iteration complexity of finite sum optimization problems. In Advances in Neural Information Processing Systems, pages 3540–3548, 2016.
- [AS16b] Yossi Arjevani and Ohad Shamir. On the iteration complexity of oblivious first-order optimization algorithms. In Proceedings of the 33nd International Conference on Machine Learning, pages 908–916, 2016.
- [AS16c] Yossi Arjevani and Ohad Shamir. Oracle complexity of second-order methods for finite-sum problems. arXiv preprint arXiv:1611.04982, 2016.
- [AWBR09] Alekh Agarwal, Martin J Wainwright, Peter L Bartlett, and Pradeep K Ravikumar. Information-theoretic lower bounds on the oracle complexity of convex optimization. In Advances in Neural Information Processing Systems, pages 1–9, 2009.
- [AZY16] Zeyuan Allen-Zhu and Yang Yuan. Improved svrg for non-strongly-convex or sum-of-non-convex objectives. Technical report, Technical report, arXiv preprint, 2016.
- [Ber97] Dimitri P Bertsekas. A new class of incremental gradient methods for least squares problems. SIAM Journal on Optimization, 7(4):913–926, 1997.
- [BHG07] Doron Blatt, Alfred O Hero, and Hillel Gauchman. A convergent incremental gradient method with a constant step size. SIAM Journal on Optimization, 18(1):29–51, 2007.
- [BM16] Alberto Bietti and Julien Mairal. Stochastic optimization with variance reduction for infinite datasets with finite-sum structure. arXiv preprint arXiv:1610.00970, 2016.
- [CT12] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
- [DBLJ14] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654, 2014.
Rie Johnson and Tong Zhang.
Accelerating stochastic gradient descent using predictive variance reduction.In Advances in Neural Information Processing Systems, pages 315–323, 2013.
- [Lan15] Guanghui Lan. An optimal randomized incremental gradient method. arXiv preprint arXiv:1507.02000, 2015.
Gaëlle Loosli, Stéphane Canu, and Léon Bottou.
Training invariant support vector machines using selective sampling.Large scale kernel machines, pages 301–320, 2007.
- [LMH15] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems, pages 3366–3374, 2015.
- [Nes04] Yurii Nesterov. Introductory lectures on convex optimization, volume 87. Springer Science & Business Media, 2004.
- [Nes12] Yu Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
- [Nit16] Atsushi Nitanda. Accelerated stochastic gradient descent for minimizing finite sums. In Artificial Intelligence and Statistics, pages 195–203, 2016.
- [RR11] Maxim Raginsky and Alexander Rakhlin. Information-based complexity, feedback and dynamics in convex programming. Information Theory, IEEE Transactions on, 57(10):7036–7056, 2011.
- [SLRB13] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, pages 1–30, 2013.
- [SS15] Shai Shalev-Shwartz. Sdca without duality. arXiv preprint arXiv:1502.06177, 2015.
- [SSZ13] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss. The Journal of Machine Learning Research, 14(1):567–599, 2013.
- [SSZ16] Shai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, 155(1-2):105–145, 2016.
- [WS16] Blake E Woodworth and Nati Srebro. Tight complexity bounds for optimizing composite objectives. In Advances in Neural Information Processing Systems, pages 3639–3647, 2016.
Appendix A Technical Lemmas
Let be the binary entropy function. Then,
Proof First, note that the first two derivatives of are
We show that the following function
is non-negative on (note that, since is continuous, it is bounded from below on and its minimum is attained on some local minimum in ). Let us locate all the extrema points of in . We have that,
Therefore, , and since
it follows that , which implies that is a local minimum of . We claim that there are exactly two more extrema points of which are in fact local maximum points. To this end, note that
where . Therefore, by Rolle’s Theorem, does not vanish in , and vanishes exactly once in and exactly once in . Since, is strictly negative in , it follows that the other two stationary points of are local maxima of . All in all, we have that if is a local minimum of , then , which implies that
concluding the proof.
Proof Let be an oblivious stochastic CLI, and suppose we apply on the class of problems (14) parametrized by , using oracles (15). We use mathematical induction to show that for any , the coordinates of the ’th iterate produced by such process can be expressed as distributions over , where denotes the set of all real polynomials with degree .
As the first iterate is allowed to depend only on and , the base case is trivial. For the inductive step, assume that any coordinate of can be expressed as a distribution over . Now, for any , the oracles answers of
|Generalized first-order oracle:|
|Steepest coordinate-descent oracle:|
form a distribution over , as the random quantities involved in the expressions ( and ) do not depend on (due to obliviousness) and the rest of the terms are either constant or linear in . Lastly, are computed by simply summing up all the oracle answers, and as such, form again distributions over .
Let be a real polynomial of degree , and let . Then, there exists such that for any it holds that
Proof Assume for the sake of contradiction that for any , there exists such that
and denote the corresponding coefficients by . We show by induction that for all . For we have that since for any there exists some such that
it holds, by continuity, that
Now, if for then
which contradicts our assumption, thus concluding the proof.