The Complexity of Making the Gradient Small in Stochastic Convex Optimization

02/13/2019 ∙ by Dylan Foster, et al. ∙ MIT Weizmann Institute of Science cornell university Toyota Technological Institute at Chicago 0

We give nearly matching upper and lower bounds on the oracle complexity of finding ϵ-stationary points (∇ F(x) ≤ϵ) in stochastic convex optimization. We jointly analyze the oracle complexity in both the local stochastic oracle model and the global oracle (or, statistical learning) model. This allows us to decompose the complexity of finding near-stationary points into optimization complexity and sample complexity, and reveals some surprising differences between the complexity of stochastic optimization versus learning. Notably, we show that in the global oracle/statistical learning model, only logarithmic dependence on smoothness is required to find a near-stationary point, whereas polynomial dependence on smoothness is necessary in the local stochastic oracle model. In other words, the separation in complexity between the two models can be exponential, and that the folklore understanding that smoothness is required to find stationary points is only weakly true for statistical learning. Our upper bounds are based on extensions of a recent "recursive regularization" technique proposed by Allen-Zhu (2018). We show how to extend the technique to achieve near-optimal rates, and in particular show how to leverage the extra information available in the global oracle model. Our algorithm for the global model can be implemented efficiently through finite sum methods, and suggests an interesting new computational-statistical tradeoff.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Success in convex optimization is typically defined as finding a point whose value is close to the minimum possible value. Information-based complexity of optimization attempts to understand the minimal amount of effort required to reach a desired level of suboptimality under different oracle models for access to the function (Nemirovski and Yudin, 1983; Traub et al., 1988). This complexity—for both deterministic and stochastic convex optimization—is tightly understood across a wide variety of settings (Nemirovski and Yudin, 1983; Traub et al., 1988; Agarwal et al., 2009; Braun et al., 2017), and efficient algorithms that achieve optimal complexity are well known.

Recently, there has been a surge of interest in optimization for non-convex functions. In this case, finding a point with near-optimal function value is typically intractable under standard assumptions—both computationally and information-theoretically. For this reason, a standard task in non-convex optimization is to find an -stationary point, i.e., a point where the gradient is small .

In stochastic non-convex optimization, there has been a flurry of recent research on algorithms with provable guarantees for finding near-stationary points (Ghadimi and Lan, 2013, 2016; Reddi et al., 2016; Allen-Zhu, 2017; Lei et al., 2017; Jin et al., 2017; Zhou et al., 2018; Fang et al., 2018). However, the stochastic oracle complexity of finding near-stationary points is not yet well understood, so we do not know whether existing algorithms are optimal, or how we hope to improve upon them.

Recent work by Carmon et al. (2017a, b) establishes tight bounds on the deterministic first-order oracle complexity of finding near-stationary points of smooth functions, both convex and non-convex. For convex problems, they prove that accelerated gradient descent is optimal both for finding approximate minimizers and approximate stationary points, while for non-convex problems, gradient descent is optimal for finding approximate stationary points. The picture is simple and complete: the same deterministic first-order methods that are good at finding approximate minimizers are also good at finding approximate stationary points, even for non-convex functions.

However, when one turns their attention to the stochastic oracle complexity of finding near-stationary points, the picture is far from clear. Even for stochastic convex optimization, the oracle complexity is not yet well understood. This paper takes a first step toward resolving the general case by providing nearly tight upper and lower bounds on the oracle complexity of finding near-stationary points in stochastic convex optimization, both for first-order methods and for global (i.e., statistical learning) methods. At first glance, this might seem trivial, since exact minimizers are equivalent to exact stationary points for convex functions. When it comes to finding approximate

stationary points the situation is considerably more complex, and the equivalence does not yield optimal quantitative rates. For example, while the stochastic gradient descent (SGD) is (worst-case) optimal for stochastic convex optimization with a first-order oracle, it appears to be far from optimal for finding near-stationary points.

Deterministic Sample Stochastic
First-Order Oracle Complexity First-Order Oracle

[origin=c]90

Upper:

Nesterov (2012)

(cor:sample-complexity-bounded)

(cor:sgd3-bounded-domain)

Lower:

Carmon et al. (2017b)

(thm:statistical-lower-bound)

(thm:first-order-lower-bound)

[origin=c]90

Upper:

Carmon et al. (2017b)

(cor:sample-complexity-bounded)

(cor:sgd3-bounded-domain)

Lower:

Carmon et al. (2017b)

(thm:statistical-lower-bound)

(thm:first-order-lower-bound)

Table 1: Upper and lower bounds on the complexity of finding such that for convex problems with -Lipschitz gradients, where

is a bound on the variance of gradient estimates.

1.1 Contributions

We present a nearly tight analysis of the local stochastic oracle complexity and global stochastic oracle complexity (“sample complexity”) of finding approximate stationary points in stochastic convex optimization. Briefly, the highlights are as follows:

  • We give upper and lower bounds on the local and global stochastic oracle complexity that match up to log factors. In particular, we show that the local stochastic complexity of finding stationary points is (up to log factors) characterized as the sum of the deterministic oracle complexity and the sample complexity.

  • As a consequence of this two-pronged approach, we show that the gap between local stochastic complexity and sample complexity of finding near-stationary points is at least exponential in the smoothness parameter.

  • We obtain the above results through new algorithmic improvements. We show that the recursive regularization technique introduced by Allen-Zhu (2018) for local stochastic optimization can be combined with empirical risk minimization to obtain logarithmic dependence on smoothness in the global model, and that the resulting algorithms can be implemented efficiently.

Complexity results are summarized in tab:results. Here we discuss the conceptual contributions in more detail.

Decomposition of stochastic first-order complexity.

For stochastic optimization of convex functions, there is a simple and powerful connection between three oracle complexities: Deterministic, local stochastic, and global stochastic. For many well-known problem classes, the stochastic first-order complexity is equal to the sum (equivalently, maximum) of the deterministic first-order complexity and the sample complexity. This decomposition of the local stochastic complexity into an “optimization term” plus a “statistical term” inspires optimization methods, guides analysis, and facilitates comparison of different algorithms. It indicates that “one pass” stochastic approximation algorithms like SGD are optimal for stochastic optimization in certain parameter regimes, so that we do not have to resort to sample average approximation or methods that require multiple passes over data.

We establish that the same decomposition holds for the task of finding approximate stationary points. Such a characterization should not be taken for granted, and it is not clear a priori that it should hold for finding stationary points. Establishing the result requires both developing new algorithms with near-optimal sample complexity in the global model, and improving previous local stochastic methods (Allen-Zhu, 2018) to match the optimal deterministic complexity.

Gap between sample complexity and stochastic first-order complexity.

For non-smooth convex objectives, finding an approximate stationary point can require finding an exact minimizer of the function (consider the absolute value function). Therefore, as one would expect, the deterministic and stochastic first-order oracle complexities for finding near-stationary points scale polynomially with the smoothness constant, even in low dimensions. Ensuring an approximate stationary point is impossible for non-smooth instances, even with an unbounded number of first-order oracle accesses. Surprisingly, we show that the sample complexity depends at most logarithmically on the smoothness. In fact, in one dimension the dependence on smoothness can be removed entirely.

Improved methods.

Our improved sample complexity results for the global stochastic oracle/statistical learning model are based on a new algorithm which uses the recursive regularization (or, “SGD3”) approach introduced by Allen-Zhu (2018). The methods iteratively solves a sequence of subproblems via regularized empirical risk minimization (RERM). Solving subproblems through RERM allows the method to exploit global access to the stochastic samples. Since the method enjoys only logarithmic dependence on smoothness (as well as initial suboptimality or distance to the optimum), it provides a better alternative to any stochastic first-order method whenever the smoothness is large relative to the variance in the gradient estimates. Since RERM is a finite-sum optimization problem, standard finite-sum optimization methods can be used to implement the method efficiently; the result is that we can beat the sample complexity of stochastic first-order methods with only modest computational overhead.

For the local stochastic model, we improve the SGD3 method of Allen-Zhu (2018) so that the “optimization” term matches the optimal deterministic oracle complexity. This leads to a quadratic improvement in terms of the initial distance to the optimum (the “radius” of the problem), . We also extend the analysis to the setting where initial sub-optimality is bounded but not the radius–a common setting in the analysis of non-convex optimization algorithms and a setting in which recursive regularization was not previously analyzed.

2 Setup

We consider the problem of finding an stationary point in the stochastic convex optimization setting. That is, for a convex function , our goal is to find a point such that

(1)

given access to only through an oracle.111Here, and for the rest of the paper, is taken to be the Euclidean norm. Formally, the problem is specified by a class of functions to which belongs, and through the type of oracle through which we access . We outline these now.

Function classes.

Recall that is is said to -smooth if

(2)

and is said to be -strongly-convex if

(3)

We focus on two classes of objectives, both of which are defined relative to an arbitrary initial point provided to the optimization algorithm.

  1. Domain-bounded functions.

    (4)
  2. Range-bounded functions.

    (5)

We emphasize that while the classes are defined in terms of a strong convexity parameter, our main complexity results concern the non-strongly convex case where . The strongly convex classes are used for intermediate results. We also note that our main results hold in arbitrary dimension, and so we drop the superscript except when it is pertinent to discussion.

Oracle classes.

An oracle accepts an argument and provides (possibly noisy/stochastic) information about the objective around the point . The oracle’s output belongs to an information space . We consider three distinct types of oracles:

  1. Deterministic first-order oracle. Denoted , with . When queried at a point , the oracle returns

    (6)
  2. Stochastic first-order oracle. Denoted , with . The oracle is specified by a function and a distribution over with the property that and . When queried at a point , the oracle draws an independent and returns

    (7)
  3. Stochastic global oracle. Denoted , with . The oracle is specified by a function and a distribution over with the property that and . When queried, the oracle draws an independent and returns the complete specification of the function , specifically,

    (8)

    For consistency with the other oracles, we say that accepts an argument , even though this argument is ignored. The global oracle captures the statistical learning problem, in which is the loss of a model evaluated on an instance , and this component function is fully known to the optimizer. Consequently, we use the terms “global stochastic complexity” and “sample complexity” interchangeably.

For the stochastic oracles, while itself may need to have properties such as convexity or smoothness, as defined need not have these properties unless stated otherwise.

Minimax oracle complexity.

Given a function class and an oracle with information space , we define the minimax oracle complexity of finding an -stationary point as

(9)

where is defined recursively as and the expectation is over the stochasticity of the oracle .222See sec:first-order for discussion of randomized algorithms.

Recap: Deterministic first-order oracle complexity.

To position our new results on stochastic optimization we must first recall what is known about the deterministic first-order oracle complexity of finding near-stationary pointst. This complexity is tightly understood, with

up to logarithmic factors (Nesterov, 2012; Carmon et al., 2017b). The algorithm that achieves these rates is accelerated gradient descent (AGD).

3 Stochastic First-Order Complexity of Finding Stationary Points

Interestingly, the usual variants of stochastic gradient descent do not appear to be optimal in the stochastic model. A first concern is that they do not yield the correct dependence on desired stationarity .

As an illustrative example, let and let any stochastic first-order oracle be given. We adopt the naive approach of bounding stationarity by function value suboptimality. In this case the standard analysis of stochastic gradient descent (e.g., Dekel et al. (2012)) implies that after iterations, , and thus

The dependence on is considerably worse than the dependence enjoyed for function suboptimality.

In recent work, Allen-Zhu (2018) proposed a new recursive regularization approach and used this in an algorithm called SGD3 that obtains the correct dependence.333Allen-Zhu (2018) also show that some simple variants of SGD are able to reduce the poor dependence to, e.g., , but they fall short of the dependence one should hope for. Similar remarks apply for . For any and , SGD3 iteratively augments the objective with increasingly strong regularizers, “zooming in” on an approximate stationary point. Specifically, in the first iteration, SGD is used to find , an approximate minimizer of . The objective is then augmented with a strongly-convex regularizer so . In the second round, SGD is initialized at , and used to find , an approximate minimizer of . This process is repeated, with for each . Allen-Zhu (2018) shows that SGD3 find an -stationary points using at most

(10)

local stochastic oracle queries. This oracle complexity has a familiar structure: it resembles the sum of an “optimization term” () and a “statistical term” (). While we show that the statistical term is tight up to logarithmic factors (thm:first-order-lower-bound), the optimization term does not match the lower bound for the deterministic setting (Carmon et al., 2017b).

0:  A function , an oracle and an alloted number of oracle accesses , an initial point , and an optimization sub-routine , with .
  , , .
  for  do
      is output of used to optimize intitialized at
     
  end for
  return  
Algorithm 1 Recursive Regularization Meta-Algorithm

Our first result is to close this gap. The key idea is to view SGD3 as a template algorithm, where the inner loop of SGD used in Allen-Zhu (2018) can be swapped out for an arbitrary optimization method . This template, alg:meta-algorithm, forms the basis for all the new methods in this paper.444The idea of replacing the sub-algorithm in SGD3 was also used by Davis and Drusvyatskiy (2018), who showed that recursive regularization with a projected subgradient method can be used to find near-stationary points for the Moreau envelope of any Lipschitz function.

To obtain optimal complexity for the local stochastic oracle model we use a variant of the accelerated stochastic approximation method (“AC-SA”) due to Ghadimi and Lan (2012) as the subroutine. Pseudocode for AC-SA is provided in alg:ac-sa. We use a variant called , see alg:gltwo. The algorithm is equivalent to AC-SA, except the stepsize parameter is reset halfway through. This leads to slightly different dependence on the smoothness and domain size parameters, which is important to control the final rate when invoked within alg:meta-algorithm.

Toward proving the tight upper bound in tab:results, we first show that alg:meta-algorithm with as its subroutine guarantees fast convergence for strongly-convex domain-bounded objectives.

Theorem 1.

For any and any , alg:meta-algorithm using as its subroutine finds a point with using

total stochastic first-order oracle accesses.

The analysis of this algorithm is detailed in app:proofs-first-order and carefully matches the original analysis of SGD3 (Allen-Zhu, 2018). The essential component of the analysis is lem:zeyuan_auxillary_lemma, which provides a bound on in terms of the optimization error of each invocation of on the increasingly strongly convex subproblems .

0:  A function , a stochastic first-order oracle , and an alloted number of oracle accesses
  
  for  do
     
     
     
     
     
     
  end for
  return  
Algorithm 2 AC-SA

Our final result for non-strongly convex objectives uses alg:meta-algorithm with on the regularized objective . The performance guarantee is as follows, and concerns both domain-bounded and range-bounded functions.

Corollary 1.

For any and any , alg:meta-algorithm with as its subroutine applied to for yields a point such that using

total stochastic first-order oracle accesses.
For any and any , the same algorithm with yields a point with using

total stochastic first-order oracle accesses.

This follows easily from thm:acc-sgd3-strongly-convex and is proven in app:proofs-first-order. Intuitively, when is chosen appropriately, the gradient of the regularized objective does not significantly deviate from the gradient of , but the number of iterations required to find an -stationary point of is still controlled.

We now provide nearly-tight lower bounds for the stochastic first-order oracle complexity. A notable feature of the lower bound is to show that show some of the logarithmic terms in the upper bound—which are not present in the optimal oracle complexity for function value suboptimality—are necessary.

Theorem 2.

For any , any , the stochastic first-order oracle complexity for range-bounded functions is lower bounded as

For any , the stochastic first-order complexity for domain-bounded functions is lower bounded as

The proof, detailed in app:proofs-lower-bounds, combines the existing lower bound on the deterministic first-order oracle complexity (Carmon et al., 2017b) with a new lower bound for the statistical term. The approach is to show that any algorithm for finding near-stationary points can be used to solve noisy binary search (NBS), and then apply a known lower bound for NBS (Feige et al., 1994; Karp and Kleinberg, 2007). It is possible to extend the lower bound to randomized algorithms; see discussion in Carmon et al. (2017b).

0:  A function , a stochastic first-order oracle , and an alloted number of oracle accesses
   AC-SA
   AC-SA
  return  
Algorithm 3

4 Sample Complexity of Finding Stationary Points

Having tightly bound the stochastic first-order oracle complexity of finding approximate stationary points, we now turn to sample complexity. If the heuristic reasoning that stochastic first-order complexity should decompose into sample complexity and deterministic first-order complexity (

) is correct, then one would expect that the sample complexity should be for both domain-bounded and range-bounded function.

A curious feature of this putative sample complexity is that it does not depend on the smoothness of the function. This is somewhat surprising since if the function is non-smooth in the vicinity of its minimizer, there may only be a single -stationary point, and an algorithm would need to return exactly that point using only a finite sample. We show that the sample complexity is in fact almost independent of the smoothness constant, with a mild logarithmic dependence. We also provide nearly tight lower bounds.

For the global setting, a natural algorithm to try is regularized empirical risk minimization (RERM), which returns .555While it is also tempting to try constrained ERM, this does not succeed even for function value suboptimality (Shalev-Shwartz et al., 2009). For any domain-bounded function , a standard analysis of ERM based on stability (Shalev-Shwartz et al., 2009) shows that . Choosing and yields an -stationary point. This upper bound, however, has two shortcomings. First, it scales with rather than that we hoped for and, second, it does not approach as , which one should expect in the noise-free case. The stochastic first-order algorithm from the previous section has better sample complexity, but the number of samples still does not approach one when .

We fix both issues by combining regularized ERM with the recursive regularization approach, giving an upper bound that nearly matches the sample complexity lower bound . They key tool here is a sharp analysis of regularized ERM—stated in the appendix as thm:erm_variance—that obtains the correct dependence on the variance .

As in the previous section, we first prove an intermediate result for the strongly convex case. Unlike sec:first-order, where was required to be convex but the components were not required to be, we must assume here either that is convex for all .666We are not aware of any analysis of ERM for strongly convex losses that does not make such an assumption. It is interesting to see whether this can be removed.

Theorem 3.

For any and any global stochastic oracle with the restriction that is convex for all , alg:meta-algorithm with ERM as its subroutine finds with using at most

total samples.

The proof is given in app:proofs-sample-complexity. As before, we handle the non-strongly convex case by applying the algorithm to .

Corollary 2.

For any and any global stochastic oracle with the restriction that is convex for all , alg:meta-algorithm with ERM as its subroutine, when applied to with , finds a point with using at most

total samples.
For any and any global stochastic oracle with the restriction that is convex for all , the same approach with finds an -stationary point using at most

total samples.

This follows immediately from thm:sample-complexity-strong-convexity by choosing small enough such that . Details are deferred to app:proofs-sample-complexity.

With this new sample complexity upper bound, we proceed to provide an almost-tight lower bound.

Theorem 4.

For any , , the sample complexity to find a -stationary point777This lower bound applies both to deterministic and randomized optimization algorithms. is lower bounded as

This lower bound is similar to constructions used to prove lower bounds in the case of finding an approximate minimizer (Nemirovski and Yudin, 1983; Nesterov, 2004; Woodworth and Srebro, 2016). However, our lower bound applies for functions with simultaneously bounded domain and range, so extra care must be taken to ensure that these properties hold. The lower bound also ensures that is convex for all . The proof is located in app:proofs-lower-bounds.

Discussion: Efficient implementation.

cor:sample-complexity-bounded provides a bound on the number of samples needed to find a near-stationary point. However, a convenient property of the method is that the ERM objective solved in each iteration is convex, -smooth, -strongly convex, and has finite sum structure with components. These subproblems can therefore be solved using at most gradient computations via a first-order optimization algorithm such as Katyusha (Allen-Zhu, 2017). This implies that the method can be implemented with a total gradient complexity of over all iterations, and similarly for the bounded-range case. Thus, the algorithm is not just sample-efficient, but also computationally efficient, albeit slightly less so than the algorithm from sec:first-order.

Removing smoothness entirely in one dimension.

The gap between the upper and lower bounds for the statistical complexity is quite interesting. We conclude from cor:sample-complexity-bounded that the sample complexity depends at most logarithmically upon the smoothness constant, which raises the question of whether it must depend on the smoothness at all. We now show that for the special case of functions in one dimension, smoothness is not necessary. In other words, all that is required to find an -stationary point is Lipschitzness.

Theorem 5.

Consider any convex, -Lipschitz function that is bounded from below,888This lower bound does not enter the sample complexity quantitatively. and any global stochastic oracle with the restriction that is convex for all . There exists an algorithm which uses samples and outputs a point such that .

The algorithm calculates the empirical risk minimizer on several independent samples, and then returns the point that has the smallest empirical gradient norm on a validation sample. The proof uses the fact that any function as in the theorem statement has a single left-most and a single right-most -stationary point. As long as the empirical function’s derivative is close to

’s at those two points, we argue that the ERM lies between them with constant probability, and is thus an

-stationary point of . We are able to boost the confidence by repeating this a logarithmic number of times. A rigorous argument is included in app:proofs-sample-complexity. Unfortunately, arguments of this type does not appear to extend to more than one dimension, as the boundary of the set of -stationary points will generally be uncountable, and thus it is not apparent that the empirical gradient will be uniformly close to the population gradient. It remains open whether smoothness is needed in two dimensions or more.

The algorithm succeeds even for non-differentiable functions, and requires neither strong convexity nor knowledge of a point for which or is bounded. In fact, the assumption of Lipschitzness (more generally, -subgaussianity of the gradients) is only required to get an in-expectation statement. Without this assumption, it can still be shown that ERM finds an -stationary point with constant probability using samples.

5 Discussion

We have proven nearly tight bounds on the oracle complexity of finding near-stationary points in stochastic convex optimization, both for local stochastic oracles and global stochastic oracles. We hope that the approach of jointly studying stochastic first-order complexity and sample complexity will find use more broadly in non-convex optimization. To this end, we close with a few remarks and open questions.

  1. Is smoothness necessary for finding -stationary points? While the logarithmic factor separating the upper and lower bound we provide for stochastic first-order oracle complexity is fairly inconsequential, the gap between the upper and lower bound on the sample complexity is quite interesting. In particular, we show through thm:statistical-lower-bound and cor:sample-complexity-bounded that

    and similarly for the domain-bounded case. Can the factor on the right-hand side be removed entirely? Or in other words, is it possible to find near-stationary points in the statistical learning model without smoothness?999For a general non-smooth function , a point is said to be an -stationary point if there exists such that . By thm:sample-complexity-non-smooth, we know that this is possible in one dimension.

  2. Tradeoff between computational complexity and sample complexity. Suppose our end goal is to find a near-stationary point in the statistical learning setting, but we wish to do so efficiently. For range-bounded functions, if we use alg:meta-algorithm with as a subroutine we require samples, and the total computational effort (measured by number of gradient operations) is also . On the other hand, if we use alg:meta-algorithm with RERM as a subroutine and implement RERM with Katyusha, then we obtain an improved sample complexity of , but at the cost of a larger number of gradient operations: . Tightly characterizing such computational-statistical tradeoffs in this and related settings is an interesting direction for future work.

  3. Active stochastic oracle. For certain stochastic first-order optimization algorithms based on variance reduction (SCSG (Lei et al., 2017), SPIDER (Fang et al., 2018)), a gradient must be computed at multiple points for the same sample . We refer to such algorithms as using an “active query” first-order stochastic oracle, which is a stronger oracle than the classical first-order stochastic oracle (see Woodworth et al. (2018) for more discussion). It would be useful to characterize the exact oracle complexity in this model, and in particular to understand how many active queries are required to obtain logarithmic dependence on smoothness as in the global case.

  4. Complexity of finding stationary points for smooth non-convex functions. An important open problem is to characterize the minimax oracle complexity of finding near-stationary points for smooth non-convex functions, both for local and global stochastic oracles. For a deterministic first-order oracle, the optimal rate is . In the stochastic setting, a simple sample complexity lower bound follows from the convex case, but this is not known to be tight.

Acknowledgements

We would like to thank Srinadh Bhojanapalli and Robert D. Kleinberg for helpful discussions. Part of this work was completed while DF was at Cornell University and supported by the Facebook Ph.D. fellowship. OS is partially supported by a European Research Council (ERC) grant. OS and NS are partially supported by an NSF/BSF grant. BW is supported by the NSF Graduate Research Fellowship under award 1754881.

References

  • Agarwal et al. (2009) Alekh Agarwal, Martin J Wainwright, Peter L Bartlett, and Pradeep K Ravikumar. Information-theoretic lower bounds on the oracle complexity of convex optimization. In Advances in Neural Information Processing Systems, pages 1–9, 2009.
  • Allen-Zhu (2017) Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. In

    Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing

    , pages 1200–1205. ACM, 2017.
  • Allen-Zhu (2018) Zeyuan Allen-Zhu. How to make the gradients small stochastically: Even faster convex and nonconvex sgd. In Advances in Neural Information Processing Systems, pages 1165–1175. 2018.
  • Braun et al. (2017) Gábor Braun, Cristóbal Guzmán, and Sebastian Pokutta. Lower bounds on the oracle complexity of nonsmooth convex optimization via information theory. IEEE Transactions on Information Theory, 63(7):4709–4724, 2017.
  • Carmon et al. (2017a) Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points i. arXiv preprint arXiv:1710.11606, 2017a.
  • Carmon et al. (2017b) Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points ii: First-order methods. arXiv preprint arXiv:1711.00841, 2017b.
  • Davis and Drusvyatskiy (2018) Damek Davis and Dmitriy Drusvyatskiy. Complexity of finding near-stationary points of convex functions stochastically. arXiv preprint arXiv:1802.08556, 2018.
  • Dekel et al. (2012) Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using mini-batches.

    Journal of Machine Learning Research

    , 13(Jan):165–202, 2012.
  • Fang et al. (2018) Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems, pages 687–697, 2018.
  • Feige et al. (1994) Uriel Feige, Prabhakar Raghavan, David Peleg, and Eli Upfal. Computing with noisy information. SIAM Journal on Computing, 23(5):1001–1018, 1994.
  • Ghadimi and Lan (2012) Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469–1492, 2012.
  • Ghadimi and Lan (2013) Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
  • Ghadimi and Lan (2016) Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Program., 156(1-2):59–99, 2016. doi: 10.1007/s10107-015-0871-8.
  • Jin et al. (2017) Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escape saddle points efficiently. In Proceedings of the 34th International Conference on Machine Learning, pages 1724–1732, 2017.
  • Karp and Kleinberg (2007) Richard M Karp and Robert Kleinberg. Noisy binary search and its applications. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 881–890. Society for Industrial and Applied Mathematics, 2007.
  • Lei et al. (2017) Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Non-convex finite-sum optimization via scsg methods. In Advances in Neural Information Processing Systems, pages 2348–2358, 2017.
  • Nemirovski and Yudin (1983) Arkadii Semenovich Nemirovski and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983.
  • Nesterov (2004) Yurii Nesterov. Introductory lectures on convex optimization: a basic course. 2004.
  • Nesterov (2012) Yurii Nesterov. How to make the gradients small. Optima, 88:10–11, 2012.
  • Reddi et al. (2016) Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabás Póczós, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, pages 314–323, 2016.
  • Shalev-Shwartz et al. (2009) Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Stochastic convex optimization. In Conference on Learning Theory, 2009.
  • Traub et al. (1988) Joseph F Traub, Grzegorz W Wasilkowski, and Henryk Woźniakowski. Information-based complexity. 1988.
  • Woodworth and Srebro (2016) Blake Woodworth and Nati Srebro. Tight complexity bounds for optimizing composite objectives. In Advances in Neural Information Processing Systems 29, pages 3639–3647. 2016.
  • Woodworth et al. (2018) Blake Woodworth, Jialei Wang, Brendan McMahan, and Nathan Srebro. Graph oracle models, lower bounds, and gaps for parallel stochastic optimization. In Advances in Neural Information Processing Systems 31, pages 8505–8515, 2018.
  • Zhou et al. (2018) Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic nested variance reduced gradient descent for nonconvex optimization. In Advances in Neural Information Processing Systems 31, pages 3925–3936. 2018.

Appendix A Proofs from sec:first-order: Upper Bounds

Theorem 6 (Proposition 9 of Ghadimi and Lan (2012)).

For any and any , the AC-SA algorithm returns a point after making oracle accesses such that

Lemma 1.

For any and any , the algorithm returns a point after making oracle accesses such that

Proof.

By thm:ghadimi_lan_AC_SA, the first instance of AC-SA outputs such that

(11)

and since is -strongly convex,

(12)

Also by thm:ghadimi_lan_AC_SA, the second instance of AC-SA outputs such that

(13)
(14)
(15)

Lemma 2 (Claim 6.2 of Allen-Zhu (2018)).

Suppose that for every the iterates of alg:meta-algorithm satisfy where , then

  1. For all , .

  2. For every , .

  3. For all ,

See 1

Proof.

As in lem:zeyuan_auxillary_lemma, let for each . The objective in the final iteration, , so

(16)
(17)
(18)
(19)
(20)
(21)
(22)

Above, eq:thm1-triangle-ineq-1 and eq:thm1-triangle-ineq-2 rely on the triangle inequality; eq:thm1-strong-convexity follows from the -strong convexity of ; eq:thm1-lem4 applies the third conclusion of lem:zeyuan_auxillary_lemma; eq:thm1-smoothness uses the fact that is -smooth; and finally eq:thm1-choice-of-T uses that .

We chose to be applied to initialized at using stochastic gradients. Therefore,

(23)
Using part two of lem:zeyuan_auxillary_lemma, for we can bound , thus
(24)

We can therefore bound

(25)