Lower Bounds for Higher-Order Convex Optimization

10/27/2017 ∙ by Naman Agarwal, et al. ∙ 0

State-of-the-art methods in convex and non-convex optimization employ higher-order derivative information, either implicitly or explicitly. We explore the limitations of higher-order optimization and prove that even for convex optimization, a polynomial dependence on the approximation guarantee and higher-order smoothness parameters is necessary. As a special case, we show Nesterov's accelerated cubic regularization method to be nearly tight.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

State-of-the-art optimization for machine learning has shifted from gradient based methods, namely stochastic gradient descent and its derivatives

[DHS11, JZ13]

, to methods based on higher moments. Notably, the fastest theoretical running times for both convex

[ABH16, XYRK16, BBN16] and non-convex [AAZB17, CDHS16] optimization are attained by algorithms that either explicitly or implicitly exploit second order information and third order smoothness. Of particular interest is Newton’s method, due to recent efficient implementations that run in near-linear time in the input representation. The hope was that Newton’s method, and perhaps higher order methods, can achieve iteration complexity that is independent of the condition number of the problem as well as of the dimensionality, both of which are extremely high for many large-scale applications. In this paper we explore the limitations of higher-order iterative optimization, and show that unfortunately, these hopes cannot be attained without stronger assumptions on the underlying optimization problem. To the best of our knowledge, our results are the first lower bound for order optimization for that includes higher-order smoothness. 111After the writing of the first manuscript we were made aware of the work by Arjevani et al. [ASS17] which provides lower bounds for these settings as well. We consider the problem of -order optimization. We model a -order algorithm as follows. Given a -times differentiable function , at every iteration the algorithms outputs a point and receives as input the tuple , i.e. the value of the function and its derivatives at 222An iteration is equivalent to an oracle call to -order derivatives in this model. The goal of the algorithm is to output a point such that

For the -order derivatives to be informative, one needs to bound their rate of change, or Lipschitz constant. This is called -order smoothness, and we denote it by . In particular we assume that

where is defined as the induced operator norm with respect to the Euclidean norm. Our main theorem shows the following limitation of -order iterative optimization algorithms:

Theorem 1.1.

For every number and -order algorithm ALG (deterministic or randomized), there exists an such that for all , there exists a -differentiable convex function with -order smoothness coefficient such that ALG cannot output a point such that

in number of iterations fewer than

where is a constant depending on and is defined to be the unit ball in dimensions.

Note that although the bound is stated for constrained optimization over the unit ball, it can be extended to an unconstrained setting via the addition of an appropriate scaled multiple of . We leave this adaptation for a full version of this paper. Further as is common with lower bounds the underlying dimension is assumed to be large enough and differs for the determinisitc vs randomized version. Theorems 4.1 and 5.1 make the dependence precise.

Comparison to existing bounds.

For the case of , the most efficient methods known are the cubic regularization technique proposed by Nesterov [Nes08] and an accelerated hybrid proximal extragradient method proposed by Monteiro and Svaiter [MS13]. The best known upper bound in this setting is [MS13]. We show a lower bound of demonstrating that the upper bound is nearly tight. For the case of case of , Baes [Bae09] proves an upper bound of . In comparison Theorem 1.1 proves a lower bound of .

1.1 Related work

The literature on convex optimization is too vast to survey; the reader is referred to [BV04, Nes04]. Lower bounds for convex optimization were studied extensively in the seminal work of [NY78]. In particular, tight first-order optimization lower bounds were established assuming first-order smoothness.(Also see [Nes04] for a concise presentation of the lower bound). In a recent work [AS16] presented a lower bound when given access to second-order derivatives. However a key component (as remarked by the authors themselves) missing from the bound established by [AS16] was that the constructed function was not third-order smooth. Indeed the lower bound established by [AS16] can be overcome when the function is third-order smooth (ref. [Nes08]). The upper bounds for higher-order oracles (assuming appropriate smoothness) was established by [Bae09]. Higher order smoothness has been leveraged recently in the context of non-convex optimization [AAZB17, CDHS16, AZ17]. In a surprising new discovery, [CHDS17] show that assuming higher-order smoothness the bounds for first-order optimization can be improved without having explicit access to higher-order oracles. This is a property observed in our lower bound too. Indeed as shown in the proof the higher order derivatives at the points queried by the algorithm are always 0. For further details regarding first-order lower bounds for various different settings we refer the reader to [AWBR09, WS16b, AS16, ASSS15] and the references therein. In parallel and independently, Arjevani et al. [ASS17] also obtain lower bounds for deterministic higher-order optimization. In comparison, their lower bound is stronger in terms of the exponent than the ones proved in this paper, and matches the upper bound for . However, our construction and proof are simple (based on the well known technique of smoothing) and our bounds hold for randomized algorithms as well, as opposed to the their deterministic lower bounds.

1.2 Overview of Techniques

Our lower bound is inspired by the lower bound presented in [CHW12a]. In particular we construct the function as a piecewise linear convex function defined by

with carefully constructed vectors

and restricting the domain to be the unit ball. The key idea here is that querying a point reveals information about at most one hyperplane. The optimal point however can be shown to require information about all the hyperplanes. Unfortunately the above function is not differentiable. We now smooth the function by the ball smoothing operator (defined in Definition

2.2) which averages the function in a small Euclidean ball around a point. We show (c.f. Corollary 2.4) that iterative application of the smoothing operator ensures -differentiability as well as boundedness of the -order smoothness. Two key issues arise due to above smoothing. Firstly although the smoothing operator leaves the function unchanged around regions far away from the intersection of the hyperplanes, it is not the case for points lying near the intersection. Indeed querying a point near the intersection of the hyperplanes can potentially lead to leak of information about multiple hyperplanes at once. To avoid this, we carefully shift the linear hyperplanes making them affine and then arguing that this shifting indeed forces sufficient gap between the points queried by the algorithm and the intersections leaving sufficient room for smoothing. Secondly such a smoothing is well known to introduce a dependence on the dimension in the smoothness coefficients. Our key insight here is that for the class of functions being considered for the lower bound (c.f. Definition 2.1) smoothing can be achieved without a dependence on the dimension(c.f. Theorem 2.3). This is essential to achieving dimension free lower bounds and we believe this characterization can be of intrinsic interest.

1.3 Organization of the paper

We begin by providing requisite notations and definitions for the smoothing operator and proving the relevant lemmas regarding smoothing in Section 2. In Section 3 we provide the construction of our hard function. In Section 4 we state and prove our main theorem (Theorem 4.1) showing the lower bound against determinsitic algorithms. We also prove Theorem 1.1 based on Theorem 4.1 in this Section. In Section 5 we state and prove the Theorem 5.1 showing the lower bound against randomized algorithms.

2 Preliminaries

2.1 Notation

We use to refer to the dimensional unit ball. We suppress the from the notation when it is clear from the context. Let be an -dimensional linear subspace of . We denote by , an matrix which containes an orthonormal basis of as rows. Let denote the orthogonal complement of . Given a vector and a subspace let denote the perpendicular component of w.r.t . We now define the notion of a invariant function.

Definition 2.1 (-invariance).

Let be an dimensional linear subspace of . A function is said to be -invariant if for all and belonging to the subspace , i.e. we have that

Equivalently there exists a function such that for all , .

A function is defined to be -Lipschitz with respect to a norm if it satisfies

Lipschitzness for the rest of the paper will be measured in the norm.

2.2 Smoothing

In this section we define the smoothing operator and the requisite properties.

Definition 2.2 (Smoothing operator).

Given a -dimensional subspace and a parameter define the operator (refered henceforth as the smoothing operator) as

As a shorthand we define . Further for any define i.e. the smoothing operator applied on iteratively times.

When we suppress the notation from to . Following is the main lemma we prove regarding the smoothing operator.

Lemma 2.3.

Let be a dimensional linear subspace of and be -invariant and -Lipschitz. Let be the smoothing of . Then we have the following properties.

  1. is differentiable and also -lipschitz and -invariant.

  2. is -Lipschitz.


As stated before being -invariant implies that there exists a function such that . Therefore we have that

where . The representation of as implies that is invariant. Further the above equality implies that . A standard argument using Stokes’ theorem shows that is differentiable even when is not 333We need to be not differentiable in a measure 0 set which is always the case with our constructions and that (Lemma 1 [FKM05]), where is the -dimensional sphere, i.e.

The first inequality follows from Jensen’s inequality and the second inequality follows from noticing that being -Lipschitz implies that is -Lipschitz. We now have that

being -Lipschitz immediately gives us

Corollary 2.4.

Given a -Lipschitz continuous function and an -dimensional subspace such that is -invariant, we have that the function is -times differentiable . Moreover we have that for any


We will argue inductively. The base case () is a direct consequence of the function being -Lipschitz. Suppose the theorem holds for . To argue about we will consider the function for and for a unit vector . We will first consider the case . Using the inductive hypothesis and the fact that smoothing and derivative commute for differentiable functions we have that

Note that the inductive hypothesis implies that is -Lipschitz and so is via Lemma 2.3. Therefore we have that

We now consider the case when . By Lemma 2.3 we know that is differentiable and therefore we have that is times differentiable. Further we have that

A direct application of Lemma 2.3 gives that

Further it is immediate to see that

which implies using the fact that is Lipschitz that

3 Construction of the hard function

In this section we describe the construction of our hard function . Our construction is inspired by the information-theoretic hard instance of zero sum games of [CHW12b]. The construction of the function will be characterized by a sequence of vectors , and parameters . We assume . To make the dependence explicit we denote the hard function as

For brevity in the rest of the section we supress from the notation, however all the quantities defined in the section depend on them. To define we will define auxilliary vectors and auxilliary functions . Given a sequence of vectors , let for be defined as the subspace spanned by the vectors . Further inductively define vectors as follows. If (i.e. the perpendicular component of on the subspace is not zero), define

If indeed belongs to the subspace , then is defined to be an arbitrary unit vector in the perpendicular subspace . Further define an auxilliary function

Given the parameter , now define the following functions

With these definitions in place we can now define the hard function parameterized by . Let be the subspace spanned by


i.e. is constructed by smoothing -times with respect to the parameters . We now collect some important observations regarding the function .

Observation 3.1.

is convex and continuous. Moreover it is 1-Lipschitz and is invariant with the respect to the r dimensional subspace .

Note that is a of linear functions and hence convex. Since smoothing preserves convexity we have that is convex. 1-Lipschitzness follows by noting that by definition . It can be seen that is -invariant and therefore by Theorem 2.3 we get that is -invariant.

Observation 3.2.

is -differentiable with the Lipschitz constants for all .

Above is a direct consequence of Corollary 2.4 and the fact that is 1-Lipschitz and invariant with respect to the -dimensional subspace . Corollary 2.4 also implies that


Setting , we get that . Therefore the following inequality follows from Equation (3.2) and by noting that


The following lemma provides a characterization of the derivatives of at the points .

Lemma 3.3.

Given a sequence of vectors and parameters , let be a sequence of functions defined as

If the parameters are such that

then we have that


We will first note the following about the smoothing operator . At any point all the derivatives and the function value of for any function depend only on the values of the function in a ball of radius atmost around the point . Consider the function and for any . Note that by definition of the functions , for any such that

we have that . Therefore to prove the lemma it is sufficient to show that

Lets first note the following facts. By construction we have that . This immediately implies that


Further using the fact that , we have that


Further note that by construction which implies . Again using the fact that , we have that


The above equations in particular imply that as long as , we have that


which as we argued before is sufficient to prove the lemma. ∎

4 Main Theorem and Proof

The main theorem (Theorem 4.1) follows immediately from the following main technical lemma by setting . 444For the randomized version the statement follows the same way from Theorem 5.1.

Theorem 4.1.

For any integer , any , and and any -order deterministic algorithm working on , there exists a convex function for , such that for steps of the algorithm every point queried by the algorithm is such that

Moreover the function is guaranteed to be -differentiable with Lipschitz constants bounded as


We first prove Theorem 1.1 in the deterministic case using Theorem 4.1.

Proof of Theorem 1.1 Deterministic case.

Given an algorithm ALG and numbers define . For any pick a number such that

Let be the function constructed in Theorem 4.1 for parameters and define the hard function

Note that by the guarantee in Equation (4.1) we get that is -order smooth with coefficient at most . Note that since this is a scaling of the original hard function the lower bound applies directly and therefore ALG cannot achieve accuracy

in less that iterations which finishes the proof of the theorem. ∎

We now provide the proof of Theorem 4.1.

Proof of Theorem 4.1.

Define the following parameters


Consider a deterministic algorithm . Since is deterministic let the first point played by the algorithm be fixed to be . We now define a series of functions inductively for all as follows


The above definitions simulate the deterministic algorithm Alg with respect to changing functions . is the input the algorithm will receive if it queried point and the function was . is the next point the algorithm Alg will query on round given the inputs over the previous rounds. Note that thus far these quantities are tools defined for analysis. Since Alg is deterministic these quantities are all deterministic and well defined. We will now prove that the function defined in the series above satisfies the properties required by the Theorem 4.1.

Bounded Lipschitz Constants Using Corollary 2.4 and the fact that has Lipschitz constant bounded by 1 we get that the function has higher order Lipschitz constants bounded above as

Suboptimality Let be the points queried by the algorithm Alg when executed on . We need to show that


Equation 4.7 follows as a direct consequence of the following two claims.

Claim 4.2.

We have that for all , where is defined by Equation 4.6.

Claim 4.3.

We have that

To remind the reader were variables defined by Equation (4.6) and are the points played by the algorithm Alg when run on . Claim 4.2 shows that even though was constructed using the outputs produced by the algorithm does not change. Claim 4.2 and Claim 4.3 derive Equation 4.7 in a straightforward manner thus finishing the proof of Theorem 4.1. ∎

We now provide the proofs of Claim 4.2 and Claim 4.3.

Proof of Claim 4.2.

We will prove the claim inductively. The base case is immediate because Alg is deterministic and therefore the first point queried by it is always the same. Further note that for is defined inductively as follows.


It is now sufficient to show that


where is as defined in Equation (4.5). To see this note that

Equation (4.10) is a direct consequence of Lemma 3.3 by noting that which is true by definition of these parameters. ∎

Proof of Claim 4.3.

Using Lemma 3.3 we have that . Further Equation (3.6) implies that

Now using (3.3) using we get that every point in is such that

The above follows by the choice of parameters and being large enough. This finishes the proof of Claim 4.3. ∎

5 Lower Bounds against Randomized Algorithms

In this section we prove the version of Theorem 4.1 for randomized algorithms. The key idea underlying the proof remains the same. However since we cannot simulate the algorithm anymore we choose the vectors forming the subspace randomly from a large enough

. This ensures that no algoirthm with few queries can discover the subspace in which the function is non-invariant with reasonable probability. Naturally the dimension required for Theorem

4.1 now is larger than the tight we achieved as in the case of deterministic algorithms.

Theorem 5.1.

For any integer , any , , and any -order (potentially randomized algorithm), there exists a differentiable convex function for , such that with probability at least (over the randomness of the algorithm) for steps of the algorithm every point queried by the algorithm is at least sub-optimal over the unit ball. Moreover the function is guaranteed to be -differentiable with Lipschitz constants bounded as


We provide a randomized construction for the function . The construction is the same as in Section 3 but we repeat it here for clarity. We sample a random dimensional basis . Let be the subspace spanned by and be the perpendicular subspace. Further define an auxilliary function

Given a parameter , now define the following functions


i.e. smoothing with respect to . The hard function we propose is the random function with parameters set as and . We restate facts which can be derived analogously to those derived in Section 3 (c.f. Equations (3.2),(3.3)).


The following key lemma will be the main component of the proof.

Lemma 5.2.

Let be the points queried by a randomized algorithm throughout its execution on the function . With probability at least (over the randomess of the algorithm and the selection of ) the following event happens

Using the above lemma we first demonstrate the proof of Theorem 5.1. We will assume the event in Lemma 5.2 happens.

Bounded Lipschitz Constants Using Corollary 2.4, the fact that has Lipschitz constant bounded by 1 and that is invariant with respect to the dimensional subspace , we get that the function has higher order Lipschitz constants bounded above as

Sub-optimality : The event in the lemma implies that which implies that and from Equation (5.2) we get that

Now using Equation (5.2) we get that every is such that

The last inequality follows by the choice of parameters. This finishes the proof of Theorem 5.1. ∎

Proof of Lemma 5.2.

We will use the following claims to prove the lemma. For any vector , define the event . The event we care about then is

Claim 5.3.

if holds, then all depend only on .

Claim 5.4.

For any , if if we have that holds then we have that with probability at least (over the choice of and the randomness of the algorithm) the event happens.

Claim 5.3 is a robust version of the argument presented in the proof of Theorem 4.1. Claim 5.4 is a byproduct of the fact that in high dimensions the corelation between a fixed vector and a random small basis is small. Claim 5.3 is used to prove the Claim 5.4. Lemma 5.2 now follows via a simple inductive argument using Claim 5.4 which is as follows

Proof of Claim 5.3.

As noted before the smoothing operator is such that at any point all the derivatives of depend only on the value of in a ball of radius around the point . Therefore it is sufficient to show that for the function , for every such that we have that depends only on . To ensure this, it is enough to ensure that for every such we have that which is what we prove next. Lets first note the following facts. By the definition of we have that . This immediately implies that


Now since we know each is -Lipschitz 555, this also gives us


By the event we also know that . This implies as above


The above equations imply that as long as (which is true by the choice of parameters), we have that


which is sufficient to prove Claim 5.3. ∎

Proof of Claim 5.4.

Consider any . Given is true for all , applying Claim 5.3 for all , implies that all the information that the algorithm possesses is only a function of and the internal randomness of the algorithm. Further we can assume that the basis is chosen by the inductive process which picks uniformly randomly from the subspace . Therefore we have that the choice of the remaining basis

is uniformly distributed in a subspace of dimension

completely independent of any vector the algorithm might play. Since we wish to bound the absolute value of the inner product we can assume 666Otherwise the absolute value of the inner product is only lower. The lemma now reduces to the following quantity, consider a fixed unit vector in a . Consider picking a dimensional subspace of given by the basis uniformly randomly. We wish to bound the probability that

The rest of the argument follows the argument by [WS16a](Proof of Lemma 7). Note that for this probability amounts to the surface area of a sphere above the caps of radius