Near-Optimal Lower Bounds For Convex Optimization For All Orders of Smoothness

12/02/2021
by   Ankit Garg, et al.
Google
Microsoft
4

We study the complexity of optimizing highly smooth convex functions. For a positive integer p, we want to find an ϵ-approximate minimum of a convex function f, given oracle access to the function and its first p derivatives, assuming that the pth derivative of f is Lipschitz. Recently, three independent research groups (Jiang et al., PLMR 2019; Gasnikov et al., PLMR 2019; Bubeck et al., PLMR 2019) developed a new algorithm that solves this problem with Õ(1/ϵ^2/3p+1) oracle calls for constant p. This is known to be optimal (up to log factors) for deterministic algorithms, but known lower bounds for randomized algorithms do not match this bound. We prove a new lower bound that matches this bound (up to log factors), and holds not only for randomized algorithms, but also for quantum algorithms.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

10/13/2020

High-Order Oracle Complexity of Smooth and Strongly Convex Optimization

In this note, we consider the complexity of optimizing a highly smooth (...
09/21/2022

On the Complexity of Finding Small Subgradients in Nonsmooth Optimization

We study the oracle complexity of producing (δ,ϵ)-stationary points of L...
10/05/2020

No quantum speedup over gradient descent for non-smooth convex optimization

We study the first-order convex optimization problem, where we have blac...
04/14/2017

On the Gap Between Strict-Saddles and True Convexity: An Omega(log d) Lower Bound for Eigenvector Approximation

We prove a query complexity lower bound on rank-one principal component ...
07/24/2018

On the Randomized Complexity of Minimizing a Convex Quadratic Function

Minimizing a convex, quadratic objective is a fundamental problem in mac...
02/12/2018

Raz-McKenzie simulation: new gadget and unimprovability of Thickness Lemma

Assume that we have an outer function f:{0, 1}^n →{0, 1} and a gadget fu...
06/08/2020

Generalizing the Sharp Threshold Phenomenon for the Distributed Complexity of the Lovász Local Lemma

Recently, Brandt, Maus and Uitto [PODC'19] showed that, in a restricted ...

1 Introduction

In recent years, several optimization algorithms have been proposed, especially for machine learning problems, that achieve improved performance by exploiting the smoothness of the function to be optimized. Specifically, these algorithms have better performance when the function’s first, second, or higher order derivatives are Lipschitz 

[Nes08, Bae09, MS13, Nes19, GDG19, JWZ19, BJL19b].

In this paper we study the problem of minimizing a highly smooth convex function given black-box access to the function and its higher order derivatives. The simplest example of the family of problems we consider here is the problem of approximately minimizing a convex function , given access to an oracle that on input outputs , under the assumption that the function’s first derivative, its gradient , has bounded Lipschitz constant. This problem can be solved by Nesterov’s accelerated gradient descent, and it is known that this algorithm is optimal (in high dimension) among deterministic and randomized algorithms [Nes83, NY83].

More generally, for any positive integer , consider the th-order optimization problem: For known , , and , we have a times differentiable convex function whose th derivative has Lipschitz constant at most , which means

(1)

where is the norm (for vectors) or induced norm (for operators)555For a symmetric

order tensor

, the induced norm is defined as .. Our goal is to find an -approximate minimum of this function in a ball of radius , which is any that satisfies

(2)

where is the -ball of radius around the origin. We can access the function through a th order oracle, which when queried with a point outputs

(3)

As usual, denotes the th derivative of .

Our primary object of study will be the minimum query cost of an algorithm that solves the problem, i.e. the number of queries (or calls) to the oracle in eq. 3 that an algorithm has to make.666For simplicity we assume that the oracle’s output is computed to arbitrarily many bits of precision. This only makes our results stronger, since we prove lower bounds in this paper. For a fixed , it seems like this problem has 4 independent parameters, , , , and , but the parameters are not all independent since we can scale the input and output spaces of the function to affect the latter 3 parameters. Thus the complexity of any algorithm can be written as a function of and . In this paper we focus on the high-dimensional setting where may be much larger than the other parameters, and the best algorithms in this regime have complexity that only depends on with no dependence on .

As noted, the problem has been studied since the early 80s [Nes83, NY83], and the problem has also been considered [Nes08, MS13]. In particular, it was known [Bae09, Nes19] that the problem can be solved by a deterministic algorithm that makes

(4)

oracle calls,777Note that the query complexity does not have any dependence on the dimension . Of course, actually implementing each query will take poly time, but we only count the number of queries here. where the subscript in the big Oh (or big Omega) notation means the constant in the big Oh can depend on . In other words, this notation means that we treat as a constant.

In an exciting recent development, new algorithms were proposed for all (with very similar complexity) by three independent groups of researchers: Gasnikov, Dvurechensky, Gorbunov, Vorontsova, Selikhanovych, and Uribe [GDG19]; Jiang, Wang, and Zhang [JWZ19]; Bubeck, Jiang, Lee, Li and Sidford [BJL19b]. All three groups develop deterministic algorithms that make

(5)

oracle calls.

This algorithm is nearly optimal among deterministic algorithms, since the works [Nes19, ASS19] showed that any deterministic algorithm that solves this problem must make queries. However, for randomized algorithms, the known lower bound is weaker. Agarwal and Hazan [AH18] showed that any randomized algorithm must make

(6)

queries. To the best of our knowledge, no lower bounds are known in the setting of high-dimensional smooth convex optimization against quantum algorithms, although quantum lower bounds are known in the low-dimensional setting [CCLW20, vAGGdW20] and for non-smooth convex optimization [GKNS21].

In this work, we close the gap (up to log factors) between the known algorithm and randomized lower bound for all . Furthermore, our lower bound also holds against quantum algorithms.

Theorem 1.

Fix any . For all , , , there exists an and a set of -dimensional functions with th-order Lipschitz constant (i.e., satisfying eq. 1) such that any randomized or quantum algorithm that outputs an -approximate minimum (satisfying eq. 2) for any function must make

(7)

queries to a th order oracle for (as in eq. 3).

In fact, this lower bound holds even against highly parallel randomized algorithms, where the algorithm can make poly() queries in each round and we only count the total number of query rounds (and not the total number of queries). See [BJL19a] for previous work in this setting, including speedups for first-order convex optimization in the low dimensional setting.

In this introduction, we have deliberately avoided explaining the quantum model of computation to make the results accessible to readers without a background in quantum computing. The entire paper is written so that the randomized lower bound is fully accessible to any reader who does not wish to understand the quantum model and quantum lower bound. For readers familiar with quantum computing, we note that the only thing to be changed to get the quantum model is to modify the oracle in eq. 3 to support queries in quantum superposition. This is done in the usual way, by defining a unitary implementation of the oracle, which allows quantum algorithms to make superposition queries and potentially solve the problem more efficiently than randomized algorithms.

2 High level overview

Let us first consider the lower bound against randomized algorithms. Let us also first look at the special setting of where we still assume access to the gradient (or ) oracle. To be more precise, the oracle returns subgradients, since gradients need not be defined at all points for Lipschitz convex functions. For this setting, known popularly as nonsmooth convex optimization, the optimal lower bound of is in fact a classical result [NY83]. The proof of this result is very elegant and has been used subsequently to prove several other related lower bounds such as for parallel randomized algorithms [Nem94], quantum algorithms [GKNS21] etc. Since our proof builds on this framework, we now review this.

Nonsmooth lower bound instance.

The lower bound instance for nonsmooth convex optimization is where is chosen as

(8)

with , , and comprises orthonormal vectors chosen uniformly at random. The argument essentially shows that

  1. [label=()]

  2. in order to find a -approximate minimizer, one needs to know all the ’s, and

  3. with high probability, each query reveals at most one new vector

    .

This yields a lower bound of queries for achieving an error of . Since is a -Lipschitz function, when we rewrite this bound in terms of error, it yields the randomized lower bound.

For the setting, known popularly as smooth convex optimization, the optimal lower bound is also a classical result originally proven in [NY83]. However, the proof of this result in [NY83] is quite complicated and is not widely known. The recent papers of [GN15, DG20] provide a much simpler proof of the result by using the lower bound construction for setting described above and using smoothing, which we now review.

Smoothing.

Smoothing refers to the process of approximating a given Lipschitz function by another function , which has Lipschitz continuous derivatives for some . Further, since we will be applying this operation to (8), we will describe smoothing in this context.

Definition 1.

An operator which takes a -Lipschitz function to another convex function is called a -smoothing operation if it satisfies the following:

  1. Smoothness: derivatives of are Lipschitz continuous with parameter , and

  2. Approximation: For any , we have .

If we can design a smoothing operation as per the above definition with and further ensure that property (ii) above i.e., with high probability, each query to the first derivatives of reveals at most one new vector , then the proof strategy of lower bound for nonsmooth convex optimization can be executed on the smoothed instance , there by giving us a lower bound for -order smooth convex optimization. This is the key idea of [GN15, DG20]. Further, the smaller is, the better the bound we obtain. However, since can have discontinuous derivatives, there is a tension between the approximation property which tries to keep close to and the smoothness property. So, one cannot make very small after fixing . For the rest of this section, we fix in Definition 1.

For the setting, there is a well-known smoothing operation known as Moreau/inf-conv smoothing [BC11], which obtains the best possible smoothing with . This gives the tight query lower bound of for smooth convex optimization.

However, there is no known generalization of inf-conv smoothing for , so one needs to use a different smoothing operator to extend this proof strategy for proving query lower bounds for higher order smooth convex optimization. Given any [AH18] indeed construct such a smoothing, called randomized smoothing which maps Lipschitz convex functions to convex functions with Lipschitz derivatives. In the general setting, a smoothing operator with would give the optimal lower bound of . However, the randomized smoothing of [AH18] can obtain only leading to a suboptimal lower bound for order smooth convex optimization.

We design an improved smoothing operation for the specific class of functions in eq. 8, with the optimal using two key ideas. The first idea is the softmax function with parameter defined as , where . If we apply , with , to functions of the form (8) through:

(9)

where , we can show that satisfies Definition 1 with the optimal value of . However, any query on derivatives of reveals information about all the vectors simultaneously since for instance the gradient is given by

(10)

Consequently, it cannot be directly used to obtain a lower bound. The second idea is that even though the function value and derivatives of have contribution from all ’s, the contribution is heavily dominated (i.e., up to error) by , where , whenever for every .

Based on this insight, we design a new -Lipschitz convex function given by where for an appropriate to be chosen later, where . The key property satisfied by is that implies that for any . This implies that near points of discontinuous gradients for i.e., points where changes, the resulting discontinuity in is . In contrast, the change in gradients of the original instance near points of discontinuity is . If we apply randomized smoothing to , the resulting function can then be shown to have order Lipschitz constant . The precise details, proved in Lemma 4, are technical and form the bulk of this paper. The same proof strategy immediately yields the same bound on the number of rounds for parallel randomized algorithms as long as the number of queries in each round is at most . The reason is that queries are still not sufficient to obtain information about more than one vector per round. Finally, the same proof strategy can be adapted to the quantum setting using the hybrid argument [BBBV97]. See Section 6 for more details.

3 Smoothing preliminaries

In this section we look at some smoothing functions and their properties. The proofs of these properties can be found in Appendix A.

Let be the ball of radius around .

Definition 2 (Randomized smoothing).

For any function and real-valued , the randomized smoothing operator produces a new function from with the same domain and range, defined as

(11)

This smoothing turns non-smooth functions into smooth functions. If we start with a function that is Lipschitz, then after randomized smoothing, the resulting function’s first derivative will be defined and Lipschitz [AH18]. Since we want to construct functions with derivatives, we define a -fold version of randomized smoothing. Recall that is the same as in the introduction (i.e., we are proving lower bounds on the th order optimization problem). This operation also depends on a parameter that we will fix later.

Definition 3 (Smoothing).

The smoothing operator on input outputs the function

(12)

The main properties we require from this smoothing are as follows.

Lemma 1.

For the smoothing operator defined above, the following statements hold true.

  1. For any functions for which are well-defined, .

  2. The value only depends on the values of within a radius of .

  3. The gradient and higher order derivatives of at depend only on the values of within .

  4. If is -Lipschitz in a ball of radius around , then is also -Lipschitz at .

  5. Let be -Lipschitz in a ball of radius around . Then is -times differentiable, and for any , is -Lipschitz in a -ball around with .

  6. Let be -Lipschitz in a ball of radius around . Then .

  7. If is a convex function, then is also a convex function.

We also use the softmax function introduced earlier.

Definition 4 (Softmax).

For a real number , the softmax function is defined as

(13)

Let us also define, for , as

(14)

We note the following smoothness properties of softmax.

Lemma 2.

The following are true of the function for any .

  1. The first derivative of can be computed as

    (15)
  2. is -Lipschitz and convex.

  3. The higher-order derivatives of satisfy

    (16)

We will also need the following lemma which roughly states that if and are nearly the same, then their gradients are also nearly the same at .

Lemma 3.

Let and . If

(17)

Then

(18)

4 Function construction and properties

In this section we define the class of functions used in our randomized (and quantum) lower bound, and state the properties of the function that will be exploited in the lower bound.

Let be parameters to be defined shortly. and are parameters as used in the high level overview (eq. 8), is the parameter required to define and is the parameter used in the definition of .

Function construction.

Given a list of orthonormal vectors , which we collectively call , we recall that denotes the vector

(19)

We can now define our hard function class as follows.

Definition 5.

Let be a set of orthonormal vectors. The functions , …, , and depend on as follows. Define, for each , the function as

(20)

Define , and .

Note that in the above definition we apply the functions on and not on . However, since is obtained by applying a unitary transform on and then translating it, the observations about in Lemmas 3 and 2 also hold for .

We set , (or ), .

Function properties.

We now state some properties of the function that will be used to show the lower bounds.

Lemma 4.

For any choice of , the function is convex, -times differentiable and satisfies

(21)

where .

The proof relies on the fact that is smooth and hence each is smooth. If for a particular in a -neighborhood of , then would also be smooth (by Lemma 1, item 4). If depends on multiple s in a -neighborhood of , then we know that at least two softmax’s involved in the definitions of the s have nearly the same value in the neighborhood of , and by Lemma 3 they have nearly the same gradient. This makes nearly smooth, which will allow us to say that is smooth at (by Lemma 1, item 5).

Proof of Lemma 4.

Each is an instance of softmax applied to plus a constant. Since is the vector transformed by a unitary and then translated, the smoothness and convexity properties of also apply to . Hence each is convex, -times differentiable and its th derivatives are -Lipschitz (see Lemma 2). The function , being a maximum over convex functions, is also convex. By the properties of the smoothing operator (Lemma 1), the function is also convex.

Let . Let be the minimum number such that there is a point for which . We can rewrite as follows: . We call the smooth term and the non-smooth term. We know that has an upper bound on the Lipschitzness of its -th order derivatives. If all points satisfy , then the non-smooth term is and so it does not change the smoothness of . will maintain this smoothness (see item 4 of Lemma 1).

If the non-smooth term is non-zero at some point in , then we wish to show that the non-smooth term has a small Lipschitz constant in . This would imply, via item 5 of Lemma 1, that the -th order derivative of the smoothing of the non-smooth term with would have a small Lipschitz constant. Towards this let be any point in . Let be the set . The set of subgradients of the non-smooth term at is the convex hull of . So if we show that for an arbitrary , , then we know that the non-smooth part is -Lipschitz at . If , then the gradient is zero. Let us take an (since is the smallest, in fact ). By convexity of the ball and the continuity of and , there must be a point in for which . Note that .

The statement translates to

(22)

Expanding the expression for and we get

(23)
(24)

Since , we have that for any unit vector . Hence

(25)

For all , . So we can conclude from the above that

(26)

Now by Lemma 3, .

Hence the non-smooth part of is -Lipschitz in . The th derivatives of are thus by Lemma 1, -Lipschitz. We know that and , simplifying our bound to . Furthermore, and . Hence we can rewrite our upper bound as . ∎

We now see how to prove the query lower bound on optimizing this function class. In order to do so, we need to introduce some intermediate functions. Let and . Let an oracle call to a function at a point be denoted by . The following results will hold when the set of orthonormal vectors is chosen uniformly at random (or Haar randomly). The next two lemmas about these intermediate functions form the backbone of our lower bound.

Lemma 5.

Fix any . Let be distributed Haar randomly. Conditioned on any fixing of , any query in the unit ball will satisfy with probability at least .

Lemma 6.

Let be distributed Haar randomly. Conditioned on any fixing of , any point in the unit ball will be -optimal for with probability at most .

The proofs of both of these use the following lemma.

Lemma 7.

Fix any . Conditioned on any fixing of , any query in the unit ball will, with probability , satisfy .

Proof.

Note that for , is distributed uniformly at random from a unit sphere in . The following useful concentration statement about random unit vectors follows from [Bal97, Lemma 2.2].

Proposition 2.

Let . Then for a random unit vector , and all ,

(27)

Using Proposition 2 and the fact that we have that for any in the unit ball,

Applying a union bound for each of the vectors , we have that with probability at least , . (We use the constant in the lemma statement only because it is a nicer constant than .) ∎

Proof of Lemma 5.

To show that , we will show that for all , . Let be the event that satisfies . We will show that . Hence let us assume holds.

We know that satisfies for all . Hence for any , . To show that , it is sufficient to show that for all .

Note that if and only if

Since , the following statement which we will show is in fact stronger.

This can be rewritten as , or .

We know this last statement is true because the RHS is at most which is smaller than , which is (recall that and ).

Since is true with probability  Lemma 7, the lemma follows. ∎

Proof of Lemma 6.

Again, let be the event that satisfies . Let us assume holds.

The value of can be lower bound as follows. Since , . Since is -Lipschitz, (because ).

For , we know each is at most . This is at most

This in turn is at most since and . So .

Since , and so does not optimize .

Since holds with probability , the lemma follows. ∎

To see how Lemmas 5 and 6 above lead us to our lower bound, fix any -query algorithm and consider the following experiment. For each from to , when the algorithm makes its th query do the following.

  • Sample from the space orthogonal to the vectors to .

  • Provide the algorithm the value that returns on the query. Note that the function depends only on the sampled vectors through .

It follows that the output of the algorithm is independent of the vector (conditioned on the vectors through ). Now we use Lemma 6 to say that with high probability the output of the algorithm is not -optimal for . We can now use Lemma 5 along with the hybrid argument to conclude that with high probability the transcript of this query algorithm is the same as the actual transcript (i.e. the transcript had to all been sampled at the beginning and all the queries been to ). Since the transcripts are the same with high probability, the outputs of the algorithms are also the same with high probability. Hence even when all the queries are to , with high probability the output is not -optimal for . This proof is made formal as the proof of Theorem 3.

5 Lower bounds

We can now establish the randomized lower bound using Lemma 4, Lemma 5, and Lemma 6.

Theorem 3.

Let be a randomized query algorithm making at most queries to . When is distributed Haar randomly, the probability that the output of is -optimal is .

Proof.

Let the success probability of be when is distributed Haar randomly. We can fix the randomness of to get a deterministic algorithm with success probability at least on the same distribution.

Let us denote the transcript of as where is the th query made and

is the output of the algorithm. Note that these are random variables that depend only on

. We now create hybrid transcripts for . The hybrid transcript is defined as the transcript of when, for all , its th oracle call (which is supposed to be to ) is replaced with an oracle call to . Note that

  • For any , .

  • is a function of