How to trap a gradient flow

01/09/2020
by   Sébastien Bubeck, et al.
0

We consider the problem of finding an ε-approximate stationary point of a smooth function on a compact domain of R^d. In contrast with dimension-free approaches such as gradient descent, we focus here on the case where d is finite, and potentially small. This viewpoint was explored in 1993 by Vavasis, who proposed an algorithm which, for any fixed finite dimension d, improves upon the O(1/ε^2) oracle complexity of gradient descent. For example for d=2, Vavasis' approach obtains the complexity O(1/ε). Moreover for d=2 he also proved a lower bound of Ω(1/√(ε)) for deterministic algorithms (we extend this result to randomized algorithms). Our main contribution is an algorithm, which we call gradient flow trapping (GFT), and the analysis of its oracle complexity. In dimension d=2, GFT closes the gap with Vavasis' lower bound (up to a logarithmic factor), as we show that it has complexity O(√(log(1/ε)/ε)). In dimension d=3, we show a complexity of O(log(1/ε)/ε), improving upon Vavasis' O(1 / ε^1.2). In higher dimensions, GFT has the remarkable property of being a logarithmic parallel depth strategy, in stark contrast with the polynomial depth of gradient descent or Vavasis' algorithm. In this higher dimensional regime, the total work of GFT improves quadratically upon the only other known polylogarithmic depth strategy for this problem, namely naive grid search.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

06/25/2019

Complexity of Highly Parallel Non-Smooth Convex Optimization

A landmark result of non-smooth convex optimization is that gradient des...
10/05/2020

No quantum speedup over gradient descent for non-smooth convex optimization

We study the first-order convex optimization problem, where we have blac...
04/16/2022

On Acceleration of Gradient-Based Empirical Risk Minimization using Local Polynomial Regression

We study the acceleration of the Local Polynomial Interpolation-based Gr...
06/06/2021

Complexity Analysis of Stein Variational Gradient Descent Under Talagrand's Inequality T1

We study the complexity of Stein Variational Gradient Descent (SVGD), wh...
07/07/2020

Streaming Complexity of SVMs

We study the space complexity of solving the bias-regularized SVM proble...
11/28/2021

Escape saddle points by a simple gradient-descent based algorithm

Escaping saddle points is a central research topic in nonconvex optimiza...
06/08/2020

Generalizing the Sharp Threshold Phenomenon for the Distributed Complexity of the Lovász Local Lemma

Recently, Brandt, Maus and Uitto [PODC'19] showed that, in a restricted ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Let be a smooth function (i.e., the map is -Lipschitz, and is possibly non-convex). We aim to find an -approximate stationary point, i.e., a point such that . It is an elementary exercise to verify that for smooth and bounded functions, gradient descent finds such a point in steps, see e.g., Nesterov (2004). Moreover, it was recently shown in Carmon et al. (2019) that this result is optimal, in the sense that any procedure with only black-box access to (e.g., to its value and gradient) must, in the worst case, make queries before finding an -approximate stationary point. This situation is akin to the non-smooth convex case, where the same result (optimality of gradient descent at complexity ) holds true for finding an -approximate optimal point (i.e., such that ), Nemirovski and Yudin (1983); Nesterov (2004).

There is an important footnote to both of these results (convex and non-convex), namely that optimality only holds in arbitrarily high dimension (specifically the hard instance in both cases require ). It is well-known that in the convex case this large dimension requirement is actually necessary, for the cutting plane type strategies (e.g., center of gravity) can find -approximate optimal points on compact domains in queries. It is natural to ask: Is there some analogue to cutting planes for non-convex optimization?111We note that a different perspective on this question from the one developed in this paper was investigated in (Hinder, 2018), where the author asks whether one can adapt actual cutting planes to non-convex settings. In particular Hinder (2018) shows that one can improve upon gradient descent and obtain a complexity with a cutting plane method, under a higher order smoothness assumption (namely third order instead of first order here). In dimension it is easy to see that one can indeed do a binary search to find an approximate stationary point of a smooth non-convex function on an interval. The first non-trivial case is thus dimension , which is the focus of this paper (although we also obtain new results in high dimensions, and in particular our approach does achieve parallel depth, see below for details).

This problem, of finding an approximate stationary point of a smooth function on a compact domain of , was studied in 1993 by Stephen A. Vavasis in (Vavasis, 1993). From an algorithmic perspective, his main observation is that in finite dimensional spaces one can speed up gradient descent by using a warm start. Specifically, observe that gradient descent only needs queries when starting from a -approximate optimal point. Leveraging smoothness (see e.g., Lemma 2 below), observe that the best point on a -net of the domain will be -approximate optimal. Thus starting gradient descent from the best point on -net one obtains the complexity in . Optimizing over , one obtains a complexity. In particular for this yields a query strategy. In addition to this algorithmic advance, Vavasis also proved a lower bound of for deterministic algorithms. In this paper we close the gap up to a logarithmic term. Our main contribution is a new strategy loosely inspired by cutting planes, which we call gradient flow trapping (GFT), with complexity

. We also extend Vavasis lower bound to randomized algorithms, by connecting the problem with unpredictable walks in probability theory

(Benjamini et al., 1998).

Although we focus on for the description and analysis of GFT in this paper, one can in fact easily generalize to higher dimensions. Before stating our results there, we first make precise the notion of approximate stationary points, and we also introduce the parallel query model.

1.1 Approximate stationary point

We focus on the constraint set , although this is not necessary and we make this choice mainly for ease of exposition. Let us fix a differentiable function such that , . Our goal is to find a point such that for any ,

We say that such an is an -stationary point (its existence is guaranteed by the extreme value theorem). In particular if this means that . More generally, for (possibly on the boundary), let us define the projected gradient at , by:

It is standard to show (see also Vavasis (1993)) that is an -stationary point of if and only if .

1.2 Parallel query model

In the classical black-box model, the algorithm can sequentially query an oracle at points and obtain the value222Technically we consider here the zeroth order oracle model. It is clear that one can obtain a first order oracle model from it, at the expense of a multiplicative dimension blow-up in the complexity. In the context of this paper an extra factor is small, and thus we do not dwell on the distinction between zeroth order and first order. of the function . An extension of this model, first considered in (Nemirovski, 1994), is as follows: instead of submitting queries one by one sequentially, the algorithm can submit any number of queries in parallel. One can then count the depth, defined as the number of rounds of interaction with the oracle, and the total work, defined as the total number of queries.

It seems that the parallel complexity of finding stationary points has not been studied before. As far as we know, the only low-depth algorithm (say depth polylogarithmic in ) is the naive grid search: simply query all the points on an -net of (it is guaranteed that one point in such a net is an -stationary point). This strategy has depth , and total work . As we explain next, the high-dimensional version of GFT has depth , and its total work improves at least quadratically upon grid search.

1.3 Complexity bounds for GFT

In this paper we give a complete proof of the following near-optimal result in dimension :

Theorem 1

Let . The gradient flow trapping algorithm (see Section 4) finds a -stationary point with less than queries to the value of .

It turns out that there is nothing inherently two-dimensional about GFT. At a very high level, one can think of GFT as making hyperplane cuts, just like standard cutting planes methods in convex optimization. While in the convex case those hyperplane cuts are simply obtained by gradients, here we obtain them by querying a

-net on a carefully selected small set of hyperplanes. Note also that the meaning of a “cut” is much more delicate than for traditional cutting planes methods (here we use those cuts to “trap” gradient flows). All of these ideas are more easily expressed in dimension , but generalizing them to higher dimensions presents no new difficulties (besides heavier notation). In Section 4.4 we prove the following result:

Theorem 2

The high-dimensional version of GFT finds an -stationary point in depth and in total work .

In particular we see that the three-dimensional version of GFT has complexity . This improves upon the previous state of the art complexity (Vavasis, 1993). However, on the contrary to the two-dimensional case, we believe that here GFT is suboptimal. As we discuss in Section 5.3, in dimension we conjecture the lower bound .

In dimensions , the total work given by Theorem 2 is worse than the total work of Vavasis’ algorithm. On the other hand, the depth of Vavasis’ algorithm is of the same order as its total work, in stark contrast with GFT which maintains a logarithmic depth even in higher dimensions. Among algorithms with polylogarithmic depth, the total work given in Theorem 2 is more than a quadratic improvement (in fixed dimension) over the previous state of the art (namely naive grid search).

1.4 Paper organization

The rest of the paper (besides Section 5 and Section 6) is dedicated to motivating, describing and analyzing our gradient flow trapping strategy in dimension (from now on we fix , unless specified otherwise). In Section 2 we make a basic “local to global” observation about gradient flow which forms the basis of our “trapping” strategy. Section 3 is an informal section on how one could potentially use this local to global phenomenon to design an algorithm, and we outline some of the difficulties one has to overcome. In Section 4 we formally describe our new strategy and analyze its complexity. In Section 5 we extend Vavasis’ lower bound to randomized algorithms. Finally we conclude the paper in Section 6 by introducing several open problems related to higher dimensions.

2 A local to global phenomenon for gradient flow

We begin with some definitions. For an axis-aligned rectangle we denote its volume and diameter by

We further define the aspect ratio of as . The edges of are the subsets

and the boundary of , which we denote is the union of all edges.

If is a segment and , we say that is a -net of , if for any , there exists some such that . We will always assume implicitly that if is a -net, then the endpoints of are elements of .

We denote for the largest value one can obtain by minimizing on a -net of . Formally,

where the supremum is taken over all -nets of . We say that a pair of segment/point in (where is not a subset of an edge of ) satisfies the property for some if there exists such that

where

When is a subset of we always say that satisfies (for any and any ).
For an axis-aligned rectangle and , we say that satisfies if, for any of the four edges of , one has that satisfies . We refer to as the pivot for .

Our main observation is as follows:

Lemma 1

Let be a rectangle such that satisfies for some and . Then must contain a -stationary point (in fact the gradient flow emanating from must visit a -stationary point before exiting ).

This lemma will be our basic tool to develop cutting plane-like strategies for non-convex optimization. From “local” information (values on a net of the boundary of ) one deduces a “global” property (existence of approximate stationary point in ).

Proof. Let us assume by contradiction that does not contain a -stationary point, and consider the unit-speed gradient flow constrained to stay in . That is, is the piecewise differentiable function defined by and , where is the projected gradient defined in the previous section. Since there is no stationary point in , it must be that the gradient flow exits . Let us denote , and an edge of such that . Remark that cannot be part of an edge of . Furthermore, for any , one has

In particular , so that

Lemma 2 below shows that for any one has , and thus together with the above display it shows that does not satisfy , which is a contradiction.

Lemma 2

For any segment and one has:

Proof. Let be such that . If is an endpoint of , then we are done since we require the endpoints of to be in the -nets. Otherwise is in the relative interior of , and thus one has for any . In particular by smoothness one has:

Moreover for any -net of there exists such that , and thus , which concludes the proof.

Our algorithmic approach to finding stationary points will be to somehow shrink the domain of consideration over time. At first it can be slightly unclear how the newly created boundaries interact with the definition of stationary points. To dispell any mystery, it might be useful to keep in mind the following observation, which states that if satisfies , then cannot be on a boundary of which was not part of the original boundary of .

Lemma 3

Let be a rectangle such that satisfies for some and . Then .

Proof. Let be an edge or which is not a subset of . Then by definition of , and by invoking Lemma 2, one has:

In particular if then , and thus which is a contradiction.

3 From Lemma 1 to an algorithm

Lemma 1 naturally leads to the following algorithmic idea (for sake of simplicity in this discussion we replace squares by circles): given some current candidate point in some well-conditioned domain (e.g., such that the domain contains and is contained in balls centered at and of comparable sizes), query a -net on the circle , and denote for the best point found on this net. If one finds a significant enough improvement, say , then this is great news, as it means that one obtained a per query improvement of (to be compared with gradient descent which only yields an improvement of ). On the other hand if no such improvement is found, then the gradient flow from must visit an -stationary point inside .333In “essence” satisfies , this is only slightly informal since we defined for rectangles and is a circle. In particular we chose the improvement instead of the larger (which is enough to obtain ) to account for an extra term due to polygonal approximation of the circle. We encourage the reader to ignore this irrelevant technicality. In other words one can now hope to restrict the domain of consideration to a region inside , which is a constant fraction smaller than the original domain.

Optimistically this strategy would give a rate for -bounded smooth functions (since at any given scale one could make at most improvement steps). In particular together with the warm start this would tentatively yield a rate, thus already improving the state-of-the-art by Vavasis.

There is however a difficulty in the induction part of the argument. Indeed, what we know after a shrinking step is that the current point satisfies for any . Now we would like to query a net on . Say that after such querying we find that we can’t shrink, namely we found some point with , and in particular for any . Could the gradient flow from escape the original circle without visiting an -stationary point? Unfortunately the answer is yes. Indeed (because of the discretization error ) one cannot rule out that there would be a point with , and since is only at distance from , such a point could be attained from with a gradient flow without -stationary points. Of course one could say that instead of satisfying we now only satisfy , and try to control the increase of the approximation guarantee, but such an approach would not improve upon the of gradient descent (simply because we could additively worsen the approximation guarantee too many times).

The core part of the above argument will remain in our full algorithm (querying a -net to shrink the domain). However it is made more difficult by the discretization error as we just saw. We also note that this discretization issue does not appear in discrete spaces, which is one reason why discrete spaces are much easier than continuous spaces for local optimization problems.

Technically we observe that the whole issue of discretization comes from the fact that when we update the center, we move closer to the boundary, which we “pay” in the term in , and we cannot “afford” it because of the discretization error term that we suffer when we update. Thus this issue would disappear if in our induction hypothesis we had for the boundary. Our strategy will work in two steps: first we give a querying strategy for a domain with that ensures that one can always shrink with guaranteed for the boundary, and secondly we give a method to essentially turn a boundary into .

4 Gradient flow trapping

We say that a pair is a domain if is an axis-aligned rectangle with aspect ratio bounded by , and . The gradient flow trapping (GFT) algorithm is decomposed into two subroutines:

  1. The first algorithm, which we call the parallel trap, takes as input a domain satisfying . It returns a domain satisfying and such that . The cost of this step is at most queries.

  2. The second algorithm, which we call edge fixing, takes as input a domain satisfying (for some ) and such that for edges of one also has for . It returns a domain such that either (i) it satisfies and for edges it also satisfies , or (ii) it satisfies and furthermore . The cost of this step is at most queries.

Equipped with these subroutines, GFT proceeds as follows. Initialize , , and . For :

  • if , call parallel trap on , and update , , and .

  • Otherwise call edge fixing, and update . If then set and , and otherwise set and .

We terminate once the diameter of is smaller than .

Next we give the complexity analysis of GFT assuming the claimed properties of the subroutines parallel trap and edge fixing in 1. and 2. above. We then proceed to describe in details the subroutines, and prove that they satisfy the claimed properties.

4.1 Complexity analysis of GFT

The following three lemmas give a proof of Theorem 1.

Lemma 4

GFT stops after at most steps.

Proof. First note that at least one out of five steps of GFT reduces the volume of the domain by (since one can do at most four steps in a row of edge fixing without volume decrease). Thus on average the volume decrease per step is at least , i.e., . In particular since has aspect ratio smaller than , it is easy to verify . Thus for any , one must have . Thus we see that GFT performs at most steps.

Lemma 5

When GFT stops, its pivot is a -stationary point.

Proof. First note that , thus after steps we know that satisfies at least . In particular by Lemma 1, must contain a -stationary point, and since the diameter is less than , it must be (by smoothness) that is a -stationary point.

Lemma 6

GFT makes at most queries before it stops.

Proof. As we saw in the proof of Lemma 4, one has . Furthermore the step requires at most queries. Thus the total number of queries is bounded by:

4.2 A parallel trap

Let be a domain. We define two segments and in as follows. Assume that is a translation of . For sake of notation assume that in fact with and , where (in practice one always ensures this situation with a simple change of variables). Now we define and (See Figure 1).

Figure 1: The parallel trap

The parallel trap algorithm queries a -net on both and (which cost at most ). Denote to be the best point (in terms of value) found on the union of those nets. That is, denoting for the queried -net, then

One has the following possibilities (see Figure 2 for an illustration):

  • If then we set and .

  • Otherwise we set . If we set , and if we set .

The above construction is justified by the following lemma (a trivial consequence of the definitions), and it proves in particular the properties of parallel trap described in 1. at the beginning of Section 4.

Lemma 7

The rectangle has aspect ratio smaller than , and it satisfies . Furthermore if satisfies , then satisfies .

Proof. The first sentence is trivial to verify. For the second sentence, first note that for any edge of one has for since by assumption one has for and furthermore . Next observe that has at most one new edge with respect to , and this edge is at distance at least from , thus in particular one has for . Furthermore by definition , and thus , or in other words satisfies .

Figure 2: The three possible cases for . is marked in red.

4.3 Edge fixing

Let be a domain satisfying for some , and with some edges possibly also satisfying . Denote for the closest edge to that does not satisfy , and let . We will consider three444We need three candidates to ensure that the domain will shrink. candidate smaller rectangles, , and , as well as three candidate pivots (in addition to ) , and . The rectangles are defined by , where we set . The possible output of edge fixing will be either for some , or (see Figure 3 for a demonstration of how to construct ).

To guarantee the properties described in 2. at the beginning of Section 4 we will prove the following: if the output is for some then all edges will satisfy (Lemma 10 below) and the domain has shrunk (Lemma 8 below), and if the output is then one more edge satisfies compared to while all edges still satisfy at least (Lemma 9 below).

Lemma 8

For any one has . Furthermore if the aspect ratio of is smaller than , then so is the aspect ratio of .

Proof. Let us denote for the length of in the axis of (the edge whose distance to defines ), and for the length in the orthogonal direction (and similarly define and ).

Since one has . Furthermore and , so that . This implies that .

For the second statement observe that (the first inequality is by assumption on the aspect ratio of , the second inequality is by definition of

). Given this estimate, the construction of

implies that , which concludes the fact that has aspect ratio smaller than .

Queries and choice of output.

The edge fixing algorithm queries a -net on for all (thus a total of queries), and we define to be the best point found on each respective net.

If for all one has

(1)

then we set . Otherwise denote for the smallest number which violates (1), and set .

Figure 3: Edge fixing: the rectangle is marked in red.
Lemma 9

If then satisfies . Furthermore for any edge of , if satisfies (respectively ) then so does .

Proof. Since it means that . In particular since satisfies one has , and thus now one has which means that satisfies .

Let us now turn to some other edge of . Certainly if satisfies then so does since . But, in fact, even is preserved since by the triangle inequality (and ) one has

Lemma 10

If for some , then satisfy .

Proof. By construction, if , then for any edge of one has . Furthermore one has , and thus by definition one then has for whenever . If then by the triangle inequality, , and moreover is also an edge with respect to . Thus from the definition of , satisfies . Also by our choice of , we know that . Hence satisfies as well.

4.4 Generalization to higher dimensions

As explained in the introduction, there is no reason to restrict GFT to and, in fact, the algorithm may be readily adapted to higher-dimensional spaces, such as , for some . We now detail the necessary changes and derive the complexity announced in Theorem 2.

First, if is an affine hyperplane, and , we define for in the obvious way (i.e., same definition except that we consider a -net of ). Similarly for , when is an axis-aligned hyperrectangle.

Gradient flow trapping in higher dimensions replaces every line by a hyperplane, and every rectangle by a hyperrectangle. In particular at each step GFT maintains a domain , where is a hyperrectangle with aspect ratio bounded by 3, and . The two subroutines are adapted as follows:

  1. Parallel trap works exactly in the same way, except that the two lines and are replaced by two corresponding affine hyperplanes. In particular the query cost of this step is now , and the volume shrinks by at least .

  2. In edge fixing, we now have three hyperrectangles , and we need to query nets on their faces. Thus the total cost of this step is . Moreover, suppose that domain does not shrink at the end of this step and the output is a domain for some other . In this case we know that has some face , such that did not satisfy , but does satisfy . It follows that we can run edge fixing, at most times before the domain shrinks.

We can now analyze the complexity of the high-dimensional version of GFT:

Proof.[Of Theorem 2] First observe that, if is a hyperrectangle in with aspect ratio bounded by , then we have the following inequality,

By repeating the same calculations done in Lemma 4 and the observation about parallel trap and edge fixing made above, we see that the domain shrinks at least once in every steps, so that at step ,

and

Since the algorithm stops when , we get

The total work done by the algorithm is evident now by considering the number of queries at each step.

5 Lower bound for randomized algorithms

In this section, we show that any randomized algorithm must make at least queries in order to find an -stationary point. This extends the lower bound in (Vavasis, 1993), which applied only to deterministic algorithms. In particular, it shows that, up to logarithmic factors, adding randomness cannot improve the algorithm described in the previous section.

For an algorithm , a function and we denote by the number of queries made by , in order to find an -stationary point of . Our goal is to bound from below

where the infimum is taken over all random algorithms and the supremum is taken over all smooth functions, . The expectation is with respect to the randomness of . By Yao’s minimax principle we have the equality

Here, is a deterministic algorithm and is a distribution over smooth functions. The rest of this section is devoted to proving the following theorem:

Theorem 3

Let be a decreasing function such that

and set

(2)

Then,

Remark that one may take in the theorem. In this case , and , which is the announced lower bound.

One of the main tools utilized in our proof is the construction introduced in (Vavasis, 1993). We now present the relevant details.

5.1 A reduction to monotone path functions

Let stand for the grid graph. That is,

We say that a sequence of vertices, is a monotone path in if and for every , either equals or . In other words, the path starts at the origin and continues each step by either going right or up. If is a monotone path, we associate to it a monotone path function by

By a slight abuse of notation, we will sometimes refer to the path function and the path itself as the same entity. If we write for and for the prefix . If is such that , we say that does not lie on the path.
We denote the set of all monotone path functions on by It is clear that if then is the only local minimum of and hence the global minimum.

Informally, the main construction in (Vavasis, 1993) shows that for every there is a corresponding smooth function , which ’traces’ the path in and preserves its structure. In particular, finding an -stationary point of is not easier than finding the minimum of .

To formally state the result we fix and assume for simplicity that is an integer. We henceforth denote and identify with in the following way: if we write for the square:

If , then denotes the closure of the set .

Lemma 11 (Section 3, (Vavasis, 1993))

Let . Then there exists a function with the following properties:

  1. is smooth.