On the Gap Between Strict-Saddles and True Convexity: An Omega(log d) Lower Bound for Eigenvector Approximation

04/14/2017 ∙ by Max Simchowitz, et al. ∙ 0

We prove a query complexity lower bound on rank-one principal component analysis (PCA). We consider an oracle model where, given a symmetric matrix M ∈R^d × d, an algorithm is allowed to make T exact queries of the form w^(i) = Mv^(i) for i ∈{1,...,T}, where v^(i) is drawn from a distribution which depends arbitrarily on the past queries and measurements {v^(j),w^(j)}_1 < j < i-1. We show that for a small constant ϵ, any adaptive, randomized algorithm which can find a unit vector v for which v^Mv> (1-ϵ)M, with even small probability, must make T = Ω( d) queries. In addition to settling a widely-held folk conjecture, this bound demonstrates a fundamental gap between convex optimization and "strict-saddle" non-convex optimization of which PCA is a canonical example: in the former, first-order methods can have dimension-free iteration complexity, whereas in PCA, the iteration complexity of gradient-based methods must necessarily grow with the dimension. Our argument proceeds via a reduction to estimating the rank-one spike in a deformed Wigner model. We establish lower bounds for this model by developing a "truncated" analogue of the χ^2 Bayes-risk lower bound of Chen et al.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A major open problem in machine learning and optimization is identifying classes of non-convex problems that admit efficient optimization procedures. Motivated by the empirical successes of matrix factorization/completion 

[86, 68, 49], sparse coding [64], phase retrieval [34]

and deep neural networks 

[48, 12]

, a growing body of theoretical work has demonstrated that many gradient and local-search heuristics - inspired by convex optimization - enjoy sound theoretical guarantees in a wide variety of non-convex problems 

[22, 18, 62, 44, 10, 11, 79, 80, 15, 85, 82]. Notably, Ge et al. [39]

introduced a polynomial-time noisy gradient algorithm for computing approximate local-minima of non-convex objectives which have the “strict-saddle property”: that is, objectives whose first-order stationary points are either local minima, or saddle points at which the Hessian has a strictly negative eigenvalue. It has since been shown that many well-studied non-convex problems can be formulated as “strict saddle” objectives whose local minimizers are all globally optimal (or near-optimal) 

[78, 50, 40, 15], thereby admitting efficient optimization by local search.

Recently, Jin et al. [45] proposed a gradient algorithm which finds an approximate local minimum of a strict saddle objective in a number of iterations which matches first-order methods for comparable convex problems, up to poly-logarithmic factors in the dimension. This might seem to suggest that, from the perspective of first-order optimization, strict saddle objectives and truly convex problems are identical. But there is a caveat: unlike the algorithm proposed by Jin et al. [45], the iteration complexity of first order methods for optimizing truly convex functions typically has no explicit dependence on the ambient dimension [21, 61]. This begs the question:

Does the iteration complexity of first order methods for strict saddle problems necessarily depend on the ambient dimension? Stated otherwise, is there a gap in the complexity of first-order optimization for “almost-convex” and “truly convex” problems?

This paper answers the above questions in the affirmative by considering perhaps the simplest and most benign strict saddle problem: approximating the top eigenvector of a symmetric matrix , also known as rank-one PCA. The latter is best cast as a strict-saddle problem with objective function to be maximized , subject to the smooth equality constraint  [39, 50]. We show that the gradient query complexity of rank-one PCA necessarily scales with the ambient dimension, even in the “easy” regime where the eigengap is bounded away from zero.

More precisely, we consider an oracle model, where given a symmetric matrix , an algorithm is allowed to make exact queries of the form for , where is drawn from a distribution which may depend arbitrarily on the past queries ; these queries are precisely the rescaled gradients of the objective . We show (Theorem 2.1) that any adaptive, randomized algorithm which finds a unit vector for which for any symmetric matrix whose second-eigenvalue is at most times its leading eigenvalue in magnitude must make at least queries. This matches the performance of the power method and Lanczos algorithms as long as is bounded away from one.

In fact, we show that if is bounded by a small constant times , then the probability of finding a unit vector with objective value at least is as small as . We also show (Theorem 2.3) that given any , it takes adaptive queries to test if the operator norm of is above the threshold , or below

. Our lower bounds are based on the widely studied deformed Wigner random matrix model 

[51, 33], suggesting that the factor should be regarded as necessary for “typical” symmetric matrices , not just for some exceptionally adversarial instances.

1.1 Proof Techniques

We reduce the problem of top-eigenvector computation to adaptively estimating the rank-one component of the deformation of a Wigner matrix  [5, 33, 51] in our query model. Here, controls the eigengap, and

is drawn uniformly from the unit sphere. Unlike many lower bounds for active learning 

[38, 3, 43], it is insufficient to assume that the algorithm may take the most informative measurements in hindsight, since that would entail estimating with only -measurements. As a first pass, we use a recursive application of Fano’s method, similar to the strategy adopted in Price and Woodruff [66] for proving lower bounds on adaptive estimation. This method bounds the rate at which information is accumulated by controlling the information obtained from the -th measurement in terms of the information gained from measurements . Unfortunately, in our setting, this technique can only establish a lower bound of queries.

To sharpen our results, we adopt an argument based on a -divergence analogue of Fano’s inequality, introduced in Chen et al. [26]. But whereas the -divergence computations in Fano’s inequality allow us to decompose the information obtained at each round into a sum, the adaptivity of the algorithm introduces correlations between the likelihood ratios that appear in the computations. Thus, we need to carefully truncate the distributions that arise in our lower bound construction, and restrict them to some carefully-defined “good events”. This permit us to bound the rate of information-accumulation.

In general, the probability of these good events conditioned on the spike may vary, and thus treating the truncated probabilities as conditional distributions introduces serious complications. To simplify things, we observe that the theory of -divergences, from which Fano’s inequality and the -analogue in Chen et al. [26]

are derived, can be generalized straightforwardly to non-normalized measures, i.e., truncated probability distributions. We therefore derive a general version of the Bayes-risk lower bound from Chen et al. 


for non-normalized distributions, which, when specialized to

, enables us to prove a sharp lower bound of queries.

To prove the lower bound on testing the spectral norm on

, we reduce the problem to that of testing the null hypothesis

for a Wigner matrix , against an alternative hypothesis , where is drawn uniformly on the sphere and some sufficiently positive . The bound mainly follows from Pinsker’s inequality (similarly to the combinatorial hypothesis testing lower bound in Addario-Berry et al. [2]) but again with the added nuance of the need to truncate our likelihood ratios due to the adaptivity of the algorithm.

1.2 Related Work

Oracle Lower Bounds for Optimizations. In their seminal work, Nemirovskii and Yudin [60] established lower bounds on the number of calls an algorithm must make to a gradient-oracle in order to approximately optimize a convex function. While they match known upper bounds in terms of dependence on relevant parameters (accuracy, condition number, Lipschitz constant), the constructions are regarded as brittle [8]: the construction considers a worst-case initialization, and makes the strong assumption that the point whose gradient is queried lies in the affine space spanned by the gradients queried up to that iterate. Arjevani and Shamir [9] addresses some of the weakness of the lower bounds [60] (e.g., allowing some randomization), but at the expense of placing more restrictive assumptions of the class of optimization algorithms considered. In contrast, our lower bound places no assumptions on how the algorithm chooses to make its successive queries.

In recent years, lower bounds have been established for stochastic convex optimization  [3, 43] where each gradient- or function-value oracle query is corrupted with i.i.d. noise. While these lower bounds are information-theoretic, and thus unconditional, they do not hold in the setting considered in this work, where we are allowed to make exact, noiseless queries. As mentioned above, the proof strategy for proving lower bounds on the exact-oracle model is quite different than in the noisy-oracle setting.

Active Learning and Adaptive Data Analysis. Our proof techniques casts the eigenvector-computation as a type of sequential estimation problem, which have been studied at length in the context of sparse recovery and active adaptive compressed sensing [7, 66, 25, 24]. Due to the noiseless oracle model, our setting is most similar to [66], whereas [7, 25, 24] study measurement noise. Our setting also exhibits similarities to the stochastic linear bandit problem [73]. More broadly, query complexity has received much recent attention in the context of communication-complexity [6, 59], in which lower bounds on query complexity imply corresponding bounds against communication via lifting theorems. Similar ideas also arise in understanding the implication of memory-constraints on statistical learning [77, 76, 70].

Local Search for Non-Convex Optimization. As mentioned in the introduction, there has been a flurry of recent work establishing the efficacy and correctness of local search algorithms in numerous non-convex problems, including Dictionary Learning [11, 80], Matrix Factorization [15, 82, 85], Matrix Completion [40, 16], Phase Retrieval [18, 79, 22], and training neural networks [44]. Particular attention has been devoted to avoiding saddle points in nonconvex landscapes [50, 39, 45, 78], which, without further regularity assumptions, are known to render the task of finding even local minimizers computationally hard [55]. Recent work has also considered second-order algorithms for non-convex optimization [80, 4, 23]. However, to the best of the authors’ knowledge, the lower bounds presented in this paper are the first which show a gap in the iteration complexity of first-order methods for convex and “benign” non-convex objectives.

PCA, Low-Rank Matrix Approximation and Norm-Estimation. The growing interest in non-convexity has also spurred new results in eigenvector computation, motivated in part by the striking ressemblence between eigenvector approximation algorithms (e.g. the power method, Lanczos Algorithm [31], Oja’s algorithm [71]

, and newer, variance-reduced stochastic gradient approaches 

[35, 71]) and analogous first-order convex optimization procedures. Recent works have also studied PCA in the streaming [71], communication-bounded [37, 13], and online learning settings [36]. More generally, eigenvector approximation is widely regarded as a fundamental algorithmic primitive in machine learning [46], numerical linear algebra [31], optimization, and numerous graph-related learning problems [75, 65, 63]. While lower bounds have been established for PCA in the memory- and communication-limited settings [19, 70], we are unaware of lower bounds that pertain to the noiseless query model studied in this work.

Rank-one PCA may be regarded one of the simplest low-rank matrix approximation problems [42]. The numerical linear algebra community has studied low-rank matrix approximation far more broadly, with an eye towards computation-, memory-, and communication-efficient algorithms [58, 57, 67], as well as algorithms which take advantage of the sparsity of their inputs [28, 74, 27]

. Previous work has also studied the problem of estimating functions of a matrix’s singular values 

[53], including the special case of estimating Schatten -norms [52]. To the best of our knowledge, lower bounds for sketching concern the cases where the sketches are chosen non-adaptively.

2 Statement of Main Results

Let denote the -norm on , and let denote the unit sphere. Let denote the set of symmetric matrices, and for , we let denote its eigenvalues in decreasing order, denote the corresponding eigenvectors, and overload to denote its operator norm.

Definition 2.1 (Eigenratio).

For , we define the set of matrices with positive leading eigenvector and bounded eigenratio between its first and second eigenvalues:


The iteration complexity of rank-one PCA is typically stated in terms of the eigengap . This work instead focuses on lower bounds which hold when the eigengap is close to , motivating our parameterization in terms of the eigenratio instead. We now define our query model:

Definition 2.2 (Query Model).

And adaptive query algorithm with query complexity is an algorithm which, for rounds , queries an oracle with a vector , and receives a noiseless response . At the end rounds, the algorithm returns a vector . The queries and output are allowed to be randomized and adaptive, in that is a function of , as well as some initial random seed. We say that is deterministic if, for all , is a deterministic function of . We say that is non-adaptive if, for all the distribution of is independent of the observations , (but may dependent on past observations.)

Example 2.1.

The Power Method and Lanczos algorithms [31] are both randomized, adaptive query methods. Even though the iterates of the Lanczos and power methods converge to the top eigenvector at different rates, they are nearly identical algorithms from our query-complexity perspective: both identify on the Krylov space . The only difference is that the Lanczos algorithm selects in a more intelligent manner than the power method. Running the power method from a deterministic initialization would be a non-randomized algorithm. Any non-randomized algorithm, even an adaptive one, must take queries in the worse case, since queries can only identify a matrix up to a dimensional subspace. Randomized, but non-adaptive algorithms need to take queries as well, as established formally in Li et al. [52].

2.1 Lower Bound for Estimation

We let denote probability taken with respect to the randomness of and a fixed as input, and denote probability with respect to and drawn from a distribution . The main result of this work is the following distributional lower bound:

Theorem 2.1 (Main Theorem).

There exists universal positive constants , , , and such that the following holds: for all , and , there exists a distribution supported on such that the output of any adaptive query algorithm with query complexity satisfies


Thus, since a distributional lower bound implies a worst-case lower bound, we have

Corollary 2.2.

Any adaptive -query algorithm with output which satisfies

for all must make queries.

The constraint that implies that there is a large eigengap. In this regime, our lower bound matches the power method, which yields such that in iterations. We also note that, while is not necessarily positive semi-definite in our construction, one can simply add a multiple of to enforce this constraint111 Proposition 3.2 and the proof of Theorem 3.3 show that is bounded on ., and this only changes by a constant.

2.2 Lower Bound for Testing

We now consider the problem of testing whether the operator norm of a symmetric matrix is below a threshold , or above a threshold .

Definition 2.3 (Adaptive Testing Algorithm).

An adaptive detection algorithm makes adaptive, possibly randomized queries as per Definition 2.2, and at the end of rounds, returns a test which is a function of , and some initial random seed.

Our second result establishes a lower bound on the sum of type-I and type-II errors incurred when testing between distributions on matrices with separated operator norms:

Theorem 2.3 (Detection Lower Bound).

There exists universal positive constants , , , such that the following holds: for all and , there exists two distributions and on such that

and (2.3)

Moreover, for any binary test returned by an adaptive -query algorithm , we have


This implies a worst-case lower bound for testing, matched by the power method for large :

Corollary 2.4.

Any randomized adaptive -query algorithm which can test whether or with probability of error requires at least queries.

3 Reduction to Estimation

We now construct the distribution used to prove Theorems 2.1 and 2.3. We begin by constructing a family of distributions on , indexed by , and place a prior on . We then show that if is drawn from the marginal distribution, then with good probability, lies in for an appropriate , and that any for which is large must be close to . Hence, establishing the desired lower bound is reduced to a lower bound on estimating . The construction is based on the Gaussian Orthogonal Ensemble, also know as the Wigner Model [5].

Definition 3.1 (Gaussian Orthogonal Ensemble (GOE)).

We say that if the entries are independent, for , , for , , and for , . We also define the constant


For a precise, non-asymptotic upper bound on , we direct the reader to Bandeira and van Handel [14]; an asymptotic bound can be found in Anderson et al. [5], and non-asymptotic bounds with looser constants are shown by Vershynin [83]. We now define the generative process for our lower bound:

Definition 3.2 (Deformed Wigner Model).

Let , and a distribution supported on

(e.g., the uniform distribution.) We then independently draw

and , and set . We also let denote the law of conditioned on .

In the sequel, we will take our algorithm to be fixed. Abusing notation slightly, we will therefore let denote the law of and under , conditioned on . We now state our main technical result, which establishes a lower bound on estimating in the setting of Defintion 3.2, which we prove using Corollary 5.4 in Appendix D.3.

Proposition 3.1 (Main Estimation Result).

Let and let be generated from the deformed Wigner model, Definition 3.2. Let be the output of an adaptive -query algorithm with input . Then for any , we have


where is a universal constant, (observe that .)

Proposition 3.1 states that, until queries have been made, the probability of having an inner product with of at least is tiny, i.e., . The following proposition establishes that, if is drawn from the deformed Wigner model, then with high probability, lies in for , and that optimizing entails estimating :

Proposition 3.2.

Fix , and . Let , then the following three assertions simultaneously hold with probability at least :

  1. , and for all , ,

  2. for ,

  3. Let as above. For any and any , if , then

Remark 3.1.

As , then and ; thus for large , optimizing is essentially equivalent to estimating . The above proposition also lets us take to be as small as , or equivalently, arbitrarily close to 1. In this regime, we show in Appendix A.2 that behaves like , provided that . Thus, as the eigengap decreases, must be ever-closer to to ensure that overlaps with the spike . Nevertheless, we can still ensure non-negligible overlap between and for values of arbitrarily close to 1.

We now state a more detailed version of Theorem 2.1. A formal version of Theorem 2.3 is established in Section 6.

Theorem 3.3 (Formal Statement of Theorem 2.1).

There exist an absolute constant , such that for any and , there exists a distribution supported on such that, for any randomized, adaptive -query algorithm , we have


where is defined in Equation (3.7).

If is bounded away from and bounded away from zero, then by Remark 3.1, the , and we recover Theorem 2.1 by observing that the quantity in the exponent is then . However, Theorem 3.3 is more general because, in view of Remark 3.1, it permits to be arbitrarily close to .

Proof of Theorem 3.3.

Set . Now we apply Proposition 3.2 with this , and with , so that the conditions of Proposition 3.2 hold with probability at least . If , then Equation (3.7) implies . Then, we invoke Proposition 3.1 with that value of , and condition on the event in Theorem 3.2. Finally, we use the bound . ∎

3.1 Computing the Conditional Likelihoods

We begin by introducing some useful simplifications. First, in the spirit of Yao’s minimax duality principle [84], we assume that is deterministic222Indeed, let , and be any event measurable with respect to , , . Then, given a randomized adaptive query algorithm such that , we can view as a superposition of deterministic algorithms , where is a random seed. Then, .. Second, we assume that are orthonormal. This is without loss of generality because one can reconstruct the response to a query by simply quering the projection of onto the orthogonal complement of the previous queries , and normalizing. Finally, we introduce a simplification which will make our queries resemble queries of the form .

Observation 3.1.

For will let then define the orthogonal project onto the complement of the span of . We may assume without of generality that, rather that returning responses , the oracle returns responses .

This is valid because once queries , it knows , and thus, since and are symmetric, it also knows . Thus, throughout, we will take . We let also let denote the data collected after the -th measurement is taken, and let denote the -algebra generated by . The collection forms a filtration, and since our algorithm is deterministic, is -measurable. We show that, with our modified measurements , then the query-observation pairs have Gaussian likelihoods conditional on and .

Lemma 3.4 (Conditional Likelihoods).

Under , the law of conditioned on , we have


In particular, is conditionally independent of given and .

Lemma 3.4 is proved in Appendix B. We remark that is rank-deficient, with its kernel being equal to the span of . Nevertheless, because the mean vector lies in the orthogonal complement of , computing can be understood as , where denotes the Moore-Penrose pseudo-inverse [42]. We write


and we will use the following equality and inequality frequently and without comment:

and (3.11)

These just follow from the facts that is an orthogonal projection and , and so .

4 A First Attempt: a Lower Bound of

Many adaptive estimation lower bounds are often shown by considering the most informative measurements an algorithm could take if it knew the true hidden parameter [38, 43, 3, 72]. Unfortunately, this line of attack in insufficient for a non-vacuous lower bound in our setting: if an oracle tells the algorithm to measure at a unit vector for which is at least , then we would have , and so by Proposition 3.2, we would verify that is close to . Of course, it is highly unlikely that our first measurement is close to the true ; indeed, if is drawn uniformly from the sphere , then with high probability. But what about the second measurement, or the third? What is to stop the algorithm from rapidly learning to take highly informative measurement? To show this cannot happen, we will adopt a simple recursive strategy:

  1. We relate the information collected at stage to the inner products , .

  2. We bound the inner product of with the -st query by its inner products with all past queries as

    where is a constant depending on .

To demonstrate the above proof strategy, we start by establishing a sub-optimal lower bound of . Then in Section 5, we introduce a more refined machinery to sharpen the bound to . First, we observe that the mutual information between and (see, e.g., Cover and Thomas [29]) is controlled by the inner products , :

Proposition 4.1.

Let be an isotropic probability distribution supported on , and denote the law of conditioned on . Then for all integers ,

where (4.12)

We prove Proposition 4.1 in Appendix C.1. We now recursively bound the mutual information , using an argument similar to Price and Woodruff [66], with the exception that we will rely on a more recent continuum formulation of Fano’s inequality [32, 26] to control the information :

Proposition 4.2 (Global Fano [32]).

Let be a prior over a measure space , and let denote a family of distributions over a space indexed by . Then, if is an action space,

is a loss function, and

is a measurable map, we have


We will apply Proposition 4.2 at each query stage : we let , denote the rank-one spike, the prior over , to be the data collected at the end of round , and to be the law of conditioned on . We use the action space , our actions will be the -st query, , and the loss function we consider is


for some fixed . This leads to the following bound, proved in Appendix C.1.

Proposition 4.3.

Let denote an isotropic distribution on the sphere , which satisfies the following concentration bound for some constants and all ,


Then, the sequence satisfies the following recursion: for all , , we have


Therefore, integrating over yields


In Appendix F, we prove that if is the uniform distribution on , then we can take , and in the above proposition. This relies on the following concentration result:

Lemma 4.4 (Spherical Concentration).

Let where is the uniform distribution over . Then for all ,


Hence, from Equation (4.17), grows by at most after each query. Hence, until queries are taken, will be , which entails that will not be accurately estimated. It is worth understanding why this spurious factor appears using . The main weakness with Proposition 4.2 is that the denominator contains the logarithm of the “best-guess probability” . This results in a very weak tail bound on of , which incurs a factor when integrated. To overcome this weakness, we will work instead with estimates based on the divergence, which will be a lot more careful in taking advantage of the small value of .

5 A Sharper Lower Bound on Estimation

In this section, we use more refined machinery based on the -divergence to sharpen the lower bound from to . When bounding the -divergences in Proposition 4.1, the proof crucially relies upon the fact that the log-likelihoods decompose into a sum, and could thus be bounded using linearity of expectations. This is no longer the case when working with the squares of likelihood-ratios which arise in the divergence, because the adaptivity of the queries can introduce strong correlations between likelihood ratios arising from subsequent measurements. To remedy this, we will proceed by designing a sequence of “good truncation events” for each , on which the likelihood ratios will be well-behaved. We now fix some positive numbers to be specified later, and define the events for and integer by


For an arbitrary probability measure on a space , and an event , we use the following notation to denote the truncated (non-normalized) measure


In the sequel, we will be working with the measures . Note that these are no longer actual probability measures, since their total mass is , which is in general strictly less than one. In Appendix E, we show that -divergences - a family of measures of distance between distributions which include both the and the divergence [30, 26] - generalize straightforwardly to non-negative measures which are not normalized (e.g., truncated probability distributions.) Leaving the full generality to the appendix, we will use a “generalized -divergence” between non-normalized measures which modifies the classical -divergence (for a comparison to the classical divergence, see the discussion following Remark E.1.)

Definition 5.1 (-divergence).

Let denote two nonnegative measures on a space , such that , and is absolutely continuous with respect to 333That is, for every , implies that . We define


When is a probability distribution, the above can be written as .

In Appendix E, we prove a generalization of the -divergence Bayes risk lower bounds of Chen et al. [26], which we state here for the divergence:

Proposition 5.1.

Adopting the setup of Proposition 4.2, let be a distribution over a space , be a family of probability measures on , be an action space, a binary loss function, and let denote a measurable map from to . Given a family of -measurable events, let denote the truncated measure as per Equation (5.20). Set

Then, for any nonnegative measure on , we have

Remark 5.1.

Even though the above proposition is an analogue of Corollary 7 in Chen et al. [26], it cannot be derived merely as a consequence of that bound, and is sharper and easier to use than bounds that would arise by replacing the truncated distributions with conditional distributions . See Remark E.2 for further discussion.

As in Section 4, we take , to be the distribution of conditioned on , and as in our above discussion, will define the truncation events from Equation (5.19); the index for which we apply Proposition 5.1 will always be clear from context. If we were to follow the case, we would bound the corresponding mutual information quantity by taking to be the law of induced by when . We would then apply Jensen’s inequality (in view of Lemma E.1) to upper bound the corresponding mutual information quantity by the average -divergence between and , where and are both drawn i.i.d. from .

This argument does not work in our setting, because once we restrict to the events , two measures and may no longer be absolutely continuous, and thus have an infinite -divergence. Instead, we apply Proposition 5.1 with the measure to denote the (un-truncated) probability law of under the random matrix . Since is un-truncated, and since the matrix has a continuous density, all the measures , and thus , are absolutely continuous with respect to it. Thus, we can use the events to control the divergence as follows:

Lemma 5.2 (Upper Bound on Likelihood Ratios).

Let , and be the event in Equation (5.19). For and , define the expected conditional likelihood ratio


Moreover, let . Then,


The above Lemma specializes Lemma C.2, proved in Appendix C.2. Noting that the conditional laws and have Gaussian densities, a computation detailed in Lemma C.3 yields


Using the bound (Equation (3.11)), and that on , we get