1 Introduction
A major open problem in machine learning and optimization is identifying classes of nonconvex problems that admit efficient optimization procedures. Motivated by the empirical successes of matrix factorization/completion
[86, 68, 49], sparse coding [64], phase retrieval [34]and deep neural networks
[48, 12], a growing body of theoretical work has demonstrated that many gradient and localsearch heuristics  inspired by convex optimization  enjoy sound theoretical guarantees in a wide variety of nonconvex problems
[22, 18, 62, 44, 10, 11, 79, 80, 15, 85, 82]. Notably, Ge et al. [39]introduced a polynomialtime noisy gradient algorithm for computing approximate localminima of nonconvex objectives which have the “strictsaddle property”: that is, objectives whose firstorder stationary points are either local minima, or saddle points at which the Hessian has a strictly negative eigenvalue. It has since been shown that many wellstudied nonconvex problems can be formulated as “strict saddle” objectives whose local minimizers are all globally optimal (or nearoptimal)
[78, 50, 40, 15], thereby admitting efficient optimization by local search.Recently, Jin et al. [45] proposed a gradient algorithm which finds an approximate local minimum of a strict saddle objective in a number of iterations which matches firstorder methods for comparable convex problems, up to polylogarithmic factors in the dimension. This might seem to suggest that, from the perspective of firstorder optimization, strict saddle objectives and truly convex problems are identical. But there is a caveat: unlike the algorithm proposed by Jin et al. [45], the iteration complexity of first order methods for optimizing truly convex functions typically has no explicit dependence on the ambient dimension [21, 61]. This begs the question:
Does the iteration complexity of first order methods for strict saddle problems necessarily depend on the ambient dimension? Stated otherwise, is there a gap in the complexity of firstorder optimization for “almostconvex” and “truly convex” problems?
This paper answers the above questions in the affirmative by considering perhaps the simplest and most benign strict saddle problem: approximating the top eigenvector of a symmetric matrix , also known as rankone PCA. The latter is best cast as a strictsaddle problem with objective function to be maximized , subject to the smooth equality constraint [39, 50]. We show that the gradient query complexity of rankone PCA necessarily scales with the ambient dimension, even in the “easy” regime where the eigengap is bounded away from zero.
More precisely, we consider an oracle model, where given a symmetric matrix , an algorithm is allowed to make exact queries of the form for , where is drawn from a distribution which may depend arbitrarily on the past queries ; these queries are precisely the rescaled gradients of the objective . We show (Theorem 2.1) that any adaptive, randomized algorithm which finds a unit vector for which for any symmetric matrix whose secondeigenvalue is at most times its leading eigenvalue in magnitude must make at least queries. This matches the performance of the power method and Lanczos algorithms as long as is bounded away from one.
In fact, we show that if is bounded by a small constant times , then the probability of finding a unit vector with objective value at least is as small as . We also show (Theorem 2.3) that given any , it takes adaptive queries to test if the operator norm of is above the threshold , or below
. Our lower bounds are based on the widely studied deformed Wigner random matrix model
[51, 33], suggesting that the factor should be regarded as necessary for “typical” symmetric matrices , not just for some exceptionally adversarial instances.1.1 Proof Techniques
We reduce the problem of topeigenvector computation to adaptively estimating the rankone component of the deformation of a Wigner matrix [5, 33, 51] in our query model. Here, controls the eigengap, and
is drawn uniformly from the unit sphere. Unlike many lower bounds for active learning
[38, 3, 43], it is insufficient to assume that the algorithm may take the most informative measurements in hindsight, since that would entail estimating with only measurements. As a first pass, we use a recursive application of Fano’s method, similar to the strategy adopted in Price and Woodruff [66] for proving lower bounds on adaptive estimation. This method bounds the rate at which information is accumulated by controlling the information obtained from the th measurement in terms of the information gained from measurements . Unfortunately, in our setting, this technique can only establish a lower bound of queries.To sharpen our results, we adopt an argument based on a divergence analogue of Fano’s inequality, introduced in Chen et al. [26]. But whereas the divergence computations in Fano’s inequality allow us to decompose the information obtained at each round into a sum, the adaptivity of the algorithm introduces correlations between the likelihood ratios that appear in the computations. Thus, we need to carefully truncate the distributions that arise in our lower bound construction, and restrict them to some carefullydefined “good events”. This permit us to bound the rate of informationaccumulation.
In general, the probability of these good events conditioned on the spike may vary, and thus treating the truncated probabilities as conditional distributions introduces serious complications. To simplify things, we observe that the theory of divergences, from which Fano’s inequality and the analogue in Chen et al. [26]
are derived, can be generalized straightforwardly to nonnormalized measures, i.e., truncated probability distributions. We therefore derive a general version of the Bayesrisk lower bound from Chen et al.
[26]for nonnormalized distributions, which, when specialized to
, enables us to prove a sharp lower bound of queries.To prove the lower bound on testing the spectral norm on
, we reduce the problem to that of testing the null hypothesis
for a Wigner matrix , against an alternative hypothesis , where is drawn uniformly on the sphere and some sufficiently positive . The bound mainly follows from Pinsker’s inequality (similarly to the combinatorial hypothesis testing lower bound in AddarioBerry et al. [2]) but again with the added nuance of the need to truncate our likelihood ratios due to the adaptivity of the algorithm.1.2 Related Work
Oracle Lower Bounds for Optimizations. In their seminal work, Nemirovskii and Yudin [60] established lower bounds on the number of calls an algorithm must make to a gradientoracle in order to approximately optimize a convex function. While they match known upper bounds in terms of dependence on relevant parameters (accuracy, condition number, Lipschitz constant), the constructions are regarded as brittle [8]: the construction considers a worstcase initialization, and makes the strong assumption that the point whose gradient is queried lies in the affine space spanned by the gradients queried up to that iterate. Arjevani and Shamir [9] addresses some of the weakness of the lower bounds [60] (e.g., allowing some randomization), but at the expense of placing more restrictive assumptions of the class of optimization algorithms considered. In contrast, our lower bound places no assumptions on how the algorithm chooses to make its successive queries.
In recent years, lower bounds have been established for stochastic convex optimization [3, 43] where each gradient or functionvalue oracle query is corrupted with i.i.d. noise. While these lower bounds are informationtheoretic, and thus unconditional, they do not hold in the setting considered in this work, where we are allowed to make exact, noiseless queries. As mentioned above, the proof strategy for proving lower bounds on the exactoracle model is quite different than in the noisyoracle setting.
Active Learning and Adaptive Data Analysis. Our proof techniques casts the eigenvectorcomputation as a type of sequential estimation problem, which have been studied at length in the context of sparse recovery and active adaptive compressed sensing [7, 66, 25, 24]. Due to the noiseless oracle model, our setting is most similar to [66], whereas [7, 25, 24] study measurement noise. Our setting also exhibits similarities to the stochastic linear bandit problem [73]. More broadly, query complexity has received much recent attention in the context of communicationcomplexity [6, 59], in which lower bounds on query complexity imply corresponding bounds against communication via lifting theorems. Similar ideas also arise in understanding the implication of memoryconstraints on statistical learning [77, 76, 70].
Local Search for NonConvex Optimization. As mentioned in the introduction, there has been a flurry of recent work establishing the efficacy and correctness of local search algorithms in numerous nonconvex problems, including Dictionary Learning [11, 80], Matrix Factorization [15, 82, 85], Matrix Completion [40, 16], Phase Retrieval [18, 79, 22], and training neural networks [44]. Particular attention has been devoted to avoiding saddle points in nonconvex landscapes [50, 39, 45, 78], which, without further regularity assumptions, are known to render the task of finding even local minimizers computationally hard [55]. Recent work has also considered secondorder algorithms for nonconvex optimization [80, 4, 23]. However, to the best of the authors’ knowledge, the lower bounds presented in this paper are the first which show a gap in the iteration complexity of firstorder methods for convex and “benign” nonconvex objectives.
PCA, LowRank Matrix Approximation and NormEstimation. The growing interest in nonconvexity has also spurred new results in eigenvector computation, motivated in part by the striking ressemblence between eigenvector approximation algorithms (e.g. the power method, Lanczos Algorithm [31], Oja’s algorithm [71]
, and newer, variancereduced stochastic gradient approaches
[35, 71]) and analogous firstorder convex optimization procedures. Recent works have also studied PCA in the streaming [71], communicationbounded [37, 13], and online learning settings [36]. More generally, eigenvector approximation is widely regarded as a fundamental algorithmic primitive in machine learning [46], numerical linear algebra [31], optimization, and numerous graphrelated learning problems [75, 65, 63]. While lower bounds have been established for PCA in the memory and communicationlimited settings [19, 70], we are unaware of lower bounds that pertain to the noiseless query model studied in this work.Rankone PCA may be regarded one of the simplest lowrank matrix approximation problems [42]. The numerical linear algebra community has studied lowrank matrix approximation far more broadly, with an eye towards computation, memory, and communicationefficient algorithms [58, 57, 67], as well as algorithms which take advantage of the sparsity of their inputs [28, 74, 27]
. Previous work has also studied the problem of estimating functions of a matrix’s singular values
[53], including the special case of estimating Schatten norms [52]. To the best of our knowledge, lower bounds for sketching concern the cases where the sketches are chosen nonadaptively.2 Statement of Main Results
Let denote the norm on , and let denote the unit sphere. Let denote the set of symmetric matrices, and for , we let denote its eigenvalues in decreasing order, denote the corresponding eigenvectors, and overload to denote its operator norm.
Definition 2.1 (Eigenratio).
For , we define the set of matrices with positive leading eigenvector and bounded eigenratio between its first and second eigenvalues:
(2.1) 
The iteration complexity of rankone PCA is typically stated in terms of the eigengap . This work instead focuses on lower bounds which hold when the eigengap is close to , motivating our parameterization in terms of the eigenratio instead. We now define our query model:
Definition 2.2 (Query Model).
And adaptive query algorithm with query complexity is an algorithm which, for rounds , queries an oracle with a vector , and receives a noiseless response . At the end rounds, the algorithm returns a vector . The queries and output are allowed to be randomized and adaptive, in that is a function of , as well as some initial random seed. We say that is deterministic if, for all , is a deterministic function of . We say that is nonadaptive if, for all the distribution of is independent of the observations , (but may dependent on past observations.)
Example 2.1.
The Power Method and Lanczos algorithms [31] are both randomized, adaptive query methods. Even though the iterates of the Lanczos and power methods converge to the top eigenvector at different rates, they are nearly identical algorithms from our querycomplexity perspective: both identify on the Krylov space . The only difference is that the Lanczos algorithm selects in a more intelligent manner than the power method. Running the power method from a deterministic initialization would be a nonrandomized algorithm. Any nonrandomized algorithm, even an adaptive one, must take queries in the worse case, since queries can only identify a matrix up to a dimensional subspace. Randomized, but nonadaptive algorithms need to take queries as well, as established formally in Li et al. [52].
2.1 Lower Bound for Estimation
We let denote probability taken with respect to the randomness of and a fixed as input, and denote probability with respect to and drawn from a distribution . The main result of this work is the following distributional lower bound:
Theorem 2.1 (Main Theorem).
There exists universal positive constants , , , and such that the following holds: for all , and , there exists a distribution supported on such that the output of any adaptive query algorithm with query complexity satisfies
(2.2) 
Thus, since a distributional lower bound implies a worstcase lower bound, we have
Corollary 2.2.
Any adaptive query algorithm with output which satisfies
for all must make queries.
The constraint that implies that there is a large eigengap. In this regime, our lower bound matches the power method, which yields such that in iterations. We also note that, while is not necessarily positive semidefinite in our construction, one can simply add a multiple of to enforce this constraint^{1}^{1}1 Proposition 3.2 and the proof of Theorem 3.3 show that is bounded on ., and this only changes by a constant.
2.2 Lower Bound for Testing
We now consider the problem of testing whether the operator norm of a symmetric matrix is below a threshold , or above a threshold .
Definition 2.3 (Adaptive Testing Algorithm).
An adaptive detection algorithm makes adaptive, possibly randomized queries as per Definition 2.2, and at the end of rounds, returns a test which is a function of , and some initial random seed.
Our second result establishes a lower bound on the sum of typeI and typeII errors incurred when testing between distributions on matrices with separated operator norms:
Theorem 2.3 (Detection Lower Bound).
There exists universal positive constants , , , such that the following holds: for all and , there exists two distributions and on such that
and  (2.3) 
Moreover, for any binary test returned by an adaptive query algorithm , we have
(2.4) 
This implies a worstcase lower bound for testing, matched by the power method for large :
Corollary 2.4.
Any randomized adaptive query algorithm which can test whether or with probability of error requires at least queries.
3 Reduction to Estimation
We now construct the distribution used to prove Theorems 2.1 and 2.3. We begin by constructing a family of distributions on , indexed by , and place a prior on . We then show that if is drawn from the marginal distribution, then with good probability, lies in for an appropriate , and that any for which is large must be close to . Hence, establishing the desired lower bound is reduced to a lower bound on estimating . The construction is based on the Gaussian Orthogonal Ensemble, also know as the Wigner Model [5].
Definition 3.1 (Gaussian Orthogonal Ensemble (GOE)).
We say that if the entries are independent, for , , for , , and for , . We also define the constant
(3.5) 
For a precise, nonasymptotic upper bound on , we direct the reader to Bandeira and van Handel [14]; an asymptotic bound can be found in Anderson et al. [5], and nonasymptotic bounds with looser constants are shown by Vershynin [83]. We now define the generative process for our lower bound:
Definition 3.2 (Deformed Wigner Model).
Let , and a distribution supported on
(e.g., the uniform distribution.) We then independently draw
and , and set . We also let denote the law of conditioned on .In the sequel, we will take our algorithm to be fixed. Abusing notation slightly, we will therefore let denote the law of and under , conditioned on . We now state our main technical result, which establishes a lower bound on estimating in the setting of Defintion 3.2, which we prove using Corollary 5.4 in Appendix D.3.
Proposition 3.1 (Main Estimation Result).
Let and let be generated from the deformed Wigner model, Definition 3.2. Let be the output of an adaptive query algorithm with input . Then for any , we have
(3.6) 
where is a universal constant, (observe that .)
Proposition 3.1 states that, until queries have been made, the probability of having an inner product with of at least is tiny, i.e., . The following proposition establishes that, if is drawn from the deformed Wigner model, then with high probability, lies in for , and that optimizing entails estimating :
Proposition 3.2.
Fix , and . Let , then the following three assertions simultaneously hold with probability at least :

, and for all , ,

for ,

Let as above. For any and any , if , then
(3.7)
Remark 3.1.
As , then and ; thus for large , optimizing is essentially equivalent to estimating . The above proposition also lets us take to be as small as , or equivalently, arbitrarily close to 1. In this regime, we show in Appendix A.2 that behaves like , provided that . Thus, as the eigengap decreases, must be evercloser to to ensure that overlaps with the spike . Nevertheless, we can still ensure nonnegligible overlap between and for values of arbitrarily close to 1.
We now state a more detailed version of Theorem 2.1. A formal version of Theorem 2.3 is established in Section 6.
Theorem 3.3 (Formal Statement of Theorem 2.1).
There exist an absolute constant , such that for any and , there exists a distribution supported on such that, for any randomized, adaptive query algorithm , we have
(3.8) 
where is defined in Equation (3.7).
If is bounded away from and bounded away from zero, then by Remark 3.1, the , and we recover Theorem 2.1 by observing that the quantity in the exponent is then . However, Theorem 3.3 is more general because, in view of Remark 3.1, it permits to be arbitrarily close to .
Proof of Theorem 3.3.
3.1 Computing the Conditional Likelihoods
We begin by introducing some useful simplifications. First, in the spirit of Yao’s minimax duality principle [84], we assume that is deterministic^{2}^{2}2Indeed, let , and be any event measurable with respect to , , . Then, given a randomized adaptive query algorithm such that , we can view as a superposition of deterministic algorithms , where is a random seed. Then, .. Second, we assume that are orthonormal. This is without loss of generality because one can reconstruct the response to a query by simply quering the projection of onto the orthogonal complement of the previous queries , and normalizing. Finally, we introduce a simplification which will make our queries resemble queries of the form .
Observation 3.1.
For will let then define the orthogonal project onto the complement of the span of . We may assume without of generality that, rather that returning responses , the oracle returns responses .
This is valid because once queries , it knows , and thus, since and are symmetric, it also knows . Thus, throughout, we will take . We let also let denote the data collected after the th measurement is taken, and let denote the algebra generated by . The collection forms a filtration, and since our algorithm is deterministic, is measurable. We show that, with our modified measurements , then the queryobservation pairs have Gaussian likelihoods conditional on and .
Lemma 3.4 (Conditional Likelihoods).
Under , the law of conditioned on , we have
(3.9) 
In particular, is conditionally independent of given and .
Lemma 3.4 is proved in Appendix B. We remark that is rankdeficient, with its kernel being equal to the span of . Nevertheless, because the mean vector lies in the orthogonal complement of , computing can be understood as , where denotes the MoorePenrose pseudoinverse [42]. We write
(3.10) 
and we will use the following equality and inequality frequently and without comment:
and  (3.11) 
These just follow from the facts that is an orthogonal projection and , and so .
4 A First Attempt: a Lower Bound of
Many adaptive estimation lower bounds are often shown by considering the most informative measurements an algorithm could take if it knew the true hidden parameter [38, 43, 3, 72]. Unfortunately, this line of attack in insufficient for a nonvacuous lower bound in our setting: if an oracle tells the algorithm to measure at a unit vector for which is at least , then we would have , and so by Proposition 3.2, we would verify that is close to . Of course, it is highly unlikely that our first measurement is close to the true ; indeed, if is drawn uniformly from the sphere , then with high probability. But what about the second measurement, or the third? What is to stop the algorithm from rapidly learning to take highly informative measurement? To show this cannot happen, we will adopt a simple recursive strategy:

We relate the information collected at stage to the inner products , .

We bound the inner product of with the st query by its inner products with all past queries as
where is a constant depending on .
To demonstrate the above proof strategy, we start by establishing a suboptimal lower bound of . Then in Section 5, we introduce a more refined machinery to sharpen the bound to . First, we observe that the mutual information between and (see, e.g., Cover and Thomas [29]) is controlled by the inner products , :
Proposition 4.1.
Let be an isotropic probability distribution supported on , and denote the law of conditioned on . Then for all integers ,
where  (4.12) 
We prove Proposition 4.1 in Appendix C.1. We now recursively bound the mutual information , using an argument similar to Price and Woodruff [66], with the exception that we will rely on a more recent continuum formulation of Fano’s inequality [32, 26] to control the information :
Proposition 4.2 (Global Fano [32]).
Let be a prior over a measure space , and let denote a family of distributions over a space indexed by . Then, if is an action space,
is a loss function, and
is a measurable map, we have(4.13) 
We will apply Proposition 4.2 at each query stage : we let , denote the rankone spike, the prior over , to be the data collected at the end of round , and to be the law of conditioned on . We use the action space , our actions will be the st query, , and the loss function we consider is
(4.14) 
for some fixed . This leads to the following bound, proved in Appendix C.1.
Proposition 4.3.
Let denote an isotropic distribution on the sphere , which satisfies the following concentration bound for some constants and all ,
(4.15) 
Then, the sequence satisfies the following recursion: for all , , we have
(4.16) 
Therefore, integrating over yields
(4.17) 
In Appendix F, we prove that if is the uniform distribution on , then we can take , and in the above proposition. This relies on the following concentration result:
Lemma 4.4 (Spherical Concentration).
Let where is the uniform distribution over . Then for all ,
(4.18) 
Hence, from Equation (4.17), grows by at most after each query. Hence, until queries are taken, will be , which entails that will not be accurately estimated. It is worth understanding why this spurious factor appears using . The main weakness with Proposition 4.2 is that the denominator contains the logarithm of the “bestguess probability” . This results in a very weak tail bound on of , which incurs a factor when integrated. To overcome this weakness, we will work instead with estimates based on the divergence, which will be a lot more careful in taking advantage of the small value of .
5 A Sharper Lower Bound on Estimation
In this section, we use more refined machinery based on the divergence to sharpen the lower bound from to . When bounding the divergences in Proposition 4.1, the proof crucially relies upon the fact that the loglikelihoods decompose into a sum, and could thus be bounded using linearity of expectations. This is no longer the case when working with the squares of likelihoodratios which arise in the divergence, because the adaptivity of the queries can introduce strong correlations between likelihood ratios arising from subsequent measurements. To remedy this, we will proceed by designing a sequence of “good truncation events” for each , on which the likelihood ratios will be wellbehaved. We now fix some positive numbers to be specified later, and define the events for and integer by
(5.19) 
For an arbitrary probability measure on a space , and an event , we use the following notation to denote the truncated (nonnormalized) measure
(5.20) 
In the sequel, we will be working with the measures . Note that these are no longer actual probability measures, since their total mass is , which is in general strictly less than one. In Appendix E, we show that divergences  a family of measures of distance between distributions which include both the and the divergence [30, 26]  generalize straightforwardly to nonnegative measures which are not normalized (e.g., truncated probability distributions.) Leaving the full generality to the appendix, we will use a “generalized divergence” between nonnormalized measures which modifies the classical divergence (for a comparison to the classical divergence, see the discussion following Remark E.1.)
Definition 5.1 (divergence).
Let denote two nonnegative measures on a space , such that , and is absolutely continuous with respect to ^{3}^{3}3That is, for every , implies that . We define
(5.21) 
When is a probability distribution, the above can be written as .
In Appendix E, we prove a generalization of the divergence Bayes risk lower bounds of Chen et al. [26], which we state here for the divergence:
Proposition 5.1.
Adopting the setup of Proposition 4.2, let be a distribution over a space , be a family of probability measures on , be an action space, a binary loss function, and let denote a measurable map from to . Given a family of measurable events, let denote the truncated measure as per Equation (5.20). Set
Then, for any nonnegative measure on , we have
(5.22) 
Remark 5.1.
Even though the above proposition is an analogue of Corollary 7 in Chen et al. [26], it cannot be derived merely as a consequence of that bound, and is sharper and easier to use than bounds that would arise by replacing the truncated distributions with conditional distributions . See Remark E.2 for further discussion.
As in Section 4, we take , to be the distribution of conditioned on , and as in our above discussion, will define the truncation events from Equation (5.19); the index for which we apply Proposition 5.1 will always be clear from context. If we were to follow the case, we would bound the corresponding mutual information quantity by taking to be the law of induced by when . We would then apply Jensen’s inequality (in view of Lemma E.1) to upper bound the corresponding mutual information quantity by the average divergence between and , where and are both drawn i.i.d. from .
This argument does not work in our setting, because once we restrict to the events , two measures and may no longer be absolutely continuous, and thus have an infinite divergence. Instead, we apply Proposition 5.1 with the measure to denote the (untruncated) probability law of under the random matrix . Since is untruncated, and since the matrix has a continuous density, all the measures , and thus , are absolutely continuous with respect to it. Thus, we can use the events to control the divergence as follows:
Lemma 5.2 (Upper Bound on Likelihood Ratios).
Let , and be the event in Equation (5.19). For and , define the expected conditional likelihood ratio
(5.23) 
Moreover, let . Then,
(5.24) 
Comments
There are no comments yet.