Tight Query Complexity Lower Bounds for PCA via Finite Sample Deformed Wigner Law

04/04/2018 ∙ by Max Simchowitz, et al. ∙ 0

We prove a query complexity lower bound for approximating the top r dimensional eigenspace of a matrix. We consider an oracle model where, given a symmetric matrix M∈R^d × d, an algorithm Alg is allowed to make T exact queries of the form w^(i) = Mv^(i) for i in {1,...,T}, where v^(i) is drawn from a distribution which depends arbitrarily on the past queries and measurements {v^(j),w^(i)}_1 < j < i-1. We show that for every gap∈ (0,1/2], there exists a distribution over matrices M for which 1) gap_r(M) = Ω(gap) (where gap_r(M) is the normalized gap between the r and r+1-st largest-magnitude eigenvector of M), and 2) any algorithm Alg which takes fewer than const×r d/√(gap) queries fails (with overwhelming probability) to identity a matrix V∈R^d × r with orthonormal columns for which 〈V, MV〉> (1 - const×gap)∑_i=1^r λ_i(M). Our bound requires only that d is a small polynomial in 1/gap and r, and matches the upper bounds of Musco and Musco '15. Moreover, it establishes a strict separation between convex optimization and randomized, "strict-saddle" non-convex optimization of which PCA is a canonical example: in the former, first-order methods can have dimension-free iteration complexity, whereas in PCA, the iteration complexity of gradient-based methods must necessarily grow with the dimension.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Statement of Main Results

Let denote the norm on , and let denote the unit sphere and denote the Stieffel manifold consisting of matrices such that . Let denote the set of symmetric matrices, and for , we let denote its eigenvalues in decreasing order, denote the corresponding eigenvectors, let and denote the operator and Frobenious norms, and . Finally, we define the eigengap of as , where is the -th singular value of . We will also use the notation . We now introduce a definition of our query model:

Definition 1.1 (Query Model).

An randomized adaptive query algorithm with query complexity and accuracy is an algorithm which, for rounds , queries an oracle with a vector , and receives a noiseless response . At the end rounds, the algorithm returns a matrix . The queries and output are allowed to be randomized and adaptive, in that is a function of , as well as some random seed.

The goal of is to return a satisfying

for some small . In the rank-one case, is a vector , and the above condition reduces to .

Example 1.1 (Examples of Randomized Query Algorithms).

In the rank-one case, the Power Method and Lanczos algorithms [17] are both randomized, adaptive query methods. Even though the iterates of the Lanczos and power methods converge to the top eigenvector at different rates, they make identical queries: namely, they both identify on the Krylov space spanned by . Lanczos differs from the Power Method by choosing to be the optimal vector in this Krylov space, rather than the last iterate. Observe that even in the rank- case, our query model still permits each single vector-query to be chosen adaptively. Hence, our lower bound applies to subspace iterations (e.g., the block Krylov method of Musco and Musco [28]), and to algorithms which use deflation [4]).

To state our results, we construct for every a distribution over matrices under which have a . To do so, we introduce the classical or Wigner law [5]:

Definition 1.2 (Gaussian Orthogonal Ensemble (GOE)).

We say that if the entries are independent, for , , for , , and for , .

In the rank one case, we will then take our matrix to be , where , , and is a parameter to be chosen. A critical result gives a finite-sample analogue of a classical result in random-matrix theory, which states that with high probability. On the other hand, concentrates around , and thus by eigenvalue interlacing . Motivated by this, we define the asymptotic gap of :

(1.1)

It is well know that, for a fixed , as . We give a finite sample analogue:

Proposition 1.1 (Finite Sample Eigengap of Deformed Wigner).

Let , where and are independent. For , define the event

There exists exists a polynomially bounded function such that for , and , . Moreover, on , .

The explicit polynomial can be derived from a more precise statement, Theorem 5.1. We now state more precise version of Theorem 1:

Theorem 1.2.

Fix a and any where is as in Proposition 1.1, and let be the solution to Equation 1.1. Let where and . Then for any satisfying Definition 1.1, we have

(1.2)

Where is the probability taken with respect to the randomness of the algorithm. Note that on , .

Observe as is bounded away from . Hence, if , is a large enough polynomial in , then if , or equivalently, , we see that the probability that is at most , proving Theorem 1.

In Appendix A, we present two additional results that follow as easy modifications of our proofs: Theorem A.1 presents an improved -dependence for , and generalizes to the setting where is allowed rounds of adaptivity, and makes a batch of queries per round; Theorem A.3 presents a modification of Theorem 1.2 which establishes a sharp lower bound of in the “easy” regime where approaches one. Our techniques can be adapted to show sharp lower bounds for adaptively testing between against ; we omit these arguments in the interest of brevity.

2 Proof Roadmap

2.1 Notation

In what follows, we shall use bold letters , , and to denote the random vectors and matrices which arise from the deformed Wigner law; blackboard font and will be used to denote laws governing these quantities. We will use standard typesetting (e.g. ) to denote fixed (non-random) quantities vectors, as well as problem dimension and rank of the plant .

Quantities relating to will be in serif font; these include the queries , responses , and outputs and . The law of these quantities under will be denoted in bold serif.

Mathematical operators like are are denoted in Roman or standard font, and asymptotic quantities like in Courier.

2.2 Reduction from Eigenvector Computation to Estimating

In this section, we show that an algorithm which adaptively finds a near-optimal implies the existence of a deterministic algorithm which plays a sequence of orthonormal queries for which is large. Our first step is to show that if is near-optimal, then has a large overlap with , in the following sense:

Lemma 2.1.

Given any , any , and under the event , if , then .

In the rank one case, with , and , the above lemma just implies that a near optimal satisfies . In the more general case, we have that means that the image of needs to have “uniformly good” coverage of the planted matrix . The proof of Lemma 2.1 begins with the Lowner-order inequality

In the rank one-case, this reduces to

Hence, if we want , we must have that, since concentrates around ,

which gives the lower bound. For , the proof becomes more technical, and is deferred to Appendix B.1.

Next, we argue that the performance of the optimal is bounded by a quantity depending only on the query vectors. As a first simplification, we argue that we may assume without loss of generality that are orthonormal.

Observation 2.1.

We may assume that the queries are orthonormal, so that:

and that, rather that returning responses , the oracle returns responses , where we note that is the projection onto .

The assumption that the queries are orthonormal are valid since we can always reconstruct -queries from an associated orthonormal sequence obtained via the Gram-Schmidt procedure. The reason we can assume the responses are of the form is that since queries , it knows , and thus, since and are symmetric, it also knows , and thus can be reconstructed from . The next observation shows that it suffices to upper bound with

Observation 2.2.

We may assume without loss of generality that makes queries after outputing , and that .

This is valid because we can always modify the algorithm so that the queries ensures that

In this case, we have that for all (in particular, ),

Lastly, suppose it is the case that for any determinstic algorithm , and some bounds and . Then for any randomized algorithm , Fubini’s theorem implies

as well. Hence,

Observation 2.3.

We may assume that, for all , the query is deterministic given the previous query-observation pairs .

2.3 Lower Bounding the Estimation Problem

As discussed above, we need to present lower bounds for the problem of sequentially selecting measurements for which the associated measurement matrix has a large overlap with the planted matrix . Proving a lower bound for this sequential, statistical problem constitutes the main technical effort of this paper. We encode the entire history of up to time as ; in particular, describes the entire history of the algorithm.

Next, for , we let denote the law of where conditioned on . In the rank-one case, we denotes the law of where conditioned on . We will also abuse notation slightly by letting denote the law obtained by running on , i.e. with . In the rank-one case, we have the following theorem, whose proof is outlined in Section D:

Theorem 2.2.

Let , where , and . Then for all ,

The above theorem essential states that the quantity can grow at most geometrically at a rate of , with an initial value sufficiently large in terms of the probability and . In Section 4, we prove an analogous bound, which gives a geometric control on :

Theorem 2.3.

Let , where . Then for and , and

In Section A.3, we combine Theorem 2.3, Lemma 2.1, and Observation 2.1 to prove Theorem 1.2. The final rate is a consequence of the fact that . As mentioned in the paragraph New Techniques, our main technical hammer for proving Theorems 2.2 and 2.3 is a novel data-processing lower bound (Proposition 3.4) which applies to “truncated” distributions; the techniques are explained in greater detail in Appendix F.

2.4 Conditional Likelihoods from Orthogonal Queries

We conclude with one further simplification which yields a closed form for the conditional distributions of our queries. Observe that it suffices to observe the queries , our algorithm already “knows” the matrix from the previous queries. Hence,

Observation 2.4.

We may assume that we observe queries , where .

We now show that, with our modified measurements , then the query-observation pairs in the rank-one case have Gaussian likelihoods conditional on and .

Lemma 2.4 (Conditional Likelihoods).

Let denote the orthogonal projection onto the orthogonal complement of . Under (the joint law of and on ), we have

3 Proof of Theorem 2.2 ()

In this section, we prove a lower bound for the rank-one planted perturbation. The arguments in this section will also serve as the bedrock for the rank case, and exemplify our proof strategy. Given any , we introduce the notation , which is just the square Euclidean norm of the projection of onto the span of . will serve as a “potential function” which captures how much information the queries have collected about the planted solution , in a sense made precise in Proposition 3.4 below. The core of our argument is the following proposition, whose proof is given in the following subsection:

Proposition 3.1.

Let be a sequence such that , and for , . Then for all , one has

(3.3)

The above proposition states that, given two thresholds , the probability that exceeds the threshold on the event that does not exceed the threshold is small. Hence, for a sequence of thresholds , we have

Theorem 2.2 now follows by choosing the appropriate sequence , selecting appropriately, and verifying that the right hand side of the above display is at most . For intuition, setting , we see that once gets large, it is enough to choose ensure that the exponent in Equation (3.1) is a negative number of sufficiently large magnitude. The details are worked out in Appendix D. We now turn to the proof of Proposition 3.1.

3.1 Proving Proposition 3.1

To prove Proposition 3.1, we argue that if is much smaller than , then under the event , the algorithm does not have enough information about to select a new query vector for which . The following proposition is proved in Section 3.3, and arises as a special case of a more general information theoretic tools introduced in that section.

Proposition 3.2.

Let be any distribution supported on , and let . Then,

As is typical for data-processing inequalities, the above proposition relates the probability of the event to an “information” term capturing the size of power of likelihood ratios restricted to the event , and an “entropy” term, which captures how unlikely it would be to find a such that by just randomly guessing. We remark that Proposition 3.2 differs from many standard data-processing inequalities (e.g., Fano’s inequality or the bounds in Chen et al. [15]) in two ways: first, we use an unorthodox information measure: -powers of likelihood ratios for close to zero. This choice of divergence gives us granular control in the case when is close to one. As mentioned above, we will ultimately take by setting . Second, we consider the restriction of the likelihood ratios to the “low-information” event . As mentioned above, this is necessary to deal with the ill-behaved tails of the likelihoods. In Appendix F we present additional general data-processing inequalities for truncated distributions.

Deducing Proposition 3.1 from Proposition 3.2 now follows readily by bounding the “entropy” and “information” terms. We use concentration of measure on the sphere to bound the entropy term as follows (see Appendix C.1 for proof):

Lemma 3.3.

For any fixed and , we have

We now state 3.4 which gives an upper bound on the information term. The proof is considerably more involed that that of Lemma 3.3, and so we present a sketch in Section 3.2 below.

Proposition 3.4.

For any and any fixed , we have

(3.4)

In particular, by taking an expectation over , we have that

This motivates the choice of as an information-potential, since it gives us direct control over bounds of the likelihood ratios. Propostion 3.1 now follows immediately from stringing together Proposition 3.2, Proposition 3.4 for the “information term”, and Lemma 3.3 for the “entropy term”.

3.2 Proof of Proposition 3.4 (“Information Term”)

The difficulty in Proposition 3.4 is that truncating to the event introduces correlations between the conditional likelihoods that don’t arise in the conditionally independent likelihoods of Lemma 2.4. Nevertheless, we use a careful peeling argument (Appendix C.3) to upper bound the information term, an expected product of likelihoods, by a product of expected conditional likelihoods which we can compute. Formally, we have

Proposition 3.5 (Generic upper bound on likelihood ratios).

Fix an , and fix . Define the likelihood functions

(3.5)

Then for any subset , we have

(3.6)

where denotes the first columns of .

Here, we remark that the tilde-notation (,…) represents fixed vectors which the random quantities etc.. For example, in the event , is considered to be a deterministic matrix.

We can now invoke a computation of the -th moment of the likelihood ratios between two Gaussians, proved in Appendix C.4.

Lemma 3.6.

Let denote the distribution and denote , where . Then

(3.7)

We are now in a position to prove Proposition 3.4:

Proof of Proposition 3.4.

Fix a , and we shall and apply Proposition 3.5 with and . In the language of Proposition 3.5 , we have

Now, observe that, is the density of and is the density of . Since , we have . Thus,

(3.8)

Hence, by Lemma 3.6, we have