An important phenomenon in the study of the computational characteristics of random problems is the appearance of statistical-to-computational gaps
, wherein a problem may be solved by an inefficient algorithm—typically a brute-force search—but empirical evidence, heuristic formal calculations, and negative results for classes of powerful algorithms all suggest that the same problem cannot be solved by any algorithm running in polynomial time. Many examples of this phenomenon arise from Bayesian estimation tasks, in which the goal is to recover asignal from noisy observations. Bayesian problems exhibiting statistical-to-computational gaps in certain regimes include graph problems such as community detection [DKMZ11]
, estimation for models of structured matrices and tensors[LKZ15, HSS15], statistical problems arising from imaging and microscopy tasks [PWBM18a, BBLS18], and many others. A different family of examples comes from random optimization problems that are signal-free, where there is no “planted” structure to recover; rather, the task is simply to optimize a random objective function as effectively as possible. Notable instances of problems of this kind that exhibit statistical-to-computational gaps include finding a large clique in a random graph [Jer92]
, finding a large submatrix of a random matrix[GL18], or finding an approximate solution to a random constraint satisfaction problem [AC08].
In this paper, we study a problem from the latter class, the problem of maximizing a quadratic form over a constraint set , where is a random matrix drawn from the Gaussian orthogonal ensemble,111Gaussian orthogonal ensemble (GOE): is symmetric with for and independently. . Unlike previous works that have studied whether an efficient algorithm can optimize and find that achieves a large objective value, we study whether an efficient algorithm can certify an upper bound on the objective over all . In the notable case of the Sherrington-Kirkpatrick (SK) Hamiltonian [SK75, Pan13], where , while there is an efficient algorithm believed to optimize arbitrarily close to the true maximum [Mon18], we show that, conditional on the correctness of the low-degree likelihood ratio method recently developed by [HS17, Hop18], there is no efficient algorithm to certify an upper bound that improves on a simple spectral certificate. Thus, the certification task for this problem exhibits a statistical-to-computational gap, while the optimization task does not.
Signal-free random optimization problems.
The general task we will be concerned with is the optimization of a random function,
is a probability distribution on some measurable space. Sometimes, such a task arises in statistical estimation, in particular as a likelihood maximization task, where the random function is the likelihood of an observed dataset for a given parameter value . But the same formal task, stripped of this statistical origin, is still very common: in statistical physics, random functions arise in models of magnetism in disordered media; in optimization, random functions encode uncertainty in the parameters of a problem; and in theoretical computer science, random instances of algorithmic tasks describe the average-case rather than worst-case computational characteristics of a problem. Below, we review a well-studied example showing the connection between a prominent statistical estimation problem and a related signal-free random optimization problem.
Consider the Rademacher-spiked Wigner model, a family of probability distributions indexed by and . Letting , is the law of the random matrix . If is fixed in advance, then the log-likelihood of for some observed data is
where depends on but not on .
Thus, drawing defines a random optimization problem,
Success in the associated estimation problem corresponds to recovering as the solution to this problem; the “overlap” is often used as a quantitative measure of success in this task.
A natural “signal-free” version of this problem arises by setting . In this case, note that does not actually depend on , leaving us with the optimization
Up to scaling and a change of sign, this task is the same as that of identifying the ground state configuration or energy in the Sherrington-Kirkpatrick (SK) spin glass model [SK75, Pan13]. For this reason we refer to (3) as the SK problem. Note that there is no “planted” solution with respect to which we may measure an algorithm’s performance; rather, the quality of the an algorithm obtains is measured only by the value of .
Computational tasks: optimization vs. certification.
Let us contrast two computational tasks of interest for a given optimization problem. The first, most obvious task is that of optimization, producing an algorithm computing such that is as large as possible (say, in expectation, or with high probability as the size of the problem diverges).
Another task is that of certification, producing instead an algorithm computing a number , such that for all and all we have . The main additional challenge of certification over optimization is that must produce a valid upper bound on for every possible value of the data , no matter how unlikely is to occur under . Subject to this requirement, we seek to minimize (again, in a suitable probabilistic sense when ). Convex relaxations are a common approach to certification, where is relaxed to a convex superset admitting a sufficiently simple description that it is possible to optimize exactly over using convex optimization.
If is the true maximizer of , then for any pair of optimization and certification algorithms as above, we have
Thus, in the case of a maximization problem, optimization algorithms approximate the true value from below, while certification algorithms approximate it from above. We are then interested in how tight either inequality is in the limit for a given sequence of random problems (indexed by some notion of problem size ). Of course, we can achieve “perfect” optimization and certification by exhaustive search over all , but we are interested only in computationally efficient (polynomial-time in ) algorithms.
To make these definitions concrete, let us review a simple instance of each type of algorithm for the problem of Example 1.1.
Example 1.1 (Continued).
In the SK problem (3), two related spectral algorithms give simple examples of algorithms for both optimization and certification.
For certification, writing for the largest eigenvalue of , we may take advantage of the bound
For optimization, if is the eigenvector corresponding to
is the eigenvector corresponding to, then we may produce the feasible point where denotes the -valued sign function, applied entrywise. The vector is distributed as an isotropically random unit vector in , so the quality of this solution may be computed as
with high probability as . (The error term in the first equality may be computed as , where the sum is over all eigenvectors except . This analysis appeared in [ALR87], one of the first rigorous mathematical works on the energy landscape of the SK model.)
On the other hand, from the origins of this problem in statistical physics, its true optimal value as is known to approach
where the constant is expressed via the celebrated Parisi formula for the free energy of the SK model [Par79, Pan13, Tal06]. The approximate value we give above was estimated with numerical experiments in previous works (see, e.g., [Par80, CR02]).
The recent result of [Mon18] implies, assuming a widely-believed conjecture from statistical physics, that for any there exists a polynomial-time optimization algorithm achieving with high probability a value of on the SK problem. This work builds on that of [ABM18, Sub18], and these works taken together formalize the heuristic idea from statistical physics that optimization is tractable for certain optimization problems exhibiting full replica symmetry breaking. On the other hand, there are few results addressing the SK certification problem. The only previous work we are aware of in this direction is [MS15], where a simple semidefinite programming relaxation is shown to achieve the same value as the spectral certificate (5).
Theorem 1.2 (Informal).
As mentioned earlier, Theorem 1.2 reveals a striking gap between optimization and certification: it is possible to efficiently give a tight lower bound on the maximum objective value by exhibiting a specific solution , but impossible to efficiently give a tight upper bound. The same result in fact holds more generally for a wide variety of constraints other than (see Corollary 3.9). Due to the high-dimensional nature of the problem, we expect that the value of a certification algorithm should concentrate tightly; thus we also expect Theorem 1.2 to still hold if is replaced by any positive constant.
Our result has important consequences for convex programming. A natural approach for optimizing the SK problem (3) would be to use a convex programming relaxation such as a semidefinite program based on the sum-of-squares hierarchy [Sho87, Par00, Las01]. Such a method would rewrite the objective of the SK problem as for for some , and relax this constraint on to a weaker constraint for which the associated optimization problem can be solved efficiently. One can either hope that the relaxation is tight and outputs a valid solution for (with high probability), or employ a rounding procedure to extract a valid solution from . Note that the optimal value of any convex relaxation of (3) provides an upper bound on the optimal value of (3) and therefore gives a certification algorithm. Thus Theorem 1.2 implies that no polynomial-time computable convex relaxation of (3) can have value and in particular cannot be tight. This suggests that any convex programming approach for optimization should fail to find a solution of value close to (even if a rounding procedure is used). This highlights a fundamental weakness of convex programs: even the most powerful convex programs (such as sum-of-squares relaxations) should fail to optimize (3), even though other methods succeed (namely, the message-passing algorithm of [Mon18]).222In contrast, simple rounded convex relaxations are believed to approximate many similar problems optimally in the worst-case (rather than average-case) setting [KKMO07]. An explanation for this suboptimality is that convex programs are actually solving a fundamentally harder problem: certification.
The SK problem is not the first known instance of a problem where perfect optimization is tractable but perfect certification appears to be hard. One example comes from random constraint satisfaction problems (CSPs).
Example 1.3 (Random -).
In the canonical random - problem, the decision variable is a Boolean , and the optimization task is to maximize the number of satisfied clauses , each of which is a Boolean expression of the form where each is chosen uniformly among the and their Boolean negations . Let us write for the number of satisfied clauses a given assignment achieves, a random function of .
Prior work has used sum-of-squares lower bounds to argue for hardness of certification in problems such as random CSPs [KMOW17], planted clique [BHK16a], tensor injective norm [HSS15], community detection in hypergraphs [KBG17], and others. These results prove that the sum-of-squares hierarchy (of constant degree) fails to certify effectively, which suggests that all polynomial-time algorithms will fail to certify as well. In our case, it appears difficult to prove sum-of-squares lower bounds for the SK problem, so we instead take a novel approach based on a different heuristic for computational hardness, which we explain in the next section.
Overview of techniques.
The proof of our main result (Theorem 1.2) has two parts. First, we give a reduction from a certain hypothesis testing problem, the negatively-spiked Wishart model [Joh01, BBP05, BS06, PWBM18b], to the SK certification problem. We then use a method introduced by [HS17, Hop18] based on the low-degree likelihood ratio to give strong evidence that detection in the negatively-spiked Wishart model is computationally hard (in the relevant parameter regime).
In the spiked Wishart model, we either observe i.i.d. samples , or i.i.d. samples where the “spike” is a uniformly random hypercube vector, and . The goal is to distinguish between these two cases with probability as . In the negatively-spiked () case with , this task amounts to deciding whether there is a hypercube vector that is nearly orthogonal to all of the samples . When , a simple spectral method succeeds when [BBP05, BS06], and we expect the problem to be computationally hard when .
Let us now intuitively explain the relation between the negatively-spiked Wishart model and the SK certification problem. Suppose we want to certify that
where , for some small constant . Since the eigenvalues of
approximately follow the semicircle distribution on[Wig93], we need to certify that the top
-dimensional eigenspace ofdoes not (approximately) contain a hypercube vector, for some small depending on . In particular, we need to distinguish between a uniformly random -dimensional subspace (the distribution of the actual top -dimensional eigenspace of ) and a -dimensional subspace that contains a hypercube vector. Equivalently, by taking orthogonal complements, we need to distinguish between a uniformly random -dimensional subspace and a -dimensional subspace that is orthogonal to a hypercube vector. This is essentially the problem of detection in the negatively-spiked Wishart model with and , and these parameters lie in the “hard” regime .
More formally, our reduction constructs a distribution over symmetric matrices such that when . This also has the property that (conditional on the hardness of the detection problem described above) it is computationally hard to distinguish between and . Note that the existence of such a implies hardness of certification for the SK problem, because if an algorithm could certify that when , then it could distinguish from .
This idea of “planting” a hidden solution (in our case, a hypercube vector ) in such a way that it is difficult to detect is referred to as quiet planting [ZK08, ZK11]. Roughly speaking, our quiet planting scheme draws and then rotates the top eigenspace of to align with a random hypercube vector , while leaving the eigenvalues of unchanged. We remark that the more straightforward planting scheme, with , is not quiet because it changes the largest eigenvalue of [FP07]. The question of how to design optimal quiet planting schemes in general remains an interesting open problem.
The final ingredient in our proof is to argue that detection in the spiked Wishart model is computationally hard below the spectral threshold. We do this through a calculation involving the projection of the likelihood ratio between the “null” and “planted” distributions of this model onto the subspace of low-degree polynomials. This method may be viewed as an implementation of the intuitive idea that the correct strategy for quiet planting is to match the low-degree moments of the distributionsand . We discuss the details of this “low-degree method” further in Section 2.4.
Our results on hardness in the spiked Wishart model may be of independent interest: our calculations indicate that, for a large class of spike priors, no polynomial-time algorithm can successfully distinguish the spiked and unspiked models below the classical spectral threshold [BBP05, BS06], both in the negatively-spiked and positively-spiked regimes.
2.1 Probability Theory
All our asymptotic notation (e.g., ) pertains to the limit . We consider parameters of the problem (e.g., ) to be held fixed as . Thus, the constants hidden by and do not depend on but may depend on the other parameters.
If is a sequence of probability spaces, and is a sequence of events with , then we say holds with high probability if .
It need not be the case that is equal to the variance of , but it can be shown that if is subgaussian with variance proxy . The name subgaussian refers to the fact that
is the moment-generating function of the Gaussian distribution. A random variable with law is then trivially subgaussian. A more interesting example is that any bounded centered random variable is subgaussian: by Hoeffding’s lemma, if almost surely, then is subgaussian with variance proxy .
We next give some background facts from random matrix theory. Their proofs and further information may be found in a general reference such as [AGZ10].
The Gaussian orthogonal ensemble is a probability distribution over symmetric matrices , under which and when , where the entries are independent for distinct pairs with .
The choice of variances ensures the following crucial invariance property of .
For any fixed orthogonal matrix
For any fixed orthogonal matrix, if , then the law of is also .
Our scaling of the entries of is chosen to ensure a spectrum of constant width, as shown by the following classical result.
Let . Then, almost surely,
In particular, for any , with high probability.
Furthermore, by Wigner’s semicircle law [Wig93], the empirical distribution of eigenvalues of converges weakly to a semicircle distribution supported on .
2.2 Constrained PCA
A constraint set is a sequence where . The constrained principal component analysis (PCA) problem with constraint set
constrained principal component analysis (PCA) problem with constraint set, denoted , is
We will work only with constraint sets supported on vectors of approximately unit norm.
Several problems previously considered in the literature may be described in the constrained PCA framework:
the positive PCA null model: [MR16].
Our results in this paper will apply to the first two examples: the SK model, and sparse PCA on a Wigner matrix when .
Let be a (randomized) algorithm that takes a square matrix as input and outputs a number . We say that certifies a value on if
for any symmetric matrix , , and
if then with high probability.
For purposes of generality, we have allowed to be a randomized algorithm (i.e., it is allowed to use randomness in its computations, but the output must be a true upper bound almost surely). We do not expect certification algorithms to need randomness in an essential way, but sometimes it may be convenient, e.g., to obtain a random initialization for an iterative optimization procedure.
2.3 Spiked Wishart Models
A normalized spike prior is a sequence where is a probability distribution over , such that if then in probability as .
Definition 2.10 (Spiked Wishart model).
Let be a normalized spike prior, let , and let . Let . We define two probability distributions over :
Under , draw independently for .
Under , draw . If , then draw independently for . Otherwise, draw independently for .
We call the planted model and the null model. Taken together, we refer to these two distributions as the spiked Wishart model . For fixed and we denote the sequence by .
Several remarks on this definition are in order. First, we make the explicit choice for concreteness, but our results apply to any choice of for which as .
Second, often the Wishart model is described in terms of the distribution of the sample covariance matrix . We instead work directly with the samples so as not to restrict ourselves to algorithms that only use the sample covariance matrix. (This modification only makes our results on computational hardness of detection more general.)
Finally, the definition of has two cases in order to ensure that the covariance matrix used as a parameter in the Gaussian distribution is positive semidefinite. We will work in the setting where the first case () occurs with high probability. Priors for which this case occurs almost surely will be especially important, so we define the following terminology for this situation.
Let and let be a normalized spike prior. We say that is -good if when then almost surely.
We will often consider spike priors having i.i.d. entries.
Let be a probability distribution over such that and . Let denote the normalized spike prior that draws each entry of independently from . (We do not allow to depend on .)
We will sometimes need to slightly modify the spike prior to ensure that it is -good and has bounded norm.
For a normalized spike prior , let the -truncation of denote the following normalized spike prior. To sample from , first sample and let
If then since is normalized ( in probability), the first case of (8) occurs with high probability. The upper bound is for technical convenience, and there is nothing special about the constant . Note also that the -truncation of an i.i.d. prior is no longer i.i.d.
We consider the algorithmic task of distinguishing between and in the following sense.
For sequences of distributions and over measurable spaces , we say that an algorithm achieves strong detection between and if
The celebrated BBP transition [BBP05] implies a spectral algorithm for strong detection in the spiked Wishart model whenever .
Let be any normalized spike prior. If then there exists a polynomial-time algorithm for strong detection in .
The algorithm computes the largest eigenvalue (if ) or smallest eigenvalue (if ) of the sample covariance matrix . This eigenvalue converges almost surely to a limiting value which is different under and .
In this paper we will argue (see Corollary 3.3) that if is an i.i.d. subgaussian prior (in the sense of Definition 2.12 with subgaussian), then no polynomial-time algorithm achieves strong detection below the BBP threshold (i.e., when ). It is known that for some priors, there is an exponential-time algorithm for strong detection below the BBP threshold [PWBM18b]. For very sparse priors, e.g., if is supported on entries, we enter the sparse PCA regime where polynomial-time strong detection is possible below the BBP threshold (see, e.g., [JL04, DM14b]). Note that a normalized spike prior with this level of sparsity cannot take the form , since we require to be independent of .
2.4 The Low-Degree Likelihood Ratio
proposed a strikingly simple method for predicting whether Bayesian inference problems are computationally easy or hard. This method is known to recover widely-conjectured computational thresholds for many high-dimensional inference problems such as planted clique, densest--subgraph, random constraint satisfaction, community detection in the stochastic block model, and sparse PCA (see [Hop18]). We now give an overview of this method.
Consider the problem of distinguishing two simple hypotheses and which are probability distributions on some domain (where typically the dimension grows with ). One example is the spiked Wishart model for some fixed choice of the parameters . The idea is to take low-degree polynomials (for some notion of “low”) as a proxy for polynomial-time algorithms and examine whether there is a sequence of low-degree polynomials that can distinguish from .
It will be convenient to take as the “null” distribution, which is often i.i.d. Gaussian (as in the Wishart case) or i.i.d. Rademacher (having entries i.i.d. taking values with equal probability). The distribution induces an inner product on square-integrable functions given by , as well as a corresponding norm . For , let denote the space of polynomial functions of degree at most . For a function , let denote the orthogonal projection (with respect to ) of onto . The following result relates the distinguishing power of low-degree polynomials to the so-called low-degree likelihood ratio.
Theorem 2.16 ([Hs17]).
Let and be probability distributions on for each . Suppose is absolutely continuous with respect to , so that the likelihood ratio is defined. Then
We include the short proof here for completeness.
The left-hand side can be rewritten as
so by basic linear algebra (in particular, the variational description of orthogonal projection), the maximum is attained by taking . ∎
Note that the left-hand side of (9) is a heuristic measure of whether there is a degree- polynomial that can distinguish from . Thus we expect if there is a degree- polynomial that achieves strong detection, and if there is no such polynomial.
We take -degree polynomials as a proxy for polynomial-time computable functions. One justification for this is that, in practice, many polynomial-time algorithms compute the leading eigenvalue of a symmetric matrix whose entries are constant-degree polynomials in the data; in fact, there is formal evidence that such low-degree spectral methods are as powerful as the sum-of-squares hierarchy [HKP17]. Typically, rounds of power iteration are required to compute the leading eigenvalue, i.e., we can distinguish from using the polynomial for some . The above motivates the following informal conjecture.
For “nice” distributions and , if for some , then there is no randomized polynomial-time algorithm for strong detection between and .
This conjecture is useful because the norm of the low-degree likelihood ratio, , can be computed (or at least bounded) for various distributions such as the stochastic block model [HS17] and the spiked tensor model [Hop18].
The “converse” of Conjecture 2.17 is not quite expected to be true. If for some then we should expect an -time algorithm but not necessarily a polynomial-time algorithm. This is because not every -degree polynomial can be evaluated in polynomial time.
Conjecture 2.17 is informal in that we have not specified what is meant by “nice” distributions. See [Hop18] for a precise variant of Conjecture 2.17; however, this variant uses the more refined notion of coordinate degree and so does not directly apply to the calculations we perform in this paper. Roughly speaking, “nice” distributions and are assumed to satisfy the following:
should be a product distribution, e.g., i.i.d. Gaussian or i.i.d. Rademacher;
should be sufficiently symmetric with respect to permutations of its coordinates; and
we should be able to further add a small amount of noise to , ruling out distributions with brittle algebraic structure (such as random satisfiable instances of XOR-SAT, which can be identified using Gaussian elimination [CD99]).
3 Main Results
3.1 Spiked Wishart Models
We expect that Conjecture 2.17 applies to the spiked Wishart model. The following states this assumption formally.
Fix , and a normalized spike prior . Let and be the sequences of planted and null models, respectively, of the spiked Wishart model . Define the likelihood ratio . If there exists some such that , then there is no randomized polynomial-time algorithm for strong detection in .
We now give bounds on .
Fix constants and .
Suppose . Let where is subgaussian with and . Then, for any , we have .
Suppose . Let where is symmetric about zero with and . Suppose also that is -good. Then, for any , we have .
Section 5 is devoted to the proof of Theorem 3.2. Part 1 of Theorem 3.2, when combined with Conjecture 3.1, implies that for i.i.d. subgaussian priors, strong detection is computationally hard below the BBP threshold.
Suppose Conjecture 3.1 holds. Fix constants and . Let be subgaussian with and . Let be either or . If , then there is no randomized polynomial-time algorithm for strong detection in .
The case follows immediately from Part 1 of Theorem 3.2. If strong detection is impossible for , then strong detection is also impossible for , as these two spike priors differ with probability (under the natural coupling). ∎
Part 2 of Theorem 3.2 is merely a sanity check to ensure that the low-degree likelihood ratio indeed exhibits a phase transition at
is merely a sanity check to ensure that the low-degree likelihood ratio indeed exhibits a phase transition atand does not predict computational hardness when (as we know polynomial-time strong detection is possible in this regime; see Theorem 2.15). The assumption that is symmetric about zero is for convenience only and should not be essential.
In Part 1 of Theorem 3.2 and in Corollary 3.3, the requirement that be a (-truncated) i.i.d. prior can be relaxed. All that we require of is that it is the -truncation of a normalized spike prior that admits a local Chernoff bound, as described in Definition 5.11. One type of prior that does not have this property is one where is very sparse, e.g., supported on entries; this is the sparse PCA regime (see e.g., [JL04, DM14b]) for which polynomial-time strong detection is possible below the BBP threshold.
Note that Part 1 of Theorem 3.2 holds for any , which is much larger than the required by Conjecture 3.1. Since we expect degree- polynomials to correspond to -time algorithms [Hop18], this suggests that the conclusion of Corollary 3.3 holds not only for polynomial-time algorithms but for -time algorithms, for any . In other words, nearly-exponential time is required to achieve strong detection.
3.2 Constrained PCA
We now give a reduction from strong detection in the spiked Wishart model to certification in the constrained PCA problem.
Let be a constraint set and let be a normalized spike prior such that if then with high probability. Suppose that for some there is a randomized polynomial-time algorithm that certifies the value on . Then there exist and (depending on ) such that there is a randomized polynomial-time algorithm for strong detection in .
We give the proof in Section 4. Note that the parameters above satisfy (the “hard” regime).
Suppose Conjecture 3.1 holds. Let be subgaussian with and . Let be a constraint set such that, if , then with high probability. Then, for any , there is no randomized polynomial-time algorithm to certify the value on .
In particular, we obtain the hardness of improving on the spectral certificate in the SK model.
If Conjecture 3.1 holds, then, for any , there is no randomized polynomial-time algorithm to certify the value on the SK problem .
Apply Corollary 3.9 with having the Rademacher distribution (equal to with equal probability) and . ∎
4 Proof of Reduction from Spiked Wishart to Constrained PCA
Proof of Theorem 3.8.
Let be a constraint set and let be a normalized spike prior such that, if , then with high probability. Suppose that for some there is a randomized polynomial-time algorithm that certifies the value on . We will show that this implies that there is a polynomial-time algorithm for strong detection in with certain parameters and (depending on ). Note that these parameters lie in the “hard” regime .
Our algorithm for detection in the Wishart model is as follows. Fix , to be chosen later. Since we have (for sufficiently large ). Given samples , let and let be its orthogonal complement. We sample having the distribution conditioned on the event that the span of the top eigenvectors of is . Concretely, we can obtain a sample in the following way. Let be a uniformly random orthonormal basis for and let be a uniformly random orthonormal basis for . Sample and let be the eigenvalues of . Then, let . Finally, run the certification algorithm for on . The detection algorithm then thresholds :
We now prove that indeed achieves strong detection in . First, if the samples are drawn from the null model , then is a uniformly random -dimensional subspace of , so by Proposition 2.4 the law of constructed above is . Thus with high probability by assumption, and therefore with high probability, i.e., our algorithm correctly reports that the samples were drawn from the null model.
Next, suppose the samples are drawn from the planted model with planted spike . We will choose and so that with high probability. Since with high probability, this would imply , so we will have with high probability, i.e., our algorithm will correctly report that the samples were drawn from the planted model.
It remains to show that . Let be the eigenvalues of and let be the corresponding (unit-norm) eigenvectors. By Proposition 2.5, with high probability, for all , . Furthermore, by the semicircle law [Wig93], with high probability, where is a function satisfying as (recalling that ). Letting denote the norm of the orthogonal projection of onto , we have, with high probability,
Thus we need to upper bound . Let denote the orthogonal projection matrix onto . Since is the span of , we have where
and is the smallest nonzero eigenvalue of . (Here denotes Loewner order.) Since is a spiked Wishart matrix, it follows from Theorem 1.2 of [BS06] that its smallest nonzero eigenvalue converges almost surely to as . Thus we have . Therefore,
We have and so