DeepAI

# Optimal Average-Case Reductions to Sparse PCA: From Weak Assumptions to Strong Hardness

In the past decade, sparse principal component analysis has emerged as an archetypal problem for illustrating statistical-computational tradeoffs. This trend has largely been driven by a line of research aiming to characterize the average-case complexity of sparse PCA through reductions from the planted clique (PC) conjecture - which conjectures that there is no polynomial-time algorithm to detect a planted clique of size K = o(N^1/2) in G(N, 1/2). All previous reductions to sparse PCA either fail to show tight computational lower bounds matching existing algorithms or show lower bounds for formulations of sparse PCA other than its canonical generative model, the spiked covariance model. Also, these lower bounds all quickly degrade with the exponent in the PC conjecture. Specifically, when only given the PC conjecture up to K = o(N^α) where α < 1/2, there is no sparsity level k at which these lower bounds remain tight. If α< 1/3 these reductions fail to even show the existence of a statistical-computational tradeoff at any sparsity k. We give a reduction from PC that yields the first full characterization of the computational barrier in the spiked covariance model, providing tight lower bounds at all sparsities k. We also show the surprising result that weaker forms of the PC conjecture up to clique size K = o(N^α) for any given α∈ (0, 1/2] imply tight computational lower bounds for sparse PCA at sparsities k = o(n^α/3). This shows that even a mild improvement in the signal strength needed by the best known polynomial-time sparse PCA algorithms would imply that the hardness threshold for PC is subpolynomial. This is the first instance of a suboptimal hardness assumption implying optimal lower bounds for another problem in unsupervised learning.

• 11 publications
• 32 publications
06/19/2018

### Reducibility and Computational Lower Bounds for Problems with Planted Sparse Structure

The prototypical high-dimensional statistics problem entails finding a s...
04/03/2013

### Computational Lower Bounds for Sparse PCA

In the context of sparse principal component detection, we bring evidenc...
05/22/2018

### More Consequences of Falsifying SETH and the Orthogonal Vectors Conjecture

The Strong Exponential Time Hypothesis and the OV-conjecture are two pop...
02/19/2019

### Universality of Computational Lower Bounds for Submatrix Detection

In the general submatrix detection problem, the task is to detect the pr...
06/18/2020

### Free Energy Wells and Overlap Gap Property in Sparse PCA

We study a variant of the sparse PCA (principal component analysis) prob...
07/25/2021

### Logspace Reducibility From Secret Leakage Planted Clique

The planted clique problem is well-studied in the context of observing, ...
09/20/2021

### Abelian Repetition Threshold Revisited

Abelian repetition threshold ART(k) is the number separating fractional ...

## 1 Introduction

Principal component analysis (PCA), the task of projecting multivariate samples onto the leading eigenvectors of their empirical covariance matrix, is one of the most popular dimension reduction techniques in statistics. However in modern high-dimensional settings, PCA no longer provides a meaningful estimate of principal components. More precisely, the BBP transition in random matrix theory implies that PCA yields a statistically inconsistent estimator when the data is distributed according to the spiked covariance model

baik2005phase ; paul2007asymptotics ; johnstone2009consistency . Sparse PCA was introduced in johnstoneSparse04 to alleviate this inconsistency in the high-dimensional setting and has found applications in a diverse range of fields. Examples include online visual tracking wang2013online naikal2011informative , image compression majumdar2009image , electrocardiography johnstone2009consistency , gene expression analysis zou2006sparse ; chun2009expression ; parkhomenko2009sparse ; chan2010using , RNA sequence classification tan2014classification and metabolomics studies allen2011sparse . Further background on sparse PCA can be found in wang2016statistical .

The canonical formulation of sparse PCA is the spiked covariance model, which has observed data distributed i.i.d. according to where is a

-sparse unit vector and

parametrizes the signal strength. The objective is either to estimate the direction of the spike or detect its existence. Detection is formulated as a hypothesis testing problem on the data with hypotheses

 H0:X∼N(0,Id)⊗n and H1:X∼N(0,Id+θvv⊤)⊗n.

There is an extensive literature on both the statistical limits and efficient algorithms for sparse PCA. A large number of algorithms solving sparse PCA have been proposed amini2009high ; ma2013sparse ; cai2013sparse ; berthet2013optimal ; berthet2013complexity ; shen2013consistency ; krauthgamer2015semidefinite ; deshpande2014sparse ; wang2016statistical . As shown in berthet2013optimal ; cai2015optimal ; wang2016statistical , the statistical limit of detection in the spiked covariance model is and the minimax rate of estimation under the norm is . These rates are achieved by some of the estimators in the papers listed above, none of which can be computed in polynomial time. In contrast, in the sparse regime of , the best known polynomial time algorithms require the much larger signal for detection and achieve a minimax rate of estimation under the norm of . This phenomenon raises the general question: is this statistical-computational gap inherent or are there better polynomial time algorithms?

Since the seminal paper of Berthet and Rigollet berthet2013complexity

, sparse PCA has emerged as the archetypal example of a high-dimensional statistical problem with a statistical-computational gap and serves as an ideal case-study of the general phenomenon. Consequently, a line of research has aimed to characterize the average-case complexity of sparse PCA. This and prior work seeks to determine the computational phase diagram of sparse PCA by answering the following question.

###### Question 1.1.

For a given scaling of parameters , is sparse PCA information-theoretically impossible, efficiently solvable or in principle feasible but computationally hard?

There are several feasible approaches to providing an answer to this question. One taken in the literature is to show lower bounds for variants of sparse PCA in classes of algorithms such as the sum of squares hierarchy ma2015sum ; hopkins2017power and statistical query algorithms diakonikolas2016statistical ; lu2018edge . A more traditional complexity-theoretic approach is to show lower bounds against all polynomial time algorithms by reducing from a conjecturally hard problem. Since the seminal paper of Berthet and Rigollet, a growing line of research berthet2013optimal ; berthet2013complexity ; wang2016statistical ; gao2017sparse ; brennan2018reducibility has shown computational lower bounds for sparse PCA through reductions from the planted clique (pc) conjecture berthet2013optimal ; berthet2013complexity ; wang2016statistical ; gao2017sparse ; brennan2018reducibility .

The pc problem is to detect or find a planted clique of size in the -vertex Erdős-Rényi graph . Because the largest clique in

is with high probability of size roughly

, planted cliques of size can be detected in time by searching over all -node subsets. The pc conjecture is that there are no polynomial-time algorithms for this task if , which we state formally in Section 2. Since its introduction in kuvcera1995expected and jerrum1992large , pc has been studied extensively. Spectral algorithms, approximate message passing, semidefinite programming, nuclear norm minimization and several other polynomial-time combinatorial approaches all appear to fail to recover the planted clique when alon1998finding ; feige2000finding ; mcsherry2001spectral ; feige2010finding ; ames2011nuclear ; dekel2014finding ; deshpande2015finding ; chen2016statistical . In support of the pc conjecture, it has been shown that cliques of size cannot be detected by the Metropolis process jerrum1992large , low-degree sum of squares relaxations barak2016nearly and statistical query algorithms feldman2013statistical . Recently, atserias2018clique showed that super-polynomial length regular resolution is required to certify that Erdős-Rényi graphs do not contain cliques of size . In the other direction, frieze2008new ; brubaker2009random provide an algorithm for finding cliques of size given the ability to compute the 2-norm of a certain random

-parity tensor

111Computing this 2-norm was shown to be hard in the worst case in hillar2013most , although its average-case complexity remains unknown.. Because pc is easier for larger , the assumption that pc is hard for clique sizes is weaker for smaller .

Throughout this paper, we say that a computational lower bound is tight or optimal if it matches what is achievable by the best known polynomial-time algorithms. All of the previous reductions from pc to sparse PCA in berthet2013optimal ; berthet2013complexity ; wang2016statistical ; gao2017sparse ; brennan2018reducibility either fail to show tight computational lower bounds matching existing algorithms or show lower bounds for formulations of sparse PCA other than the spiked covariance model. In particular, a full characterization of the computational feasibility in the spiked covariance model has remained open. These lower bounds also all quickly degrade with the exponent in the pc conjecture. Specificallly, when only given the pc conjecture up to where , there is no sparsity level at which these lower bounds remain tight. If these reductions fail to even show the existence of a statistical-computational tradeoff at any sparsity . This phenomenon is unsurprising – as the assumed lower bound for pc is loosened, it is intuitive that the resulting lower bound for sparse PCA always weakens as well.

The main results of this work are to resolve the the feasibility diagram for a formulation of sparse PCA in the spiked covariance model and give an unintuitively strong reduction from pc. More precisely, our results are as follows:

• We show the surprising result that weaker forms of the pc conjecture up to clique size for any given imply tight computational lower bounds for sparse PCA at sparsities .

• We give a reduction from pc that yields the first full characterization of the computational barrier in the spiked covariance model, providing tight lower bounds at all sparsities . This completes the feasibility picture for the spiked covariance model depicted in Figure 1 and partially resolves a question raised in brennan2018reducibility .

This first result has several strong implications for the relationship between the computational complexity of sparse PCA and pc. Even a mild improvement in the signal strength needed by the best known polynomial-time sparse PCA algorithms, to below , would imply that the hardness threshold for pc is subpolynomial, rather than on the order as is widely conjectured. Whether or not the hardness threshold for planted clique is in fact at is irrelevant to the statistical-computational gap for sparse PCA in the practically relevant highly sparse regime. This result is also the first instance of a suboptimal hardness assumption implying optimal lower bounds for another problem in unsupervised learning. Before stating our results in Section 1.2, we review prior reductions to sparse PCA, state the resulting bounds and discuss why there is a fundamental limitation to the pre-existing techniques. We also give a high-level overview of how our reductions overcome these barriers.

### 1.1 Prior Reductions and Overcoming the Limits of Reducing Directly to Samples

As in most previous papers on sparse PCA, we primarily focus on the more sparse regime in which and satisfy the conditions and . The following is a brief overview of previous reductions to sparse PCA in the literature.

• Berthet-Rigollet (2013b): berthet2013optimal gives a reduction from planted clique to show hardness for solving sparse PCA with semidefinite programs that can be computed in polynomial time.

• Berthet-Rigollet (2013a): berthet2013complexity gives a reduction from planted clique to a composite vs. composite hypothesis testing formulation of sparse PCA, showing lower bounds for polynomial time algorithms that solve sparse PCA uniformly over all noise distributions satisfying a -dimensional sub-Gaussian concentration condition. In this formulation, berthet2013complexity shows tight computational lower bounds up to the threshold of when and , matching the best known polynomial time algorithms at these sparsities.

• Wang-Berthet-Samworth (2016): wang2016statistical shows computational lower bounds for the estimation task in a similar distributional-robust sub-Gaussian variant of sparse PCA as in berthet2013complexity , again up to the conjectured threshold of .

• Gao-Ma-Zhou (2017): gao2017sparse shows the first computational lower bounds for sparse PCA in its canonical generative model, the spiked covariance model, in which the noise distribution is exactly a multivariate Gaussian. The reduction is to a simple vs. composite hypothesis testing formulation of sparse PCA and shows lower bounds up to the suboptimal threshold of when and . This lower bound is only tight when and .

• Brennan-Bresler-Huleihel (2018): brennan2018reducibility provides an alternative reduction based on random rotations to strengthen the lower bounds from gao2017sparse to hold for a simple vs. simple hypothesis testing formulation of the spiked covariance model up to the suboptimal threshold of . It also shows the first tight lower bounds in the high sparsity regime up to the threshold of when , which are matched by polynomial-time algorithms.

Assuming the hardness of planted clique with vertices and clique size , berthet2013optimal ; berthet2013complexity show lower bounds for sparse PCA with sparsities , samples and dimension , and satisfying (1) below. Given this same planted clique hardness assumption, gao2017sparse and brennan2018reducibility show lower bounds for sparse PCA with sparsity , samples and dimension , and satisfying (2), where

 (1)θ=~o(k⋅Kn)(2)θ=~o(k2n).

The bound (1), upon plugging in , implies hardness for any . When only given the planted clique conjecture up to for an , there is no sparsity level at which either of these lower bounds remain tight to the conjectured boundary of . If , then and these reductions fail to show computational lower bounds beyond the statistical limit of . A more detailed discussion of previous reductions from planted clique to sparse PCA is in Appendix B.

All of these prior reductions convert the rows of the pc adjacency matrix directly to samples from sparse PCA. This turns out to be lossy for a simple reason: while the planted clique itself is sparse both in its column and row support, sparse PCA data is independent across samples and thus structured only along one axis when arranged into an matrix. Thus, directly converting a pc instance to samples from sparse PCA necessarily eliminates the structure across the rows and destroys some of the underlying structure.

The intuition behind our main reduction is that an instance of pc naturally corresponds to a shifted and rescaled empirical covariance matrix of samples from the spiked covariance model, rather than the samples themselves. This empirical covariance matrix, although it has dependent entries and is distributionally very different from planted clique, has sparse structure along both of its axes. We elaborate on this intuition in Appendix B. While previous reductions have all produced samples directly, we map an instance of pc to this empirical covariance matrix and then map to samples using an isotropic property of the Wishart distribution. This strategy yields an optimal relationship between that no longer degrades with in the assumption in the pc conjecture. Executing this strategy proves to be a delicate task because of dependence among the entries of the empirical covariance matrix, requiring a number of new average-case reduction primitives. Specifically, to overcome this dependence, we require several decomposition and comparison results for random matrices proven in Section 6 and crucially use a recent result in bubeck2016testing comparing goe to Wishart matrices. Our techniques are outlined in more detail in Section 1.3.

### 1.2 Summary of Results and Structure of the Paper

#### Strong Hardness from Weak Assumptions.

We now state our main result, which essentially shows that whether or not there are better polynomial time algorithms for pc is irrelevant to the statistical-computational gap for sparse PCA in the highly sparse regime of low , which consists of the “interpretable” levels of sparsity that often arise in applications. The following is an informal statement of our main theorem, which is formally stated in Corollary 5.1. Instead of only showing lower bounds from pc, we reduce from the more general the planted dense subgraph detection problem which entails testing for the presence of a planted in a where . The problem pc is recovered by setting and .

###### Theorem 1.1 (Informal Main Theorem).

Fix some and . If there is no randomized polynomial-time algorithm solving the planted dense subgraph detection problem with densities and on graphs of size with any dense subgraph size , then there is no randomized polynomial time algorithm solving detection in the spiked covariance model for all with , and .

Formulating our theorems in terms of planted dense subgraph means that they subsume statements based on pc hardness and have additional consequences such as for quasipolynomial time algorithms for sparse PCA, which are discussed in Section 9. The reduction proving our main theorem is sketched in Section 1.3 and formally described in Section 5. In Section 2, we formally introduce the models of sparse PCA we consider and the known statistical limits and polynomial-time algorithms for these models. In Sections 3 and 4, we introduce average-case reductions in total variation and a number of crucial subroutines that we will use in our reductions. In Sections 6 and 7, we prove our main theorem.

#### Completing the Computational Phase Diagram for the Spiked Covariance Model.

In Section 8, we give a reduction to the spiked covariance model combining a variant of the random rotations reduction of brennan2018reducibility with an internal subsampling reduction to yield tight lower bounds for all sparsities in a variant of the spiked covariance model based on the planted clique conjecture for cliques of size . Together with the lower bounds when in Theorem 8.5 of brennan2018reducibility , this yields the first complete set of computational lower bounds in a simple vs. composite formulation of the spiked covariance model of sparse PCA that are tight at all sparsities , partially resolving a question in brennan2018reducibility , which posed this problem for the simple vs. simple formulation. The fcspca simple vs. composite formulation of the spiked covariance model for which we characterize feasibility, as well as some other formulations of varying flexibility, are described in Section  2. The fcspca formulation allows a slight variation in the size of the support of and in the signal strength parameter, and is quite close to the original spiked covariance model.

The following is an informal statement of our second main theorem, which combines Corollary 8.1 with Theorem 8.5 of brennan2018reducibility . The associated diagram is given in right of Figure 1.

###### Theorem 1.2 (Informal Second Main Theorem).

Fix some . If there is no randomized polynomial time algorithm solving the planted dense subgraph detection problem with densities and on graphs of size with dense subgraph size , then there is no randomized polynomial time algorithm solving detection in the fcspca simple vs. composite formulation of the spiked covariance model for all satisfying either:

• for sparsities and

• for sparsities and

In Section 8.2, we also show that a simple internal cloning reduction within sparse PCA shows that hardness at implies hardness at for any . This shows that the tight lower bounds derived from the planted clique conjecture up to in our main theorem can also be extended to lower bounds with that, although not tight, still show nontrivial statistical-computational gaps. In Section 9, we discuss simple extensions and implications of our results. We also state several problems left open in this work.

### 1.3 Overview of Techniques

To show our main result, it suffices to give a reduction mapping an instance of pc with to an instance of the spiked covariance model with parameters satisfying under both and . Concretely, the desired should simultaneously map to and to , approximately under total variation, where is a -sparse unit vector with nonzero entries . As mentioned above, our insight is that the optimal dependence can arise from first mapping approximately to the scaled empirical covariance matrix222For technical reasons we actually map to where the are restrictions of the to coordinates containing all of the sparse principal component. , and then mapping to . Note that if we represent edge versus no edge in planted clique by , then the expectations of the nondiagonal entries of the planted clique adjacency matrix and the empirical covariance matrix closely resemble are equal up to a rescaling under both and . As described next, one of the main challenges is capturing the dependencies among the random entries.

#### Empirical covariance matrix.

Under , it holds that is distributed as a isotropic Wishart matrix with degrees of freedom. A recent result in random matrix theory of bubeck2016testing implies that if , then converges in total variation to , where denotes the distribution of and . We see that the empirical covariance matrix has approximately independent entries and is quite simple in this case.

The distribution of is much more complicated under , exhibiting dependencies among the entries whose structure depends on the location of the planted spike. A useful characterization of the distribution seems hopeless in general, but a simplification occurs in the regime , whereby the entries become jointly Gaussian, although still dependent. In Section 6, we show that the distribution of under converges in total variation to where

 B=√n⋅(N(0,1)⊗d×d+√θ⋅vSw⊤1+√θ⋅w2v⊤S+θgvSv⊤S)

and and are independent. The independence between these terms implies that is close in total variation to the following independent sum involving three instances of simpler average-case problems:

 n⋅Im+√n6(M+M⊤+CL+C⊤L+CR+C⊤R)

where and are such that

 H0:M∼N(0,1)⊗d×d H1:M∼θ√3⋅(√n/2+g)vv⊤+N(0,1)⊗d×d H0:CL∼N(0,1)⊗d×d H1:CL∼√3θ⋅w1v⊤+N(0,1)⊗d×d H0:CR∼N(0,1)⊗d×d H1:CR∼√3θ⋅vw⊤2+N(0,1)⊗d×d.

Observe that is a variant of Gaussian biclustering with random signal strength parameter and and are weak instances of the spiked covariance model. Crucially, all of these matrices are expressed in terms of the spike direction in a way that makes it possible to map to them from an instance of planted clique without knowing the clique location. The reduction proceeds by cloning the planted clique instance to obtain multiple independent copies with slightly modified parameters, and then mapping to each of and . To execute this step, we require a number of subroutines to overcome several distributional subtleties, as described next.

#### Missing diagonal entries.

The covariance matrix describing the distribution of the spiked covariance model under has nontrivial signal on its diagonals, with th diagonal entry . However, the diagonal of the pc adjacency matrix is uniformly zero and thus does not contain any signal distinguishing clique vertices versus non-clique vertices. Without planting signal in the diagonal one cannot map to the desired distribution in total variation, and thus we are faced with the task of placing ’s in the correct entries without knowing the clique location. To produce the missing diagonal entries in the adjacency matrix of , we embed this adjacency matrix as a principal minor in a larger matrix using a procedure from brennan2019universality .

#### Further distributional maps.

A number of further distributional maps are described in detail in Section 4. To remove the symmetry of the pc adjacency matrix and produce multiple independent copies, we provide two cloning procedures. To Gaussianize the entries of these matrices, we use the rejection kernel framework introduced in brennan2018reducibility . We additionally introduce a reduction, termed random rotations, to produce the weak instances and of the spiked covariance model appearing above in the decomposition of the empirical covariance matrix . Because these have significantly reduced signal, we can use a lossy procedure. Now the reduction has arrived at an object close in total variation to and the task is to retrieve the samples from .

#### Inverse Wishart sampling.

In Section 6, it is shown that a consequence of the isotropy of independent multivariate Gaussians is that if is distributed as the scaled empirical covariance matrix of , then , where consists of the top rows of a matrix sampled from the Haar measure on the orthogonal group and . Here is the positive semidefinite square-root of . What is noteworthy is that the samples produced have distribution described by the true covariance , even though we started with the empirical covariance matrix . Therefore, generating and computing

completes the reduction, after applying a further post-processing step of appropriately padding the data matrix.

#### Subsampling and cloning.

Finally in Section 8, we give subsampling and cloning internal reductions within sparse PCA which, when combined with our random rotation reduction, imply the lower bounds in our second main theorem. By internal reduction we mean that these procedures take as input and then output instances of sparse PCA, but with different parameters. This translates hardness from a given point in the feasibility diagram to other points and we apply it to a point where the random rotation reduction gives tight hardness, and , to deduce tight hardness for smaller and . The subsampling internal reduction is based on the observation that projecting down to a smaller coordinate subspace of data from the spiked covariance model results in another instance of the spiked covariance model. The formal statement follows by carefully accounting for the proportion of the support of the spike that is retained.

### 1.4 Related Work on Statistical-Computational Gaps

This work is part of a growing body of literature giving rigorous evidence for computational-statistical gaps in high-dimensional inference problems. A more detailed survey of this area can be found in the introduction section of brennan2018reducibility .

#### Computational Lower Bounds for Sparse PCA.

In addition to average-case reductions, the average-case complexity of sparse PCA has been examined from the perspective of several restricted models of computation including the statistical query model and sum of squares semidefinite (SOS) programming hierarchy. Statistical query lower bounds for sparse PCA and related problems were shown in lu2018edge and diakonikolas2016statistical . Degree four SOS lower bounds for sparse PCA in the spiked covariance model were shown in ma2015sum . SOS lower bounds were also shown for the spiked Wigner model, a closely related model, in hopkins2017power . In brennan2018reducibility , average-case reductions from the planted clique conjecture were used to fully characterize the computational phase diagram of the spiked Wigner model. We remark that the entry-wise independence in the spiked Wigner model makes it an easier object to map to with average-case reductions. The worst-case complexity of sparse PCA as an approximation problem has also been considered in chan2016approximability .

#### Average-Case Reductions.

One reason behind the recent focus on showing hardness in restricted models of computation is that average-case reductions are inherently delicate, creating obstacles to obtaining satisfying hardness results. As described in Barak2017 , these technical obstacles have left us with an unsatisfying theory of average-case hardness. Unlike reductions in worst-case complexity, average-case reductions between natural decision problems need to precisely map the distributions on instances to one another without destroying the underlying signal in polynomial-time. The delicate nature of this task has severely limited the development of techniques. For a survey of recent results in average-case complexity, we refer to bogdanov2006average .

In addition to those showing lower bounds for sparse PCA, there have been a number of average-case reductions from planted clique in both the computer science and statistics literature. These include reductions to testing -wise independence alon2007testing , biclustering detection and recovery ma2015computational ; cai2015computational ; caiwu2018 , planted dense subgraph hajek2015computational , RIP certification wang2016average ; koiran2014hidden , matrix completion chen2015incoherence , minimum circuit size and minimum Kolmogorov time-bounded complexity hirahara2017average and sparse PCA berthet2013optimal ; berthet2013complexity ; wang2016statistical ; gao2017sparse . A web of reductions from planted clique to planted independent set, planted dense subgraph, sparse PCA, the spiked Wigner model, gaussian biclustering and the subgraph stochastic block model was given in brennan2018reducibility . The planted clique conjecture has also been used as a hardness assumption for average-case reductions in cryptography juels2000hiding ; applebaum2010public , as described in Sections 2.1 and 6 of Barak2017 .

A number of average-case reductions in the literature have started with different average-case assumptions than the planted clique conjecture. Variants of planted dense subgraph have been used to show hardness in a model of financial derivatives under asymmetric information arora2011computational , link prediction baldin2018optimal , finding dense common subgraphs charikar2018finding and online local learning of the size of a label set awasthi2015label . Hardness conjectures for random constraint satisfaction problems have been used to show hardness in improper learning complexity daniely2014average , learning DNFs daniely2016complexity and hardness of approximation feige2002relations . There has also been a recent reduction from a hypergraph variant of the planted clique conjecture to tensor PCA zhang2017tensor .

### 1.5 Notation

In this paper, we adopt the following notation. Let

denote the distribution law of a random variable

and given two laws and , let denote where and are independent. Given a distribution , let denote the distribution of where the are i.i.d. according to . Similarly, let denote the distribution on with i.i.d. entries distributed as . Given a finite or measurable set , let

denote the uniform distribution on

. Let and denote total variation distance and divergence, respectively. Let denote a multivariate normal random vector with mean and covariance matrix , where is a positive semidefinite matrix. Let denote a -distribution with degrees of freedom. Let denote the set of all unit vectors with . Let , be the set of simple graphs on vertices and let the Orthogonal group on be . Let denote the vector with if and if where .

## 2 Problem Formulations, Algorithms and Statistical Limits

We consider detection problems , wherein the algorithm is given a set of observations and tasked with distinguishing between two hypotheses:

• a uniform hypothesis , under which observations are generated from the natural noise distribution for the problem; and

• a planted hypothesis , under which observations are generated from the same noise distribution but with a latent planted sparse structure.

In all of the detection problems we consider, is a simple hypothesis consisting of a single distribution and is either also simple or a composite hypothesis consisting of several distributions. Typically, consists of the canonical noise distribution and either consists of the set of observation models associated with each possible planted sparse structure or a mixture over them. When is a composite hypothesis, it consists of a set of distributions where the parameter varies over a set . We will also denote the set , which is a singleton if is simple, of distributions as for notational convenience. As discussed in brennan2018reducibility and hajek2015computational , lower bounds for simple vs. simple hypothesis testing formulations are stronger and technically more difficult than for formulations involving composite hypotheses. The reductions in berthet2013complexity ; wang2016statistical are to composite vs. composite formulations of sparse PCA, the reduction in gao2017sparse is to a simple vs. composite formulation of the spiked covariance model and, when , the reduction in brennan2018reducibility is to the simple vs. simple formulation of the spiked covariance model.

Given an observation , an algorithm solves the detection problem with nontrivial probability if there is an such that its Type III error satisfies that

 limsupn→∞(PH0[A(X)=1]+supP∈H1PX∼P[A(X)=0])≤1−ϵ

where is the parameter indicating the size of . We refer to this quantity as the asymptotic Type III error of for the problem . If the asymptotic Type III error of is zero, then we say solves the detection problem . A simple consequence of this definition is that if achieves asymptotic Type III error for a composite testing problem with hypotheses and , then it also achieves this same error on the simple problem with hypotheses and where is any mixture of the distributions in . We remark that simple vs. simple formulations are the hypothesis testing problems that correspond to average-case decision problems as in Levin’s theory of average-case complexity. In particular, showing a lower bound in Type III error for a simple vs. simple formulation shows that no polynomial time algorithm can solve with probability greater than where is a language separating and and . For example, could correspond to for planted clique. A survey of average-case complexity can be found in bogdanov2006average .

We now formally define planted clique, planted dense subgraph and sparse PCA. Let denote an Erdős-Rényi random graph with edge probability . Let denote the graph formed by sampling and replacing the induced graph on a subset of size chosen uniformly at random with a sample from . The planted dense subgraph problem is defined as follows.

###### Definition 2.1 (Planted Dense Subgraph).

The detection problem has hypotheses

 H0:G∼G(n,q)andH1:G∼G(n,k,p,q)

The planted clique problem is then . There are many polynomial-time algorithms in the literature that find the planted clique in , including approximate message passing, semidefinite programming, nuclear norm minimization and several combinatorial approaches feige2000finding ; mcsherry2001spectral ; feige2010finding ; ames2011nuclear ; dekel2014finding ; deshpande2015finding ; chen2016statistical . All of these algorithms require that if is constant, despite the fact that the largest clique in contains vertices with high probability. This leads to the following conjecture.

###### Conjecture 2.1 (pc Conjecture).

Fix some constant . Suppose that is a sequence of randomized polynomial time algorithms and is a sequence of positive integers satisfying that . Then if is an instance of , it holds that

 liminfn→∞(PH0[An(G)=1]+PH1[An(G)=0])≥1.

All of our hardness results can be obtained from this conjecture with . The pc Conjecture implies a similar barrier at for where are constants, which we refer to as the pds conjecture. If , then this follows from the hardness of and the reduction keeping each edge in the graph with probability . If , then applying the same reduction to the complement graph yields this barrier. We now give several formulations of the sparse PCA detection problem in the spiked covariance model, several of which were considered in brennan2018reducibility . The following is a general composite hypothesis testing formulation of the spiked covariance model that we will specialize afterwards, including as a simple hypothesis testing problem.

###### Definition 2.2 (Spiked Covariance Model of Sparse PCA).

The sparse PCA detection problem in the spiked covariance model has hypotheses

 H0:X1,X2,…,Xn∼N(0,Id)⊗nand H1:X1,X2,…,Xn∼N(0,Id+θ′vv⊤)⊗n where θ′∈Aθ and v∈Bk

where is a subset of the positive real numbers and is a subset of the set of -sparse unit vectors in .

Specializing this general definition with different sets and , we arrive at the following variants of the spiked covariance model:

• Uniform biased sparse PCA is denoted as where and is the set of all -sparse unit vectors in with nonzero coordinates all equal to .

• Composite biased sparse PCA is denoted as where

 Aθ =[θ(1−γ√k),θ(1+γ√k)] Bk =k⋃k′=k−γ√kSk′,d

for some fixed parameter satisfying that as .

• Fully composite unbiased sparse PCA is denoted as where is the same as in cbspca and

 Bk={v∈Sd−1:k−τ√k≤∥v∥0≤k and % |vi|≥1√k for i∈supp(v)}

The canonical formulation of sparse PCA in the spiked covariance model is the simple vs. simple hypothesis testing formulation where is drawn uniformly at random from under . Note that randomly permuting the coordinates of any instance of ubspca exactly produces an instance of this simple vs. simple formulation. This implies that the two models are equivalent in the sense that an algorithm solves ubspca if and only if there is an algorithm solving this formulation with a runtime differing from that of by at most . Note that an instance of ubspca is an instance of cbspca which is an instance of fcspca, and thus lower bounds for these formulations are decreasing in strength. The lower bounds from our main theorem are for the strongest formulation ubspca and the complete set of tight lower bounds in our second main theorem are for fcspca. The relationships between these and other formulations of the spiked covariance model of PCA are further discussed in Appendix A.

#### Algorithms and Statistical Limits.

As mentioned in the introduction, the statistical limit of detection in the spiked covariance model is and the best-known algorithms solve the problem if . The information-theoretic lower bound for can be found in berthet2013optimal if and for all in cai2015optimal . Upper bounds at these two barriers are shown in berthet2013complexity if and with the following two algorithms:

1. Semidefinite Programming: Form the empirical covariance matrix and solve the convex program

 maxZ Tr(^ΣZ) s.t. Tr(Z)=1,|Z|1≤k,Z⪰0

Thresholding the resulting maximum solves the detection problem as long as .

2. -Sparse Eigenvalue:

Compute and threshold the -sparse unit vector that maximizes . This can be found by finding the largest eigenvector of each principal submatrix of . This succeeds as long as .

Note that the semidefinite programming algorithm runs in polynomial time while -sparse eigenvalue in exponential time.

## 3 Background on Average-Case Reductions

### 3.1 Reductions in Total Variation and the Computational Model

As introduced in berthet2013complexity and ma2015computational , we give approximate reductions in total variation to show that lower bounds for one hypothesis testing problem imply lower bounds for another. These reductions yield an exact correspondence between the asymptotic Type III errors of the two problems. This is formalized in the following lemma, which is Lemma 3.1 from brennan2018reducibility . Its proof is short and follows from the definition of total variation.

###### Lemma 3.1 (Lemma 3.1 in brennan2018reducibility ).

Let and be detection problems with hypotheses and , respectively. Let be an instance of and let be an instance of . Suppose there is a polynomial time computable map satisfying

 dTV(LH0(A(X)),LH′0(Y))+supP∈H1infπ∈Δ(H′1)dTV(LP(A(X)),∫H′1LP′(Y)dπ(P′))≤δ

If there is a randomized polynomial time algorithm solving with Type III error at most , then there is a randomized polynomial time algorithm solving with Type III error at most .

If , then given a blackbox solver for , the algorithm that applies and then solves and requires only a single query to the blackbox. An algorithm that runs in randomized polynomial time refers to one that has access to independent random bits and must run in time where is the size of the instance of the problem. For clarity of exposition, in our reductions we assume that explicit expressions can be exactly computed and assume that and random variables can be sampled in operations. Note that this implies that random variables can be sampled in operations.

### 3.2 Properties of Total Variation

Throughout the proof of our main theorem, we will need a number of well-known facts and inequalities concerning total variation distance.

###### Fact 3.1.

The distance satisfies the following properties:

1. (Triangle Inequality) Given three distributions and on a measurable space , it follows that

 dTV(P,Q)≤dTV(P,R)+dTV(Q,R)
2. (Data Processing) Let and be distributions on a measurable space and let be a Markov transition kernel. If and then

 dTV(L(f(A)),L(f(B)))≤dTV(P,Q)
3. (Tensorization) Let and be distributions on a measurable space . Then

 dTV(n∏i=1Pi,n∏i=1Qi)≤n∑i=1dTV(Pi,Qi)
4. (Conditioning on an Event) For any distribution on a measurable space and event , it holds that

 dTV(P(⋅|A),P)=1−P(A)
5. (Conditioning on a Random Variable) For any two pairs of random variables and each taking values in a measurable space , it holds that

 dTV(L(X),L(X′))≤d% TV(L(Y),L(Y′))+Ey∼Y[dTV(L(X|Y=y),L(X′|Y′=y))]

where we define for all .

Given an algorithm and distribution on inputs, let denote the distribution of induced by . If has steps, let denote the th step of and denote the procedure formed by steps through . Each time this notation is used, we clarify the intended initial and final variables when and are viewed as Markov kernels. The next lemma encapsulates the structure of all of our analyses of average-case reductions.

###### Lemma 3.2.

Let be an algorithm that can be written as for a sequence of steps

. Suppose that the probability distributions

are such that for each . Then it follows that

 dTV(A(P0),Pm)≤m∑i=1ϵi
###### Proof.

This follows from a simple induction on . Note that the case when follows by definition. Now observe that by the data-processing and triangle inequalities in Fact 3.1, we have that if then

 dTV(A(P0),Pm) ≤dTV(Am∘B(P0),Am(Pm−1))+dTV(Am(Pm−1),Pm) ≤dTV(B(P0),Pm−1)+ϵm ≤m∑i=1ϵi

where the last inequality follows from the induction hypothesis applied with to . This completes the induction and proves the lemma. ∎

## 4 Mapping to Submatrix Problems and χ2 Random Rotations

In this section, we introduce a number of subroutines that we will use in both of our reductions to sparse PCA in Sections 5 and 8.

### 4.1 Graph and Bernoulli Matrix Cloning

We begin with Graph-Clone and Bernoulli-Matrix-Clone, shown in Figure 2. The procedure Graph-Clone was introduced in brennan2019universality and produces several independent samples from a planted subgraph problems given a single sample. The procedure Bernoulli-Matrix-Clone is nearly identical, producing independent samples from planted Bernoulli submatrix problems given a single sample. Their properties as a Markov kernel are captured in the next two lemmas.

Let denote the distribution of planted dense subgraph instances from conditioned on the subgraph being planted on the vertex set where . Similarly let denote the distribution of matrices in with independent entries where if and otherwise. Let denote the mixture of induced by picking the -subset uniformly at random.

###### Lemma 4.1 (Lemma 5.2 in brennan2019universality ).

Let , and satisfy that

 1−p1−q≤(1−P1−Q)tand(PQ)t≤pq

Then the algorithm runs in time and satisfies that for each ,

 A(G(n,q))∼G(n,Q)⊗tandA(G(n,S,p,q))∼G(n,S,P,Q)⊗t