# The power of sum-of-squares for detecting hidden structures

We study planted problems---finding hidden structures in random noisy inputs---through the lens of the sum-of-squares semidefinite programming hierarchy (SoS). This family of powerful semidefinite programs has recently yielded many new algorithms for planted problems, often achieving the best known polynomial-time guarantees in terms of accuracy of recovered solutions and robustness to noise. One theme in recent work is the design of spectral algorithms which match the guarantees of SoS algorithms for planted problems. Classical spectral algorithms are often unable to accomplish this: the twist in these new spectral algorithms is the use of spectral structure of matrices whose entries are low-degree polynomials of the input variables. We prove that for a wide class of planted problems, including refuting random constraint satisfaction problems, tensor and sparse PCA, densest-k-subgraph, community detection in stochastic block models, planted clique, and others, eigenvalues of degree-d matrix polynomials are as powerful as SoS semidefinite programs of roughly degree d. For such problems it is therefore always possible to match the guarantees of SoS without solving a large semidefinite program. Using related ideas on SoS algorithms and low-degree matrix polynomials (and inspired by recent work on SoS and the planted clique problem by Barak et al.), we prove new nearly-tight SoS lower bounds for the tensor and sparse principal component analysis problems. Our lower bounds for sparse principal component analysis are the first to suggest that going beyond existing algorithms for this problem may require sub-exponential time.

## Authors

• 15 publications
• 13 publications
• 13 publications
• 12 publications
• 17 publications
• 14 publications
• ### Machinery for Proving Sum-of-Squares Lower Bounds on Certification Problems

In this paper, we construct general machinery for proving Sum-of-Squares...
11/09/2020 ∙ by Aaron Potechin, et al. ∙ 0

• ### Sparse PCA: Algorithms, Adversarial Perturbations and Certificates

We study efficient algorithms for Sparse PCA in standard statistical mod...
11/12/2020 ∙ by Tommaso d'Orsi, et al. ∙ 0

• ### Fast and robust tensor decomposition with applications to dictionary learning

We develop fast spectral algorithms for tensor decomposition that match ...
06/27/2017 ∙ by Tselil Schramm, et al. ∙ 0

• ### The Kikuchi Hierarchy and Tensor PCA

For the tensor PCA (principal component analysis) problem, we propose a ...
04/08/2019 ∙ by Alexander S. Wein, et al. ∙ 0

• ### Computational Barriers to Estimation from Low-Degree Polynomials

One fundamental goal of high-dimensional statistics is to detect or reco...
08/05/2020 ∙ by Tselil Schramm, et al. ∙ 0

• ### Outlier-robust moment-estimation via sum-of-squares

We develop efficient algorithms for estimating low-degree moments of unk...
11/30/2017 ∙ by Pravesh K. Kothari, et al. ∙ 0

• ### Average-Case Integrality Gap for Non-Negative Principal Component Analysis

Montanari and Richard (2015) asked whether a natural semidefinite progra...
12/03/2020 ∙ by Afonso S. Bandeira, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Recent years have seen a surge of progress in algorithm design via the sum-of-squares (SoS) semidefinite programming hierarchy. Initiated by the work of [BBH12], who showed that polynomial time algorithms in the hierarchy solve all known integrality gap instances for Unique Games and related problems, a steady stream of works have developed efficient algorithms for both worst-case [BKS14, BKS15, BKS17, BGG16] and average-case problems [HSS15, GM15, BM16, RRS16, BGL16, MSS16a, PS17]. The insights from these works extend beyond individual algorithms to characterizations of broad classes of algorithmic techniques. In addition, for a large class of problems (including constraint satisfaction), the family of SoS semidefinite programs is now known to be as powerful as any semidefinite program (SDP) [LRS15].

In this paper we focus on recent progress in using Sum of Squares algorithms to solve average-case, and especially planted problems—problems that ask for the recovery of a planted signal perturbed by random noise. Key examples are finding solutions of random constraint satisfaction problems (CSPs) with planted assignments [RRS16] and finding planted optima of random polynomials over the -dimensional unit sphere [RRS16, BGL16]

. The latter formulation captures a wide range of unsupervised learning problems, and has led to many unsupervised learning algorithms with the best-known polynomial time guarantees

[BKS15, BKS14, MSS16b, HSS15, PS17, BGG16].

In many cases, classical algorithms for such planted problems are spectral

algorithms—i.e., using the top eigenvector of a natural matrix associated with the problem input to recover a planted solution. The canonical algorithms for the

planted clique [AKS98], principal components analysis (PCA) [Pea01], and tensor decomposition (which is intimately connected to optimizaton of polynomials on the unit sphere) [Har70] are all based on this general scheme. In all of these cases, the algorithm employs the top eigenvector of a matrix which is either given as input (the adjacency matrix, for planted clique), or is a simple function of the input (the empirical covariance, for PCA).

Recent works have shown that one can often improve upon these basic spectral methods using SoS, yielding better accuracy and robustness guarantees against noise in recovering planted solutions. Furthermore, for worst case problems—as opposed to the average-case planted problems we consider here—semidefinite programs are strictly more powerful than spectral algorithms.111For example, consider the contrast between the SDP algorithm for Max-Cut of Goemans and Williamson, [GW94], and the spectral algorithm of Trevisan [Tre09]; or the SDP-based algorithms for coloring worst-case 3-colorable graphs [KT17] relative to the best spectral methods [AK97] which only work for random inputs. A priori one might therefore expect that these new SoS guarantees for planted problems would not be achievable via spectral algorithms. But curiously enough, in numerous cases these stronger guarantees for planted problems can be achieved by spectral methods! The twist is that the entries of these matrices are low-degree polynomials in the input to the algorithm . The result is a new family of low-degree spectral algorithms with guarantees matching SoS but requriring only eigenvector computations instead of general semidefinite programming [HSSS16, RRS16, AOW15a].

This leads to the following question which is the main focus of this work.

Are SoS algorithms equivalent to low-degree spectral methods for planted problems?

We answer this question affirmatively for a wide class of distinguishing problems which includes refuting random CSPs, tensor and sparse PCA, densest--subgraph, community detection in stochastic block models, planted clique, and more. Our positive answer to this question implies that a light-weight algorithm—computing the top eigenvalue of a single matrix whose entries are low-degree polynomials in the input—can recover the performance guarantees of an often bulky semidefinite programming relaxation.

To complement this picture, we prove two new SoS lower bounds for particular planted problems, both variants of component analysis: sparse principal component analysis and tensor principal component analysis (henceforth sparse PCA and tensor PCA, respectively) [ZHT06, RM14]. For both problems there are nontrivial low-degree spectral algorithms, which have better noise tolerance than naive spectral methods [HSSS16, DM14b, RRS16, BGL16]

. Sparse PCA, which is used in machine learning and statistics to find important coordinates in high-dimensional data sets, has attracted much attention in recent years for being apparently computationally intractable to solve with a number of samples which is more than sufficient for brute-force algorithms

[KNV15, BR13b, MW15a]. Tensor PCA appears to exhibit similar behavior [HSS15]. That is, both problems exhibit information-computation gaps.

Our SoS lower bounds for both problems are the strongest yet formal evidence for information-computation gaps for these problems. We rule out the possibility of subexponential-time SoS algorithms which improve by polynomial factors on the signal-to-noise ratios tolerated by the known low degree spectral methods. In particular, in the case of sparse PCA, it appeared possible prior to this work that it might be possible in quasipolynomial time to recover a

-sparse unit vector

in dimensions from samples from the distribution

. Our lower bounds suggest that this is extremely unlikely; in fact this task probably requires polynomial SoS degree and hence

time for SoS algorithms. This demonstrates that (at least with regard to SoS algorithms) both problems are much harder than the planted clique problem, previously used as a basis for reductions in the setting of sparse PCA [BR13b].

Our lower bounds for sparse and tensor PCA are closely connected to the failure of low-degree spectral methods in high noise regimes of both problems. We prove them both by showing that with noise beyond what known low-degree spectral algorithms can tolerate, even low-degree scalar algorithms (the result of restricting low-degree spectral algorithms to matrices) would require subexponential time to detect and recover planted signals. We then show that in the restricted settings of tensor and sparse PCA, ruling out these weakened low-degree spectral algorithms is enough to imply a strong SoS lower bound.

### 1.1 SoS and spectral algorithms for robust inference

We turn to our characterization of SoS algorithms for planted problems in terms of low-degree spectral algorithms. First, a word on planted problems. Many planted problems have several formulations: search, in which the goal is to recover a planted solution, refutation, in which the goal is to certify that no planted solution is present, and distinguishing, where the goal is to determine with good probability whether an instance contains a planted solution or not. Often an algorithm for one version can be parlayed into algorithms for the others, but distinguishing problems are often the easiest, and we focus on them here.

A distinguishing problem is specified by two distributions on instances: a planted distribution supported on instances with a hidden structure, and a uniform

distribution, where samples w.h.p. contain no hidden structure. Given an instance drawn with equal probability from the planted or the uniform distribution, the goal is to determine with probability greater than

whether or not the instance comes from the planted distribution. For example:

Planted clique Uniform distribution: , the Erdős-Renyi distribution, which w.h.p. contains no clique of size . Planted distribution: The uniform distribution on graphs containing a -size clique, for some . (The problem gets harder as gets smaller, since the distance between the distributions shrinks.)

Planted xor Uniform distribution: a xor instance on variables and equations , where all the triples and the signs are sampled uniformly and independently. No assignment to will satisfy more than a -fraction of the equations, w.h.p. Planted distribution: The same, except the signs are sampled to correlate with for a randomly chosen , so that the assignment satisfies a -fraction of the equations. (The problem gets easier as gets larger, and the contradictions in the uniform case become more locally apparent.)

We now formally define a family of distinguishing problems, in order to give our main theorem. Let be a set of instances corresponding to a product space (for concreteness one may think of to be the set of graphs on vertices, indexed by , although the theorem applies more broadly). Let , our uniform distrbution, be a product distribution on .

With some decision problem in mind (e.g. does contain a clique of size ?), let be a set of solutions to ; again for concreteness one may think of as being associated with cliques in a graph, so that is the set of all indicator vectors on at least vertices.

For each solution , let be the uniform distribution over instances that contain . For example, in the context of planted clique, if is a clique on vertices , then would be the uniform distribution on graphs containing the clique . We define the planted distribution to be the uniform mixture over , .

The following is our main theorem on the equivalence of sum of squares algorithms for distinguishing problems and spectral algorithms employing low-degree matrix polynomials.

###### Theorem 1.1 (Informal).

Let , and let be sets of real numbers. Let be a family of instances over , and let be a decision problem over with the set of possible solutions to over . Let be a system of polynomials of degree at most in the variables and constant degree in the variables that encodes , so that

• for , with high probability the system is unsatisfiable and admits a degree- SoS refutation, and

• for , with high probability the system is satisfiable by some solution , and remains feasible even if all but an -fraction of the coordinates of are re-randomized according to .

Then there exists a matrix whose entries are degree- polynomials such that

where denotes the maximum non-negative eigenvalue.

The condition that a solution remain feasible if all but a fraction of the coordinates of are re-randomized should be interpreted as a noise-robustness condition. To see an example, in the context of planted clique, suppose we start with a planted distribution over graphs with a clique of size . If a random subset of vertices are chosen, and all edges not entirely contained in that subset are re-randomized according to the distribution, then with high probability at least of the vertices in remain in a clique, and so remains feasible for the problem : has a clique of size ?

### 1.2 SoS and information-computation gaps

Computational complexity of planted problems has become a rich area of study. The goal is to understand which planted problems admit efficient (polynomial time) algorithms, and to study the information-computation gap phenomenon: many problems have noisy regimes in which planted structures can be found by inefficient algorithms, but (conjecturally) not by polynomial time algorithms. One example is the planted clique problem, where the goal find a large clique in a sample from the uniform distribution over graphs containing a clique of size for a small constant . While the problem is solvable for any by a brute-force algorithm requiring time, polynomial time algorithms are conjectured to require .

A common strategy to provide evidence for such a gap is to prove that powerful classes of efficient algorithms are unable to solve the planted problem in the (conjecturally) hard regime. SoS algorithms are particularly attractive targets for such lower bounds because of their broad applicability and strong guarantees.

In a recent work, Barak et al. [BHK16] show an SoS lower bound for the planted clique problem, demonstrating that when , SoS algorithms require time to solve planted clique. Intriguingly, they show that in the case of planted clique that SoS algorithms requiring time can distinguish planted from random graphs only when there is a scalar-valued degree polynomial (here is the adjacency matrix of a graph) with

 EG(n,1/2)p(A)=0,Eplantedp(A)⩾nΩ(1)⋅(VG(n,1/2)p(A))1/2.

That is, such a polynomial

has much larger expectation in under the planted distribution than its standard deviation in uniform distribution. (The choice of

is somewhat arbitrary, and could be replaced with or with small changes in the parameters.) By showing that as long as any such polynomial must have degree , they rule out efficient SoS algorithms when . Interestingly, this matches the spectral distinguishing threshold—the spectral algorithm of [AKS98] is known to work when .

This stronger characterization of SoS for the planted clique problem, in terms of scalar distinguishing algorithms rather than spectral distinguishing algorihtms, may at first seem insignificant. To see why the scalar characterization is more powerful, we point out that if the degree-moments of the planted and uniform distributions are known, determining the optimal scalar distinguishing polynomial is easy: given a planted distribution and a random distribution over instances , one just solves a linear algebra problem in the coefficients of to maximize the expectation over relative to :

 maxpEI∼μ[p2(I)]s.t. EI∼ν[p2(I)]=1.

It is not difficult to show that the optimal solution to the above program has a simple form: it is the projection of the relative density of with respect to projected to the degree- polynomials. So given a pair of distributions , in time, it is possible to determine whether there exists a degree- scalar distinguishing polynomial. Answering the same question about the existence of a spectral distinguisher is more complex, and to the best of our knowledge cannot be done efficiently.

Given this powerful theorem for the case of the planted clique problem, one may be tempted to conjecture that this stronger, scalar distinguisher characterization of the SoS algorithm applies more broadly than just to the planted clique problem, and perhaps as broadly as thm:maindist. If this conjecture is true, given a pair of distributions and with known moments, it would be possible in many cases to efficiently and mechanically determine whether polynomial-time SoS distinguishing algorithms exist!

###### Conjecture 1.2.

In the setting of thm:maindist, the conclusion may be replaced with the conclusion that there exists a scalar-valued polynomial of degree so that

 Euniformp(I)=0 and Eplantedp(I)⩾nΩ(1)(Euniformp(I)2)1/2

To illustrate the power of this conjecture, in the beginning of Section 6 we give a short and self-contained explanation of how this predicts, via simple linear algebra, our -degree SoS lower bound for tensor PCA. As evidence for the conjecture, we verify this prediction by proving such a lower bound unconditionally.

We also note why thm:maindist does not imply Conjecture 1.2. While, in the notation of that theorem, the entries of are low-degree polynomials in , the function is not (to the best of our knowledge) a low-degree polynomial in the entries of (even approximately). (This stands in contrast to, say the operator norm or Frobenious norm of , both of which are exactly or approximately low-degree polynomials in the entries of .) This means that the final output of the spectral distinguishing algorithm offered by thm:maindist is not a low-degree polynomial in the instance .

### 1.3 Exponential lower bounds for sparse PCA and tensor PCA

Our other main results are strong exponential lower bound on the sum-of-squares method (specifically, against time or degree algorithms) for the tensor and sparse principal component analysis (PCA). We prove the lower bounds by extending the techniques pioneered in [BHK16]. In the present work we describe the proofs informally, leaving full details to a forthcoming full version.

##### Tensor PCA

We start with the simpler case of tensor PCA, introduced by [RM14].

###### Problem 1.3 (Tensor PCA).

Given an order- tensor in , determine whether it comes from:

• Uniform Distribution: each entry of the tensor sampled independently from .

• Planted Distribution: a spiked tensor, where is sampled uniformly from , and where is a random tensor with each entry sampled independently from .

Here, we think of as a signal hidden by Gaussian noise. The parameter is a signal-to-noise ratio. In particular, as grows, we expect the distinguishing problem above to get easier.

Tensor PCA is a natural generalization of the PCA problem in machine learning and statistics. Tensor methods in general are useful when data naturally has more than two modalities: for example, one might consider a recommender system which factors in not only people and movies but also time of day. Many natural tensor problems are NP hard in the worst-case. Though this is not necessarily an obstacle to machine learning applications, it is important to have average-case models to in which to study algorithms for tensor problems. The spiked tensor setting we consider here is one such simple model.

Turning to algorithms: consider first the ordinary PCA problem in a spiked-matrix model. Given an matrix , the problem is to distinguish between the case where every entry of

is independently drawn from the standard Gaussian distribution

and the case when is drawn from a distribution as above with an added rank one shift in a uniformly random direction

. A natural and well-studied algorithm, which solves this problem to information-theoretic optimality is to threshold on the largest singular value/spectral norm of the input matrix. Equivalently, one thresholds on the maximizer of the degree two polynomial

in

A natural generalization of this algorithm to the tensor PCA setting (restricting for simplicity for this discussion) is the maximum of the degree-three polynomial over the unit sphere—equivalently, the (symmetric) injective tensor norm of . This maximum can be shown to be much larger in case of the planted distribution so long as . Indeed, this approach to distinguishing between planted and uniform distributions is information-theoretically optimal [PWB16, BMVX16]. Since recovering the spike and optimizing the polynomial on the sphere are equivalent, tensor PCA can be thought of as an average-case version of the problem of optimizing a degree- polynomial on the unit sphere (this problem is NP hard in the worst case, even to approximate [HL09, BBH12]).

Even in this average-case model, it is believed that there is a gap between which signal strengths allow recovery of by brute-force methods and which permit polynomial time algorithms. This is quite distinct from the vanilla PCA setting, where eigenvector algorithms solve the spike-recovery problem to information-theoretic optimality. Nevertheless, the best-known algorithms for tensor PCA arise from computing convex relaxations of this degree- polynomial optimization problem. Specifically, the SoS method captures the state of the art algorithms for the problem; it is known to recover the vector to error in polynomial time whenever [HSS15]. A major open question in this direction is to understand the complexity of the problem for . Algorithms (again captured by SoS) are known which run in time [RRS16, BGG16]. We show the following theorem which shows that the sub-exponential algorithm above is in fact nearly optimal for SoS algorithm.

###### Theorem 1.4.

For a tensor , let

 SoSd(T)=max~E~E[⟨T,x⊗k⟩] such that ~E is a degree d % pseudoexpectation and satisfies {∥x∥2=1}\lx@notefootnoteFordefinitionsofpseudoexpectationsandrelatedmatters,seethesurvey\@@cite[cite][\@@bibrefDBLP:journals/corr/BarakS14].

For every small enough constant , if has iid Gaussian or entries, , for every for some universal .

In particular for third order tensors (i.e ), since degree SoS is unable to certify that a random -tensor has maximum value much less than , this SoS relaxation cannot be used to distinguish the planted and random distributions above when .333In fact, our proof for this theorem will show somewhat more: that a large family of constraints—any valid constraint which is itself a low-degree polynomial of —could be added to this convex relaxation and the lower bound would still obtain.

##### Sparse PCA

We turn to sparse PCA, which we formalize as the following planted distinguishing problem.

###### Problem 1.5 (Sparse PCA (λ,k)).

Given an symmetric real matrix , determine whether comes from:

• Uniform Distribution: each upper-triangular entry of the matrix is sampled iid from ; other entries are filled in to preserve symmetry.

• Planted Distribution: a random -sparse unit vector with entries is sampled, and is sampled from the uniform distribution above; then .

We defer significant discussion to Section 6, noting just a few things before stating our main theorem on sparse PCA. First, the planted model above is sometimes called the spiked Wigner model—this refers to the independence of the entries of the matrix . An alternative model for sparse PCA is the spiked Wishart model: is replaced by , where each , for some number of samples and some signal-strength . Though there are technical differences between the models, to the best of our knowledge all known algorithms with provable guarantees are equally applicable to either model; we expect that our SoS lower bounds also apply in the spiked Wishart model.

We generally think of as small powers of ; i.e. for some ; this allows us to generally ignore logarithmic factors in our arguments. As in the tensor PCA setting, a natural and information-theoretically optimal algorithm for sparse PCA is to maximize the quadratic form , this time over -sparse unit vectors. For from the uniform distribution standard techniques (-nets and union bounds) show that the maximum value achievable is with high probability, while for from the planted model of course . So, when one may distinguish the two models by this maximum value.

However, this maximization problem is NP hard for general quadratic forms [CPR16]. So, efficient algorithms must use some other distinguisher which leverages the randomness in the instances. Essentially only two polynomial-time-computable distinguishers are known.444If one studies the problem at much finer granularity than we do here, in particular studying

up to low-order additive terms and how precisely it is possible to estimate the planted signal

, then the situation is more subtle [DM14a]. If then the maximum eigenvalue of distinguishes the models. If then the planted model can be distinguished by the presence of large diagonal entries of . Notice both of these distinguishers fail for some choices of (that is, ) for which brute-force methods (optimizing over sparse ) could successfully distinguish planted from uniform ’s. The theorem below should be interpreted as an impossibility result for SoS algorithms in the regime. This is the strongest known impossibility result for sparse PCA among those ruling out classes of efficient algorithms (one reduction-based result is also know, which shows sparse PCA is at least as hard as the planted clique problem [BR13a]. It is also the first evidence that the problem may require subexponential (as opposed to merely quasi-polynomial) time.

###### Theorem 1.6.

If , let

 SoSd,k(A)=max~E~E⟨x,Ax⟩ s.t. ~E is degree d and satisfies {x3i=xi,∥x∥2=k}.

There are absolute constants so that for every and , if , then for ,

 EA∼{±1}(n2)SoSd,k(A)⩾min(n1/2−εk,nρ−εk).

For more thorough discussion of the theorem, see Section 6.3.

### 1.4 Related work

##### On interplay of SoS relaxations and spectral methods

As we have already alluded to, many prior works explore the connection between SoS relaxations and spectral algorithms, beginning with the work of [BBH12] and including the followup works [HSS15, AOW15b, BM16] (plus many more). Of particular interest are the papers [HSSS16, MS16b], which use the SoS algorithms to obtain fast spectral algorithms, in some cases running in time linear in the input size (smaller even than the number of variables in the associated SoS SDP).

In light of our thm:maindist, it is particularly interesting to note cases in which the known SoS lower bounds matching the known spectral algorithms—these problems include planted clique (upper bound: [AKS98], lower bound:555SDP lower bounds for the planted clique problem were known for smaller degrees of sum-of-squares relaxations and for other SDP relaxations before; see the references therein for details. [BHK16]), strong refutations for random CSPs (upper bound:666There is a long line of work on algorithms for refuting random CSPs, and 3SAT in particular; the listed papers contain additional references. [AOW15b, RRS16], lower bounds: [Gri01b, Sch08, KMOW17]), and tensor principal components analysis (upper bound: [HSS15, RRS16, BGG16], lower bound: this paper).

We also remark that our work applies to several previously-considered distinguishing and average-case problems within the sum-of-squares algorithmic framework: block models [MS16a] , densest--subgraph [BCC10]; for each of these problems, we have by thm:maindist an equivalence between efficient sum-of-squares algorithms and efficient spectral algorithms, and it remains to establish exactly what the tradeoff is between efficiency of the algorithm and the difficulty of distinguishing, or the strength of the noise.

To the best of knowledge, no previous work has attempted to characterize SoS relaxations for planted problems by simpler algorithms in the generality we do here. Some works have considered characterizing degree- SoS relaxations (i.e. basic semidefinie programs) in terms of simpler algorithms. One such example is recent work of Fan and Montanari [FM16] who showed that for some planted problems on sparse random graphs, a class of simple procedures called local algorithms performs as well as semidefinite programming relaxations.

##### On strong SoS lower bounds for planted problems

By now, there’s a large body of work that establishes lower bounds on SoS SDP for various average case problems. Beginning with the work of Grigoriev [Gri01a], a long line work have established tight lower bounds for random constraint satisfaction problems [Sch08, BCK15, KMOW17] and planted clique [MPW15, DM15, HKP15, RS15, BHK16]. The recent SoS lower bound for planted clique of [BHK16] was particularly influential to this work, setting the stage for our main line of inquiry. We also draw attention to previous work on lower bounds for the tensor PCA and sparse PCA problems in the degree- SoS relaxation [HSS15, MW15b]—our paper improves on this and extends our understanding of lower bounds for tensor and sparse PCA to any degree.

Tensor principle component analysis was introduced by Montanari and Richard [RM14] who indentified information theoretic threshold for recovery of the planted component and analyzed the maximum likelihood estimator for the problem. The work of [HSS15] began the effort to analyze the sum of squares method for the problem and showed that it yields an efficient algorithm for recovering the planted component with strength . They also established that this threshold is tight for the sum of squares relaxation of degree 4. Following this, Hopkins et al. [HSSS16] showed how to extract a linear time spectral algorithm from the above analysis. Tomioka and Suzuki derived tight information theoretic thresholds for detecting planted components by establishing tight bounds on the injective tensor norm of random tensors [TS14]. Finally, very recently, Raghavendra et. al. and Bhattiprolu et. al. independently showed sub-exponential time algorithms for tensor pca [RRS16, BGL16]. Their algorithms are spectral and are captured by the sum of squares method.

### 1.5 Organization

In sec:low-deg-dist we set up and state our main theorem on SoS algorithms versus low-degree spectral algorithms. In sec:examp we show that the main theorem applies to numerous planted problems—we emphasize that checking each problem is very simple (and barely requires more than a careful definition of the planted and uniform distributions). In sec:moment-match and sec:proofofthm we prove the main theorerm on SoS algorithms versus low-degree spectral algorithms.

In section 7 we get prepared to prove our lower bound for tensor PCA by proving a structural theorem on factorizations of low-degree matrix polynomials with well-behaved Fourier transforms. In section 8 we prove our lower bound for tensor PCA, using some tools proved in section 9.

##### Notation

For two matrices , let . Let denote the Frobenius norm, and its spectral norm. For matrix valued functions over and a distribution over , we will denote and by .

For a vector of formal variables , we use to denote the vector consisting of all monomials of degree at most in these variables. Furthermore, let us denote .

## 2 Distinguishing Problems and Robust Inference

In this section, we set up the formal framework within which we will prove our main result.

#### Uniform vs. Planted Distinguishing Problems

We begin by describing a class of distinguishing problems. For a set of real numbers, we will use denote a space of instances indexed by variables—for the sake of concreteness, it will be useful to think of as ; for example, we could have and as the set of all graphs on vertices. However, the results that we will show here continue to hold in other contexts, where the space of all instances is or .

###### Definition 2.1 (Uniform Distinguishing Problem).

Suppose that is the space of all instances, and suppose we have two distributions over , a product distribution (the “uniform” distribution), and an arbitrary distribution (the “planted” distribution).

In a uniform distinguishing problem, we are given an instance which is sampled with probability from and with probability from , and the goal is to determine with probability greater than which distribution was sampled from, for any constant .

#### Polynomial Systems

In the uniform distinguishing problems that we are interested in, the planted distribution will be a distribution over instances that obtain a large value for some optimization problem of interest (i.e. the max clique problem). We define polynomial systems in order to formally capture optimization problems.

###### Program 2.2 (Polynomial System).

Let be sets of real numbers, let , and let be a space of instances and be a space of solutions. A polynomial system is a set of polynomial equalities

 gj(x,I)=0∀j∈[m],

where are polynomials in the program variables , representing , and in the instance variables , representing . We define to be the degree of in the program variables, and to be the degree of in the instance variables.

###### Remark 2.3.

For the sake of simplicity, the polynomial system prog:bopt has no inequalities. Inequalities can be incorporated in to the program by converting each inequality in to an equality with an additional slack variable. Our main theorem still holds, but for some minor modifications of the proof, as outlined in sec:proofofthm.

A polynomial system allows us to capture problem-specific objective functions as well as problem-specific constraints. For concreteness, consider a quadtratic program which checks if a graph on vertices contains a clique of size . We can express this with the polynomial system over program variables and instance variables , where iff there is an edge from to , as follows:

 {∑i∈[n]xi−k=0}∪{xi(xi−1)=0}i∈[n]∪{(1−Iij)xixj=0}i,j∈([n]2).

#### Planted Distributions

We will be concerned with planted distributions of a particular form; first, we fix a polynomial system of interest and some set of feasible solutions for , so that the program variables represent elements of . Again, for concreteness, if is the set of graphs on vertices, we can take to be the set of indicators for subsets of at least vertices.

For each fixed , let denote the uniform distribution over for which the polynomial system is feasible. The planted distribution is given by taking the uniform mixture over the , i.e., .

#### SoS Relaxations

If we have a polynomial system where for every , then the degree- sum-of-squares SDP relaxation for the polynomial system prog:bopt can be written as,

###### Program 2.4 (SoS Relaxation for Polynomial System).

Let be a polynomial system in instance variables and program variables . If for all , then an SoS relaxation for is

 ⟨Gj(I),X⟩=0∀j∈[m] X⪰0

where is an matrix containing the variables of the SDP and are matrices containing the coefficients of in , so that the constraint encodes the constraint in the SDP variables. Note that the entries of are polynomials of degree at most in the instance variables.

#### Sub-instances

Suppose that is a family of instances; then given an instance and a subset , let denote the sub-instance consisting of coordinates within . Further, for a distribution over subsets of , let denote a subinstance generated by sampling . Let denote the set of all sub-instances of an instance , and let denote the set of all sub-instances of all instances.

#### Robust Inference

Our result will pertain to polynomial systems that define planted distributions whose solutions to sub-instances generalize to feasible solutions over the entire instance. We call this property “robust inference.”

###### Definition 2.5.

Let be a family of instances, let be a distribution over subsets of , let be a polynomial system as in prog:bopt, and let be a planted distribution over instances feasible for . Then the polynomial system is said to satisfy the

robust inference property for probability distribution

on and subsampling distribution , if given a subsampling of an instance from , one can infer a setting of the program variables that remains feasible to for most settings of .

Formally, there exists a map such that

 PI∼μ,S∼Θ,~I∼ν|IS[x(IS) is a feasible for S on IS∘~I]⩾1−ε(n,d)

for some negligible function . To specify the error probability, we will say that polynomial system is -robustly inferable.

#### Main Theorem

We are now ready to state our main theorem.

###### Theorem 2.6.

Suppose that is a polynomial system as defined in prog:bopt, of degree at most in the program variables and degree at most in the instance variables. Let such that

1. The polynpomial system is -robustly inferable with respect to the planted distribution and the sub-sampling distribution .

2. For , the polynomial system admits a degree- SoS refutation with numbers bounded by with probability at least .

Let be such that for any subset with ,

 PS∼Θ[α⊆S]⩽1n8B

There exists a degree matrix polynomial such that,

 EI∼μ[λ+max(Q(I))]EI∼ν[λ+max(Q(I))]⩾nB/2
###### Remark 2.7.

Our argument implies a stronger result that can be stated in terms of the eigenspaces of the subsampling operator. Specifically, suppose we define

 Sεdef={α | PS∼Θ{α⊆S}⩽ε}

Then, the distinguishing polynomial exhibited by thm:main satisfies . This refinement can yield tighter bounds in cases where all monomials of a certain degree are not equivalent to each other. For example, in the Planted Clique problem, each monomial consists of a subgraph and the right measure of the degree of a sub-graph is the number of vertices in it, as opposed to the number of edges in it.

In sec:examp, we will make the routine verifications that the conditions of this theorem hold for a variety of distinguishing problems: planted clique (lem:pc-ex), refuting random CSPs (lem:csp-ex, stochastic block models (lem:sbm-ex), densest--subgraph (lem:dks-ex), tensor PCA (lem:tpca-ex), and sparse PCA (lem:spca-ex). Now we will proceed to prove the theorem.

## 3 Moment-Matching Pseudodistributions

We assume the setup from sec:low-deg-dist: we have a family of instances , a polynomial system with a family of solutions , a “uniform” distribution which is a product distribution over , and a “planted” distribution over defied by the polynomial system as described in sec:low-deg-dist.

The contrapositive of thm:low-deg is that if is robustly inferable with respect to and a distribution over sub-instances , and if there is no spectral algorithm for distinguishing and , then with high probability there is no degree- SoS refutation for the polynomial system (as defined in prog:boptmat). To prove the theorem, we will use duality to argue that if no spectral algorithm exists, then there must exist an object which is in some sense close to a feasible solution to the SoS SDP relaxation.

Since each in the support of is feasible for by definition, a natural starting point is the SoS SDP solution for instances . With this in mind, we let be an arbitrary function from the support of over to PSD matrices. In other words, we take

 Λ(I)=^μ(I)⋅M(I)

where is the relative density of with respect to , so that , and is some matrix valued function such that and for all . Our goal is to find a PSD matrix-valued function that matches the low-degree moments of in the variables , while being supported over most of (rather than just over the support of ).

The function is given by the following exponentially large convex program over matrix-valued functions,

###### Program 3.1 (Pseudodistribution Program).
 min ∥P∥2Fr,ν (3.1) s.t. ⟨Q,P⟩ν=⟨Q,Λ′⟩ν∀Q:I→R[n]⩽d×[n]⩽d, deginst(Q)⩽D (3.2) P⪰0 Λ′=Λ+η⋅Id,2−22n>η>0 (3.3)

The constraint eq:low-deg fixes , and so the objective function eq:obj can be viewied as minimizing , a proxy for the collision probability of the distribution, which is a measure of entropy.

###### Remark 3.2.

We have perturbed in eq:lambda-perturb so that we can easily show that strong duality holds in the proof of claim:dual. For the remainder of the paper we ignore this perturbation, as we can accumulate the resulting error terms and set to be small enough so that they can be neglected.

The dual of the above program will allow us to relate the existence of an SoS refutation to the existence of a spectral algorithm.

###### Program 3.3 (Low-Degree Distinguisher).
 max ⟨Λ,Q⟩ν s.t. Q:I→R[n]⩽d×[n]⩽d, deginst(Q)⩽D ∥Q+∥2Fr,ν⩽1,

where is the projection of to the PSD cone.

###### Claim 3.4.

prog:disting is a manipulation of the dual of prog:distrib, so that if prog:distrib has optimum , prog:disting as optimum at least .

Before we present the proof of the claim, we summarize its central consequence in the following theorem: if prog:distrib has a large objective value (and therefore does not provide a feasible SoS solution), then there is a spectral algorithm.

###### Theorem 3.5.

Fix a function be such that . Let be the function that gives the largest non-negative eigenvalue of a matrix. Suppose then the optimum of prog:distrib is equal to only if there exists a low-degree matrix polynomial such that,

 EI∼μ[λ+max(Q(I))]⩾Ω(√opt/nd)

while,

 EI∼ν[λ+max(Q(I))]⩽1.
###### Proof.

By claim:dual, if the value of prog:distrib is , then there is a polynomial achieves a value of for the dual. It follows that

 EI∼μ[λ+max(Q(I))]⩾1ndEI∼μ[⟨Id,Q(I))⟩]⩾1nd⟨Λ,Q⟩ν=Ω(√opt/nd),

while

 EI∼ν[λ+max(Q(I))]⩽√EI∼ν[λ+max(Q(I))2]⩽√EI∼ν∥Q+(I)∥2Fr⩽1.

It is interesting to note that the specific structure of the PSD matrix valued function plays no role in the above argument—since serves as a proxy for monomials in the solution as represented by the program variables , it follows that the choice of how to represent the planted solution is not critical. Although seemingly counterintuitive, this is natural because the property of being distinguishable by low-degre distinguishers or by SoS SDP relaxations is a property of and .

We wrap up the section by presenting a proof of the claim:dual.

###### Proof of claim:dual.

We take the Lagrangian dual of prog:distrib. Our dual variables will be some combination of low-degree matrix polynomials, , and a PSD matrix :

 L(P,Q,A)=∥P∥2Fr,ν−⟨Q,P−Λ′⟩ν−⟨A,P⟩νs.t.A⪰0.

It is easy to verify that if is not PSD, then can be chosen so that the value of is . Similarly if there exists a low-degree polynomial upon which and differ in expectation, can be chosen as a multiple of that polynomial so that the value of is .

Now, we argue that Slater’s conditions are met for prog:distrib, as is strictly feasible. Thus strong duality holds, and therefore

 minPmaxA⪰0,QL(P,Q,A)⩽maxA⪰0,QminPL(P,Q,A).

Taking the partial derivative of with respect to , we have

 ∂∂PL(P,Q,A) =2⋅P−Q−A.

where the first derivative is in the space of functions from . By the convexity of as a function of , it follows that if we set , we will have the minimizer. Substituting, it follows that

 minPmaxA⪰0,QL(P,Q,A) ⩽maxA⪰0,Q14∥A+Q∥2Fr,ν−12⟨Q,A+Q−Λ′⟩ν−12⟨A,A+Q⟩ν =maxA⪰0,Q⟨Q,Λ′⟩ν−14∥A+Q∥2Fr,ν (3.4)

Now it is clear that the maximizing choice of is to set , the negation of the negative-semi-definite projection of . Thus eq:optbd simplifies to

 minPmaxA⪰0,QL(P,Q,A) ⩽maxQ⟨Q,Λ′⟩ν−14∥Q+∥2Fr,ν ⩽maxQ⟨Q,Λ⟩ν+ηTrν(Q+)−14∥Q+∥2Fr,ν, (3.5)

where we have used the shorthand . Now suppose that the low-degree matrix polynomial achieves a right-hand-side value of

 ⟨Q∗,Λ⟩ν+η⋅Trν(Q∗+)−14∥Q∗+∥2Fr,ν⩾c.

Consider . Clearly . Now, multiplying the above inequality through by the scalar , we have that

 ⟨Q′,Λ⟩ν ⩾c∥Q∗+∥Fr,ν−η⋅Trν(Q∗+)∥Q∗+∥Fr,ν+14∥Q∗+∥Fr,ν ⩾c∥Q∗+∥Fr,ν−η⋅nd+14∥Q∗+∥Fr,ν.

Therefore is at least , as if then the third term gives the lower bound, and otherwise the first term gives the lower bound.

Thus by substituting , the square root of the maximum of eq:unconst within an additive lower-bounds the maximum of the program

 max ⟨Q,Λ⟩ν s.t. Q:I→R[n]⩽d×[n]⩽d,deginst(Q)⩽D ∥Q+∥2Fr,ν⩽1.

This concludes the proof. ∎

## 4 Proof of thm:main

We will prove thm:main by contradiction. Let us assume that there exists no degree- matrix polynomial that distinguishes from . First, the lack of distinguishers implies the following fact about scalar polynomials.

###### Lemma 4.1.

Under the assumption that there are no degree- distinguishers, for every degree- scalar polynomial ,

 ∥Q∥2Fr,μ⩽nB∥Q∥2Fr,ν
###### Proof.

Suppose not, then the degree- matrix polynomial will be a distinguisher between and . ∎

##### Constructing Λ

First, we will use the robust inference property of to construct a pseudo-distribution . Recall again that we have defined to be the relative density of with respect to , so that . For each subset , define a PSD matrix-valued function as,

 ΛS(I)=EI′¯¯¯S[^μ(IS∘I′¯¯¯S)]⋅x(IS)⩽d(x(IS)⩽d)T

where we use to denote the restriction of to , and to denote the instance given by completing the sub-instance with the setting . Notice that is a function depending only on —this fact will be important to us. Define . Observe that is a PSD matrix-valued function that satisfies

 ⟨Λ∅,∅,1⟩ν=EI∼νES∼ΘEI′¯¯¯S∼ν[^μ(IS∘I′¯¯¯S)]=ESEI¯¯¯SEIS∘I′¯¯¯S∼ν[^μ(IS∘I′¯¯¯S)]=1 (4.1)

Since is an average over , each of which is a feasible solution with high probability, is close to a feasible solution to the SDP relaxation for . The following Lemma formalizes this intuition.

Define , and use to denote the orthogonal projection into .

###### Lemma 4.2.

Suppose prog:bopt satisfies the -robust inference property with respect to planted distribution and subsampling distribution and if for all then for every , we have

 ⟨Λ,G⟩ν⩽√ε⋅K⋅(ES∼ΘE~I¯¯¯S∼νEI∼μ∥G(IS∘I¯¯¯S)∥22)\nicefrac12
###### Proof.

We begin by expanding the left-hand side by substituting the definition of . We have

 ⟨Λ,G⟩ν =ES∼ΘEI∼ν⟨ΛS(IS),G(I)⟩ =ES∼ΘEI∼νEI′¯¯¯S∼ν^μ(IS∘I′¯¯¯S)⋅⟨x(IS)⩽d(x(IS)⩽d)T,G(I)⟩ And because the inner product is zero if x(IS) is a feasible solution, ⩽ES∼ΘEI∼νEI′¯¯¯S∼ν^μ(IS∘I′¯¯¯S)⋅I[x(IS)% is infeasible for S(I)]⋅∥∥x(IS)⩽d∥∥22⋅∥G(I)∥Fr ⩽ES∼ΘEI∼νEI′¯¯¯S∼ν^μ(IS∘I′¯¯¯S)⋅I[x(IS)% is infeasible for S(I)]⋅K⋅∥G(I)∥Fr And now letting ~