High-dimensional estimation via sum-of-squares proofs

07/30/2018 ∙ by Prasad Raghavendra, et al. ∙ 0

Estimation is the computational task of recovering a hidden parameter x associated with a distribution D_x, given a measurement y sampled from the distribution. High dimensional estimation problems arise naturally in statistics, machine learning, and complexity theory. Many high dimensional estimation problems can be formulated as systems of polynomial equations and inequalities, and thus give rise to natural probability distributions over polynomial systems. Sum-of-squares proofs provide a powerful framework to reason about polynomial systems, and further there exist efficient algorithms to search for low-degree sum-of-squares proofs. Understanding and characterizing the power of sum-of-squares proofs for estimation problems has been a subject of intense study in recent years. On one hand, there is a growing body of work utilizing sum-of-squares proofs for recovering solutions to polynomial systems when the system is feasible. On the other hand, a general technique referred to as pseudocalibration has been developed towards showing lower bounds on the degree of sum-of-squares proofs. Finally, the existence of sum-of-squares refutations of a polynomial system has been shown to be intimately connected to the existence of spectral algorithms. In this article we survey these developments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In estimation problems, the goal is to recover a structured object from an observed input which partially obfuscates it. Formally, an estimation problem is specified by a family of distributions over parametrized by . The input consists of a sample drawn from for some , and the goal is to recover the value of the parameter . We refer to as the hidden variable or the parameter, and to the sample as the measurement or the instance.

Often, it is information-theoretically impossible to recover hidden variables in that their value is not completely determined by the measurements. Further, even when recovery is information-theoretically possible, in many high-dimensional settings it is computationally intractable to recover . For these reasons, we often seek to recover

approximately by minimizing the expected loss for an appropriate loss function. For example, if

denotes the estimate for given the measurement , a natural goal would be to minimize the expected mean-square loss given by .

In many cases, we can formulate such a minimization problem as a feasibility problem for a system of polynomial equations. By classical NP-completeness results, general polynomial systems in many variables are computationally intractable in the worst case. In our context, an estimation problem gives rise to a distribution over polynomial systems that encode it. We wish to study a typical system drawn from this distribution. If the underlying distributions are sufficiently well-behaved, polynomial systems yield an avenue to design algorithms for high-dimensional estimation problems.

In this survey, our tool for studying such polynomial systems will be sum-of-squares (SoS) proofs. Sum-of-squares proofs yield a complete proof system for reasoning about polynomial systems [Kri64, Ste74]. More importantly, SoS proofs are constructive: the problem of finding a sum-of-squares proof can be formulated as a semidefinite program, and thus algorithms for convex optimization can be used to find a sum-of-squares proof when one exists. Low-degree SoS proofs can be found efficiently, and the computational complexity of the algorithm grows exponentially with the degree of the polynomials involved in the proof.

The study of low-degree SoS proofs in the context of estimation problems suggests a rich family of questions. For natural estimation problems, if a polynomial system drawn from the corresponding distribution is feasible, can one harness sum-of-squares proofs towards solving the polynomial system? (surprisingly, the answer is often yes!) If a system from this distribution is typically infeasible, what is the smallest degree of a sum-of-squares refutation? Are there structural characterizations of the degree of SoS refutations in terms of the properties of the distribution? Is there a connection between the existence of low-degree SoS proofs and the spectra of random matrices associated with the distribution (yielding efficient spectral algorithms)? Over the past few years, significant strides have been made on all these fronts, exposing the contours of a rich theory that remains largely hidden. This survey will be devoted to expounding some of the major developments in this context.

1.1 Estimation problems

We will start by describing a few estimation problems that will be recurring examples in our survey.

Example 1.1 (-clique).

Fix a positive integer . In the -clique problem, a clique of size is planted within a random graph drawn from the Erdős-Rényi distribution denoted . The goal is to recover the -clique.

Formally, the structured family is parametrized by subsets . For a subset , the distribution over measurements is specified by the following sampling procedure:

  • Sample a graph from the Erdős-Rényi distribution and set where denotes the clique on the vertices in .

An application of the second moment method

[GM75] shows that for all , the clique can be exactly recovered with high probability given the graph . However, for any , there is no known polynomial time algorithm for the problem with the best algorithm being a brute force search running in time . Improving upon this runtime is an open problem dating back to Karp in 1976 [Kar76], but save for the spectral algorithm of Alon et al. for [AKS98a], the only progress has been in proving lower bounds against broad classes of algorithms (e.g. [Jer92, FK03, FGR17, BHK16]).

We will now see how to encode the problem as a polynomial system. For pairs , let denote the natural -encoding of the graph , namely, for all . Set . We will refer to the variables as instance variables as they specify the input to the problem. The variables will be referred to as the hidden variables. We encode each constraint as a polynomial equality or inequality:

are Boolean
if then are not both in clique
at least vertices in clique

Note that when we are solving the estimation problem, the instance variables are given, and the hidden variables are the unknowns in the polynomial system. It is easy to check that the only feasible solutions

for this system of polynomial equations are Boolean vectors

which are supported on cliques of size at least in .

Refutation and distinguishing.

For every estimation problem that we will encounter in this survey, we can associate two related computational problems termed refutation and distinguishing. In estimation problems, we typically think of instances as having structure: we sample from a structured distribution , and we wish to recover the hidden variables that give structure to . But there may also be instances which do not have structure. The goal of refutation is to certify that there is no hidden structure, when there is none.

A null distribution is a probability distribution over instances for which there is no hidden structure . For example, in the -clique problem, the corresponding null distribution is the Erdős-Rényi random graph (without a planted clique). With high probability, a graph has no clique with significantly more than vertices. Therefore, for a fixed , given a graph , the goal of a refutation algorithm is to certify that has no clique of size . Equivalently, the goal of a refutation algorithm is to certify the infeasibility of the associated polynomial system.

The most rudimentary computational task associated with estimation and refutation is that of distinguishing. The setup of the distinguishing problem is as follows. Fix a prior distribution on the hidden variables , which in turn induces a distribution on , obtained by first sampling and then sampling . The input consists of a sample which is with equal probability drawn from the structured distribution or the null distribution . The computational task is to identify which distribution the sample is drawn from, with a probability of success for some constant . For example, the structured distribution for -clique is obtained by setting the prior distribution of to be uniform on subsets of of size . In the distinguishing problem, the input is a graph drawn from either or the null distribution , and the algorithm is required to identify the distribution. For every problem included in this survey, the distinguishing task is formally no harder than estimation or refutation, i.e., the existence of algorithms for estimation or refutation immediately implies a distinguishing algorithm.

Example 1.2.

(tensor PCA) The family of structured distributions is parametrized by unit vectors . A sample from consists of a

-tensor

where is a symmetric

-tensor whose entries are i.i.d Gaussian random variables sampled from

. The goal is to recover a vector that is close as possible to .

A canonical strategy to recover given is to maximize the degree- polynomial associated with the symmetric tensor . Specifically, if we set

then one can show that with high probability over . If then . Furthermore, when it can be shown that is close to the unique maximizer of the function . So the problem of recovering can be encoded as the following polynomial system:

is in the unit sphere
has large value for

In the distinguishing and refutation versions of this problem, we will take the null distribution to be the distribution over -tensors with independent Gaussian entries sampled from (equivalent to the distribution of the noise from ). For a -tensor , the maximum of over the unit ball is referred to as the injective tensor norm of the tensor , and is denoted by . If then with high probability over choice of [ABAC̆]. Thus when , the refutation version of the tensor PCA problem reduces to certifying an upper bound on . If we could compute exactly, then we can certify that for as large as .

The injective tensor norm is known to be computationally intractable in the worst case [Gur03, Gha10, BBH12]. Understanding the the function for random

is a deep topic in probability theory and statistical physics (e.g.

[ABAC̆]). As an estimation problem, tensor PCA was first considered by [MR14], and inspired multiple follow-up works concerned with spectral and SoS algorithms (e.g. [HSS15, HSSS16, RRS17, BGL17]).

Example 1.3.

(Matrix & Tensor Completion) In matrix completion, the hidden parameter is a rank- matrix . For a parameter , the measurement consists of a partial matrix revealing a subset of entries of , namely for a subset with . The probability distribution over measurements is obtained by picking the set to be a uniformly random subset of entries.

To formulate a polynomial system for recovering a rank- matrix consistent with the measurement , we will use a matrix of variables , and write the following system of constraints on it:

is consistent with measurement

Tensor completion is the analogous problem with being a higher-order tensor namely, for some fixed . The corresponding polynomial system is again over a matrix of variables with columns and the following system of constraints,

is consistent with measurement

1.2 Sum-of-squares proofs

The sum-of-squares (SoS) proof system is a restricted class of proofs for reasoning about polynomial systems. Fix a set of polynomial inequalities in variables . We will refer to these inequalities as the axioms. Starting with the axioms , a sum-of-squares proof of is given by an identity of the form,

where are real polynomials. It is clear that any identity of the above form manifestly certifies that the polynomial , whenever each for real . The degree of the sum-of-squares proof is the maximum degree of all the summands, i.e., .

Sum-of-squares proofs extend naturally to polynomial systems that involve a set of equalities along with a set of inequalities . We can extend the definition syntactically by replacing each equality by a pair of inequalities and .

We will the use the notation to denote that the assertion that, there exists a degree- sum-of-squares proof of from the set of axioms . The superscript in the notation indicates that the sum-of-squares proof is an identity of polynomials where is the formal variable. We will drop the subscript or superscript when it is clear from the context, and just write . Sum-of-squares proofs can also be used to certify the infeasibility, or refute, the polynomial system. In particular, a degree- sum-of-squares refutation of a polynomial system is an identity of the form,

(1.1)

where is at most .

The sum-of-squares proof system has been an object of study starting with the work of Hilbert and Minkowski more than a century ago (see [Rez00] for a survey). With no restriction on degree, Stengle’s Positivestellensatz implies that sum-of-squares proofs form a complete proof system, i.e., if the axioms imply , then there is an SoS proof of this fact.

The algorithmic implications of the sum-of-squares proof system were first realized in the works of Parrilo [Par00] and Lasserre [Las01], who independently arrived at families of algorithms for polynomial optimization using semidefinite programming (SDP). Specifically, these works observed that semidefinite programming can be used to find a degree- SoS proof in time , if there exists one. This family of algorithms (called a hierarchy, as we have an algorithm for each even integer degree-) are referred to as the sum-of-squares SDP hierarchy. We say that the SoS algorithm is low-degree if does not grow with .

The SoS hierarchy has since emerged as a powerful tool for algorithm design. On the one hand, the first few levels of the SoS hierarchy systematically capture a vast majority of algorithms in combinatorial optimization and approximation algorithms developed over several decades. Furthermore, the low-degree SoS SDP hierarchy holds the promise of yielding improved approximations to NP-hard combinatorial optimization problems, approximations that would beat the long-standing and universal barrier posed by the notorious unique games conjecture

[Tre12, BS14].

More recently, the low-degree SoS SDP hierarchy has proved to be a very useful tool in designing algorithms for high-dimensional estimation problems, wherein the inputs are drawn from a natural probability distribution. For this survey, we organize the recent work on this topic into three lines of work.

  • When the polynomial system for an estimation problem is feasible, can sum-of-squares proofs be harnessed to retrieve the solution? The answer is yes for many estimation problems, including tensor decomposition, matrix and tensor completion, and clustering problems. Furthermore, there is a simple and unifying principle that underlies all of these applications. Specifically, the underlying principle asserts that if there is a low-degree SoS proof that all solutions to the system are close to the hidden variable , then a low-degree SoS SDP can be used to actually retrieve . We will discuss this broad principle and several of its implications in Section 2.

  • When the polynomial system is infeasible, what is the smallest degree at which it admits sum-of-squares proof of infeasibility? The degree of the sum-of-squares refutation is critical for the run-time of the SoS SDP-based algorithm. Recent work by Barak et al. [BHK16] introduces a technique referred to as “pseudocalibration” for proving lower bounds on the degree of SoS refutation, developed in the context of the work on -clique. Section 3

    is devoted to the heuristic technique of pseudocalibration, and the mystery surrounding its effectiveness.

  • Can the existence of degree- of sum-of-square refutations be characterized in terms of (spectral) properties of the underlying distribution? In Section 4, we will discuss a result that shows a connection between the existence of low-degree sum-of-squares refutations and the spectra of certain low-degree matrices associated with the distribution. This connection implies that under fairly mild conditions, the SoS SDP based algorithms are no more powerful than a much simpler and more lightweight class of algorithms referred to as spectral algorithms. Roughly speaking, a spectral algorithm proceeds by constructing a matrix out of the input instance

    , and then using the eigenvalues of the matrix

    to recover the desired outcome.

Notation.

For a positive integer , we use to denote the set . We sometimes use to denote the set of all subsets of of size , and to denote the set of all multi-subsets of cardinality at most .

If and is a multiset, then we will use the shorthand to denote the monomial . We will also use to denote the vector containing all monomials in of degree at most (including the constant monomial ), where . Let denote the space of polynomials of degree at most in variables .

For a function , we will say if for some universal constant . We say that if .

If is a distribution over the probability space , then we use the notation for sampled according to . For an event , we will use as the indicator that occurs. We use to denote the Erdős-Rényi distribution with parameter , or the distribution over graphs where each edge is included independently with probability .

If is an matrix, we use to denote ’s largest eigenvalue. When , then denotes ’s trace. If is an matrix as well, then we use to denote the matrix inner product. We use to denote the Frobenius norm of , . For a subset , we will use to denote the indicator vector of in . We will also use to denote the all-1’s vector.

For two matrices we use to denote both the Kronecker product of and , and the order- tensor given by taking and reshaping it with modes for the rows and columns of and of . We also use to denote the -th Kronecker power of . For an order- tensor and for a permutation of , we denote by the matrix reshaping given by ordering the modes of so that index the rows and index the columns.

Pseudoexpectations.

For a polynomial system in variables consisting of inequalities , we can write an SDP of size which finds a degree- sum-of-squares refutation, if one exists (see [Rot13] for more discussion).

If there is no degree- refutation, the dual semidefinite program computes in time a linear functional over degree- polynomials which we term a pseudoexpectation. Formally, a degree- pseudoexpectation is a linear functional over polynomials of degree at most with the properties that , for all and polynomials such that , and whenever .

Claim 1.4.

If there exists a degree- pseudoexpectation for the polynomial system , then does not admit a degree- refutation.

Proof.

Suppose admits a degree- refutation. Applying the pseudoexpectation operator to the left-hand-side of Eq. 1.1, we have . Applying to the right-hand-side of Eq. 1.1, the first summand must be non-negative by definition of since it is a sum of squares, and the second summand is non-negative, since we assumed that satisfies the constraints of . This yields a contradiction. ∎

The properties above imply that when , then if is a degree- pseudoexpectation operator for the polynomial system defined by , as well. This implies that satisfies several useful inequalities; for example, the Cauchy-Schwarz inequality.

Claim 1.5.

If is a degree- pseudoexpectation and if are polynomials of degree at most , then .

Proof.

We have the following polynomial equality of degree at most :

Applying to both sides, using that , we have our conclusion. ∎

Other versions of the Cauchy-Schwarz inequality can be shown to hold for pseudoexpectations as well; see e.g. [BBH12] for details.

2 Algorithms for high-dimensional estimation

In this section, we prove a algorithmic meta-theorem for high-dimensional estimation that provides a unified perspective on the best known algorithms for a wide range of estimation problems. This unifying perspective allows us to obtain algorithms with significantly better guarantees than what’s known to be possible with other methods. We illustrate the power of this meta-theorem by applying it to matrix and tensor completion, tensor decomposition, and clustering.

2.1 Algorithmic meta-theorem for estimation

We consider the following general class of estimation problems, which will turn out to capture a plethora of interesting problems in a useful way: In this class, an estimation problem111 In contrast to the discussion of estimation problems in Section 1, for every parameter, we have a set of possible measurements as opposed to a distribution over measurements. We can model distributions over measurements in this way by considering a set of “typical measurements”. The viewpoint in terms of sets of possible measurements will correspond more closely to the kind of algorithms we consider. is specified by a set of pairs , where is called parameter and is called measurement. Nature chooses a pair , we are given the measurement and our goal is to (approximately) recover the parameter .

For example, we can encode compressed sensing with measurement matrix and sparsity bound by the following set of pairs,

Similarly, we can encode matrix completion with observed entries and rank bound by the set of pairs,

For both examples, the measurement was a simple (linear) function of the parameter. This is not always the case; consider for example the following clustering problem. There are two distinct centers , and we observe samples such that each sample is closer to either or . Then we can encode the problem of finding and as follows,

Identifiability.

In general, an estimation problem may be ill-posed in the sense that, even ignoring computational efficiency, it may not be possible to (approximately) recover the parameter for a measurement because we have for two far-apart parameters and .

For a pair , we say that identifies exactly if for all . Similarly, we say that identifies up to error if for all . We say that is identifiable (up to error ) if every satisfies that identifies (up to error ).

For example, for compressed sensing , it is not difficult to see that every -sparse vector is identifiable if every subset of at most columns of is linearly independent. For tensor decomposition, a sufficient condition under which the observation is enough to identify (up to a permutation of its columns) is if the columns of are linearly independent.

From identifiability proofs to efficient algorithms.

By itself, identifiability typically only implies that there exists an inefficient algorithm to recover a vector close to the parameter from the observation (e.g. by brute-force search over the set of all ). But perhaps surprisingly, the notion of identifiability in a broader sense can also help us understand if there exists an efficient algorithm for this task. Concretely, if the proof of identifiability is captured by the sum-of-squares proof system at low degree, then there exists an efficient algorithm to (approximately) recover from .

In order to formalize this phenomenon, let the set be be described by polynomial equations

where is a vector-valued polynomial and are auxiliary variables.222 We allow auxiliary variables here because they might make it easier to describe the set . The algorithms we consider depend on the algebraic description of we choose and different descriptions can lead to different algorithmic guarantees. In general, it is not clear which description is best. However, typically, the more auxiliary variables the better. (In other words, is a projection of the variety given by the polynomials .) The following theorem shows that there is an efficient algorithm to (approximately) recover given if there exists a low-degree proof of the fact that the equation implies that is (close to) .

Theorem 2.1 (Meta-theorem for efficient estimation).

Let be a vector-valued polynomial and let the triples satisfy . Suppose , where . Then, every degree- pseudo-distribution consistent with the constraints satisfies

Furthermore, for every , there exists a polynomial-time algorithm (with running time )333In order to be able to state running times in a simple way, we assume that the total bit-complexity of and the vector-valued polynomial (in the monomial basis) is bounded by a fixed polynomial in . that given a vector-valued polynomial and a vector outputs a vector with the following guarantee: if with a proof of bit-complexity at most , then .

Despite not being explicitly stated, the above theorem is the basis for many recent advances in algorithms for estimation problems through the sum-of-squares method [BKS15, BKS14, HSS15, MSS16, BM16, PS17, KSS18, HL18].

Proof.

Let be a degree- pseudo-distribution with . Since degree- sum-of-squares proofs are sound for degree- pseudo-distributions, we have . In particular, . By Cauchy–Schwarz for pseudo-distributions (Claim 1.5), every vector satisfies

By choosing , we obtain the desired conclusion about .

Given a measurement , the algorithm computes a degree- pseudo-distribution that satisfies up to error and outputs . We are guaranteed that such a pseudo-distribution exists, e.g. the distribution that places all its probability mass on the vector . If the proof has bit-complexity , it follows that satisfies up to error . In particular, . By the same argument as before, it follows that . ∎

2.2 Matrix and tensor completion

In matrix completion, we observe a few entries of a low-rank matrix and the goal is to fill in the missing entries. This problem is studied extensively both from practical and theoretical perspectives. One of its practical applications is in recommender systems, which was the basis of the famous Netflix Prize competition. Here, we may observe a few movie ratings for each user and the goal is to infer a user’s preferences for movies that the user hasn’t rated yet.

In terms of provable guarantees, the best known polynomial time algorithm for matrix completion is based on a semidefinite programming relaxation. Let be a rank- matrix such that its left and right singular vectors are -incoherent444 Random unit vectors satisfy this notion of -incoherence for . In this sense, incoherent vectors behave similar to random vectors. , i.e., they satisfy and for all and . The algorithm observes the partial matrix that contains a random cardinality subset of the entries of . If , then with high probability over the choice of the algorithm recovers exactly [CR09, Gro11, Rec11, Che15]. This bound on is nearly optimal in that appears to be necessary because an -by- rank- matrix has degrees of freedom (the entries of its singular vectors).

In this section, we will show how the above algorithm is captured by sum-of-squares and, in particular, Theorem 2.1. We remark that this fact follows directly by inspecting the analysis of the original algorithm [CR09, Gro11, Rec11, Che15]. The advantage of sum-of-squares here is two-fold: First, it provides a unified perspective on algorithms for matrix completion and other estimation problems. Second, the sum-of-squares approach for matrix completion extends in a natural way to tensor completion (in a way that the original approach for matrix completion does not).

Identifiability proof for matrix completion.

For the sake of clarity, we consider a simplified setup where the matrix is assumed to be a rank- projector so that for

-incoherent orthonormal vectors

. The following theorem shows that, with high probability over the choice of , the matrix is identified by the partial matrix . Furthermore, the proof of this fact is captured by sum-of-squares. Together with Theorem 2.1, the following theorem implies that there exists a polynomial-time algorithm to recover from .

Theorem 2.2 (implicit in [Cr09, Gro11, Rec11, Che15]).

Let be an -dimensional projector and orthonormal with incoherence . Let be a random symmetric subset of size . Consider the system of polynomial equations in the -by- matrix variable ,

Suppose . Then, with high probability over the choice of ,

Proof.

The analyses of the aforementioned algorithm for matrix completion [CR09, Gro11, Rec11, Che15] show the following: let be the complement of in . Then if satisfies our incoherence assumptions, with high probability over the choice of , there exists555 Current proofs of the existence of this matrix proceed by an ingenious iterative construction of this matrix (alternatingly projecting to two affine subspaces). The analysis of this iterative construction is based on matrix concentration bounds. We refer to prior literature for details of this proof [Gro11, Rec11, Che15]. a symmetric matrix with and . As we will see, this matrix also implies that the above proof of identifiability exists.

Since and , we have

Since and contains the equation , we have . At the same time, we have

where the first step uses and the second step uses because and contains the equation . Combining the lower and upper bound on , we obtain

Together with the facts and , we obtain as desired. ∎

Identifiability proof for tensor completion.

Tensor completion is the analog of matrix completion for tensors. We observe a few of the entries of an unknown low-rank tensor and the goal is to fill in the missing entries. In terms of provable guarantees, the best known polynomial-time algorithms are based on sum-of-squares, both for exact recovery [PS17] (of tensors with orthogonal low-rank decompositions) and approximate recovery [BM16] (of tensors with general low-rank decompositions).

Unlike for matrix completion, there appears to be a big gap between the number of observed entries required by efficient and inefficient algorithms. For 3-tensors, all known efficient algorithms require observed entries (ignoring the dependence on incoherence) whereas information-theoretically observed entries are enough. The gap for higher-order tensors becomes even larger. It is an interesting open question to close this gap or give formal evidence that the gap is inherent.

As for matrix completion, we consider the simplified setup that the unknown tensor has the form for incoherent, orthonormal vectors . The following theorem shows that with high probability, is identifiable from random entries of and this fact has a low-degree sum-of-squares proof.

Theorem 2.3 ([Ps17]).

Let be orthonormal vectors with incoherence and let be their 3-tensor. Let be a random symmetric subset of size . Consider the system of polynomial equations in the -by- matrix variable with columns ,

Suppose . Then, with high probability over the choice of ,

Proof.

Let be the matrix with columns . Analogous to the proof for matrix completion, the heart of the proof is the existence of a 3-tensor that satisfies the following properties: , , and

(2.1)

These properties imply that are the unique global maximizers of the cubic polynomial over the unit sphere. (We remark that for matrix completion, the spectral properties of the matrix imply that the unique global optimizers of the quadratic polynomial are the unit vectors in the span of .)

The proof that this tensor exists follows the same approach as the proof of existence of the matrix for matrix completion in Theorem 2.2 and proceeds by an iterative construction [Rec11, Gro11]. The main difference is due to the fact that for we only need to ensure spectral properties, whereas for we need to ensure the existence of (higher-degree) sum-of-squares proofs Eq. 2.1. We refer to previous literature for details of the proof that such exists with high probability over the choice of [PS17].

Similar to the proof for matrix completion, we have by the properties of that and . By Eq. 2.1 and linearity,

Because includes the equations and because the final term is a sum of squares, we conclude that for all and for all with . We also have the following claim:

Claim 2.4.

When are orthogonal and and , then

We give the (easy) proof of Claim 2.4 in Appendix A. Thus, from the orthonormality of the ,

Together with the facts and , we obtain as desired. ∎

2.3 Overcomplete tensor decomposition

Tensor decomposition refers to the following general class of estimation problems: Given (a noisy version of) a -tensor of the form , the goal is to (approximately) recover one, most, or all of the component vectors . It turns out that under mild conditions on the components , the noise, and the tensor order , this estimation task is possible information theoretically. For example, generic components with are identified by their 3-tensor [CO12] (up to a permutation of the components). Our concern will be what conditions on the components, the noise, and the tensor order allow us to efficiently recover the components.

Besides being significant in its own right, tensor decomposition is a surprisingly versatile and useful primitive to solve other estimation problems. Concrete examples of problems that can be reduced to tensor decomposition are latent Dirichlet allocation models, mixtures of Gaussians, independent component analysis, noisy-or Bayes nets, and phylogenetic tree reconstruction

[LCC07, MR05, AFH12, HK13, BCMV14, BKS15, MSS16, AGMR17]. Through these reductions, better algorithms for tensor decomposition can lead to better algorithms for a large number of other estimation problems.

Toward better understanding the capabilities of efficient algorithms for tensor decomposition, we focus in this section on the following more concrete version of the problem.

Problem 2.5 (Tensor decomposition, single component recovery, constant error).

Given an order- tensor with component vectors , find a vector that is close666

This notion of closeness ignores the sign of the components. If the tensor order is odd, the sign can often be recovered as part of some postprocessing. If the tensor order is even, the sign of the components is not identified.

to one of the component vectors in the sense that .

Algorithms for Problem 2.5 can often be used to solve a priori more difficult versions of the tensor decomposition that ask to recover most or all of the components or that require the error to be arbitrarily small.

A classical spectral algorithm attributed to Jennrich [Har70, LRA93] can solve Problem 2.5 for up to generic components if the tensor order is at least . (Concretely, the algorithm works for 3-tensors with linearly independent components.) Essentially the same algorithm works up to generic777Here, the vectors are assumed to be linearly independent. components if the tensor order is at least . A more sophisticated algorithm [LCC07] solves Problem 2.5 for up to generic888Concretely, the vectors are assumed to be linearly independent. components if the tensor order is at least . However, these algorithms and their analyses break down if the tensor order is only 3 and the number of components is , even if the components are random vectors.

In this and the subsequent section, we will discuss a polynomial-time algorithm based on sum-of-squares that goes beyond these limitations of previous approaches.

Theorem 2.6 ([Mss16] building on [Bks15, Gm15, Hsss16]).

There exists a polynomial-time algorithm to solve Problem 2.5 for tensor order 3 and