# The Kikuchi Hierarchy and Tensor PCA

For the tensor PCA (principal component analysis) problem, we propose a new hierarchy of algorithms that are increasingly powerful yet require increasing runtime. Our hierarchy is analogous to the sum-of-squares (SOS) hierarchy but is instead inspired by statistical physics and related algorithms such as belief propagation and AMP (approximate message passing). Our level-ℓ algorithm can be thought of as a (linearized) message-passing algorithm that keeps track of ℓ-wise dependencies among the hidden variables. Specifically, our algorithms are spectral methods based on the Kikuchi Hessian matrix, which generalizes the well-studied Bethe Hessian matrix to the higher-order Kikuchi free energies. It is known that AMP, the flagship algorithm of statistical physics, has substantially worse performance than SOS for tensor PCA. In this work we `redeem' the statistical physics approach by showing that our hierarchy gives a polynomial-time algorithm matching the performance of SOS. Our hierarchy also yields a continuum of subexponential-time algorithms, and we prove that these achieve the same (conjecturally optimal) tradeoff between runtime and statistical power as SOS. Our results hold for even-order tensors, and we conjecture that they also hold for odd-order tensors. Our methods suggest a new avenue for systematically obtaining optimal algorithms for Bayesian inference problems, and our results constitute a step toward unifying the statistical physics and sum-of-squares approaches to algorithm design.

## Authors

• 23 publications
• 13 publications
• 20 publications
• ### Decomposing Overcomplete 3rd Order Tensors using Sum-of-Squares Algorithms

Tensor rank and low-rank tensor decompositions have many applications in...
04/21/2015 ∙ by Rong Ge, et al. ∙ 0

• ### A unifying tutorial on Approximate Message Passing

Over the last decade or so, Approximate Message Passing (AMP) algorithms...
05/05/2021 ∙ by Oliver Y. Feng, et al. ∙ 8

• ### The power of sum-of-squares for detecting hidden structures

We study planted problems---finding hidden structures in random noisy in...
10/13/2017 ∙ by Samuel B. Hopkins, et al. ∙ 0

• ### Fourier PCA and Robust Tensor Decomposition

Fourier PCA is Principal Component Analysis of a matrix obtained from hi...
06/25/2013 ∙ by Navin Goyal, et al. ∙ 0

• ### A statistical model for tensor PCA

We consider the Principal Component Analysis problem for large tensors o...
11/04/2014 ∙ by Andrea Montanari, et al. ∙ 0

• ### Noisy Tensor Completion via the Sum-of-Squares Hierarchy

In the noisy tensor completion problem we observe m entries (whose locat...
01/26/2015 ∙ by Boaz Barak, et al. ∙ 0

• ### The DLR Hierarchy of Approximate Inference

We propose a hierarchy for approximate inference based on the Dobrushin,...
07/04/2012 ∙ by Michal Rosen-Zvi, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

High-dimensional Bayesian inference problems are widely studied, including planted clique [Jer92, AKS98], sparse PCA [JL04], and community detection [DKMZ11b, DKMZ11a], just to name a few. For these types of problems, two general strategies, or meta-algorithms, have emerged. The first is rooted in statistical physics and includes the belief propagation (BP) algorithm [Pea86, YFW03] along with variants such as approximate message passing (AMP) [DMM09], and related spectral methods such as linearized BP [KMM13, BLM15], and the Bethe Hessian [SKZ14]. The second meta-algorithm is the sum-of-squares (SOS) hierarchy [Sho87, Par00, Las01] (a hierarchy of increasingly powerful semidefinite programming relaxations to polynomial optimization problems) along with spectral methods inspired by it [HSS15, HSSS16]. Both of these meta-algorithms are known to achieve statistically-optimal performance for many problems. Furthermore, when they fail to perform a task, this is often seen as evidence that no polynomial-time algorithm can succeed. Such reasoning takes the form of free energy barriers (in statistical physics) [LKZ15a, LKZ15b] or SOS lower bounds (e.g., [BHK16]). Thus, we generally expect both meta-algorithms to have optimal statistical performance among all efficient algorithms.

A fundamental question is whether we can unify statistical physics and SOS, showing that the two approaches yield, or at least predict, the same performance on a large class of problems. However, one barrier to this comes from the tensor principal component analysis (PCA) problem [RM14], on which the two meta-algorithms seem to have very different performance. For , in the order- tensor PCA (or spiked tensor) problem we observe a -fold tensor

 \bY=λ\bx⊗p∗+\bG

where is a signal-to-noise ratio (SNR) parameter, is a planted signal that we hope to recover, and is a symmetric noise tensor with entries. Information-theoretically, it is possible to recover (in the limit , with fixed) when111Here we are ignoring log factors, so can be understood to mean . [RM14, LML17]. The sum-of-squares hierarchy gives a polynomial-time algorithm to recover provided [HSS15], and there are SOS lower bounds suggesting that no polynomial-time algorithm can do better [HSS15, HKP17]. However, AMP, the flagship algorithm of statistical physics, is suboptimal (if ) and fails to succeed unless [RM14]. Various other ‘local’ algorithms such as the tensor power method, Langevin dynamics, and gradient descent also fail below this ‘local’ threshold [RM14, AGJ18]. This casts serious doubts on the optimality of the statistical physics approach.

In this paper we resolve the above discrepancy and ‘redeem’ the statistical physics approach. The Bethe free energy associated with AMP is merely the first level of a hierarchy of Kikuchi free energies [YFW03, Kik51, Kik94]. From these Kikuchi free energies, we derive a hierarchy of increasingly powerful algorithms for tensor PCA, similar in spirit to generalized belief propagation [YFW03]. Roughly speaking, our level- algorithm can be thought of as an iterative message-passing algorithm that reasons about -wise dependencies among the hidden variables and has time- and space-complexity . Specifically, the level- algorithm is a spectral method on (a submatrix of) the Kikuchi Hessian matrix, which generalizes the Bethe Hessian matrix [SKZ14].

For order- tensor PCA, we show that level of the Kikuchi hierarchy recovers a variant of the tensor unfolding algorithm [RM14, HSS15], which succeeds down to the SOS threshold (provably for even and conjecturally for odd ), closing the gap between SOS and statistical physics. Furthermore, by taking levels for various values of , we obtain a continuum of subexponential-time algorithms that achieve a certain tradeoff between runtime and the signal-to-noise ratio—exactly the same tradeoff curve that SOS is known to achieve222The strongest known SOS results only apply to a variant of the spiked tensor model with Rademacher-valued observations, but we do not expect this difference to be important; see Section 2.6. We prove that our algorithm matches the SOS bound only for even . [BGG16, BGL16]. This motivates a new informal meta-conjecture: for Bayesian inference problems, the SOS hierarchy and the Kikuchi hierarchy achieve the optimal tradeoff between runtime and statistical power.

Our results redeem the statistical physics approach to algorithm design and give hope that the Kikuchi hierarchy provides a systematic way to derive optimal algorithms for a large class of Bayesian inference problems. We see this as a step toward unifying the statistical physics and SOS approaches.

## 2 Preliminaries and Prior Work

### 2.1 Notation

Our asymptotic notation (e.g., ) pertains to the limit (large dimension) and may hide constants depending on (tensor order), which we think of as fixed. We say an event occurs with high probability

if it occurs with probability

.

A tensor is an ( times) multi-array with entries denoted by , where . We call the order of and the dimension of

. For a vector

, the rank-1 tensor is defined by . A tensor is symmetric if for any permutation . For a symmetric tensor, if , we will often write .

### 2.2 The Spiked Tensor Model

A general formulation of the spiked tensor model is as follows. For an integer , let be an asymmetric tensor with entries i.i.d. . We then symmetrize this tensor and obtain a symmetric tensor as follows

 \bG:=1√p!∑π∈Sp˜\bGπ,

where is the symmetric group of permutations of , and . Note that if are distinct then .

Now we let (the spike prior

) be a (Borel) probability distribution over

, and draw a ‘signal’ vector (or spike) from . Then we let be the tensor

 \bY=λ\bx⊗p∗+\bG. (1)

We will often focus on the Rademacher-spiked tensor model where is the -product of the symmetric Rademacher measure. (In this case the entries of are drawn i.i.d. uniformly from .) We will sometimes state results without specifying the prior , in which case the result holds for any prior (normalized so that ). Finally, let denote the law of the tensor . We will consider the limit with held fixed. The parameter may depend on .

Our algorithms will depend only on the entries where the indices are distinct: that is, on the collection

 {\bYE=λ\bxE∗+\bGE:E⊆[n],|E|=p},

where for a vector we write .

Perhaps one of the simplest statistical tasks is binary hypothesis testing, which in our case amounts to, upon receiving a tensor with the promise that it was sampled from with , telling whether or . We refer to for as the planted distribution, and as the null distribution.

We say that an algorithm (or test) achieves strong detection between and if

 limn→∞\PPλ(f(\bY)=1)=1andlimn→∞\PP0(f(\bY)=0)=1.

Additionally we say that achieves weak detection between and

if the sum of Type-I and Type-II errors remains strictly below 1:

 limsupn→∞{\PP0(f(\bY)=1)+\PPλ(f(\bY)=0)}<1.

An additional goal is to recover the planted vector . Note that when is even, it is impossible to resolve the sign of the signal (i.e., and are equally likely). Thus, our goal is to recover up to sign.

The normalized correlation between vectors is

 \corr(^\bx,\bx)=|⟨^\bx,\bx⟩|∥^\bx∥∥\bx∥.

achieves weak recovery if is lower-bounded by a strictly positive constant—and we write —with high probability, and achieves strong recovery if with high probability.

We expect that strong detection and weak recovery are generally equally difficult, although formal implications are not known in either direction. We will see in Section 2.5 that in some regimes, weak recovery and strong recovery are equivalent.

##### The Matrix Case.

When , the spiked tensor model reduces to the spiked Wigner model

. We know from random matrix theory that when

with

, strong detection is possible by thresholding the maximum eigenvalue of

, and weak recovery is achieved by the leading eigenvector

[FP07]. For many spike priors (including Rademacher), strong detection and weak recovery are statistically impossible when  [MRZ15, DAM15, PWBM18] (although weak detection is still possible below  [EKJ18]

), so there is a sharp phase transition at

. A more sophisticated algorithm is AMP (approximate message passing) [DMM09, FR18, LKZ15a, DAM15], which can be thought of as a modification of the matrix power method which uses certain nonlinear transformations to exploit the structure of the spike prior. For many spike priors (including Rademacher), AMP is known to achieve the information-theoretically optimal correlation with the true spike [DAM15, DMK16]. However, like PCA (top eigenvector), AMP achieves zero correlation (asymptotically) when . For certain spike priors (e.g., sparse priors), statistical-to-computational gaps can appear whereby it is statistically possible to succeed for some but we do not expect that polynomial-time algorithms can do so [LKZ15a, LKZ15b, BMV18].

##### The Tensor Case.

The tensor case was introduced by [RM14], who also proposed various algorithms. Information-theoretically, there is a sharp phase transition similar to the matrix case: for many spike priors (including Rademacher), if then weak recovery is possible when and impossible when , for a particular constant depending on and  [LML17]. Strong detection undergoes the same transition, being possible if and impossible otherwise [Che17, CHL18]. In fact, it is shown in these works that even weak detection is impossible below , in sharp contrast with the matrix case. There are polynomial-time algorithms (e.g., SOS) that succeed at both strong detection and strong recovery when (for any spike prior) [RM14, HSS15, HSSS16], which is above the information-theoretic threshold by a factor that diverges with . There are also SOS lower bounds suggesting that (for many priors) no polynomial-time algorithm can succeed at strong detection or weak recovery when [HSS15, HKP17]. Thus, unlike the matrix case, we expect a large statistical-to-computational gap, i.e., a “possible-but-hard” regime when .

### 2.3 The Tensor Power Method and Local Algorithms

Various algorithms have been proposed and analyzed for tensor PCA [RM14, HSS15, HSSS16, ADGM16, AGJ18]. We will present two such algorithms that are simple, representative, and will be relevant to the discussion. The first is the tensor power method [RM14].

(Tensor Power Method) For a vector and a tensor , let denote the vector

 \bY{\bu}i=∑j1,…,jp−1\bYi,j1,…,jp−1\buj1⋯\bujp−1.

The tensor power method begins with an initial guess (e.g., chosen at random) and repeatedly iterates the update rule until converges.

The tensor power method appears to only succeed when [RM14], which is worse (by a factor that diverges with ) than the SOS threshold (). The AMP algorithm of [RM14] is a more sophisticated variant of the tensor power method. Like the power method, AMP fails unless [RM14]. This discrepancy between AMP and SOS is what motivated the current work. Two other related algorithms, gradient descent and Langevin dynamics, also fail unless [AGJ18]. Following [AGJ18], we refer to all of these algorithms (tensor power method, AMP, gradient descent, Langevin dynamics) as local algorithms, and we refer to the corresponding threshold as the local threshold. Here ‘local’ is not a precise notion, but roughly speaking, local algorithms keep track of a current guess and iteratively update it to a nearby vector that is favorable (e.g., in terms of log-likelihood).

### 2.4 The Tensor Unfolding Algorithm

We have seen that local algorithms do not seem able to reach the SOS threshold. Let us now describe one of the simplest algorithms that does reach this threshold: tensor unfolding. Tensor unfolding was first proposed by [RM14], where it was shown to succeed when and conjectured to succeed when (the SOS threshold). For the case , the same algorithm was later reinterpreted as a spectral relaxation of SOS, and proven to succeed333The analysis of [HSS15] applies to a close variant of the spiked tensor model in which the noise tensor is asymmetric. We do not expect this difference to be important. when [HSS15], confirming the conjecture of [RM14]. We now present the tensor unfolding method, restricting to the case for simplicity. There is a natural extension to all [RM14], and (a close variant of) this algorithm will in fact appear as level in our hierarchy of algorithms (see Section 3 and Appendix C).

(Tensor Unfolding) Given an order-3 tensor , flatten it to an matrix , i.e., let . Compute the leading eigenvector of .

If we use the matrix power method to compute the leading eigenvector, we can restate the tensor unfolding method as an iterative algorithm: keep track of state vectors and , initialize randomly, and alternate between applying the update steps and . We will see later (see Section 4.1) that this can be interpreted as a message-passing algorithm between singleton indices, represented by , and pairs of indices, represented by . Thus, tensor unfolding is not ‘local’ in the sense of Section 2.3 because it keeps a state of size (keeping track of pairwise information) instead of size . We can, however, think of it as local on a ‘lifted’ space, and this allows it to surpass the local threshold.

Other methods have also been shown to achieve the SOS threshold , including SOS itself and various spectral methods inspired by it [HSS15, HSSS16].

### 2.5 Boosting and Linearization

One fundamental difference between the matrix case () and tensor case () is the following boosting property. The following result, implicit in [RM14], shows that for , if is substantially above the information-theoretic threshold (i.e., ) then weak recovery can be boosted to strong recovery via a single power iteration. We give a proof in Appendix D. Let with any spike prior (supported on ). Suppose we have an initial guess satisfying . Obtain from via a single iteration of the tensor power method: . There exists a constant such that with high probability,

 \corr(ˆ\bx,\bx∗)≥1−cλ−1τ1−pn(1−p)/2.

In particular, if is any constant and then .

For , since we do not expect polynomial-time algorithms to succeed when , this implies an “all-or-nothing” phenomenon: for a given , the optimal polynomial-time algorithm will either achieve correlation that is asymptotically 0 or asymptotically 1. This is in stark contrast to the matrix case where, for , the optimal correlation is a constant (in ) depending on both and the spike prior .

In light of this boosting result, we are primarily concerned with achieving the optimal threshold for weak recovery (instead of the optimal correlation), which suggests we can use a simple ‘linearized’ spectral method instead of a non-linear AMP-type algorithm. To illustrate this in the matrix case (), one needs to use AMP in order to achieve optimal correlation, but one can achieve the optimal threshold using linearized AMP, which boils down to simply computing the top eigenvector. Similarly, in the related setting of community detection in the stochastic block model, one needs to use belief propagation to achieve optimal correlation [DKMZ11b, DKMZ11a, MNS14], but one can achieve the optimal threshold using a linearized version of belief propagation, which is a spectral method on the non-backtracking walk matrix [KMM13, BLM15] or the related Bethe Hessian [SKZ14]. Our spectral methods for tensor PCA are based on the Kikuchi Hessian, which is a generalization of the Bethe Hessian.

### 2.6 Subexponential-time Algorithms for Tensor PCA

The degree- sum-of-squares algorithm is a large semidefinite program that requires runtime to solve. Oftentimes the regime of interest is when is constant, so that the algorithm runs in polynomial time. However, one can also explore the power of subexponential-time algorithms by letting for , which gives an algorithm with runtime roughly . Results of this type are known for tensor PCA [BGG16, BGL16, RRS17]. The strongest such results are for a different variant of tensor PCA, which we now define.

In the order- discrete spiked tensor model with spike prior (supported on ) and SNR parameter , we draw a spike and then for each , we independently observe a

-valued random variable

with444Note that for, e.g., the Rademacher spike prior with . . This model differs from our usual one in that the observations are conditionally Rademacher instead of Gaussian, but we do not believe this makes an important difference. However, for technical reasons, the known SOS results are strongest in this discrete setting.

[[BGL16, HSS15]] For any , there is an algorithm with runtime that achieves strong detection and strong recovery in the order- discrete spiked tensor model (with any spike prior) whenever

 λ≥ℓ−(p−2)/4n−p/4polylog(n).

The work of [BGL16] shows how to certify an upper bound on the injective norm of a random -valued tensor, which immediately implies the algorithm for strong detection. When combined with [HSS15], this can also be made into an algorithm for strong recovery (see Lemma 4.4 of [HSS15]). Similar (but weaker) SOS results are also known for the standard spiked tensor model (see [RRS17] and arXiv version 1 of [BGL16]), and we expect that Theorem 2.6 also holds for this case.

When for , Theorem 2.6 implies that we have an algorithm with runtime that succeeds when

. Note that this interpolates smoothly between the polynomial-time threshold (

) when , and the information-theoretic threshold () when . We will prove (for even) that our algorithms achieve this same tradeoff, and we expect this tradeoff to be optimal.

## 3 Main results

In this section we present our main results about detection and recovery in the spiked -tensor model. We propose a hierarchy of spectral methods, which are directly derived from the hierarchy of Kikuchi free energies

. Specifically, the symmetric difference matrix (defined below) appears (approximately) as a submatrix of the Hessian of the Kikuchi free energy. The full details of this heuristic derivation are given later (see Section

4 and Appendix E), and for now we simply state the algorithms and results.

We will restrict our attention to the Rademacher-spiked tensor model, which is the setting in which we derived our algorithms. However, we show in Appendix B that the same algorithm works for a large class of priors (at least for strong detection).

We will also restrict to the case where the tensor order is even. In the odd- case we give an algorithm but are only able to prove sub-optimal results, and we defer the details to Appendix C. Our approach (for even) requires the introduction of two matrices:

##### The symmetric difference matrix of order ℓ.

Let be even and fix an integer , and consider the symmetric matrix indexed by sets of size , having entries

 \bMS,T={\bYS\symdTif |S\symdT|=p,0otherwise. (2)

Here, denotes the symmetric difference between sets. The leading eigenvector of is intended to be an estimate of where . The following voting matrix is a natural rounding scheme to extract an estimate of from such a vector.

##### The voting matrix.

To a vector we associate the following symmetric ‘voting’ matrix having entries

 \bVii(\bv)=0∀i∈[n],and\bVij(\bv)=12∑S,T∈([n]ℓ)\bvS\bvT\boneS\symdT={i,j}∀i≠j. (3)

Now we are in position to formulate our algorithms for detection and recovery. Let us define the important quantity

 dℓ\defeq(n−ℓp/2)(ℓp/2). (4)

This is the number of sets of size such that for a given set of size .

[Detection for even ]

1. Compute the top eigenvalue of the symmetric difference matrix .

2. Reject the null hypothesis

(i.e., return ‘1’) if .

[Recovery for even ]

1. Compute a (unit-norm) leading eigenvector555We define a leading eigenvector to be an eigenvector whose eigenvalue is maximal (although our results also hold for an eigenvector whose eigenvalue is maximal in absolute value). of .

2. Form the associated voting matrix .

3. Compute a leading eigenvector of , and output .

The next two theorems characterize the performance of Algorithms 3 and 3 for the strong detection and recovery tasks, respectively. The proofs can be found in Appendix A. Consider the Rademacher-spiked tensor model with even. For all and , we have

 \PP0(λmax(\bM)≥λdℓ2)∨\PPλ(λmax(\bM)≤λdℓ2)≤2nℓe−λ2dℓ/8.

Therefore, Algorithm 3 achieves strong detection between and if as .

Consider the Rademacher-spiked tensor model with even (and ). Let be the output of Algorithm 3. There exists an absolute constant such that for all and , if and , then with probability at least .

If , we have , and so the above theorems imply that strong detection and strong recovery are possible as soon as . Comparing with Theorem 2.6, this scaling coincides with the guarantees achieved by the level- SOS algorithm of [BGL16], up to a possible discrepancy in logarithmic factors.

Due to the particularly simple structure of the symmetric difference matrix (in particular, the fact that its entries are simply entries of ), the proof of detection (Theorem 3) follows from a straightforward application of the matrix Chernoff bound (see Appendix A.2). In contrast, the corresponding SOS results [BGG16, BGL16, RRS17] work with more complicated matrices involving high powers of the entries of , and the analysis is much more involved.

Our proof of recovery is unusual in that the signal component of , call it , is not rank-one; it even has a vanishing spectral gap when . Thus, the leading eigenvector of does not correlate well with the leading eigenvector of . While this may seem to render recovery hopeless at first glance, this is not the case, due to the fact that many

eigenvectors (actually, eigenspaces) of

contain non-trivial information about the spike , as opposed to only the top one. We prove this by exploiting the special structure of through the Johnson scheme, and using tools from Fourier analysis on a slice of the hypercube, in particular a Poincaré-type inequality by [Fil16].

##### Removing the logarithmic factor

Both Theorem 3 and Theorem 3 involve a logarithmic factor in in the lower bound on SNR . These log factors are an artifact of the matrix Chernoff bound, and we believe they can be removed. This suggests the following precise conjecture on the power of polynomial-time algorithms. Fix even and let be constant (not depending on ). There exists a constant with as (with fixed) such that if then Algorithm 3 and Algorithm 3 (which run in time ) achieve strong detection and strong recovery, respectively. Specifically, we expect for large . It is conceivable that strong recovery requires the additional boosting step of Proposition 2.5.

## 4 Motivating the Symmetric Difference Matrices

In this section we motivate the symmetric difference matrices used in our algorithms. In Section 4.1 we give some high-level intuition, including an explanation of how our algorithms can be thought of as iterative message-passing procedures among subsets of size . In Section 4.2 we give a more principled derivation based on the Kikuchi Hessian, with many of the calculations deferred to Appendix E.

### 4.1 Intuition: Higher-Order Message-Passing and Maximum Likelihood

As stated previously, our algorithms will choose to ignore the entries for which are not distinct; these entries turn out to be unimportant asymptotically. We restrict to the Rademacher-spiked tensor model, as this yields a clean and simple derivation. The posterior distribution for the spike given the observed tensor is

 \PP(\bx∣\bY) ∝exp{−12∑i1<⋯

over the domain . We take to be even; a similar derivation works for odd . Now fix . We can write the above as

 \PP(\bx∣\bY)∝exp{λN∑|S\symdT|=p\bYS\symdT\bxS\bxT}, (6)

where the sum is over ordered pairs

of sets with and , and where is the number of terms with a given symmetric difference .

A natural message-passing algorithm to maximize the log-likelihood is the following. For each of size , keep track of a variable , which is intended to be an estimate of . Note that there are consistency constraints that must obey, such as when ; we will relax the problem and will not require our vector to obey such constraints. Instead we simply attempt to maximize

 1∥\bu∥2∑|S\symdT|=p\bYS\symdT\buS\buT (7)

over all . To do this, we iterate the update equations

 \buS←∑T:|S\symdT|=p\bYS\symdT\buT. (8)

We call and neighbors if . Intuitively, each neighbor of sends a message to , indicating ’s opinion about . We update to be the sum of all incoming messages.

Now note that the sum in (7) is simply where is the symmetric difference matrix, and (8) can be written as

 \bu←\bM\bu.

Thus our natural message-passing scheme is precisely power iteration against , and so we should take the leading eigenvector of as our estimate of (up to a scaling factor). Finally, defining our voting matrix and taking its leading eigenvector is a natural method for rounding to a vector of the form where , thus restoring the consistency constraints we ignored before.

Indeed, if we carry out this procedure on all subsets then this works as intended, and no rounding is necessary: Consider the matrix . It is easy to verify that the eigenvectors of are precisely the Fourier basis vectors on the hypercube, namely vectors of the form where and . Moreover, the eigenvalue associated to is

 12n(\bu\bx)⊤¯¯¯¯¯¯¯¯\bM\bu\bx=12n∑S,T⊆[n]:|S\symdT|=p\bYS\symdT\bxS\bxT=∑|E|=p\bYE\bxE.

This is the expression in the log-likelihood in (5). Thus the leading eigenvector of is exactly where is the maximum-likelihood estimate of .

This procedure succeeds all the way down to the statistical threshold , but takes exponential time. Our contribution can be viewed as showing that even when we restrict to the submatrix of supported on subsets of size , the leading eigenvector still allows us to recover whenever the SNR is sufficiently large. Proving this requires us to perform Fourier analysis over a slice of the hypercube rather than the simpler setting of the entire hypercube, which we do by appealing to the Johnson scheme and some results of [Fil16].

### 4.2 Variational Inference and Kikuchi Free Energy

We now introduce the Kikuchi approximations to the free energy (or simply the Kikuchi free energies) of the above posterior (5) [YFW03, Kik51, Kik94], the principle on which our algorithms are derived. For concreteness we restrict to the case of the Rademacher-spiked tensor model, but the Kikuchi free energies can be defined for general graphical models (see [YFW03]).

The posterior distribution in (5) is a Gibbs distribution with random Hamiltonian , and inverse temperature . We let be the partition function of the model, and denote by its free energy. It is a classical fact that the Gibbs distribution has the following variational characterization. Fix a finite domain (e.g., ), and . Consider the optimization problem

 infμF(μ), (9)

where the supremum is over probability distributions supported on , and define the free energy functional of by

 F(μ)\defeq\Ex\bx∼μ[H(\bx)]−1βS(μ), (10)

where is the Shannon entropy of , i.e., . Then the unique optimizer of (9) is the Gibbs distribution . If we specialize this statement to our setting, and (as one can check) . We refer to [WJ08] for more background.

In light of the above variational characterization, a natural algorithmic strategy to learn the posterior distribution is to minimize the free energy functional over distributions . However, this is a priori intractable because (for a high-dimensional domain such as ) an exponential number of parameters are required to represent . The idea underlying the belief propagation algorithm [Pea86, YFW03] is to work only with locally-consistent marginals, or beliefs, instead of a complete distribution . Standard belief propagation works with beliefs on singleton variables and on pairs of variables. The Bethe free energy is a proxy for the free energy that only depends on these beliefs, and belief propagation is a certain procedure that iteratively updates the beliefs in order to locally minimize the Bethe free energy. The level- Kikuchi free energy is a generalization of the Bethe free energy that depends on -wise beliefs and gives (in principle) increasingly better approximations of as increases. Our algorithms are based on the principle of locally minimizing Kikuchi free energy, which we define next.

We now define the level- Kikuchi approximation to the free energy. We require , i.e., the Kikuchi level needs to be at least as large as the interactions present in the data (although the case could be handled by defining a modified graphical model with auxiliary variables). The Bethe free energy is the case .

For with , let denote the belief on , which is a probability mass function over

representing our belief about the joint distribution of

. Let denote the set of beliefs on -wise interactions for all . Following [YFW03], the Kikuchi free energy is a real-valued functional of having the form (in our case, ) where the first term is the ‘energy’ term

 E(b)=−λ∑|S|=p\bYS∑\bxS∈{±1}SbS(\bxS)\bxS.

(This is a proxy for the term in (10).) And the second term is the ‘entropy’ term

 S=∑0<|S|≤rcSSS,SS(b)=−∑\bxS∈{±1}SbS(\bxS)logbS(\bxS),

where the overcounting numbers are . These are defined so that for any with ,

 ∑T⊇S,|T|≤rcT=1, (11)

which corrects for overcounting. Notice that and each take the form of an ‘expectation’ with respect to the beliefs ; these would be actual expectations were the beliefs the marginals of an actual probability distribution. This situation is to be contrasted with the notion of a pseudo-expectation, which plays a central role in the theory underlying the sum-of-squares algorithm.

Our algorithms are based on the Kikuchi Hessian, a generalization of the Bethe Hessian matrix that was introduced in the setting of community detection [SKZ14]

. The Bethe Hessian is the Hessian of the Bethe free energy with respect to the moments of the beliefs, evaluated at belief propagation’s so-called “uninformative fixed point.” The bottom eigenvector of the Bethe Hessian is a natural estimator for the planted signal because it represents the best direction for local improvement of the Bethe free energy, starting from belief propagation’s uninformative starting point. We generalize this method and compute (a heuristic approximation of) the analogous Kikuchi Hessian matrix. The full derivation is given in Appendix

E. The order- symmetric difference matrix (2) appears as a submatrix of the level- Kikuchi Hessian whenever .

## 5 Conclusion

We have presented a hierarchy of spectral algorithms for tensor PCA, inspired by variational inference and statistical physics. In particular, the core idea of our approach is to locally minimize the Kikuchi free energy. We specifically implemented this via the Kikuchi Hessian, but there may be many other viable approaches to minimizing Kikuchi free energy (such as generalized belief propagation [YFW03]). Broadly speaking, we conjecture that for many average-case problems, algorithms based on Kikuchi free energy and algorithms based on sum-of-squares should achieve essentially the same tradeoff (namely the optimal tradeoff) between runtime and statistical power. One direction for further work is to verify that this analogy holds for problems other than tensor PCA. Perhaps one benefit of the Kikuchi hierarchy over the sum-of-squares hierarchy is that it has allowed us to systematically obtain spectral methods for tensor PCA, simply by computing a certain Hessian matrix. On the other hand, designing spectral methods based on SOS seems to require some creativity on a case-by-case basis. We are hopeful that the Kikuchi hierarchy will provide a roadmap for systematically deriving optimal algorithms for a large class of problems.

## Acknowledgments

We thank Alex Russell for suggesting the matrix Chernoff bound (Theorem A.2). For helpful discussions, we thank Afonso Bandeira, Sam Hopkins, Florent Krzakala, Tselil Schramm, Jonathan Shi, and Lenka Zdeborová. This project started during the workshop Spin Glasses and Related Topics held at the Banff International Research Station (BIRS) in Sept.–Oct., 2018. We thank our hosts at BIRS as well as the workshop organizers: Antonio Auffinger, Wei-Kuo Chen, Dmitry Panchenko, and Lenka Zdeborová.

## References

• [ABČ13] Antonio Auffinger, Gérard Ben Arous, and Jiří Černỳ. Random matrices and complexity of spin glasses. Communications on Pure and Applied Mathematics, 66(2):165–201, 2013.
• [ADGM16] Anima Anandkumar, Yuan Deng, Rong Ge, and Hossein Mobahi. Homotopy analysis for tensor PCA. arXiv preprint arXiv:1610.09322, 2016.
• [AGJ18] Gerard Ben Arous, Reza Gheissari, and Aukosh Jagannath. Algorithmic thresholds for tensor PCA. arXiv preprint arXiv:1808.00921, 2018.
• [AKS98] Noga Alon, Michael Krivelevich, and Benny Sudakov. Finding a large hidden clique in a random graph. Random Structures & Algorithms, 13(3-4):457–466, 1998.
• [BGG16] Vijay VSP Bhattiprolu, Mrinalkanti Ghosh, Venkatesan Guruswami, Euiwoong Lee, and Madhur Tulsiani. Multiplicative approximations for polynomial optimization over the unit sphere. In Electronic Colloquium on Computational Complexity (ECCC), volume 23, page 1, 2016.
• [BGL16] Vijay Bhattiprolu, Venkatesan Guruswami, and Euiwoong Lee. Sum-of-squares certificates for maxima of random tensors on the sphere. arXiv preprint arXiv:1605.00903, 2016.
• [BHK16] Boaz Barak, Samuel B Hopkins, Jonathan Kelner, Pravesh Kothari, Ankur Moitra, and Aaron Potechin. A nearly tight sum-of-squares lower bound for the planted clique problem. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 428–437. IEEE, 2016.
• [BLM15] Charles Bordenave, Marc Lelarge, and Laurent Massoulié. Non-backtracking spectrum of random graphs: community detection and non-regular ramanujan graphs. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 1347–1357. IEEE, 2015.
• [BMV18] Jess Banks, Cristopher Moore, Roman Vershynin, Nicolas Verzelen, and Jiaming Xu. Information-theoretic bounds and phase transitions in clustering, sparse PCA, and submatrix localization. IEEE Transactions on Information Theory, 64(7):4872–4894, 2018.
• [Bur17] Amanda Burcroff. Johnson schemes and certain matrices with integral eigenvalues. Technical report, University of Michigan, 2017.
• [Che17] Wei-Kuo Chen. Phase transition in the spiked random tensor with rademacher prior. arXiv preprint arXiv:1712.01777, 2017.
• [CHL18] Wei-Kuo Chen, Madeline Handschy, and Gilad Lerman. Phase transition in random tensors with multiple spikes. arXiv preprint arXiv:1809.06790, 2018.
• [DAM15] Yash Deshpande, Emmanuel Abbe, and Andrea Montanari. Asymptotic mutual information for the two-groups stochastic block model. arXiv preprint arXiv:1507.08685, 2015.
• [DKMZ11a] Aurelien Decelle, Florent Krzakala, Cristopher Moore, and Lenka Zdeborová. Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Physical Review E, 84(6):066106, 2011.
• [DKMZ11b] Aurelien Decelle, Florent Krzakala, Cristopher Moore, and Lenka Zdeborová. Inference and phase transitions in the detection of modules in sparse networks. Physical Review Letters, 107(6):065701, 2011.
• [DMK16] Mohamad Dia, Nicolas Macris, Florent Krzakala, Thibault Lesieur, and Lenka Zdeborová. Mutual information for symmetric rank-one matrix estimation: A proof of the replica formula. In Advances in Neural Information Processing Systems, pages 424–432, 2016.
• [DMM09] David L Donoho, Arian Maleki, and Andrea Montanari. Message-passing algorithms for compressed sensing. Proceedings of the National Academy of Sciences, 106(45):18914–18919, 2009.
• [EKJ18] Ahmed El Alaoui, Florent Krzakala, and Michael I Jordan. Fundamental limits of detection in the spiked wigner model. arXiv preprint arXiv:1806.09588, 2018.
• [Fil16] Yuval Filmus. An orthogonal basis for functions over a slice of the Boolean hypercube. The Electronic Journal of Combinatorics, 23(1):P1–23, 2016.
• [FP07] Delphine Féral and Sandrine Péché. The largest eigenvalue of rank one deformation of large wigner matrices. Communications in mathematical physics, 272(1):185–228, 2007.
• [FR18] Alyson K Fletcher and Sundeep Rangan. Iterative reconstruction of rank-one matrices in noise. Information and Inference: A Journal of the IMA, 7(3):531–562, 2018.
• [GS10] Chris Godsil and Sung Y Song. Association schemes. Handbook of Combinatorial Designs,, pages 325–330, 2010.
• [HKP17] Samuel B Hopkins, Pravesh K Kothari, Aaron Potechin, Prasad Raghavendra, Tselil Schramm, and David Steurer. The power of sum-of-squares for detecting hidden structures. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 720–731. IEEE, 2017.
• [HSS15] Samuel B Hopkins, Jonathan Shi, and David Steurer. Tensor principal component analysis via sum-of-square proofs. In Conference on Learning Theory, pages 956–1006, 2015.
• [HSSS16] Samuel B Hopkins, Tselil Schramm, Jonathan Shi, and David Steurer. Fast spectral algorithms from sum-of-squares proofs: tensor decomposition and planted sparse vectors. In

Proceedings of the forty-eighth annual ACM symposium on Theory of Computing

, pages 178–191. ACM, 2016.
• [Jer92] Mark Jerrum. Large cliques elude the metropolis process. Random Structures & Algorithms, 3(4):347–359, 1992.
• [JL04] Iain M Johnstone and Arthur Yu Lu. Sparse principal components analysis. Unpublished manuscript, 7, 2004.
• [Kik51] Ryoichi Kikuchi. A theory of cooperative phenomena. Phys. Rev., 81:988, 1951.
• [Kik94] Ryoichi Kikuchi. Special issue in honor of R. Kikuchi. Progr. Theor. Phys. Suppl, 115, 1994.
• [KMM13] Florent Krzakala, Cristopher Moore, Elchanan Mossel, Joe Neeman, Allan Sly, Lenka Zdeborová, and Pan Zhang. Spectral redemption in clustering sparse networks. Proceedings of the National Academy of Sciences, 110(52):20935–20940, 2013.
• [Las01] Jean B Lasserre. An explicit exact SDP relaxation for nonlinear 0-1 programs. In

International Conference on Integer Programming and Combinatorial Optimization

, pages 293–303. Springer, 2001.
• [LKZ15a] Thibault Lesieur, Florent Krzakala, and Lenka Zdeborová. MMSE of probabilistic low-rank matrix estimation: Universality with respect to the output channel. In 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 680–687. IEEE, 2015.
• [LKZ15b] Thibault Lesieur, Florent Krzakala, and Lenka Zdeborová. Phase transitions in sparse PCA. In 2015 IEEE International Symposium on Information Theory (ISIT), pages 1635–1639. IEEE, 2015.
• [LML17] Thibault Lesieur, Léo Miolane, Marc Lelarge, Florent Krzakala, and Lenka Zdeborová. Statistical and computational phase transitions in spiked tensor estimation. In 2017 IEEE International Symposium on Information Theory (ISIT), pages 511–515. IEEE, 2017.
• [MNS14] Elchanan Mossel, Joe Neeman, and Allan Sly. Belief propagation, robust reconstruction and optimal recovery of block models. In Conference on Learning Theory, pages 356–370, 2014.
• [MRZ15] Andrea Montanari, Daniel Reichman, and Ofer Zeitouni. On the limitation of spectral methods: From the gaussian hidden clique problem to rank-one perturbations of gaussian tensors. In Advances in Neural Information Processing Systems, pages 217–225, 2015.
• [Oli10] Roberto Oliveira. Sums of random Hermitian matrices and an inequality by Rudelson. Electronic Communications in Probability, 15:203–212, 2010.
• [Par00] Pablo A Parrilo. Structured semidefinite programs and semialgebraic geometry methods in robustness and optimization. PhD thesis, California Institute of Technology, 2000.
• [Pea86] Judea Pearl. Fusion, propagation, and structuring in belief networks. Artificial intelligence, 29(3):241–288, 1986.
• [PWBM18] Amelia Perry, Alexander S Wein, Afonso S Bandeira, and Ankur Moitra. Optimality and sub-optimality of PCA I: Spiked random matrix models. The Annals of Statistics, 46(5):2416–2451, 2018.
• [RM14] Emile Richard and Andrea Montanari. A statistical model for tensor PCA. In Advances in Neural Information Processing Systems, pages 2897–2905, 2014.
• [RRS17] Prasad Raghavendra, Satish Rao, and Tselil Schramm. Strongly refuting random CSPs below the spectral threshold. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 121–131. ACM, 2017.
• [Sch79] Alexander Schrijver. A comparison of the Delsarte and Lovász bounds. IEEE Trans. Information Theory, 25(4):425–429, 1979.
• [Sho87] Naum Z Shor. Class of global minimum bounds of polynomial functions. Cybernetics, 23(6):731–734, 1987.
• [SKZ14] Alaa Saade, Florent Krzakala, and Lenka Zdeborová. Spectral clustering of graphs with the Bethe Hessian. In Advances in Neural Information Processing Systems, pages 406–414, 2014.
• [Tro12] Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389–434, 2012.
• [Wat90] William C Waterhouse. The absolute-value estimate for symmetric multilinear forms. Linear Algebra and its Applications, 128:97–105, 1990.
• [WJ08] Martin J Wainwright and Michael I Jordan. Graphical models, exponential families, and variational inference.

Foundations and Trends® in Machine Learning

, 1(1–2):1–305, 2008.
• [YFW03] Jonathan S Yedidia, William T Freeman, and Yair Weiss. Understanding belief propagation and its generalizations. Exploring artificial intelligence in the new millennium, 8:236–239, 2003.

## Appendix A Analysis of Symmetric Difference and Voting Matrices

We adopt the notation for and . Recall the matrix indexed by sets of size , having entries

 \bMS,T=\bYS\symdT\bone|S\symdT|=pwhere\bYS\symdT=λxS\symdT+\bGS\symdT. (12)

First, observe that we can restrict our attention to the case where the spike is the all-ones vector without loss of generality. To see this, conjugate by a diagonal matrix with diagonal entries and obtain where where

. By symmetry of the Gaussian distribution,

are i.i.d.  random variables. Therefore, the two matrices have the same spectrum and the eigenvectors of can be obtained from those of by pre-multiplying by . So from now on we write

 \bM=λ\bX+\bZ, (13)

where and , where is a collection of i.i.d.  r.v.’s.

### a.1 Structure of \bX

The matrix is the adjacency matrix of a regular graph on vertices, where vertices are represented by sets, and two sets and are connected by an edge if , or equivalently . This matrix belongs to the Bose-Mesner algebra of the -Johnson association scheme. This is the algebra of real- or complex-valued symmetric matrices where the entry depends only on the size of the intersection . In addition to this set of matrices being an algebra, it is a commutative algebra, which means that all such matrices are simultaneously diagonalizable and share the same eigenvectors. We refer to [Sch79, GS10] for more background.

Filmus [Fil16] provides a common eigenbasis for this algebra: for , let be a sequence of distinct elements of . Let denote its total length. Now define a vector having coordinates

 uφS=m∏i=1(\boneai∈S−\bonebi∈S),|S|=ℓ.

In the case , is the empty sequence and we have (the all-ones vector).

Each is an eigenvector of . Furthermore, the linear space for is an eigenspace of (i.e., all vectors with sequences of length of have the same eigenvalue ). Lastly , and . (By convention, .)

###### Proof.

The first two statements are the content of Lemma 4.3 in [Fil16]. The dimension of is given in Lemma 2.1 in [Fil16]. ∎

We note that are not linearly independent; an orthogonal basis, called the Young basis, consisting of linear combinations of the ’s is given explicitly in [Fil16].

We see from the above proposition that has distinct eigenvalues , each one corresponding to the eigenspace . The first eigenvalue is the degree of the graph :

 μ0=dℓ=(ℓp/2)(n−ℓp/2). (14)

We provide an explicit formula for all the remaining eigenvalues: The eigenvalues of are as follows:

 μm=min(m,p/2)∑s=0(−1)s(ms)(ℓ−mp/2−s)(n−ℓ−mp/2−s),0≤m≤ℓ. (15)
###### Proof.

These are the so-called Eberlein polynomials, which are polynomials in of degree (see, e.g., [Sch79]). We refer to [Bur17] for formulae in more general contexts, but we give a proof here for completeness. Let and . Note that is nonzero if and only if for each . By symmetry, we can assume that and . Then is the sum over all sets , such that and for each , of where . For each there are choices of this set of indices, giving the first binomial. Adding these to and removing these from contributes to . To achieve , we also need to remove elements of