DeepAI

# Statistical mechanics of low-rank tensor decomposition

Often, large, high dimensional datasets collected across multiple modalities can be organized as a higher order tensor. Low-rank tensor decomposition then arises as a powerful and widely used tool to discover simple low dimensional structures underlying such data. However, we currently lack a theoretical understanding of the algorithmic behavior of low-rank tensor decompositions. We derive Bayesian approximate message passing (AMP) algorithms for recovering arbitrarily shaped low-rank tensors buried within noise, and we employ dynamic mean field theory to precisely characterize their performance. Our theory reveals the existence of phase transitions between easy, hard and impossible inference regimes, and displays an excellent match with simulations. Moreover, it reveals several qualitative surprises compared to the behavior of symmetric, cubic tensor decomposition. Finally, we compare our AMP algorithm to the most commonly used algorithm, alternating least squares (ALS), and demonstrate that AMP significantly outperforms ALS in the presence of noise.

• 2 publications
• 53 publications
11/05/2021

### The low-rank approximation of fourth-order partial-symmetric and conjugate partial-symmetric tensor

We present an orthogonal matrix outer product decomposition for the four...
10/12/2022

### cuFasterTucker: A Stochastic Optimization Strategy for Parallel Sparse FastTucker Decomposition on GPU Platform

Currently, the size of scientific data is growing at an unprecedented ra...
09/18/2018

### Phase transition in random tensors with multiple spikes

Consider a spiked random tensor obtained as a mixture of two components:...
06/11/2021

### Understanding Deflation Process in Over-parametrized Tensor Decomposition

In this paper we study the training dynamics for gradient flow on over-p...
02/18/2021

### Recovering orthogonal tensors under arbitrarily strong, but locally correlated, noise

We consider the problem of recovering an orthogonally decomposable tenso...
03/03/2018

02/21/2021

### An Optimal Inverse Theorem

We prove that the partition rank and the analytic rank of tensors are eq...

## 1 Introduction

The ability to take noisy, complex data structures and decompose them into smaller, interpretable components in an unsupervised manner is essential to many fields, from machine learning and signal processing

anandkumar2014tensor ; sidiropoulos2017tensor to neuroscience williams2017unsupervised . In datasets that can be organized as an order data matrix, many popular unsupervised structure discovery algorithms, like PCA, ICA, SVD or other spectral methods, can be unified under rubric of low rank matrix decomposition. More complex data consisting of measurements across multiple modalities can be organized as higher dimensional data arrays, or higher order tensors. Often, one can find simple structures in such data by approximating the data tensor as a sum of rank tensors. Such decompositions are known by the name of rank-decomposition, CANDECOMP/PARAFAC or CP decomposition (see kolda2009tensor for an extensive review).

The most widely used algorithm to perform rank decomposition is alternating least squares (ALS) carroll1970analysis ; harshman1970foundations , which uses convex optimization techniques on different slices of the tensor. However, a major disadvantage of ALS is that it does not perform well in the presence of highly noisy measurements. Moreover, its theoretical properties are not well understood. Here we derive and analyze an approximate message passing (AMP) algorithm for optimal Bayesian recovery of arbitrarily shaped, high-order low-rank tensors buried in noise. As a result, we obtain an AMP algorithm that both out-performs ALS and admits an analytic theory of its performance limits.

AMP algorithms have a long history dating back to early work on the statistical physics of perceptron learning

mezard1989space ; kabashima2004bp (see advani2013statistical for a review). The term AMP was coined by Donoho, Maleki and Montanari in their work on compressed sensing donoho2009message (see also kabashima2009typical ; rangan2009asymptotic ; ganguli2010statistical ; ganguli2010short ; advani2016statistical ; advani2016equivalence for replica approaches to compressed sensing and high dimensional regression). AMP approximates belief propagation in graphical models and a rigorous analysis of AMP was carried out in bayati2011dynamics

. For a rank-one matrix estimation problem, AMP was first introduced and analyzed in

rangan2012iterative . This framework has been extended in a beautiful body of work by Krzakla and Zdeborova and collaborators to various low-rank matrix factorization problems in lesieur2015mmse ; lesieur2017constrained ; lesieur2017statistical ; lesieur2015phase ). Also, recently low-rank tensor decomposition through AMP was studied in lesieur2017statistical , but their analysis was limited to symmetric tensors which are then necessarily cubic in shape.

However, tensors that occur naturally in the wild are almost never cubic in shape, nor are they symmetric. The reason is that the different modes of an order tensor correspond to measurements across very different modalities, resulting in very different numbers of dimensions across modes, yielding highly irregularly shaped, non-cubic tensors with no symmetry properties. For example in EEG studies different tensor modes could correspond to time, spatial scale, and electrodes acar2007multiway . In fMRI studies the modes could span channels, time, and patients hunyadi2017tensor

. In neurophysiological measurements they could span neurons, time, and conditions

seely2016tensor or neurons, time, and trials williams2017unsupervised . In studies of visual cortex, modes could span neurons, time and stimuli rabinowitz2015attention .

Thus, given that tensors in the wild are almost never cubic, nor symmetric, to bridge the gap between theory and experiment, we go beyond prior work to derive and analyze Bayes optimal AMP algorithms for arbitrarily shaped high order and low rank tensor decomposition with different priors for different tensor modes, reflecting their different measurement types. We find that the low-rank decomposition problem admits two phase transitions separating three qualitatively different inference regimes: (1) the easy regime at low noise where AMP works, (2) the hard regime at intermediate noise where AMP fails but the ground truth tensor is still possible to recover, if not in a computationally tractable manner, and (3) the impossible regime at high noise where it is believed no algorithm can recover the ground-truth low rank tensor.

From a theoretical perspective, our analysis reveals several surprises relative to the analysis of symmetric cubic tensors in lesieur2017statistical . First, for symmetric tensors, it was shown that the easy inference regime cannot exist, unless the prior over the low rank factor has non-zero mean. In contrast, for non-symmetric tensors, one tensor mode can have zero mean without destroying the existence of the easy regime, as long as the other modes have non-zero mean. Furthermore, we find that in the space of all possible tensor shapes, the hard regime has the largest width along the noise axis when the shape is cubic, thereby indicating that tensor shape can have a strong effect on inference performance, and that cubic tensors have highly non-generic properties in the space of all possible tensor shapes.

Before continuing, we note some connections to the statistical mechanics literature. Indeed, AMP is closely equivalent to the TAP equations and the cavity method thouless1977solution ; crisanti1995thouless in glassy spin systems. Furthermore, the posterior distribution of noisy tensor factorization is equivalent to -spin magnetic systems crisanti1992sphericalp , as we show below in section 2.2. For Bayes-optimal inference, the phase space of the problem is reduced to the Nishimori line nishimori2001statistical . This ensures that the system does not exhibit replica-symmetry breaking. Working in the Bayes-optimal setting thus significantly simplifies the statistical analysis of the model. Furthermore, it allows theoretical insights into the inference phase-transitions, as we shall see below. In practice, for many applications the prior or underlying rank of the tensors are not known a-priori

. The algorithms we present here can also be applied in a non Bayes-optimal setting, where the parametric from of the prior can not be determined. In that case, the theoretical asymptotics we describe here may not hold. However, approximate Bayesian-optimal settings can be recovered through parameter learning using expectation-maximization algorithms

dempster1977maximum . We discuss these consequences in section 4. Importantly,the connection to the statistical physics of magnetic systems allows the adaptation of many tools and intuitions developed extensively in the past few decades, see e.g. zdeborova2016statistical . We discuss more connections to statistical mechanics as we proceed below.

## 2 Low rank decomposition using approximate message passing

In the following we define the low-rank tensor decomposition problem and present a derivation of AMP algorithms designed to solve this problem, as well as a dynamical mean field theory analysis of their performance. A full account of the derivations can be found in the supplementary material.

### 2.1 Low-rank tensor decomposition

Consider a general tensor of order-, whose components are given by a set of indices, . Each index is associated with a specific mode of the tensor. The dimension of the mode is so the index ranges from . If for all then the tensor is said to be cubic. Otherwise we define

as the geometric mean of all dimensions

, and denote so that . We employ the shorthand notation , where is a set of numbers indicating a specific element of . A rank- tensor of order- is the outer product of vectors (order- tensors) where . A rank- tensor of order- has a special structure that allows it to be decomposed into a sum of rank- tensors, each of order-. The goal of the rank decomposition is to find all , for and , given a tensor of order- and rank-. In the following, we will use to denote the vector of values at each entry of the tensor, spanning the rank- components. In a low-rank decomposition it is assumed that . In noisy low-rank decomposition, individual elements are noisy measurements of a low-rank tensor [Figure 1.A]. A comprehensive review on tensor decomposition can be found in kolda2009tensor .

We state the problem of low-rank noisy tensor decomposition as follows: Given a rank- tensor

 wa=1Np−12r∑ρ=1∏αxραi, (1)

we would like to find all the underlying factors . We note that we have used the shorthand notation to refer to the index which ranges from to , i.e. the dimensionality of mode of the tensor.

Now consider a noisy measurement of the rank- tensor given by

 Y=w+√Δϵ, (2)

where is a random noise tensor of the same shape as

whose elements are distributed i.i.d according to a standard normal distribution, yielding a total noise variance

[Fig. 1.A]. The underlying factors are sampled i.i.d from a prior distribution , that may vary between the modes . This model is a generalization of the spiked-tensor models studied in lesieur2017statistical ; richard2014statistical .

We study the problem in the thermodynamic limit where while ,. In that limit, the mean-field theory we derive below becomes exact. The achievable performance in the decomposition problem depends on the signal-to-noise ratio (SNR) between the underlying low rank tensor (the signal) and the noise variance . In eq. (1) we have scaled the SNR (signal variance divided by noise variance) with so that the SNR is proportional to the ratio between the unknowns and the measurements, making the inference problem neither trivially easy nor always impossible. From a statistical physics perspective, this same scaling ensures that the posterior distribution over the factors given the data corresponds to a Boltzmann distribution whose Hamiltonian has extensive energy proportional to , which is necessary for nontrivial phase transitions to occur.

### 2.2 Tensor decomposition as a Bayesian inference problem

In Bayesian inference, one wants to compute properties of the posterior distribution

 P(w|Y)=1Z(Y,w)r∏ρp∏αN∏iPα(xραi)∏aPout(Ya|wa). (3)

Here is an element-wise output channel that introduce independent noise into individual measurements. For additive white Gaussian noise, the output channel in eq. (3) is given by

 (4)

where is a quadratic cost function. The denominator , in 3 is a normalization factor, or the partition function in statistical physics. In a Bayes-optimal setting, the priors , as well as the rank and the noise are known.

The channel universality property lesieur2015mmse states that for low-rank decomposition problems, , any output channel is equivalent to simple additive white Gaussian noise, as defined in eq. (2). Briefly, the output channel can be developed as a power series in . For low-rank estimation problems we have [eq. (1)], and we can keep only the leading terms in the expansion. One can show that the remaining terms are equivalent to random Gaussian noise, with variance equal to the inverse of the Fisher information of the channel [See supplementary material for further details]. Thus, non-additive and non-Gaussian measurement noise at the level of individual elements, can be replaced with an effective additive Gaussian noise, making the theory developed here much more generally applicable to diverse noise scenarios.

The motivation behind the analysis below, is the observation that the posterior (3), with the quadratic cost function (4) is equivalent to a Boltzmann distribution of a magnetic system at equilibrium, where can be though of as the -dimensional vectors of a spherical (xy)-spin model crisanti1992sphericalp .

### 2.3 Approximate message passing on factor graphs

To solve the problem of low-rank decomposition we frame the problem as a graphical model with an underlying bipartite factor graph. The variable nodes in the graph represent the unknowns and the factor nodes correspond to the measurements . The edges in the graph are between factor node and the variable nodes in the neighbourhood [Figure 1.B]. More precisely, for each factor node , the set of variable nodes in the neighbourhood are precisely , where each . Again, in the following we will use the shorthand notation for

. The state of a variable node is defined as the marginal probability distribution

for each of the components of the vectors . The estimators for the values of the factors are given by the means of each of the marginal distributions .

In the approximate message passing framework, the state of each node (also known as a ’belief’), is transmitted to all other variable nodes via its adjacent factor nodes [Fig. 1.C]. The state of each node is then updated by marginalizing over all the incoming messages, weighted by the cost function and observations in the factor nodes they passed on the way in:

 ηαi(x)=Pα(x)Zαi∏a∈∂αi∏βj∈∂a∖αiTrxβjηβj(xβj)eg(ya,wa). (5)

Here is the prior for each factor associated with mode , and is the partition function for normalization. The first product in (5) spans all factor nodes adjacent to variable node . The second product is over all variable nodes adjacent to each of the factor nodes, excluding the target node . The trace denotes the marginalization of the cost function over all incoming distributions.

The mean of the marginalized posterior at node is given by

 ^xαi=∫dxηαi(x)x∈Rr, (6)

and its covariance is

 ^σ2αi=∫dxηαi(x)xxT−^xαi^xTαi∈Rr×r. (7)

Eq. (5) defines an iterative process for updating the beliefs in the network. In what follows, we use mean-field arguments to derive iterative equations for the means and covariances of the these beliefs in (6)-(7). This is possible given the assumption that incoming messages into each node are probabilistically independent. Independence is a good assumption when short loops in the underlying graphical model can be neglected. One way this can occur is if the factor graph is sparse yedidia2001bethe ; mezard2009information . Such graphs can be approximated by a directed acyclic graph; in statistical physics this is known as the Bethe approximation bethe1935statistical . Alternatively, in low-rank tensor decomposition, the statistical independence of incoming messages originates from weak pairwise interactions that scale as . Loops correspond to higher order terms interaction terms, which become negligible in the thermodynamic limit bayati2011dynamics ; zdeborova2016statistical .

Exploiting these weak interactions we construct an accurate mean-field theory for AMP. Each node receives messages from every node with , through all the factor nodes that are connected to both nodes, [Fig. 1

.D]. Under the independence assumption of incoming messages, we can use the central limit theorem to express the state of node

in (5) as

 ηαi(x)=Pα(x)Zα(Aαi,uαi)∏βj≠αiexp(−xTAβjx+uTβjx), (8)

where and are the mean and covariance of the local incoming messages respectively. The distribution is normalized by the partition function

 Zα(A,u)=∫dxPα(x)exp[(uTx−xTAx)]. (9)

The mean and covariance of the distribution, eq. (6) and (7

) are the moments of the partition function

 ^xαi=∂∂uαilogZα,^σαi=∂2∂uαi∂uTαilogZα. (10)

Finally, by expanding ) in eq. (5) to quadratic order in , and averaging over the posterior, one can find a self consistent equation for and in terms of and [see supplemental material for details].

### 2.4 AMP algorithms

Using equations (10), and the self-consistent equations for and , we construct an iterative algorithm whose dynamics converges to the solution of the self-consistent equations [ see supplemental material for details]. The resulting update equations for the parameters are

 utαi =nαΔN(p−1)/2∑a∈∂αiYa⎛⎝⊙∏(β,j)∈∂b∖(α,i)^xtβj⎞⎠−1Δ^xt−1αi∑β≠αΣtβ⊙Dt,t−1αβ (11) Atα =1Δ⊙∏β≠α(1nβNN∑j=1^xtβj^xtTβj) (12) ^xt+1αi =∂∂utαilogZα(Atα,utαi) (13) ^σt+1αi =∂2∂utαi∂utTαilogZα(Atα,utαi), (14)

The second term on the RHS of (11), is given by

 Dt,t−1αβ=⊙∏γ≠α,β(1N∑k^xtγk^xt−1,Tγk),Σtα=N−1∑i^σtαi. (15)

This term originates from the the exclusion the target node from the product in equations (5) and (8). In statistical physics it corresponds to an Onsager reaction term due to the removal of the node yielding a cavity field mezard1985j . In the above, the notations , denote component-wise multiplication between two, and multiple tensors respectively.

Note that in the derivation of the iterative update equation above, we have implicitly used the assumption that we are in the Bayes-optimal regime which simplifies eq. (11)-(14) [see supplementary material for details]. The AMP algorithms can be derived without the assumption of Bayes-optimality, resulting in a slightly more complicated set of algorithms [See supplementary material for details]. However, further analytic analysis, which is the focus of this current work, and the derivation of the dynamic mean-field theory which we present below is applicable in the Bayes-optimal regime, were there is no replica-symmetry breaking, and the estimators are self-averaging. Once the update equations converge, the estimates for the factors and their covariances are given by the fixed point value of equations (13) and (14) respectively. A statistical treatment for the convergence in typical settings is presented in the following section.

### 2.5 Dynamic mean-field theory

To study the performance of the algorithm defined by eq. (11)-(14), we use another mean-field approximation that estimates the evolution of the inference error. As before, the mean-field becomes exact in the thermodynamic limit. We begin by defining order parameters that measure the correlation of the estimators with the ground truth values for each mode of the tensor

 Mtα=(nαN)−1Nα∑i=1^xtαixTαi∈Rr×r. (16)

Technically, the algorithm is permutation invariant, so one should not expect the high correlation values to necessarily appear on the diagonal of . In the following, we derive an update equation for , which will describe the performance of the algorithm across iterations.

An important property of Bayes-optimal inference is that there is no statistical difference between functions operating on the ground truth values, or on values sampled uniformly from the posterior distribution. In statistical physics this property is known as one of the Nishimori conditions nishimori2001statistical . These conditions allow us to derive a simple equation for the update of the order parameter (16). For example, from (12) one easily finds that in Bayes-optimal settings Furthermore, averaging the expression for over the posterior, we find that [ supplemental material]

 EP(W|Y)[utαi]=¯Mtαxαi, (17)

where

 ¯Mtα≡nαΔ⊙∏β≠αMtβ. (18)

Similarly, the covariance matrix of under the posterior is

 COVP(W|Y)[utiα]=¯Mtα. (19)

Finally, using eq. (13) for the estimation of , and the definition of in (16) we find a dynamical equation for the evolution of the order parameters :

 Mt+1α=EPα(x),z[fα(¯Mtα,¯Mtαxαi+√¯Mtαz)xTαi], (20)

where is the estimation of from (13). The average in (20) is over the prior and over the standard Gaussian variables . The average over represents fluctuations in the mean in (17), due to the covariance in (19).

Finally, the performance of the algorithm is given by the fixed point of (20),

 M∗α=EPα(x),z[fα(¯M∗α,¯M∗αxαi+√¯M∗αz)xTαi],where ¯M∗α≡nαΔ⊙∏β≠αM∗β. (21)

As we will see below, the inference error can be calculated from the fixed point order parameters in a straightforward manner.

## 3 Phase transitions in generic low-rank tensor decomposition

The dynamics of depend on the SNR via the noise level . To study this dependence, we solve equations (20) and (21

) with specific priors. Below we present the solution of using Gaussian priors. In the supplementary material we also solve for Bernoulli and Gauss-Bernoulli distributions, and discuss mixed cases where each mode of the tensor is sampled from a different prior. Given our choice of scaling in (

1), we expect phase transitions at values of , separating three regimes where inference is: (1) easy at small ; (2) hard at intermediate ; and (3) impossible at large . For simplicity we focus on the case of rank , where the order parameters in (16) become scalars, which we denote .

### 3.1 Solution with Gaussian priors

We study the case where are sampled from normal distributions with mode-dependent mean and variance The mean-field update equation (20) can be written as

 mt+1α=μ2ασ2α+(σ2+μ2)¯mtασ−2+¯mtα, (22)

where , as in (18). We define the average inference error for all modes

 MSE=1p∑α|^xα−xα|22σ2α=1p∑α(1+μ2ασ2α−1σ2αm∗α), (23)

where is the fixed point of eq. (22). Though we focus here on the case for simplicity, the theory is equally applicable to higher-rank tensors.

Solutions to the theory in (22) and (23) are plotted in Fig. 2.A together with numerical simulations of the algorithm (11)-(14) for order-3 tensors generated randomly according to (2). The theory and simulations match perfectly. The AMP dynamics for general tensor decomposition is qualitatively similar to that of rank- symmetric matrix and tensor decompositions, despite the fact that such symmetric objects possess only one mode. As a consequence, the space of order parameters for these two problems is only one-dimensional; in contrast for the general case we consider here, it is -dimensional. Indeed, the order parameters are all simultaneously and correctly predicted by our theory.

For low levels of noise, the iterative dynamics converge to a stable fixed point of (22) with low MSE. As the noise increases beyond a bifurcation point , a second stable fixed point emerges with and 1. Above this point AMP may not converge to the true factors. The basin of attraction of the two stable fixed points are separated by a dimensional sub-manifold in the -dimensional order parameter space of . If the initial values have sufficiently high overlap with the true factors , then the AMP dynamics will converge to the low error fixed point; we refer to this as the informative initialization, as it requires prior knowledge about the true structure. For uninformative initializations, the dynamics will converge to the high error fixed point almost surely in the thermodynamic limit.

At a higher level of noise, , another pitchfork bifurcation occurs and the high error fixed point becomes the only stable point. With noise levels above , the dynamic mean field equations will always converge to a high error fixed point. In this regime AMP cannot overcome the high noise and inference is impossible.

From eq. (22), it can be easily checked that if the prior means , then the high error fixed point with is stable for any finite . This implies that , and there is no easy regime for inference, so with uninformed initialization will never find the true solution. This difficulty was previously noted for the low-rank decomposition of symmetric tensors lesieur2017statistical , and it was further shown there that the prior must be non-zero for the existence of an easy inference regime. However, for general tensors there is higher flexibility; one mode can have a zero mean without destroying the existence of an easy regime. To show this we solved (22), with different prior means for different modes and we plot the phase boundaries in Fig. 2.B-C. For the case, is finite even if two of the priors have zero mean. Interestingly, the algorithmic transition is finite if at most one prior is has zero mean. Thus, the general tensor decomposition case is qualitatively different than the symmetric case in that an easy regime can exist even when a tensor mode has zero mean.

### 3.2 Non-cubic tensors

The shape of the tensor, defined by the different mode dimensions , has an interesting effect on the phase transition boundaries, which can be studied using (22). In figure 2.D the two transitions, and are plotted as a function of the shape of the tensor. Over the space of all possible tensor shapes, the boundary between the hard and impossible regimes, is maximized, or pushed furthest to the right in Fig. 2.D, when the shape takes the special cubic form where all dimensions are equal . This diminished size of the impossible regime at the cubic point can be understood by noting the cubic tensor has the highest ratio between the number of observed data points and the number of unknowns .

Interestingly the algorithmic transition is lowest at this point. This means that although the ratio of observations to unknowns is the highest, algorithms may not converge, as the width of the hard regime is maximized. To explain this observation, we note that in (18), the noise can be rescaled independently in each mode by defining . It follows that for non-cubic tensors the worst case effective noise across modes will be necessarily higher than in the cubic case. As a consequence, moving from cubic to non-cubic tensors lowers the minimum noise level at which the uninformative solution is stable, thereby extending the hard regime to the left in Fig. 2.D.

## 4 Bayesian AMP compared to maximum a-posteriori (MAP) methods

We now compare the performance of AMP to one of the most commonly used algorithms in practice, namely alternating least squares (ALS) kolda2009tensor . ALS is motivated by the observation that optimizing one mode while holding the rest fixed is a simple least-squares subproblem harshman1970foundations ; carroll1970analysis . Typically, ALS performs well at low noise levels, but here we explore how well it compares to AMP at high noise levels, in the scaling regime defined by defined by (1) and (2), where inference can be non-trivial.

In Fig. 3 we compare the performance of ALS with that of AMP on the same underlying large () tensors with varying amounts of noise. First, we note that that ALS does not exhibit a sharp phase transition, but rather a smooth cross-over between solvable and unsolvable regimes. Second, the robustness of ALS to noise is much lower than that of AMP. This difference is more substantial as the size of the tensors, , is increased [data not shown].

One can understand the difference in performance by noting that ALS is like a MAP estimator, while Bayesian AMP attempts to find the minimal mean square error (MMSE) solution. AMP does so by marginalizing probabilities at every node. Thus AMP is expected to produce better inferences when the posterior distribution is rough and dominated by noise. From a statistical physics perspective, ALS is a zero-temperature method, and so it is subject to replica symmetry breaking. AMP on the other hand is Bayes-optimal and thus operates at the Nishimori temperature nishimori2001statistical . At this temperature the system does not exhibit replica symmetry breaking, and the true global ground state can be found in the easy regime, when .

## 5 Summary

In summary, our work partially bridges the gap between theory and practice by creating new AMP algorithms that can flexibly assign different priors to different modes of a high-order tensor, thereby enabling AMP to handle arbitrarily shaped high order tensors that actually occur in the wild. Moreover, our theoretical analysis reveals interesting new phenomena governing how irregular tensor shapes can strongly affect inference performance and the positions of phase boundaries, and highlights the special, non-generic properties of cubic tensors. Finally, we hope the superior performance of our flexible AMP algorithms relative to ALS will promote the adoption of AMP in the wild. Code to reproduce all simulations presented in this paper is available at https://github.com/ganguli-lab/tensorAMP.

## Acknowledgments

We thank Alex Williams for useful discussions. We thank the Center for Theory of Deep Learning at the Hebrew University (J.K), and the Burroughs-Wellcome, McKnight, James S.McDonnell, and Simons Foundations, and the Office of Naval Research and the National Institutes of Health (S.G) for support.

## Appendix

In the following sections, we derive the approximate message passing (AMP) algorithms for arbitrary low-rank tensors. The derivation follows similar lines as the derivation in lesieur2017constrained ; zdeborova2016statistical for the matrix case, only here the model is generalized for higher modes p>2. We then derive dynamical mean field theory (also known as state evolution) and find the phase transition of the inference problem using non-linear analysis on the recursive dynamical equations of the order parameters. Lastly, we explicitly solve the equations for several simple examples of mode-3 tensors with a mixture of different prior distributions. For notational simplicity, we start the derivation assuming all modes are of the same dimension, , which we assume to be in the thermodynamic limit, Then, we will generalize for noncubic tensors. We emphasize that even in the cubic case the tensor is non-symmetric, and each mode is independent and is iid drawn from a prior distribution, which is potentially different for each mode.

## Appendix A Message passing on factorized graph

### a.1 Factor graph for tensor decomposition – notations

We consider a given low-rank tensor

 w0a=1Np−12r∑ρ=1∏αx0ραi. (24)

Here the vectors denote the ground-truth values to the estimation problem. Each entry of the tensor is denoted with a lower-case latin letter . The notation stands for the set of indices that define that tensor element,

 a={i1,i2,...ip}. (25)

However, we only have access to noisy measurements of the ground-truth vectors, denoted by

 Ya=w0a+√Δϵa, (26)

where is a random tesnsor, whose elements are i.i.d. gaussians with zero mean and unit variance. We assume no covariance between two measurements, .

The goal of the low rank decomposition is to find the estimators that minimize the mean square error

 ^x=argminx∑α∑i(xαi−x0αi)2.

To solve the Bayesian inference problem, using message passing, we frame it as a bipartite graphical model. Each of the variable nodes corresponds to an estimator [See figure 1.b in the main text]. We use the notation to denote all the neighboring nodes to and the notation to denote all the neighboring variable nodes adjacent to , excluding the node . The cardinality of the set of all factor points is .

Each variable point on the graph is connected to factor nodes. The set of neighboring factor nodes that is connected to the variable node is denoted as

 ∂αi={a|αi∈a}. (27)

### a.2 Weakly connected graph

An underlying assumption in belief propagation and message passing algorithms is that the incoming messages into each node are statistically independent. It can be achieved, for example in sparse graphs, where at each node the graph can be approximately considered as a tree (directed acyclic graph), without recurring loops. In the physics literature such approximation is often referred to as Bethe Lattice. In the current model this is possible due to the scaling of individual elements, , as defined in Eq. (1) in the main text. Since correlations in the messages are due to loops in the underlying graph, that pass through several nodes, we neglect them when the interactions are sufficiently weak mezard1986sk .

### a.3 Message passing

We start by defining two different types of messages (beliefs): one for messages outgoing from a variable node into a factor node . Messages going from factor nodes into variable nodes are denoted by

. Messages are the marginal probabilities at each node, measuring the posterior probability density of the estimator at that source node. A message outgoing from the variable node

to a factor node can be written in terms of the product of messages originating from all its connected nodes excluding ,

 ηαi→a(xαi)=Pα(xαi)Zαi→a∏b∈∂αi∖a~ηb→αi(xαi). (28)

The denominator is a normalization factor

 Zαi→a=TrxαiPα(xαi)∏b∈∂αi∖a~ηb→αi(xαi).

Incoming messages into variable nodes are obtained by marginalizing over distribution over all the messages. A message outgoing from a factor node into a variable node is given by

 ~ηb→αi(xαi)=1Zb→αi∏βj∈∂b∖αTrxβjηβj→b(xβj)expg(Yb,N1−p2wb), (29)

where

 wb≡r∑ρ=1∏αxραi. (30)

The normalization factor in the denominator of eq. (30) is given by

 Zb→αi=Trxαi∏βj∈∂b∖αTrxβjηβj→b(xβj)expg(Yb,N1−p2wb). (31)

The cost function at the exponent can be expanded as a power series in

 ~ηb→αi(xαi)=1Zb→αi∏βj∈∂b∖αTrxβjηβj→b(xβj)×exp[g(Y0,0)(1+1N(p−1)/2Sbwb+1Np−1(Rb−S2b)w2b+O(1N3(p−1)/2))] (32)

where and are the first and second derivative of the cost function evaluated at and :

 Sb ≡∂g(Yb,wb)∂w∣∣∣wb=0 (33) Rb ≡(∂g(Yb,wb)∂w∣∣∣wb=0)2+∂2g(Yb,wb)∂w2b∣∣ ∣∣wb=0 (34)

#### a.3.1 Belief propagation

The mean values of outgoing messages from variable node into factor node , are obtained by integrating over the marginal probabilities :

 ^xαi→i=∫dxαiηαi→a(xαi)xTαi∈Rr. (35)

Note that we have used the transpose of the vector , which will become useful for the notation below. Their covariance matrix is equal to

 ^σαi→a=∫dxαiηαi→a(xαi)xαixTαi−^xαi→i^xTα)→i∈Rr×r. (36)

Using the first- and second-order statistics, we can write explicit expressions for the moments of appearing in the expansion (32) above. The first moment reads

 ∏βj∈∂b∖α∫dxβiηβ)→b(xβj)wb=∏βj∈∂b∖α∫dxβiηβj→b(xβj)r∑ρ=1∏βj∈∂bxρβj=xTαi⊙∏βj∈∂b∖α^xβj→b. (37)

Similarly, the second moment, , is given by

 (38)

Introducing the explicit moments back into eq. (32), the incoming messages into variable nodes are given by

 ~ηb→αi(xαi)=eg(Yb,0)Zb→αi⎡⎣1+1N(p−1)/2SbxTαi⊙∏βj∈∂b∖α^xβj→b+1Np−1(Rb−S2b)xTαi⊙∏βj∈∂b∖α(σβj→b+xβj→bxTβj→b)xαi⎤⎦+O(1N3(p−1)/2). (39)

Since we are interested in the marginals in the variable nodes, we can replace this result in the expression for messages outgoing from a variable node (28), we obtain

 ηαi→a(xαi)=Pα(xαi)Zαi→a∏b∈∂αi∖a~ηb→αi(xαi)=Pα(xαi)Zαi→aeNg(Yb,0)∏b∈∂αi∖aZb→αi×exp∑b∈∂αi∖a⎡⎣1Np−1(Rb−S2b)xTαi⊙∏βj∈∂b∖α(σβj→b+xβj→bxTβj→b)xαi⎤⎦. (40)

Note that is a constant and can be absorbed into the normalization function. We define the two order parameters

 uTαi→a =1N(p−1)/2∑b∈∂αi∖aSb⊙∏βj∈∂b∖αi^xTβj→b∈Rr, (41)

and

 Aβj→b=1Np−1∑b∈∂αi∖a⎡⎣⊙∏βj∈∂b∖αiS2bxβj→bxTβj→b−Rb⊙∏βj∈∂b∖αi(σβj→b+xβj→bxTβj→b)⎤⎦∈Rr×r. (42)

Using the order parameters we rewrite equation (40) as

 ηαi→a(xαi)=Pα(xαi)Zαi→a∏b∈∂αi∖aexp(−xTαiAβj→bxαi+uTβj→bxαi). (43)

The normalization, or partition function , can be written in terms of the order parameters and as

 Zαi→a=TrxαiPα(xαi)∏b∈∂αi∖aexp(−xTαiAβj→bxαi+uTβj→bxαi). (44)

Finally, the moments of the local variables with distribution can be found directly from the partition functions by standard derivations. The mean is given by

 ^xαi→a=∂∂uαi→aZαi→a(Aαi→a,uαi→a)≡f(Aαi→a,uαi→a), (45)

and the covariance matrices are

 σαi→a=∂2∂uαi→a∂uTαi→aZαi→a(Aαi→a,uαi→a) =∂∂uαi→af(Aαi→a,uαi→a). (46)

### a.4 AMP algorithms

The mean-field equations, describing the equilibrium of the local estimators can be used to iteratively into an algorithm by iteratively calculating the statistics of the messages given their estimators using eq. (41) and (42) and then reevaluating the estimators and using eq. (45) and (46). Defining the upper-script denoting the time step of the algorithm iteration, we can write the iterative equations as

 utαi→a =1N(p−1)/2∑b∈∂αi∖aSb⊙∏βj∈∂b∖αi^xtβj→b (47) Atαi→a =1Np−1∑b∈∂αi∖a⎡⎣S2b⊙∏βj∈∂b∖αi^xtβj→b^xtTβj→b (48) −Rb⊙∏βj∈∂b∖αi(σtβj→b+^xtβj→b^xtTβj→b)⎤⎦ (49) ^xt+1αi→a =∂∂utαi→alogZαi→a(Atαi→a,utαi→a) (50) σt+1αi→a =∂2∂utαi→a∂utTαi→alogZαi→a(Atαi→a,utαi→a) (51)

### a.5 Approximate message passing – local mean-field approximation for the messages

In the equations above (47)-(51), the number of overall messages (and thus calculations) scale with the number of edges in the factorized graph, i.e. as . However, the dependence of each message on the state of the target node is weak. Therefore, the values of and are very close to their mean, when marginalized over all target nodes . The local deviations about that mean scale as . For that reason, we can consider the statistics of all outgoing messages from each node (i.e., average over all the adjacent edges), and assume small fluctuations due to the state of the targets. This procedure is essentially performing mean-field approximation at every node. The result will be the AMP equations which scale with the number of variable nodes rather than with the number of edges in the graph. In physics, this analogous to the cavity method.

To apply this reasoning to the equations, we define the order parameters and , which explicitly exclude the dependence of the target node:

 utαi =1N(p−1)/2∑b∈∂αiSb⊙∏βj∈∂b^xtβj→b, (52) Atαi =1Np−1∑b∈∂αi⎡⎣S2b⊙∏βj∈∂b∖αi^xtβj→b^xtTβj→b−Rb⊙∏βj∈∂b∖αi(σtβj→b+^xtβj→b^xtTβj→b)⎤⎦. (53)

The difference between the non-directed and the directed messages is the component that depends on the target node . For the mean-messages, the correction terms scale as , and is given by

 δutαi→a=utαi−utαi→a=1N(p−1)/2Sa⊙∏βj∈∂a∖αi^xtβj→a. (54)

For the fluctuations in the local messages about their mean, the correction term scales as

 Atαi−Atαi→a∼O(N1−p), (55)

and we will be neglecting it.

To transform the equations for the local messages statistics, to use only the target-agnostic variables, and , we calculate the difference between the two mean values