# Passed & Spurious: analysing descent algorithms and local minima in spiked matrix-tensor model

In this work we analyse quantitatively the interplay between the loss landscape and performance of descent algorithms in a prototypical inference problem, the spiked matrix-tensor model. We study a loss function that is the negative log-likelihood of the model. We analyse the number of local minima at a fixed distance from the signal/spike with the Kac-Rice formula, and locate trivialization of the landscape at large signal-to-noise ratios. We evaluate in a closed form the performance of a gradient flow algorithm using integro-differential PDEs as developed in physics of disordered systems for the Langevin dynamics. We analyze the performance of an approximate message passing algorithm estimating the maximum likelihood configuration via its state evolution. We conclude by comparing the above results: while we observe a drastic slow down of the gradient flow dynamics even in the region where the landscape is trivial, both the analyzed algorithms are shown to perform well even in the part of the region of parameters where spurious local minima are present.

## Authors

• 11 publications
• 64 publications
• 8 publications
• 57 publications
• ### Who is Afraid of Big Bad Minima? Analysis of Gradient-Flow in a Spiked Matrix-Tensor Model

07/18/2019 ∙ by Stefano Sarao Mannelli, et al. ∙ 1

• ### Marvels and Pitfalls of the Langevin Algorithm in Noisy High-dimensional Inference

Gradient-descent-based algorithms and their stochastic versions have wid...
12/21/2018 ∙ by Stefano Sarao Mannelli, et al. ∙ 20

• ### Algorithmic thresholds for tensor PCA

We study the algorithmic thresholds for principal component analysis of ...
08/02/2018 ∙ by Gérard Ben Arous, et al. ∙ 0

• ### Saving Gradient and Negative Curvature Computations: Finding Local Minima More Efficiently

We propose a family of nonconvex optimization algorithms that are able t...
12/11/2017 ∙ by Yaodong Yu, et al. ∙ 0

• ### Optimizing parametrized quantum circuits via noise-induced breaking of symmetries

Very little is known about the cost landscape for parametrized Quantum C...
11/17/2020 ∙ by Enrico Fontana, et al. ∙ 0

• ### The layer-wise L1 Loss Landscape of Neural Nets is more complex around local minima

For fixed training data and network parameters in the other layers the L...
05/06/2021 ∙ by Peter Hinz, et al. ∙ 0

• ### Message Passing Descent for Efficient Machine Learning

We propose a new iterative optimization method for the Data-Fitting (DF)...
02/16/2021 ∙ by Francesco Concetti, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A central question in computational sciences is the algorithmic feasibility of optimization in high-dimensional non-convex landscapes. This question is particularly important in learning and inference problems where the value of the optimized function is not the ultimate criterium for quality of the result, instead the generalization error or the closeness to a ground-truth signal is more relevant.

Recent years brought a popular line of research into this question where various works show for a variety of systems that spurious local minima are not present in certain regimes of parameters and conclude that consequently optimization algorithms shall succeed, without the aim of being exhaustive these include [2, 3, 4, 5, 6, 7, 8, 9, 8, 10]. The spuriosity of a minima is in some works defined by their distance from the global minima, in other works as local minimizers that lead to bad generalization or bad accuracy in reconstruction of the ground truth signal. These two notions are not always equivalent, and certainly the later is more relevant and will be used in the present work.

Many of the existing works stop at the statement that absence of spurious local minimizers leads to algorithmic feasibility and the presence of such spurious local minima leads to algorithmic difficulty, at least as far as gradient-descent-based algorithms are concerned. At the same time, even gradient-descent-based algorithms may be able to perform well even when spurious local minima are present. This is because the basins of attraction of the spurious minimas may be small and the dynamics might be able to avoid them. In the other direction, even if spurious local minima are absent, algorithms might take long time to find a minimizer for entropic reasons that in high-dimensional problems may play a crucial role.

Main results: In this work we provide a case-study of a high-dimensional inference problems – the spiked matrix-tensor model – for which we are able to describe and quantitatively compare the following:

• With the use of the Kac-Rice formula [11, 12] we compute the expected number of local minimizers of the associated likelihood at a given correlation with the ground truth signal.

• With the use of the recently introduced Langevin-state-evolution [1], that is a generalization of an approach well known in physics for analysis of the Langevin dynamics [13, 14], we give a closed-form description for the behaviour of the gradient flow (GF) algorithm in the limit of large system sizes.

• We analyze the state evolution of the maximum-likelihood version of the approximate message passing algorithm (ML-AMP).

We show that the above two algorithms (GF and ML-AMP) achieve the same error in the regime where they succeed. That same value of the error is also deduced from the position of all the minima strongly correlated with the signal as obtained from the Kac-Rice approach (precise statement below). We quantify the region of parameters in which the two above algorithms succeed and show that, up to the degree of accurancy of the extrapolation of the GF performance, the two lines are not the same. We also show that the algorithmic performance is not driven by the absence of spurious local minima. These results are summarized in Fig. 1. These result show, that in order to obtain a complete picture for settings beyond the present model, the precise interplay between absence of spurious local minima and algorithmic performance remains to be further investigated.

## 2 Problem definition

In this paper we consider the spiked matrix-tensor model as studied in [1]. This is a statistical inference problem where the ground truth signal is sampled uniformly on the -dimensional sphere, . We then obtain two types of observations about the signal, a symmetric matrix , and an order symmetric tensor , that given the signal are obtained as

 Yij=x∗ix∗j√N+ξij, (1) Ti1,…,ip=√(p−1)!N(p−1)/2x∗i1…x∗ip+ξi1,…,ip (2)

for and , using and symmetries to obtain the other non-diagonal components. Here and are for each and each

independent Gaussian random numbers of zero mean and variance

and , respectively.

The goal in this spiked matrix-tensor inference problem is to estimate the signal from the knowledge of the matrix and tensor . If only the matrix was present, this model reduces to well known model of low-rank perturbation of a random symmetric matrix, closely related to the spiked covariance model [15]. If on the contrary only the tensor is observed then the above model reduces to the spiked tensor model as introduced in [16] and studies in a range of subsequent papers.

In this paper we study the matrix-tensor model where the two observations are combined. Our motivation is similar to the one exposed in [1], that is, we aim to access a regime in which it is algorithmically tractable to obtain good performance with corresponding message passing algorithms yet it is challenging (e.g. leading to non-convex optimization) with sampling or gradient descent based algorithms, this happens when both and , while [1].

In this paper we focus on algorithms that aim to find the maximum likelihood estimator. The negative log-likelihood (Hamiltonian in the physics language, or loss function in the machine learning language) of the spiked matrix-tensor reads

 L=∑i

where is constrained to the sphere.

In a high-dimensional, , noisy regime the maximum-likelihood estimator is not always optimal as it provides in general larger error than the Bayes-optimal estimator computing the marginals of the posterior, studied in [1]. At the same time the log-likelihood (3) can be seen as a loss function, that is non-convex and high-dimensional. The tractability and properties of such minimization problems are the most questioned in machine learning these days, and are worth detailed investigation in the present model.

## 3 Landscape characterization

The first goal of this paper is to characterize the structure of local minima of the loss function (equivalently local maxima of the log-likelihood) eq. (3) as a function of the noise parameters and . We compute the average number of local minimizers having a given correlation with the ground truth signal . This leads to a so-called complexity function defined as the logarithm of the expected number of local minima at correlation with the ground truth.

A typical example of this function, resulting from our analysis, is depicted in Fig. 2 for , , and several values of . We see from the figure that at large local minima appear only in a narrow range of values of close to zero, as decreases the support of widens. At yet smaller values of the support becomes disconnected so that it is supported on an interval of value close to and on two (one negative, one positive) isolated points. For yet smaller the complexity for values of close to zero becomes negative, signalling what we call a trivialization of the landscape, where all remaining local minima are (in the leading order in ) as correlated with the ground truth as the global minima. The support of in the trivialized regions consists of two separated points. We call the value of at which the trivialization happens . In the phase diagram of Fig. 1 the trivialization of the energy landscape happens above the purple dashed line.

We use the Kac-Rice formula to determine the complexity [17, 18]. Given an arbitrary continuous function, the Kac counting formula allows to compute the number of points where the function crosses a given value. The number of minima can be characterized using Kac’s formula on the gradient of the loss (3), counting how many time the gradient crosses the zero value, under the condition of having a positive definite Hessian in order to count only local minima and not saddles. Since the spiked matrix-tensor model is characterized by a random landscape, due to the noise and , we will consider the expected number of minima obtaining the Kac-Rice formula [17, 18].

For mathematical convenience we will consider the rescaled configurations , and rescaled signal . Call the joint probability density of the gradient of the loss, and of the and the contributions of the matrix and tensor to the loss, respectively. Given the value of the two contributions to the loss and , and the correlation between the configuration and ground truth that we impose using a Dirac’s delta, the averaged number of minimizers is

 N(m,ϵ2,ϵp;Δ2,Δp)=e~ΣΔ2,Δp(m,ϵ2,ϵp)==∫SN−1E[detH|G=0,F2=Nϵ2,Fp=Nϵp,H≻0]ϕG,F2,Fp(σ,0,ϵ2,ϵp)δ(m−σ⋅σ∗)dσ. (4)

Rewrite the loss Eq. (3) neglecting terms that are constant with respect to the configuration and thus do not contribute to the complexity

 ^L=√N(p−1)!Δp∑i1<⋯

In the following we will use small letters , , , to characterize losses, gradient and Hessian constrained on the sphere and capital letters for the same quantities unconstrained. Define the

-dimensional identity matrix. The following lemma characterizes

.

###### Lemma 1.

Given the loss function Eq. (5) and a configuration such that the correlation and the signal is , then there exists a reference frame such that the joint probability distribution of , and is given by

 fkN∼1kΔkmk+1√kΔk1√NZk; (6) gN∼(1Δpmp−1+1Δ2m)√1−m2e1−√1Δp+1Δ21√N~Z; (7) hN∼(p−1Δpmp−2+1Δ2)(1−m2)e1eT1+√p−1Δp+1Δ2√N−1NW−(pfp+2f2)IN−1; (8)

with standard Gaussians and , a standard multivariate Gaussian and

a random matrix from the Gaussian orthogonal ensemble.

###### Proof sketch.

Starting from Eq. (5), split the contributions of the matrix and tensor in and , two Gaussian variables and impose the spherical constrain with a Lagrange multiplier .

 f2(σ)+fp(σ)=F2(σ)+Fp(σ)−μ2(∑iσ2i−1), (9) gi(σ)=Gi(σ)−μσi, (10) hij(σ)=Hij(σ)−μ. (11)

The expression for in a critical point can be derived as follows. Given , multiply Eq. (10) by , sum over the indices and we obtain:

. We now restrict our study to the unconstrained random variables and substitute

. Since the quantities , , , , are linear functionals of Gaussians they will be distributed as Gaussian random variables and therefore can be characterized by computing expected values and covariances. Starting from the losses coming from the matrix and the tensor in Eq. (5), and

, respectively, consider the moments with respect to the realization of the noise,

, . For the first moment leads to

 E[Fk(σ)]=NkΔk(σ⋅σ∗)k+O(1). (12)

Let’s consider the second moment but having two different configurations and ,

 E[Fk(σ)Fk(τ)]=NkΔk(σ⋅τ)k+O(1). (13)

Using standard results for derivatives of Gaussians (see e.g. [17] Eq. 5.5.4) we can obtain means and covariances of the random variables taking derivatives with respect to and . Then set , imposing the spherical constrain and using .

The last step is the definition of a convenient reference frame . Align the configuration along the last coordinate and the signal with a combination of the first and last coordinates . Finally, project on the sphere by discarding the last coordinate. ∎

We can now rewrite the determinant of the conditioned Hessian by grouping the multiplicative factor in front of the GOE in Eq. (8)

 deth=(p−1Δp+1Δ2)N−12(NN−1)−N−12det[W−tNIN−1+θNe1eT1] (14)

with and given by

 tN→t=2pϵp+2ϵ2√p−1Δp+1Δ2, (15) θN→θ=p−1Δpmp−2+1Δ2√p−1Δp+1Δ2(1−m2) (16)

in the large -limit. Therefore the Hessian behaves like a GOE shifted by with a rank one perturbation of strength . This exact same problem has already been studied in [12] and we can thus deduce the expression for the complexity as

 ~ΣΔ2,Δp(m,ϵ2,ϵp)=12logp−1Δp+1Δ21Δp+1Δ2+12log(1−m2)−12(mp−1Δp+mΔ2)21Δp+1Δ2(1−m2)−2pΔp(ϵp−mp2pΔp)2−4Δ2(ϵ2−m24Δ2)2+Φ(t)−L(θ,t), (17)

with

 Φ(t)=t24+1|t|>2[log(√t24−1+|t|2)−|t|4√t2−4]

We note at this point that for the case of the pure spiked tensor model the above expression reduces exactly to the complexity derived in [12]. The following theorem states that to the leading order Eq. (17) represents the complexity of our problem.

###### Theorem 1.

Given and , for any and it holds

 limN→∞1NlogEN(m,ϵ2,ϵp;Δ2,Δp)==~ΣΔ2,Δp(m,ϵp,ϵ2) (18)
###### Proof sketch.

The proof comes immediately from [12] Thm. 2, see also Sec. 4.1. ∎

The quantity that we are interested in is the projection of Eq. (17) to the maximal values of and :

 Σ(m)=maxϵ2,ϵp~ΣΔ2,Δp(m,ϵ2,ϵp). (19)

Eq. (19) allows to understand if at a given correlation with the signal, there are regions with an exponential expected number of minima, see Fig. 2. Thus it allows to locate parameters where the landscapes is trivial.

We computed the expected number of minima, i.e. the so-called annealed average. The annealed average might be dominated by rare samples, and in general provides only an upper bound for typical samples. The quenched complexity, i.e. the average of the logarithm of the number of minima, is more involved. The quenched calculation was done in the case of a the spiked tensor model [19]. It is interesting to notice that in [19] the authors found that the annealed complexity does not differ from the quenched complexity for . This combined with analogous preliminary results for the spiked matrix-tensor model, suggest that considering the quenched complexity would not change the conclusions of this paper presented in the phase diagrams Fig. 1.

In this section we analyze the performance of the gradient flow descent in the loss function (3)

 ddtxi(t)=−μ(t)xi(t)−δLδxi(t), (20)

where the Lagrange parameter is set in a way to ensure the spherical constraint . Our aim is to understand the final correlation between the ground truth signal and the configuration reached by the gradient flow in large but finite time, while .

The gradient flow (20) can be seen as a zero-temperature limit of the Langevin algorithm where

 ddtxi(t)=−μ(t)xi(t)−δLδxi(t)−ηi(t), (21)

with being the Langevin noise with zero mean and covariance , where has the physical meaning of temperature, the notation stands for the average over the noises and . As we take the limit , the noise becomes peaked around zero, effectively recovering the gradient flow.

The performance of the Langevin algorithm was characterized recently in [1] using equations developed in physics of disordered systems [13, 14]. In [1] this characterization was given for an arbitrary temperature and compared to the landscape of the Bayes-optimal estimator [20]. Here we hence summarize and use the results of [1] corresponding to the limit .

The Langevin dynamics with generic temperature is in the large size limit, , characterized by a set of PDEs for the self-correlation , the response function , and the correlation with the signal . Ref. [1] established that as the gradient flow evolves these quantities satisfy eqs. (74)-(76) in that paper. Taking the zero-temperature limit in those equations we obtain

 ∂∂tC(t,t′)=−~μ(t)C(t,t′)+Q′(m(t))m(t′)+∫t0dt′′R(t,t′′)Q′′(C(t,t′′))C(t′,t′′)+∫t′0dt′′R(t′,t′′)Q′(C(t,t′′)), (22) ∂∂tR(t,t′)=−~μ(t)R(t,t′)+∫tt′dt′′R(t,t′′)Q′′(C(t,t′′))R(t′′,t′), (23) ∂∂tm(t)=−~μ(t)m(t)+Q′(m(t))+∫t0dt′′R(t,t′′)m(t′′)Q(C(t,t′′)), (24)

with and the rescaled spherical constraint. Boundary conditions for the equations are , for all and . An additional equation for is obtained by fixing in Eq. (22). In the context of disordered systems those equations have been established rigorously for a related case of the matrix-tensor model without the spike [21].

### 4.1 Performance of the gradient flow

Eqs. (22-24) are integrated numerically showing the large-size-limit performance of the gradient flow algorithm. Example of this evolution is given in Fig. 3 for , . The code will be made available and linked to this paper. For consistency we confirm numerically that at large times the gradient flow reaches values of the correlation that correspond exactly to the value of the correlation of the minima correlated to the signal as obtained in the Kac-Rice approach.

As the variance increases the time it takes to the gradient flow to acquire good correlation with the signal increases. We define the convergence time as the time it takes to reach 1/2 of the final plateau. The dependence of on is consistent with a power law divergence at . This is illustrated in Fig. 4 where we plot the convergence time as a function of and show the power-law fit in the inset. The points are collected and plotted in Fig. 1, dotted blue line.

From Fig. 4 we see that the gradient flow algorithm undergoes a considerable slow-down even in the region where the landscape is trivial, i.e. does not have spurious local minimizers. At the same time divergence of the convergence time happens only well inside the phase where spurious local minimizers do exist.

## 5 Maximum-likelihood approximate message passing

Approximate Message Passing (AMP) is a popular iterative algorithm [22] with a key advantage of being analyzable via its state evolution [23]. The maximum-likehood AMP (ML-AMP) algorithm studied in this paper is a generalization of AMP for the pure spiked tensor model from [16] to the spiked matrix-tensor model. We will show that its fixed points correspond to stationary points of the loss function (3). This should be contrasted with the Bayes-optimal AMP (BO-AMO) that was studied in [1] and aims to approximate the marginals of the corresponding posterior probability distribution. The ML-AMP instead aims to estimate the maximum-likelihood solution, . In information theory the BO-AMP would correspond to the sum-product algorithm, while the present one to the max-sum algorithm. In statistical physics language the BO-AMP corresponds to temperature one, while the present one to zero temperature. In the appendix, Sec. C, we provide a schematic derivation of the ML-AMP as a zero-temperature limit of the BO-AMP, using a scheme similar to [24].

 Bti=√(p−1)!N(p−1)/2∑k2<⋯

with the -norm and the Onsager reaction term

 rt=1Δ21N∑k^σtk+p−1Δp1N∑k^σtk(1N∑k^xtk^xt−1k)p−2. (28)

### 5.1 ML-AMP & stationary points of the loss

Using an argument similar to Prop. 5.1 in [25] we can show that a fixed points found by ML-AMP corresponds to finding a stationary point of the loss Eq. (3) with a ridge regularizer.

###### Property 1.

Given a fixed point of ML-AMP, then satisfies the stationary condition of the loss.

###### Proof sketch.

Let us denote , the fixed point of Eqs. (25) and (28). From Eq. (26) and Eq. (25) we have

 (1√N||B∗||2+r∗)x∗=1√N∑kYikΔ2^x∗i+√(p−1)!N(p−1)/2∑k2<⋯

which is exactly solution of the derivative of Eq. (3) with respect to when the spherical constraint is enforced by a Lagrange multiplier

 0=−μxi+1√N∑kYikΔ2xi+√(p−1)!N(p−1)/2∑k2<⋯

Moreover ML-AMP by construction preserves the spherical constrain at every time iteration, thus the fixed point value of the Lagrange multiplier in (29) is the one leading to the spherical constraint. ∎

### 5.2 State evolution

The evolution of ML-AMP can be tracked through a set of equations called state evolution (SE). The state evolution can be characterized via an order parameter: , the correlation of the ML-AMP-estimator with the ground truth signal at time . According to the SE, as proven in [23, 16], this parameter evolves in the large limit as

 mt+1=mtΔ2+(mt)p−1Δp√1Δ2+1Δp+(mtΔ2+(mt)p−1Δp)2, (30)

and the mean square error correspondingly

 MSEt=2(1−mt). (31)

A derivation of this state evolution is presented in the appendix, Sec. C.

Analysis of the simple scalar SE, Eq. (30), allows to identify the error reached by the ML-AMP algorithm. We first observe that is always a fixed point. For the performance of ML-AMP is the stability of this fixed point that determines whether the ML-AMP will be able to find a positive correlation with the signal or not. Analyzing Eq. (30) we obtain that the is a stable fixed point for where

 ΔML−AMP2(Δp)=−Δp+√Δ2p+4Δp2. (32)

Consequently for the ML-AMP algorithm converges to , i.e. zero correlation with the signal. The line is the line plotted in Fig. 1. For and , we obtain that for the ML-AMP algorithm converges to a positive correlation with the signal, depicted in Fig. 5. In Fig. 5 we also observe that this correlation agrees (up to the numerical precision) with the position of the maximum having largest value of in the complexity function , this is also depicted in the figure. The trivialization of the landscape occurs at , thus showing that for the ML-AMP algorithm is able to ignore a good portion of the spurious local minima and to converge to the local minima best correlated with the signal.

In Fig. 5 we also compared to the MSE obtained by the Bayes-optimal AMP that provably minimizes the MSE in the case depicted in the figure [1]. We see that the gap between the Bayes-optimal error and the one reached by the loss minimization approaches goes rapidly to zero as increases.

## 6 Discussion

We analyzed the behavior of two descent algorithms in optimizing a rough high-dimensional loss landscape of the spiked matrix-tensor model. We used the Kac-Rice formula to count the average number of minima of the loss function having a given correlation with the signal. Analyzing the resulting formula we defined and located where the energy landscape becomes trivial in the sence that spurious local minima disappear. We analyzed the performance of gradient flow via integro-differential state-evolution-like equations. We numerically solved the equations and extrapolated the divergence of their convergence-time. We delimited a region of parameters for which the gradient flow is able to avoid the spurious minima and obtain a good correlation with the signal in time linear in the input size. We also analyzed the maximum-likelihood AMP algorithm, located the region of parameters in which this algorithm works, which is larger than the (numerically extrapolated) region for which the gradient flow works. We found that in cases when both the algorithms converge to an informative minima, the corresponding error is the same in both and also corresponds to the position of all the minima well correlated with the signal in the Kac-Rice approach. The relation between existence or absence of spurious local minima in the loss landscapes of a generic optimization problems and the actual performance of optimization algorithm is yet to be understood. Our analysis of the spiked matrix-tensor model brings a case-study where we were able to specify this relation quantitatively.

## Acknowledgments

We thank G. Ben Arous, G. Biroli, C. Cammarota, G. Folena, and V. Ros for precious discussions. We acknowledge funding from the ERC under the European Union’s Horizon 2020 Research and Innovation Programme Grant Agreement 714608-SMiLe; from the French National Research Agency (ANR) grant PAIL; and from ”Investissements d’Avenir” LabEx PALM (ANR-10-LABX-0039-PALM) (SaMURai and StatPhysDisSys). This research was supported in part by the National Science Foundation under Grant No. PHY-1748958.

## Appendix A Kac-Rice formula

### a.1 p-odd cases

In the cases in which the order of the tensor

is odd we encounter an interesting phenomenon due to the different symmetries of the two types of observation. The matrix is symmetric by inverting the sign of the signal,

, while the tensor is not symmetric for odd . This creates an asymmetry in the complexity, Fig. 6 (to be compared with Fig. 2) and causes a shift toward lower correlations of the band characterizing the non-informative minima. Therefor observing when the complexity at becomes negative does not guarantee that the non-informative minima disappeared. To do so, one must check that the whole non-informative band disappears. This should be contrasted with the case of even where a maximum of the complexity is always at . These two definitions of the threshold have little, but not negligible, difference, see Fig. 7. Observe that as increases the peak of the complexity decreases, since the loss Eq. (3) tends to the simple matrix-factorization problem where the landscape is characterized by two isolated minima. This implies that the two definitions become indistinguishable for large . In the main text we use the definition taking into account the maximum (even when it is not strictly at ) because gives a more accurate characterization of the trivialization threshold.

### b.1 Dependence on the initial conditions

The dynamics of the gradient flow shows a dependence on the initial conditions, because formally zero correlation is a (unstable) fixed point of the GF state evolution. In practice we observe for both GF and ML-AMP that instability of the fixed point is sufficient for good performance of the algorithm. However, this makes the definition of the convergence time depend of the initial condition.

We observed from our numerical solution of the GF state evolution equations that the initial condition add a factor to the convergence times. Thus by fitting this term and rescaling the convergence times, the different initializations collapse into a single curve, see inset of Fig. 8. Finally, the collapsed points were used to extrapolate the critical line as shown in the main text, Fig. 4.

## Appendix C Amp

### c.1 From AMP to ML-AMP

In this section we consider the spiked-tensor model in a Bayesian way. We show how the Bayes-optimal AMP leads to the Maximum Likelihood AMP using a temperature-like parameter . We will introduce the algorithm AMP for a generic , and show that as we recover ML-AMP as presented in the main text. The probability distribution we consider is

 P(X|Y,T)∝e−μ\normx2∏i

The scheme for deriving AMP estimating marginals of such a probability distribution can be found in [24, 1] and consist in making a Gaussian assumption on the distribution of the messages in the belief propagation (BP) algorithm and neglecting the node-dependence in the messages. A final consideration to be used in order to derive the algorithm is that the spherical constrain can be imposed by setting at every iteration. The resulting AMP algorithm will iterate on the following equations:

 Bti=√(p−1)!N(p−1)/2∑k2<⋯

with the -norm and the Onsager reaction term

 rt=1Δ2T21N∑kσtk+p−1ΔpT21N∑kσtk(1N∑k^xtk^xt−1k)p−2. (37)

In the limit AMP defined by Eqs. (34-37) is equivalent to ML-AMP, Eqs. (25-28). To see this we define the rescaled variables , and . Taking the limit the expression for Eq. (35) and the expression for Eq. (36) simplify as Eq. (26) and as Eq. (26) respectively. Dropping the tildes we obtain ML-AMP as presented in the main text.

### c.2 State evolution

The generic version of AMP has a slightly more complicated SE that depends of two order parameters: the already introduced and the self-overlap of the estimator. The SE equations are:

 mt+1=2zt(T)1+√1+4yt(T), (38) qt+1=4yt(T)(1+√1+4yt(T))2 (39)

and

 MSEt=1−2mt+qt, (40)

with and .

Given , in the limit AMP SE Eqs. (38-39) simplify, to a single equation corresponding to ML-AMP SE Eq. (41). This is seen by taking the limit for Eq. (39) which gives , implying . Then, using the result for , we show that Eq. (38) tends to Eq. (30).

### c.3 Derivation of spinodals

From SE Eq. (30) we can obtain analytical equations for the spinodals, the threshold of stability of the different ML-AMP fixed points. We have with

 fSE(z)=z√z2+γ, (41)

with and . Observe that: . We can now define either or