Marvels and Pitfalls of the Langevin Algorithm in Noisy High-dimensional Inference

12/21/2018 ∙ by Stefano Sarao Mannelli, et al. ∙ 20

Gradient-descent-based algorithms and their stochastic versions have widespread applications in machine learning and statistical inference. In this work we perform an analytic study of the performances of one of them, the Langevin algorithm, in the context of noisy high-dimensional inference. We employ the Langevin algorithm to sample the posterior probability measure for the spiked matrix-tensor model. The typical behaviour of this algorithm is described by a system of integro-differential equations that we call the Langevin state evolution, whose solution is compared with the one of the state evolution of approximate message passing (AMP). Our results show that, remarkably, the algorithmic threshold of the Langevin algorithm is sub-optimal with respect to the one given by AMP. We conjecture this phenomenon to be due to the residual glassiness present in that region of parameters. Finally we show how a landscape-annealing protocol, that uses the Langevin algorithm but violate the Bayes-optimality condition, can approach the performance of AMP.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 24

page 25

page 41

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Motivation

Algorithms based on noisy variants of gradients descent [1, 2]

stand at the roots of many modern applications of data science, and are being used in a wide range of high-dimensional non-convex optimization problems. The widespread use of stochastic gradient descent in deep learning

[3] is certainly one of the most prominent examples. For such algorithms, the existing theoretical analysis mostly concentrate on convex functions, convex relaxations or on regimes where spurious local minima become irrelevant. For problems with complicated landscapes where, instead, useful convex relaxations are not known and spurious local minima cannot be ruled out, the theoretical understanding of the behaviour of gradient-descent-based algorithm remains poor and represents a major avenue of research.

The goal of this paper is to contribute to such an understanding in the context of statistical learning, and to transfer ideas and techniques developed for glassy dynamics [4] to the analysis of non-convex relaxations in high-dimensional inference. In statistical learning, the minimization of a cost function is not the goal per se, but rather a way to uncover an unknown structure in the data. One common way to model and analyze this situation is to generate data with a hidden structure, and to see if the structure can be recovered. This is easily set up as a teacher-student scenario [5, 6]: First a teacher generates latent variables and uses them as input of a prescribed model to generate a synthetic dataset. Then, the student observes the dataset and tries to infer the values of the latent variables. The analysis of this setting has been carried out rigorously in a wide range of teacher-student models for high-dimensional inference and learning tasks as diverse as planted clique [7], generalized linear models such as compressed sensing or phase retrieval [8], factorization of matrices and tensors [9, 10]

or simple models of neural networks

[11]

. In these works, the information theoretically optimal performances —the one obtained by an ideal Bayes-optimal estimator, not limited in time and memory— have been computed.

The main question is, of course, how practical algorithms —operating in polynomial time with respect to the problem size— compare to these ideal performances. The last decade brought remarkable progress into our understanding of the performances achievable computationally. In particular, many algorithms based on message passing [12, 6], spectral methods [13], and semidefinite programs (SDP) [14] were analyzed. Depending on the signal-to-noise ratio, these algorithms were shown to be very efficient in many of those task. Interestingly, all these algorithm fail to reach good performance in the same region of the parameter space, and this striking observation has led to the identification of a well-defined hard phase. This is a regime of parameters in which the underlying statistical problem can be information-theoretically solved, but no efficient algorithms are known, rendering the problem essentially unsolvable for large instances. This stream of ideas is currently gaining momentum and impacting research in statistics, probability, and computer science.

The performance of the noisy-gradient descent algorithms — that are certainly the one currently most used in practice— remains an entirely open question. Do they allow to reach the same performances as message passing and SDPs? Can they enter the hard phase, do they stop to be efficient at the same moment as the other approaches, or are they worse? The ambition of the present paper is to address these questions by analyzing the performance of the Langevin algorithm in the high-dimensional limit of a spiked matrix-tensor model, defined in detail in the next section.

We argue that this spiked matrix-tensor problem is both generic and relevant. Similar models have played a fundamental role in statistics and random matrix theory

[15, 16]. Tensor factorization is also an important topic in mathematics and machine learning [17, 18, 19, 20, 21], and is a widely used in data analysis [22]

. At variance with the pure spiked tensor case

[17], this mixed matrix-tensor model has the advantage that algorithmic threshold appears at the same scale as the information-theoretic one, similarly to what is observed in simple models of neural networks [8, 11].

In the present paper, we analyse the behavior of a noisy gradient descent algorithm, also called Langevin algorithm. We explicitly compare its performance to the one of the Bayes optimal estimator and to the best known efficient algorithm so-far – the approximate message passing algorithm [12, 6]. In particular, contrary to what has been anticipated in [23, 24], but as surmised in [25], we observe that the performance of the Langevin algorithm is hampered by the many spurious metastable states still present in the AMP-easy phase. We show that the Langevin algorithm can approach AMP performances if one uses a landscape-annealing protocol in which instead of reducing the temperature, as done in thermal annealing, one instead anneals the strength of the contribution of the tensor. A number of intriguing conclusions can be drawn by these results and are likely to be valid beyond the considered model.

Finally, the possibility to describe analytically the behavior of the Langevin algorithm in this model is enabled by the existence of the Crisanti-Horner-Sommers-Cugliandolo-Kurchan (CHSCK) equations in spin glass theory, describing the behavior of the Langevin dynamics in the so-called spherical -spin model [26, 27], where the method can be rigorously justified [28]. These equations were a key development in the field of statistical physics of disordered systems that lead to detailed understanding and predictions about the slow dynamics of glasses [4]. In this paper, we bring these powerful methods and ideas into the realm of statistical learning.

2 The spiked matrix-tensor model

We now detail the spiked matrix-tensor problem: a teacher generates a

-dimensional vector

by choosing each of its components independently from a normal Gaussian distribution of zero mean and unit variance. In the large

limit this is equivalent to have a flat distribution over the -dimensional hypersphere defined by . The teacher then also generates a symmetric matrix and a symmetric order- tensor as

(1)

where and are iid Gaussian components of a symmetric random matrix and tensor of zero mean and variance and , respectively; and correspond to noises corrupting the signal of the teacher. In the limit , and , the above model reduces to the canonical spiked Wigner model [29], and spiked tensor model [17], respectively. The goal of the student is to infer the vector from the knowledge of the matrix , of the tensor , of the values and , and the knowledge of the spherical prior. The scaling with as specified in Eq. (1) is chosen in such a way that the information-theoretically best achievable error varies between perfectly reconstructed spike and random guess from the flat measure on . Here, and in the rest of the paper we denote the -dimensional vector, and with its components.

Analogous matrix-tensor models, where next to a order- tensor one observes a matrix created from the same spike are studied e.g. in [22] in the context of topic modeling, or in [17]. From the optimization-theory point of view, this model is highly non-trivial being high-dimensional and non-convex. For the purpose of the present paper this model is chosen because it involves three key ingredients: (a) It is in the class of models for which the Langevin algorithm can be analyzed exactly in large

limit. (b) The different phase transitions, both algorithmic and information theoretic, discussed hereafter, all happen at

, . (c) The AMP algorithm is in this model conjectured to be optimal among polynomial algorithms. It is this second and third ingredient that are not present in the pure spiked tensor model [17], making it unsuitable for our present study. We note that the Langevin algorithm was recently analyzed for the pure spiked tensor model in [20] in a regime where the noise variance is very small , but we also note that in that model algorithms such as tensor unfolding and semidefinite programming work better, roughly up to [17, 18],

3 Bayes-optimal estimation and message-passing

In this section we present the performance of the Bayes-optimal estimator and of the approximate message passing algorithm. This theory is based on a straightforward adaptation of analogous results known for the pure spiked matrix model [29, 30, 9] and for the pure spiked tensor model [17, 10].

The Bayes-optimal estimator is defined as the one that among all estimators minimizes the mean-squared error (MSE) with the spike

. Starting from the posterior probability distribution

(2)

the Bayes-optimal estimator reads

(3)

To simplify notation, and to make contact with the energy landscape and the statistical physics notations, it is convenient to introduce the energy cost function, or Hamiltonian, as

(4)

so that keeping in mind that for the spherical constraint is satisfied , the posterior is written as , where is the normalizing partition function.

With the use of the replica theory and its recent proofs from [9, 31, 10] one can establish rigorously that the mean squared error achieved by the Bayes-optimal estimator is given as where is the global maximizer of the so-called free entropy of the problem

(5)

This expression is derived, and proven, in the appendix Sec. B.1.

We now turn to the approximate message-passing (AMP) [17, 10], that is the best known so far for this problem. AMP is an iterative algorithm inspired from the work of Thouless-Anderson and Palmer in statistical physics [32]. We explicit its form in the Sec. B.1. Most remarkably performance of AMP can be evaluated by tracking its evolution with the iteration time and it is given in terms of the (possibly local) maximum of the above free entropy that is reached as a fixed point of the following iterative process

(6)

with initial condition with . Eq. (6) is called the State Evolution of AMP and its validity is proven for closely related models in [33]. We denote the corresponding fixed point and the corresponding estimation error .

The phase diagram presented in Fig. 1 summarizes this theory for the spiked -spin model. It is deduced by investigating the local maxima of the scalar function (5). Notably we observe that the phase diagram in terms of and splits into three phases

  • Easy in green for and any : The fixed point of the state evolution (6) is the global maximizer of the free entropy (5), and .

  • Hard in orange for and low : The fixed point of the state evolution (6) is not the global maximizer of the free entropy (5), and .

  • Impossible in red for and high : The fixed point of the state evolution (6) is the global maximizer of the free entropy (5), and .

For the -spin model with the phase diagram is slightly richer and is presented in the appendix.

Figure 1: Phase diagram of the spiked -spin model (matrix plus order 3 tensor are observed). In the easy (green) region AMP achieves the optimal error smaller than random pick from the prior. In the impossible region (red) the optimal error is as bad as random pick from the prior, and AMP achieves it as well. In the hard region (orange) the optimal error is low, but AMP does not find an estimator better than random pick from the prior. In the case of Langevin algorithm the performance is strictly worse than that for AMP in the sense that the hard region increases up to line depicted in green dots. The blue dashed-dotted line delimits the region of existence of stable 1RSB states.

4 Langevin Algorithm and its Analysis

We now turn to the core of the paper and the analysis of the Langevin algorithm. In statistics, the most commonly used way to compute the Bayes-optimal estimator (3) is to attempt to sample the posterior distribution (2) and use several independent samples to compute the expectation in (3). In order to do that one needs to set up a stochastic dynamics on that has a stationary measure at long times given by the posterior measure (2

). The Langevin algorithm is one of the possibilities (others include notably Monte Carlo Markov chain). The common bottleneck is that the time needed to achieve stationarity can be in general exponential in the system size. In which case the algorithm is practically useless. However, this is not always the case and there are regions in parameter space where one can expect that the relaxation to the posterior measure happens on finite timescales. Therefore it is crucial to understand where this happens and what are the associated relaxation timescales.

The Langevin algorithm on the hypersphere with Hamiltonian given by Eq. (4) reads

(7)

where is a zero mean noise term, with where the average is with respect to the realizations of the noise. The Lagrange multiplier is chosen in such a way that the dynamics remains on the hypersphere. In the large -limit one finds where the is the 1st term from (4) evaluated at , and is the value of the 2nd term from (4).

The presented spiked matrix-tensor model falls into the particular class of spherical -spin glasses [34, 35] for which the performance of the Langevin algorithm can be tracked exactly in the large-

limit via a set of integro-partial differential equations

[26, 27], beforehand dubbed CHSCK. We call this generalised version of the CHSCK equations Langevin State Evolution (LSE) equations in analogy with the state evolution of AMP.

In order to write the LSE equations, we defined three dynamical correlation functions

(8)
(9)
(10)

where is a pointwise external field applied at time to the Hamiltonian as . We note that the correlation functions defined above depend on the realization of the thermal history (i.e. of the noise ) and on the disorder (here the matrix and tensor ). However, in the large- limit they all concentrate around their averages. We thus define and analogously for and . Standard field theoretical methods [36] or dynamical cavity method arguments [37] can then be used to obtain a closed set of integro-differential equations for the averaged dynamical correlation functions, describing the average global evolution of the system under the Langevin algorithm. The resulting LSE equations are (see the appendix Sec. C for a complete derivation)

(11)

where we have defined . The Lagrange multiplier, , is fixed by the spherical constraint, through the condition . Furthermore causality implies that if . Finally the Ito convention on the stochastic equation (7) gives .

5 Behavior of the Langevin algorithm

In order to assess the perfomances of the Langevin algorithm and compare it with AMP, we notice that the correlation function is directly related to accuracy of the algorithm. We solve the differential equations (11) numerically along the lines of [38, 39], for a detailed procedure see the appendix Sec. D, codes available at [40]. In Fig. 2 we plot the correlation with the spike as a function of the running time for , fixed and several values of . In the inset of the plot we compare it to the same quantity obtained from the state evolution of the AMP algorithm.

Figure 2: Evolution of the correlation with the signal in the Langevin algorithm at fixed noise on the matrix () and different noises on the tensor (). As we approach the transition, estimated to , the time required to jump to the solution diverges. Inset: the behavior of as a function of the iteration time for the AMP algorithm for the same values of .

For the Langevin algorithm in Fig. 2 we see a pattern that is striking. One would expect that as the noise decreases the inference problem is getting easier, the correlation with the signal is larger and is reached sooner in the iteration. This is, after all, exactly what we observe for the AMP algorithm in the inset of Fig. 2. Also for the Langevin algorithm the plateau reached for large times becomes higher (better accuracy) as the noise is reduced. Furthermore the height of the plateau coincides with that reached by AMP, thus testifying the algorithm reached equilibrium. However, contrary to AMP, the relaxation time for the Langevin algorithm increases dramatically when diminishing (notice the log scale on x-axes of Fig. 2, as compared to the linear scale of the inset). We will define as the time it takes for the correlation to reach a value . We then plot the value of this equilibration time in the insets of Fig. 3 as a function of the noise or having fixed (upper panel) or (lower panel) respectively.

Figure 3: Fit using a power law of the relaxation times of the Langevin algorithm, the point of divergence marks the limit of the Langevin easy region. These fits have been performed both, for fixed (blue circles and yellow crosses) and for fixed (pink circles). The circles are obtained with numerical solution of LSE that uses the dynamical grid while crosses are obtained using a fixed-grid (details in Sec. D.1).

The data are clearly consistent with a divergence of at a certain finite value of and . We fit the data by a power law fit and obtain in the particular case of fixed a fit with and , whereas for fixed we obtain and . However, we are not able to strictly prove that the divergence of the relaxation time truly occurs, but at least our results imply that for and the Langevin algorithm (7) is not a practical solver for the spiked mixed matrix-tensor problem. We will call the region and where the AMP algorithm works optimally without problems yet Langevin algorithm does not, the Langevin-hard region. Both and are then plotted in Fig. 1 with green points (circles and stars) and consistently delimit the Langevin-hard region that extends considerably into the region where the AMP algorithm works optimally in a small number or iterations. Our main conclusion is thus that the Langevin algorithm designed to sample the posterior measure works efficiently in a considerably smaller region of parameters than the AMP as quantified in Fig. 1.

Fig. 4 presents another way to depict the observed data, the correlation reached after time is plotted as a function of the tensor noise variance . The results of AMP are depicted with dotted lines and, as one would expect, decrease monotonically as the noise increases. The equilibrium value (black dashed) is reached within few dozens of iterations. On the contrary, the correlation reached by the Langevin algorithm after time is non-monotonic and close to zero for small values of noise signalling again a diverging relaxation time when is decreased.

6 Glassy nature of the Langevin-hard phase

We notice that for and any finite , and after a suitable rescaling of time and temperature, the LSE equations coincide with the ones of the pure spiked-tensor () at very low temperature and slightly perturbed by additional terms. The glassy nature of the landscape of such model [21, 20] qualitatively justifies why the Langevin algorithm remains trapped in one of the spurious minima, hence leading to a hard-Langevin phase for and any finite (in particular ).

In order to obtain a quantitative estimate of the extent of the glassy region, we repeated the analysis of [25] in the present model (details in the appendix Sec. E). In particular, we compute the entropy of metastable states (finite temperature extensions of spurious minima), a.k.a. the complexity, using the one-step replica symmetry breaking (1RSB) assumption for the structure of these states. We then analyze the stability of the 1RSB solution with respect to further levels of RSB and draw the region of paramaters for which states of positive complexity exist and are stable. This happens to be below the blue dashed-dotted line depicted in Fig. 1.

We observe that indeed in the regime where the 1RSB solution indicates the existence of an exponential number of spurious metastable states, the Langevin algorithm does not work. Moreover, the curve delimiting existence of the 1RSB stable states has the same trend as the boundary of the Langevin-hard regime. Yet the Langevin-hard phase extends to larger values of , see Fig. 1. This quantitative discrepancy can be due to either the fact that we did not take into account effects of full-step replica symmetry breaking [41] or rather because the relation between the static replica calculation and the behavior of the Langevin algorithm is more complex than anticipated in the literature [41]. We let this point for investigation in future work.

Figure 4: Correlation with the signal of AMP and Langevin at the th iteration (at time ) for fixed .

7 Better performance by annealing the landscape

A generic strategy to avoid metastable states is by simulated annealing [42] where the thermal noise is given by and the temperature is time dependent and slowly decreased during the dynamics starting from very large values towards the target temperature, in our case. We tried this algorithm and found that it does not succeed to recover the signal.

As discussed above, it is the tensor part of the Hamiltonian that induces glassiness. Therefore, we consider a protocol in which contribution of the tensor is weak at initial stages of the dynamics and increases gradually into its Bayes-optimal strength. Specifically, we focus on a time-dependent Hamiltonian, , and change dynamically according to , being a constant. The target value of is the Bayes-optimal one and  is the characteristic timescale of the annealing procedure. In Fig. 5 we show that if is sufficiently large, the Langevin hard phase is destroyed and this procedure approaches the performances as AMP. More specifically, before the component becomes relevant, i.e. for times that are , the modified Langevin algorithm tends to the equilibrium solution of the pure spiked matrix model. The optimal AMP reconstruction is subsequently allowed when the tensor component gains its full weight, which occurs later for slower annealing. Conversely if is too small, the Langevin algorithm does not have the time for escaping the glassy region of the tensor component before its own contribution to the dynamics becomes relevant and irremediably hampers the final reconstruction of the signal.

It is interesting to underline that the above finding is somewhat paradoxical from the point of view of Bayesian inference. In the present setting we know perfectly the model that generated the data and all its parameters, yet we see that for the Langevin algorithm it is computationally advantageous to mismatch the parameter

in order to reach faster convergence to equilibrium. This is particularly striking given that for AMP it has been proven in [7] that mismatching the parameters can never improve the performance.

Figure 5: Behaviour of the correlation with the signal in Langevin algorithm where the tensor-related temperature is annealed as with in the Langevin-hard regime with , . We show the behavior for several rates (solid lines) and compare to the quenched Langevin algorithm (dashed line close to zero), and to the value reached by AMP for the full model (upper dotted line) and for the pure matrix model (lower dotted line). We see that unless the annealing is very fast, it reaches the AMP value. When the annealing is very slow it takes time to reach the AMP value.

8 Perspectives

In this work we have investigated the performances of the Langevin algorithm considered as a tool to sample the posterior measure in the spiked matrix-tensor model. We have shown that the Langevin algorithm fails to find the signal in part of the AMP-easy region. Our analysis is based on the Langevin State Evolution equations that describe the evolution of the algorithm in the large size limit.

In this work we managed to find the landscape-annealing protocol under which the Langevin algorithm is able to match the performance of AMP by relying on knowledge about generative model. It would be an interesting direction for future work to investigate whether the performance of the Langevin algorithm can be improved with some model-blind manner.

While we studied here the spiked matrix-tensor model, we expect that our findings are universal because they are due to the glassiness of the hard phase and therefore should apply to any local sampling dynamics, e.g. to Monte Carlo Markov chains. An interesting extension of this work would investigate algorithms closer to stochastic gradient descent and models closer to current neural network architectures.

Acknowledgments

We thank G. Folena and G. Ben Arous for precious discussions. We thank K. Miyazaki for sharing his code for the numerical integration of CHSCK equations. We acknowledge funding from the ERC under the European Union’s Horizon 2020 Research and Innovation Programme Grant Agreement 714608-SMiLe; from the French National Research Agency (ANR) grant PAIL; from ”Investissements d’Avenir” LabEx PALM (ANR-10-LABX-0039-PALM) (SaMURai and StatPhysDisSys); and from the Simons Foundation (#454935, Giulio Biroli).

References

  • [1] Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
  • [2] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 681–688, 2011.
  • [3] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
  • [4] Jean-Philippe Bouchaud, Leticia F Cugliandolo, Jorge Kurchan, and Marc Mézard. Out of equilibrium dynamics in spin-glasses and other glassy systems. Spin glasses and random fields, pages 161–223, 1998.
  • [5] H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Phys. Rev. A, 45:6056–6091, Apr 1992.
  • [6] Lenka Zdeborová and Florent Krzakala. Statistical physics of inference: thresholds and algorithms. Advances in Physics, 65(5):453–552, 2016.
  • [7] Yash Deshpande and Andrea Montanari. Finding hidden cliques of size in nearly linear time. Foundations of Computational Mathematics, 15(4):1069–1128, 2015.
  • [8] Jean Barbier, Florent Krzakala, Nicolas Macris, Léo Miolane, and Lenka Zdeborová. Phase transitions, optimal errors and optimality of message-passing in generalized linear models. in COLT’18, arXiv preprint arXiv:1708.03395, 2017.
  • [9] Jean Barbier, Mohamad Dia, Nicolas Macris, Florent Krzakala, Thibault Lesieur, and Lenka Zdeborová. Mutual information for symmetric rank-one matrix estimation: A proof of the replica formula. In Advances in Neural Information Processing Systems 29, page 424–432. 2016.
  • [10] Thibault Lesieur, Léo Miolane, Marc Lelarge, Florent Krzakala, and Lenka Zdeborová. Statistical and computational phase transitions in spiked tensor estimation. In Information Theory (ISIT), 2017 IEEE International Symposium on, pages 511–515. IEEE, 2017.
  • [11] Benjamin Aubin, Antoine Maillard, Jean Barbier, Florent Krzakala, Nicolas Macris, and Lenka Zdeborová. The committee machine: Computational to statistical gaps in learning a two-layers neural network. In Advances in Neural Information Processing Systems, 2018.
  • [12] David L Donoho, Arian Maleki, and Andrea Montanari. Message-passing algorithms for compressed sensing. Proceedings of the National Academy of Sciences, 106(45):18914–18919, Nov 2009.
  • [13] F. Krzakala, C. Moore, E. Mossel, J. Neeman, A. Sly, L. Zdeborová, and P. Zhang. Spectral redemption in clustering sparse networks. Proceedings of the National Academy of Science, 110:20935–20940, December 2013.
  • [14] Samuel B Hopkins and David Steurer. Bayesian estimation from few samples: community detection and related problems. arXiv preprint arXiv:1710.00264, 2017.
  • [15] Jinho Baik, Gérard Ben Arous, Sandrine Péché, et al.

    Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices.

    The Annals of Probability, 33(5):1643–1697, 2005.
  • [16] Iain M Johnstone and Arthur Yu Lu.

    On consistency and sparsity for principal components analysis in high dimensions.

    Journal of the American Statistical Association, 104(486):682–693, 2009.
  • [17] Emile Richard and Andrea Montanari. A statistical model for tensor PCA. In Advances in Neural Information Processing Systems, pages 2897–2905, 2014.
  • [18] Samuel B Hopkins, Jonathan Shi, and David Steurer. Tensor principal component analysis via sum-of-square proofs. In Conference on Learning Theory, pages 956–1006, 2015.
  • [19] Rong Ge and Tengyu Ma. On the optimization landscape of tensor decompositions. In Advances in Neural Information Processing Systems, pages 3653–3663, 2017.
  • [20] Gerard Ben Arous, Reza Gheissari, and Aukosh Jagannath. Algorithmic thresholds for tensor PCA. arXiv preprint arXiv:1808.00921, 2018.
  • [21] Valentina Ros, Gerard Ben Arous, Giulio Biroli, and Chiara Cammarota. Complex energy landscapes in spiked-tensor and simple glassy models: ruggedness, arrangements of local minima and phase transitions. arXiv preprint arXiv:1804.02686, to appear in PRX, 2018.
  • [22] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models. The Journal of Machine Learning Research, 15(1):2773–2832, 2014.
  • [23] Florent Krzakala and Lenka Zdeborová. Hiding quiet solutions in random constraint satisfaction problems. Physical review letters, 102(23):238701, 2009.
  • [24] Aurelien Decelle, Florent Krzakala, Cristopher Moore, and Lenka Zdeborová. Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Physical Review E, 84(6):066106, 2011.
  • [25] Fabrizio Antenucci, Silvio Franz, Pierfrancesco Urbani, and Lenka Zdeborová. On the glassy nature of the hard phase in inference problems. to appear in Phys. Rev. X, arXiv preprint arXiv:1805.05857, 2018.
  • [26] A Crisanti, H Horner, and H-J Sommers. The spherical -spin interaction spin-glass model. Zeitschrift für Physik B Condensed Matter, 92(2):257–271, 1993.
  • [27] Leticia F Cugliandolo and Jorge Kurchan. Analytical solution of the off-equilibrium dynamics of a long-range spin-glass model. Physical Review Letters, 71(1):173, 1993.
  • [28] Gerard Ben Arous, Amir Dembo, and Alice Guionnet. Cugliandolo-Kurchan equations for dynamics of spin-glasses. Probability theory and related fields, 136(4):619–660, 2006.
  • [29] Yash Deshpande and Andrea Montanari. Information-theoretically optimal sparse PCA. In Information Theory (ISIT), 2014 IEEE International Symposium on, pages 2197–2201. IEEE, 2014.
  • [30] Thibault Lesieur, Florent Krzakala, and Lenka Zdeborová. Constrained low-rank matrix estimation: Phase transitions, approximate message passing and applications. Journal of Statistical Mechanics: Theory and Experiment, 2017(7):073403, 2017.
  • [31] Marc Lelarge and Léo Miolane. Fundamental limits of symmetric low-rank matrix estimation. Probability Theory and Related Fields, pages 1–71, 2016.
  • [32] David J Thouless, Philip W Anderson, and Robert G Palmer. Solution of‘solvable model of a spin glass’. Philosophical Magazine, 35(3):593–601, 1977.
  • [33] Adel Javanmard and Andrea Montanari. State evolution for general approximate message passing algorithms, with applications to spatial coupling. Information and Inference: A Journal of the IMA, 2(2):115–144, 2013.
  • [34] Andrea Crisanti and Luca Leuzzi. Spherical 2+ p spin-glass model: An exactly solvable model for glass to spin-glass transition. Physical review letters, 93(21):217203, 2004.
  • [35] Andrea Crisanti and Luca Leuzzi. Spherical 2+ p spin-glass model: An analytically solvable model with a glass-to-glass transition. Physical Review B, 73(1):014412, 2006.
  • [36] Paul Cecil Martin, ED Siggia, and HA Rose. Statistical dynamics of classical systems. Physical Review A, 8(1):423, 1973.
  • [37] Marc Mézard, Giorgio Parisi, and Miguel-Angel Virasoro. Spin glass theory and beyond. World Scientific Publishing, 1987.
  • [38] Bongsoo Kim and Arnulf Latz. The dynamics of the spherical p-spin model: From microscopic to asymptotic. EPL (Europhysics Letters), 53(5):660, 2001.
  • [39] Ludovic Berthier, Giulio Biroli, J-P Bouchaud, Walter Kob, Kunimasa Miyazaki, and DR Reichman. Spontaneous and induced dynamic fluctuations in glass formers. I. General results and dependence on ensemble and dynamics. The Journal of chemical physics, 126(18):184503, 2007.
  • [40] Stefano Sarao Mannelli, Giulio Biroli, Chiara Cammarota, Florent Krzakala, Pierfrancesco Urbani, and Lenka Zdeborová. Langevin state evolution integrators, 2018. Available at: https://github.com/sphinxteam/spiked_matrix-tensor.
  • [41] Tommaso Rizzo. Replica-symmetry-breaking transitions and off-equilibrium dynamics. Physical Review E, 88(3):032135, 2013.
  • [42] Scott Kirkpatrick, C Daniel Gelatt, and Mario P Vecchi. Optimization by simulated annealing. science, 220(4598):671–680, 1983.
  • [43] Andrea Crisanti and Luca Leuzzi. Exactly solvable spin–glass models with ferromagnetic couplings: The spherical multi-p-spin model in a self-induced field. Nuclear Physics B, 870(1):176–204, 2013.
  • [44] Marc Mézard and Andrea Montanari. Information, physics, and computation. Oxford University Press, 2009.
  • [45] Stéphane Boucheron, Gábor Lugosi, and Olivier Bousquet. Concentration inequalities. In Advanced Lectures on Machine Learning, pages 208–240. Springer, 2004.
  • [46] Satish Babu Korada and Nicolas Macris. Exact solution of the gauge symmetric p-spin glass model on a complete graph. Journal of Statistical Physics, 136(2):205–230, 2009.
  • [47] Florent Krzakala, Jiaming Xu, and Lenka Zdeborová. Mutual information in rank-one matrix estimation. In 2016 IEEE Information Theory Workshop (ITW), pages 71–75, Sept 2016.
  • [48] Michael Aizenman, Robert Sims, and Shannon L Starr. Extended variational principle for the sherrington-kirkpatrick spin-glass model. Physical Review B, 68(21):214403, 2003.
  • [49] Jean Barbier and Nicolas Macris. The adaptive interpolation method: A simple scheme to prove replica formulas in bayesian inference. to appear in Probability Theory and Related Fields, arXiv preprint arXiv:1705.02780, 2017.
  • [50] Ahmed El Alaoui and Florent Krzakala. Estimation in the spiked wigner model: A short proof of the replica formula. In 2018 IEEE International Symposium on Information Theory (ISIT), pages 1874–1878, June 2018.
  • [51] Jean-Christophe Mourrat. Hamilton-Jacobi equations for mean-field disordered systems. arXiv preprint arXiv:1811.01432, 2018.
  • [52] Dongning Guo, S. Shamai, and S. Verdú. Mutual information and minimum mean-square error in gaussian channels. IEEE Transactions on Information Theory, 51(4):1261–1282, April 2005.
  • [53] Hans-Otto Georgii. Gibbs measures and phase transitions, volume 9. Walter de Gruyter, 2011.
  • [54] Nicolas Macris. Griffith-Kelly-Sherman correlation inequalities: A useful tool in the theory of error correcting codes. IEEE Transactions on Information Theory, 53(2):664–683, Feb 2007.
  • [55] Andrea Montanari.

    Estimating random variables from random sparse observations.

    European Transactions on Telecommunications, 19(4):385–403, 2008.
  • [56] Amin Coja-Oghlan, Florent Krzakala, Will Perkins, and Lenka Zdeborova. Information-theoretic thresholds from the cavity method. In

    Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (STOC)

    , pages 146–157, 2017.
  • [57] Hidetoshi Nishimori. Statistical physics of spin glasses and information processing: an introduction, volume 111. Clarendon Press, 2001.
  • [58] Francesco Guerra and Fabio Lucio Toninelli. The thermodynamic limit in mean field spin glass models. Communications in Mathematical Physics, 230(1):71–79, 2002.
  • [59] Federico Ricci-Tersenghi, Guilhem Semerjian, and Lenka Zdeborová. Typology of phase transitions in bayesian inference problems. preprint:arXiv:1806.11013.
  • [60] Leticia F Cugliandolo. Course 7: Dynamics of glassy systems. In Slow Relaxations and nonequilibrium dynamics in condensed matter, pages 367–521. Springer, 2003.
  • [61] Tommaso Castellani and Andrea Cavagna. Spin glass theory for pedestrians. Journal of Statistical Mechanics: Theory and Experiment, 2005:P05012, 2005.
  • [62] Elisabeth Agoritsas, Giulio Biroli, Pierfrancesco Urbani, and Francesco Zamponi.

    Out-of-equilibrium dynamical mean-field equations for the perceptron model.

    Journal of Physics A: Mathematical and Theoretical, 51(8):085002, 2018.
  • [63] Ludovic Berthier, Giulio Biroli, Jean-Philippe Bouchaud, Walter Kob, Kunimasa Miyazaki, and David R Reichman. Spontaneous and induced dynamic correlations in glass formers. II. Model calculations and comparison to numerical simulations. The Journal of chemical physics, 126(18):184504, 2007.
  • [64] Rémi Monasson. Structural glass transition and the entropy of the metastable states. Physical review letters, 75(15):2847, 1995.
  • [65] Francesco Zamponi. Mean field theory of spin glasses. arXiv preprint arXiv:1008.4844, 2010.
  • [66] Andrea Crisanti and H-J Sommers. The spherical p-spin interaction spin glass model: the statics. Zeitschrift für Physik B Condensed Matter, 87(3):341–354, 1992.
  • [67] LF Cugliandolo and J Kurchan. Weak ergodicity breaking in mean-field spin-glass models. Philosophical Magazine B, 71(4):501–514, 1995.
  • [68] LF Cugliandolo and J Kurchan. On the out-of-equilibrium relaxation of the Sherrington-Kirkpatrick model. Journal of Physics A: Mathematical and General, 27(17):5749, 1994.
  • [69] Leticia F Cugliandolo and Jorge Kurchan. Aging and effective temperatures in the low temperature mode-coupling equations. Progress of Theoretical Physics Supplement, 126:407–414, 1997.

Appendix

Appendix A Definition of the spiked matrix-tensor model

We consider a teacher-student setting in which the teacher constructs a matrix and a tensor from a randomly sampled signal and the student is asked to recover the signal from the observation of the matrix and tensor provided by the teacher [6].

The signal, is an

-dimensional vector whose entries are real i.i.d. random variables sampled from the normal distribution (i.e. the prior is

). The teacher generates from the signal a symmetric matrix and a symmetric tensor of order . Those two objects are then transmitted through two noisy channels with variances and , so that at the end one has two noisy observations given by

(12)
(13)

where, for and , and are i.i.d. random variables distributed according to and . The and are symmetric random matrix and tensor, respectively. Given and the inference task is to reconstruct the signal .

In order to solve this problem we consider the Bayesian approach. This starts from the assumption that both the matrix and tensor have been produced from a process of the same kind of the one described by Eq. (12-13). Furthermore we assume to know the statistical properties of the channel, namely the two variances and , and the prior on . Given this, the posterior probability distribution over the signal is obtained through the Bayes formula

(14)

where

(15)

Therefore we have

(16)

where is a normalization constant.

Plugging Eqs. (12-13) into Eq. (16) allows to rewrite the posterior measure in the form of a Boltzmann distribution of the mixed -spin Hamiltonian [34, 35, 43]

(17)

so that

(18)

In the following we will refer to as the partition function. We note here that in the large limit, using a Gaussian prior on the variables is equivalent to consider a flat measure over the -dimensional hypersphere . This choice will be used when we will describe the Langevin algorithm and in this case the last term in the Hamiltonian will become an irrelevant constant.

Appendix B Approximate Message Passing, state evolution and phase diagrams

Approximate Message Passing (AMP) is a powerful iterative algorithms to compute the local magnetizations given the observed matrix and tensor. It is rooted in the cavity method of statistical physics of disordered systems [32, 37] and it has been recently developed in the context of statistical inference [12], where in the Bayes optimal case it has been conjectured to be optimal among all local iterative algorithms. Among the properties that make AMP extremely useful is the fact that its performances can be analyzed in the thermodynamic limit. Indeed in such limit, its dynamical evolution is described by the so called State Evolution (SE) equations [12]. In this section we derive the AMP equations and their SE description for the spiked matrix-tensor model and solve them to obtain the phase diagram of the model as a function of the variances and of the two noisy channels.

b.1 Approximate Message Passing and Bethe free entropy

AMP can be obtained as a relaxed Gaussian closure of the Belief Propagation (BP) algorithm. The derivation that we present follows the same lines of [10, 30]. The posterior probability can be represented as a factor graph where all the variables are represented by circles and are linked to squares representing the interactions [44].

Figure 6: The factor graph representation of the posterior measure of the matrix-tensor factorization model. The variable nodes represented with white circles are the components of the signal while black squares are factor nodes that denote interactions between the variable nodes that appear in the interaction terms of the Boltzmann distribution in Eqs. (17-18). There are three types of factor nodes: is the prior that depends on a single variable, that is the probability of observing a matrix element given the values of the variables and , and finally that is the probability of observing a tensor element . The posterior, apart from the normalization factor, is simply given by the product of all the factor nodes.

This representation is very convenient to write down the BP equations. In the BP algorithm we iteratively update until convergence a set of variables, which are beliefs of the (cavity) magnetization of the nodes. The intuitive underlying reasoning behind how BP works is the following. Given the current state of the variable nodes, take a factor node and exclude one node among its neighbors. The remaining neighbors through the factor node express a belief on the state of the excluded node. This belief is mathematically described by a probability distribution called message, and depending on which factor node is selected. At the same time, another belief on the state of the excluded node is given by the rest of the network but the factor node previously taken into account, and respectively. All these messages travel in the factor graph carrying partial information on the real magnetization of the single nodes, and they are iterated until convergence111Note that the pointwise convergence of the algorithm depends on the situations.. The iterative scheme is described by the following equations

(19)
(20)
(21)
(22)

and we have omitted the normalization constants that guarantee that the messages are probability distributions. When the messages have converged to a fixed point, the estimation of the local magnetizations can be obtained through the computation of the real marginal probability distribution of the variables given by

(23)

We note that the computational cost to produce an iteration of BP scales as . Furthermore Eqs. (19 -22) are iterative equations for continuous functions and therefore are extremely hard to solve when dealing with continuous variables. The advantage of AMP is to reduce drastically the computational complexity of the algorithm by closing the equations on a Gaussian ansatz for the messages. This is justified in the present context since the factor graph is fully connected and therefore each iteration step of the algorithm involves sums of a large number of independent random variables that give rise to Gaussian distributions. Gaussian random variables are characterized by their mean and covariance that are readily obtained for expanding the factor nodes for small and .

Once the BP equations are relaxed on Gaussian messages, the final step to obtain the AMP algorithm is the so-called TAPyfication procedure [30, 32], which exploits the fact that the procedure of removing one node or one factor produces only a weak perturbation to the real marginals and therefore can be described in terms of the real marginals of the variable nodes themselves. By applying this scheme we obtain the AMP equations, which are described by a set of auxiliary variables and and by the mean and variance of the marginals of variable nodes. The AMP iterative equations are

(24)
(25)
(26)
(27)
(28)
(29)
(30)

It can be shown that these equations can be obtained as saddle point equations from the so called Bethe free entropy defined as where is the Bethe approximation to the partition function which is defined as the normalization of the posterior measure. The expression of the Bethe free entropy per variable can be computed in a standard way (see [44]) and it is given by

(31)

where

are a set of normalization factors. Using the Gaussian approximation for the messages and employing the same TAPyification procedure used to get the AMP equations we obtain the Bethe free entropy density as

(32)

where we used the variables defined in eqs. (24-27) for sake of compactness and is defined as

(33)

b.2 Averaged free entropy and its proof

Eq. (32) represents the Bethe free entropy for a single realization of the factor nodes in the large size limit. Here we wish to discuss the actual, exact, value of this free entropy, that is:

This is a random variable, since it depends a priori on the planted signal and the noise in the tensor and matrices. However one expects that, since free entropy is an intensive quantity, we expect from the statistical physics intuition that it should be self averaging and concentrate around its mean value in the large limit [37]. In fact, this is easily proven. First, since the spherical model has a rotational symmetry, one may assume the planted assignment could be any vector on the hyper-sphere, and we might as well suppose it is the uniform one : the true source of fluctuation comes from the noise and . These can be controlled by noticing that the free entropy is a Lipshitz function of the Gaussian random variable and . Indeed:

so that the free energy is Lipschitz with respect to with constant

where represent a copy (or replica) of the system. In this case

where is the overlap between the two replica and , that is bounded by one on the sphere, so . Therefore, by Gaussian concentration of Lipschitz functions (the Tsirelson-Ibragimov-Sudakov inequality [45]), we have for some constant :