1 Introduction
Machine learning as achieved astonishing success across real world problems, such as image classification, speech recognition, text processing, and physical problems, from quantum physics [4], to astrophysics [5], to highenergy physics [6]
. Despite these practical successes, a large number of aspects still lacks theoretical understanding. Practitioners identified several prescriptions to construct a working machine learning applications, but it is often unclear why those recipes are effective. Consider a typical classification task, where a dataset consisting of pictures of cats and dogs is provided to the machine with the correct labels. What follows is the minimization of a cost function. Given new images of pets, the goal of the machine is to be able to correctly classify them into cats and dogs, thus successfully generalizing from what it has seen.
The optimization process itself is puzzling. In general, the cost function is highdimensional and nonconvex. Intuition would suggest that a random initialization would lead to some local spurious, noninformative, minimum with very little hope to achieve a good generalization. Instead, in practice even the use of vanilla gradient descent often leads to good generalization. Part of the computer science community analysed the problem geometrically by studying the properties of the cost function [7, 8, 9, 10, 11]
. They consider generative models where a signal is observed through a noisy channel and it is possible to tune its strength with respect to the strength of the noise, the signal to noise ratio (SNR), and change the landscape. They showed that in a variety of problems, all the minima become equally good or the spurious minima disappear as the SNR becomes sufficiently large, thus making the landscape trivial.
In this work we review the recent effort towards an understanding of the learning dynamics using the tools of disordered systems [2, 3], and we discuss the difference in performance between message passing algorithms and algorithms for sampling a highdimensional potential [1]
. The relation between the two approaches becomes apparent from the point of view of Bayesian statistics. Let
be the guess on the hidden signal and the observation, we can express plausible is to observe given our guess, i.e. the likelihood. Bayes formula allows to invert the likelihood into the posterior probability
, that also includes prior information on the guess, such as sparsity or norm constrains. We can write an approximate expression(1) 
In the last equality we identify the terms with a Gibbs distribution with inverse temperature
. Given the posterior we can estimate the signal by considering the expected value:
(2) 
Observe that when the inverse temperature parameter equals 1, Eq. (1) is the posterior probability of the problem. As tends to infinity, the cost dominates and optimizing will maximize the likelihood.
In the eyes of a statistical physicist, the expected value would rather be called as it is formally identical to the magnetization of a system under the action of the Hamiltonian. However, the exact computation a this expected value exactly is prohibitive in large dimension, in fact it is as complex as evaluating the partition function. In order to avoid such complication numerous ingenious techniques have been considered in the past to obtain an approximate estimation. Two main approaches consist of approximating the posterior, and sampling the posterior.

The idea of adapting the approximations proposed in disordered systems to computer science problems is not recent, and early works appeared in the 80s and 90s [12, 13]. Ideas from physics were transferred to problems in signal processing and optimisation, providing both theoretical understanding and practical algorithms based on Cavity Method and its variations [14, 15, 16]. Those methods have the advantage of being at the same time algorithms and analytical tools. In many problems they were proved to be asymptotically optimal [17, 18, 19], in the sense that informationtheoretically they achieve the best performance in polynomial time.

The best known algorithms that sample the posterior are Monte Carlo and the Langevin algorithm. Studies on the Langevin algorithm in disordered systems have their root in the late 70s [20, 21, 22, 23]
. Despite the dynamics was understood for some recurrent neural networks in longtime regime
[24, 25], generalizing and solving the corresponding equations is very difficult even in the simplest models of statistical inference [26]. Consequently, analysis of the performance of gradientbased algorithms such as the Langevin algorithm remains an open problem. A progress on this question was recently made in a series of works [1, 2, 3] that we review here.
2 Spiked matrixtensor model
The model that we study in this report is the
spiked matrixtensor model
, known in physics as a planted version of the spherical mixed spin model [27, 23, 28]. Planting is a technique introduced to study statistical inference and learning problems using the same methods as for their optimization counterparts [29]. Planting appears as an additional ferromagnetic bias towards a planted solution (or groundtruth) in the Hamiltonian. In its application to inference, planting permits to introduce a signal, the groundtruth solution, in the formulation of the problem. In the neuralnetworklearning language this formulation is called teacherstudent scenario: the teacher knows the groundtruth and uses it to generate data, the student has to use the data to infer the groundtruth.The spiked matrixtensor model was introduced in [1, 2, 3] in order to build an inference problem for which the behaviour of the gradientbased dynamics is exactly solvable. For the sake of simplicity we will consider , which means that the teacher samples the ground truth and generates the data, a matrix and a order 3 tensor. The process is noisy and the data that the student receives, and
, have an intrinsic Gaussian noise of variance
and respectively. The two observations are rescaled in order to have an extensive free energy in the size of the system . The generative process is represented in Fig. 1. Substituting the data into the posterior Eq. (1) and absorbing constant terms into the prefactor, we obtain the Hamiltonian(3) 
where is the overlap with the signal. Observe that the noise terms () in the equation are rescaled by ( respectively) in order to have a problem that is neither impossibly hard (very high noise) nor trivially easy (very small noise). Under this choice of scaling of noise, we observe different transitions for values of and of order .
The spiked matrixtensor model is a natural candidate for our analysis as it has highdimensional nonconvex energy landscape. The algorithmic transition, after which algorithms start to detect the signal, occurs at the same noise scaling as the informationtheoretic transition for detection. The model is analytically tractable using different methods allowing to experiment and compare. We remark that the spiked tensor model does not have an algorithmic and informationtheoretic transitions occurring in the same scaling regime of the noise, thus it is a less interesting candidate for our analysis.
3 Sampling algorithms vs approximate algorithms
We are going to consider an algorithmic version of cavity method as an example of an approximation algorithm.[14]
. This algorithm was developed independently in the information theory and Bayesian inference community under the name of belief propagation
[30, 31, 32]. In the case of fully connected models, belief propagation can be simplified by assuming a Gaussian structure in the beliefs, leading to the Approximate Message Passing (AMP) algorithm [33, 16]. AMP presents numerous remarkable features: it provably achieves optimal performances in many problems including the spiked matrixtensor [17, 18, 19, 1] and its average behaviour can be analytically followed by a set of equations called state evolution [34]. State evolution equations allow to portrait the phase diagram of this model, see Fig. 2, it was done in [1] generalizing the results of [18] on the spiked tensor model. The phase diagram can now be used as a baseline for the behaviour of the sampling algorithms.In order to sample from the posterior probability it is necessary to design a dynamics that has the posterior probability as its stationary measure at large times. A typical sampling algorithm with this objective is the Langevin algorithm. Given a Hamiltonian of a spherical system, Langevin dynamics describes the evolution of the system coupled with a thermal bath at temperature
(4) 
where is a Langrange multiplier that imposes the spherical constraint and is the Langevin noise with and . In the late 70s, techniques [21] for the study of Langevin dynamics were adapted to disordered systems providing a set of PDEs on the evolution of few relevant observables. More recently, the results of these techniques have been proved with mathematical rigour in the mixed spin model [35, 36]. Those methods have been generalized to the study of planted systems [37] and applied to the present problem in [1, 3]. Two variants of the dynamical mean field theory were used to derive the corresponding equations: the dynamical cavity method was used in [1], and the generating functional formalism [3]. The equations obtained characterize the evolution of: the alignment of the system with the ground truth , the selfalignment at different times , and the response to a perturbation of the Hamiltonian at a previous time .
(5)  
(6)  
(7) 
with , the initial conditions for all , and that allows to derive and additional equation for . The spiked matrixtensor model has the nice feature of having a closed form for these equations, allowing an easier evaluation of the numerical solution by propagation from the initial conditions. In [1, 2] the limits of Langevin and gradient descent (respectively) have been evaluated numerically by extrapolation from the numerical solutions, see Fig. 2. In general the dynamical equations do not close, thus a selfconsistent loop is necessary in order to evaluate a numerical solution limiting the times accessible in the numerics [38].
An alternative can be derived from the work [39] where the authors proposed an ansatz for the large time behaviour of the spin model, which assumes two time scales. The authors also showed that the dynamics is attracted by states  called threshold states  characterized by a Hessian that displays marginality, i.e. its spectrum touches the zero. In [3], these two ideas are used to derive the analytical threshold of the Langevin dynamics and gradient descent, by assuming that initially the dynamics will tend to the threshold states and at later times it will increase the alignment with the ground truth. The growth is exponential and the exponent is
, the phase transition occurs when the exponent crosses the null value. Analytical and numerical results are shown in Fig.
2 giving a perfect agreement.The results suggest that sampling algorithms have worse algorithmic threshold than AMP. This idea was foreseen in [40], where the authors used a large deviation analysis [41] to find exponentially many atypical glassy states in the landscape. They conjectured that the presence of this atypical glassy states may block the dynamics of sampling algorithms. The same analysis was also performed in the spiked matrixtensor model confirming their findings [1].
Another signature of the different transitions appears in the evolution of AMP and Langevin dynamics, Fig. 3. For a fixed value of (with ), we can compare evolutions for different values of . As the system gets closer to the transition, the time to find the transition increases. We can thus observe that AMP maintains the same typical time to find the solution for the different values of , instead the typical time of the Langevin dynamics increases exponentially as becomes smaller. This illustrates the counterintuitive finding that making the problem simpler by decreasing the noise in the tensor actually harms the Langevin evolution.
4 Gradient flow and geometry
It was already clear in [39] that enters in a smooth way in the dynamical equations, thus studying the limit
we can derive the behaviour of gradient descent dynamics. In machine learning gradient descent and its several variations (e.g. stochastic gradient descent) are usually used to minimize the cost function. Currently very few problems are amenable to analytical analysis of the dynamics.
In the 80s [42] and in the early 2000 [43, 44, 45] there was an effort to understand the geometrical structure of the energy landscape in disordered models. Given the number of critical points of the model, , the studies focused on the annealed (and quenched) complexity defined as (and , respectively). The authors used an expression that enumerates the number of critical points, namely the KacRice formula [46]
, computed using replica theory. Recently another approach for the evaluation of the KacRice formula has been proposed that uses random matrix theory, giving fruitful results in the
spin model (planted and unplanted) [47, 48, 49]. In [2] the analysis was generalized to the spiked matrixtensor model allowing to distinguish between regions where exponentially many minima are present and regions where only the good minima appear. The line that separates them is the trivialization transition line. As gradient descent is run above this line, provided that the timediscretization is thin enough, we have a guarantee of finding the good minimum. Many papers [8, 9, 10, 11] have focused on this aspect showing in several problems that as the SNR is strong enough the bad minima disappear, or all the minima become equally good. In the spiked matrixtensor model geometrical trivialization and gradient descent transition can both be pinpointed on the phase diagram. The results, Fig. 2, show that gradient descent starts to detect the signal before the trivialization transition has occurred. Although the algorithmic threshold of gradient descent can not occur after the trivialization transition, it might appear counterintuitive to understand why the two lines do not coincide and there is a distance of order that separates them. In [3] the puzzle was solved, Fig. 4. The authors showed that, moving from a low SNR region where the algorithm fails, the algorithmic transition of gradient descent appears when the dominant minima (the threshold states or threshold minima) develop an instability, a BaikBenArousPéché instability [50], becoming saddles with a single negative direction that points toward the signal. In this region there are still exponentially many minima that do not carry information on the signal, nevertheless the dynamics is first attracted by the saddles at the threshold that shield the system from the bad minima and point in the right direction.5 Conclusions
In this manuscript we analyze recent progresses on the understanding of the dynamics in inference problems using the tools developed in statistical physics. The attention is focused on the spiked matrixtensor model (planted spherical mixed spin model) as a prototypical example of inference. The results on this model unveil unexpected behaviours of the dynamics and explain them from both a dynamical and a geometrical perspective. The techniques briefly summarized in the work can be extended to other models and some of the findings can be verified numerically. These are thrilling directions that we hope to pursue in the future.
Acknowledgments
We thank G. Biroli, C. Cammarota, F. Krzakala and P. Urbani for the collaborations that led to these results and G. Bassignana, S. Goldt and O. Scarlatella for reading the draft of the manuscript. We acknowledge funding from the ERC under the European Union’s Horizon 2020 Research and Innovation Programme Grant Agreement 714608SMiLe.
References
References
 [1] Stefano Sarao Mannelli, Giulio Biroli, Chiara Cammarota, Florent Krzakala, Pierfrancesco Urbani, and Lenka Zdeborová. Marvels and pitfalls of the Langevin algorithm in noisy highdimensional inference. arXiv preprint arXiv:1812.09066, 2018.
 [2] Stefano Sarao Mannelli, Florent Krzakala, Pierfrancesco Urbani, and Lenka Zdeborová. Passed & spurious: Descent algorithms and local minima in spiked matrixtensor models. In International Conference on Machine Learning, pages 4333–4342, 2019.
 [3] Stefano Sarao Mannelli, Giulio Biroli, Chiara Cammarota, Florent Krzakala, and Lenka Zdeborová. Who is afraid of big bad minima? analysis of gradientflow in spiked matrixtensor models. In Advances in Neural Information Processing Systems, pages 8676–8686, 2019.
 [4] Giuseppe Carleo and Matthias Troyer. Solving the quantum manybody problem with artificial neural networks. Science, 355(6325):602–606, 2017.
 [5] Nicholas M Ball and Robert J Brunner. Data mining and machine learning in astronomy. International Journal of Modern Physics D, 19(07):1049–1106, 2010.
 [6] Alexander Radovic, MikeKrz Williams, David Rousseau, Michael Kagan, Daniele Bonacorsi, Alexander Himmel, Adam Aurisano, Kazuhiro Terao, and Taritree Wongjirad. Machine learning at the energy and intensity frontiers of particle physics. Nature, 560(7716):41, 2018.
 [7] Afonso S Bandeira, Nicolas Boumal, and Vladislav Voroninski. On the lowrank approach for semidefinite programs arising in synchronization and community detection. In Conference on learning theory, pages 361–382, 2016.
 [8] Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pages 586–594, 2016.
 [9] Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems, pages 2973–2981, 2016.
 [10] Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems: A unified geometric analysis. In Proceedings of the 34th International Conference on Machine Learning, pages 1233–1242, 2017.
 [11] Simon S Du, Jason D Lee, Yuandong Tian, Aarti Singh, and Barnabas Poczos. Gradient descent learns onehiddenlayer CNN: Don’t be afraid of spurious local minima. In International Conference on Machine Learning, pages 1338–1347, 2018.
 [12] Marc Mézard, Giorgio Parisi, and MiguelAngel Virasoro. Spin glass theory and beyond. World Scientific Publishing, 1987.
 [13] Andreas Engel and Christian Van den Broeck. Statistical mechanics of learning. Cambridge University Press, 2001.
 [14] Marc Mézard and Andrea Montanari. Information, physics, and computation. Oxford University Press, 2009.
 [15] Lenka Zdeborová and Florent Krzakala. Statistical physics of inference: thresholds and algorithms. Advances in Physics, 65(5):453–552, 2016.
 [16] Thibault Lesieur, Florent Krzakala, and Lenka Zdeborová. Constrained lowrank matrix estimation: Phase transitions, approximate message passing and applications. Journal of Statistical Mechanics: Theory and Experiment, 2017(7):073403, 2017.
 [17] Léo Miolane. Fundamental limits of lowrank matrix estimation: the nonsymmetric case. arXiv preprint arXiv:1702.00473, 2017.
 [18] Thibault Lesieur, Léo Miolane, Marc Lelarge, Florent Krzakala, and Lenka Zdeborová. Statistical and computational phase transitions in spiked tensor estimation. In Information Theory (ISIT), 2017 IEEE International Symposium on, pages 511–515. IEEE, 2017.
 [19] Jean Barbier, Florent Krzakala, Nicolas Macris, Léo Miolane, and Lenka Zdeborová. Optimal errors and phase transitions in highdimensional generalized linear models. Proceedings of the National Academy of Sciences, 116(12):5451–5460, 2019.
 [20] Paul Cecil Martin, ED Siggia, and HA Rose. Statistical dynamics of classical systems. Physical Review A, 8(1):423, 1973.
 [21] C De Dominicis. Dynamics as a substitute for replicas in systems with quenched random impurities. Physical Review B, 18(9):4913, 1978.
 [22] Theodore R Kirkpatrick and Devarajan Thirumalai. Dynamics of the structural glass transition and the pspin—interaction spinglass model. Physical review letters, 58(20):2091, 1987.
 [23] A Crisanti, H Horner, and HJ Sommers. The spherical spin interaction spinglass model. Zeitschrift für Physik B Condensed Matter, 92(2):257–271, 1993.
 [24] Haim Sompolinsky, Andrea Crisanti, and HansJurgen Sommers. Chaos in random neural networks. Physical review letters, 61(3):259, 1988.
 [25] ACC Coolen. Statistical mechanics of recurrent neural networks II. Dynamics. arXiv preprint condmat/0006011, 2000.

[26]
Elisabeth Agoritsas, Giulio Biroli, Pierfrancesco Urbani, and Francesco
Zamponi.
Outofequilibrium dynamical meanfield equations for the perceptron model.
Journal of Physics A: Mathematical and Theoretical, 51(8):085002, 2018.  [27] David J Gross and Marc Mézard. The simplest spin glass. Nuclear Physics B, 240(4):431–452, 1984.
 [28] Andrea Crisanti and Luca Leuzzi. Spherical 2+ p spinglass model: An exactly solvable model for glass to spinglass transition. Physical review letters, 93(21):217203, 2004.
 [29] Florent Krzakala and Lenka Zdeborová. Hiding quiet solutions in random constraint satisfaction problems. Physical review letters, 102(23):238701, 2009.
 [30] Robert Gallager. Lowdensity paritycheck codes. IRE Transactions on information theory, 8(1):21–28, 1962.
 [31] Judea Pearl. Reverend Bayes on inference engines: A distributed hierarchical approach. Cognitive Systems Laboratory, School of Engineering and Applied Science, University of California, Los Angeles, 1982., 1982.
 [32] Judea Pearl. Fusion, propagation, and structuring in belief networks. Artificial intelligence, 29(3):241–288, 1986.
 [33] David L Donoho, Arian Maleki, and Andrea Montanari. Messagepassing algorithms for compressed sensing. Proceedings of the National Academy of Sciences, 106(45):18914–18919, Nov 2009.
 [34] Adel Javanmard and Andrea Montanari. State evolution for general approximate message passing algorithms, with applications to spatial coupling. Information and Inference: A Journal of the IMA, 2(2):115–144, 2013.
 [35] Gérard Ben Arous, Amir Dembo, and Alice Guionnet. CugliandoloKurchan equations for dynamics of spinglasses. Probability theory and related fields, 136(4):619–660, 2006.
 [36] Amir Dembo and Eliran Subag. Dynamics for spherical spin glasses: disorder dependent initial conditions. arXiv preprint arXiv:1908.01126, 2019.
 [37] Chiara Cammarota and Giulio Biroli. Aging and relaxation near random pinning glass transitions. EPL (Europhysics Letters), 98(1):16011, 2012.
 [38] Felix Roy, Giulio Biroli, Guy Bunin, and Chiara Cammarota. Numerical implementation of dynamical mean field theory for disordered systems: application to the LotkaVolterra model of ecosystems. Journal of Physics A: Mathematical and Theoretical, 2019.
 [39] Leticia F Cugliandolo and Jorge Kurchan. Analytical solution of the offequilibrium dynamics of a longrange spinglass model. Physical Review Letters, 71(1):173, 1993.
 [40] Fabrizio Antenucci, Silvio Franz, Pierfrancesco Urbani, and Lenka Zdeborová. Glassy nature of the hard phase in inference problems. Physical Review X, 9(1):011020, 2019.
 [41] Rémi Monasson. Structural glass transition and the entropy of the metastable states. Physical review letters, 75(15):2847, 1995.
 [42] AJ Bray and MA Moore. Metastable states, internal field distributions and magnetic excitations in spin glasses. Journal of Physics C: Solid State Physics, 14(19):2629, 1981.
 [43] Andrea Crisanti, Luca Leuzzi, Giorgio Parisi, and Tommaso Rizzo. Complexity in the sherringtonkirkpatrick model in the annealed approximation. Physical Review B, 68(17):174401, 2003.
 [44] Andrea Cavagna, Irene Giardina, Giorgio Parisi, and Marc Mézard. On the formal equivalence of the tap and thermodynamic methods in the sk model. Journal of Physics A: Mathematical and General, 36(5):1175, 2003.
 [45] A Crisanti, L Leuzzi, G Parisi, and T Rizzo. Quenched computation of the dependence of complexity on the free energy in the sherringtonkirkpatrick model. Physical Review B, 70(6):064423, 2004.
 [46] Robert J Adler and Jonathan E Taylor. Random fields and geometry. Springer Science & Business Media, 2009.
 [47] Antonio Auffinger, Gérard Ben Arous, and Jiří Černý. Random matrices and complexity of spin glasses. Communications on Pure and Applied Mathematics, 66(2):165–201, 2013.
 [48] Gérard Ben Arous, Song Mei, Andrea Montanari, and Mihai Nica. The landscape of the spiked tensor model. arXiv preprint arXiv:1711.05424, 2017.
 [49] Valentina Ros, Gérard Ben Arous, Giulio Biroli, and Chiara Cammarota. Complex energy landscapes in spikedtensor and simple glassy models: Ruggedness, arrangements of local minima, and phase transitions. Physical Review X, 9(1):011003, 2019.

[50]
Jinho Baik, Gérard Ben Arous, and Sandrine Péché.
Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices.
The Annals of Probability, 33(5):1643–1697, 2005.
Comments
There are no comments yet.