A central question in computational sciences is the algorithmic feasibility of optimization in high-dimensional non-convex landscapes. This question is particularly important in learning and inference problems where the value of the optimized function is not the ultimate criterium for quality of the result, instead the generalization error or the closeness to a ground-truth signal is more relevant.
Recent years brought a popular line of research into this question where various works show for a variety of systems that spurious local minima are not present in certain regimes of parameters and conclude that consequently optimization algorithms shall succeed, without the aim of being exhaustive these include [2, 3, 4, 5, 6, 7, 8, 9, 8, 10]. The spuriosity of a minima is in some works defined by their distance from the global minima, in other works as local minimizers that lead to bad generalization or bad accuracy in reconstruction of the ground truth signal. These two notions are not always equivalent, and certainly the later is more relevant and will be used in the present work.
Many of the existing works stop at the statement that absence of spurious local minimizers leads to algorithmic feasibility and the presence of such spurious local minima leads to algorithmic difficulty, at least as far as gradient-descent-based algorithms are concerned. At the same time, even gradient-descent-based algorithms may be able to perform well even when spurious local minima are present. This is because the basins of attraction of the spurious minimas may be small and the dynamics might be able to avoid them. In the other direction, even if spurious local minima are absent, algorithms might take long time to find a minimizer for entropic reasons that in high-dimensional problems may play a crucial role.
Main results: In this work we provide a case-study of a high-dimensional inference problems – the spiked matrix-tensor model – for which we are able to describe and quantitatively compare the following:
With the use of the recently introduced Langevin-state-evolution , that is a generalization of an approach well known in physics for analysis of the Langevin dynamics [13, 14], we give a closed-form description for the behaviour of the gradient flow (GF) algorithm in the limit of large system sizes.
We analyze the state evolution of the maximum-likelihood version of the approximate message passing algorithm (ML-AMP).
We show that the above two algorithms (GF and ML-AMP) achieve the same error in the regime where they succeed. That same value of the error is also deduced from the position of all the minima strongly correlated with the signal as obtained from the Kac-Rice approach (precise statement below). We quantify the region of parameters in which the two above algorithms succeed and show that, up to the degree of accurancy of the extrapolation of the GF performance, the two lines are not the same. We also show that the algorithmic performance is not driven by the absence of spurious local minima. These results are summarized in Fig. 1. These result show, that in order to obtain a complete picture for settings beyond the present model, the precise interplay between absence of spurious local minima and algorithmic performance remains to be further investigated.
2 Problem definition
, consistently with the spiked matrix model. These three lines, related to minimization of the landscape, and their mutual positions, are the main result of this paper. The colors in the background, separated by the black dashed-dotted lines, show for comparison the phase diagram for the Bayes-optimal inference, related to the ability to approximate the marginals of the corresponding posterior probability distribution, and are taken from. In the red region obtaining a positive correlation with the signal in information-theoretically impossible. In the green region it is possible to obtain optimal correlation with the signal using the Bayes-optimal AMP (BO-AMP). And in the orange the region the BO-AMP is not able to reach the Bayes-optimal performance.
In this paper we consider the spiked matrix-tensor model as studied in . This is a statistical inference problem where the ground truth signal is sampled uniformly on the -dimensional sphere, . We then obtain two types of observations about the signal, a symmetric matrix , and an order symmetric tensor , that given the signal are obtained as
for and , using and symmetries to obtain the other non-diagonal components. Here and are for each and each
independent Gaussian random numbers of zero mean and varianceand , respectively.
The goal in this spiked matrix-tensor inference problem is to estimate the signal from the knowledge of the matrix and tensor . If only the matrix was present, this model reduces to well known model of low-rank perturbation of a random symmetric matrix, closely related to the spiked covariance model . If on the contrary only the tensor is observed then the above model reduces to the spiked tensor model as introduced in  and studies in a range of subsequent papers.
In this paper we study the matrix-tensor model where the two observations are combined. Our motivation is similar to the one exposed in , that is, we aim to access a regime in which it is algorithmically tractable to obtain good performance with corresponding message passing algorithms yet it is challenging (e.g. leading to non-convex optimization) with sampling or gradient descent based algorithms, this happens when both and , while .
In this paper we focus on algorithms that aim to find the maximum likelihood estimator. The negative log-likelihood (Hamiltonian in the physics language, or loss function in the machine learning language) of the spiked matrix-tensor reads
where is constrained to the sphere.
In a high-dimensional, , noisy regime the maximum-likelihood estimator is not always optimal as it provides in general larger error than the Bayes-optimal estimator computing the marginals of the posterior, studied in . At the same time the log-likelihood (3) can be seen as a loss function, that is non-convex and high-dimensional. The tractability and properties of such minimization problems are the most questioned in machine learning these days, and are worth detailed investigation in the present model.
3 Landscape characterization
The first goal of this paper is to characterize the structure of local minima of the loss function (equivalently local maxima of the log-likelihood) eq. (3) as a function of the noise parameters and . We compute the average number of local minimizers having a given correlation with the ground truth signal . This leads to a so-called complexity function defined as the logarithm of the expected number of local minima at correlation with the ground truth.
A typical example of this function, resulting from our analysis, is depicted in Fig. 2 for , , and several values of . We see from the figure that at large local minima appear only in a narrow range of values of close to zero, as decreases the support of widens. At yet smaller values of the support becomes disconnected so that it is supported on an interval of value close to and on two (one negative, one positive) isolated points. For yet smaller the complexity for values of close to zero becomes negative, signalling what we call a trivialization of the landscape, where all remaining local minima are (in the leading order in ) as correlated with the ground truth as the global minima. The support of in the trivialized regions consists of two separated points. We call the value of at which the trivialization happens . In the phase diagram of Fig. 1 the trivialization of the energy landscape happens above the purple dashed line.
We use the Kac-Rice formula to determine the complexity [17, 18]. Given an arbitrary continuous function, the Kac counting formula allows to compute the number of points where the function crosses a given value. The number of minima can be characterized using Kac’s formula on the gradient of the loss (3), counting how many time the gradient crosses the zero value, under the condition of having a positive definite Hessian in order to count only local minima and not saddles. Since the spiked matrix-tensor model is characterized by a random landscape, due to the noise and , we will consider the expected number of minima obtaining the Kac-Rice formula [17, 18].
For mathematical convenience we will consider the rescaled configurations , and rescaled signal . Call the joint probability density of the gradient of the loss, and of the and the contributions of the matrix and tensor to the loss, respectively. Given the value of the two contributions to the loss and , and the correlation between the configuration and ground truth that we impose using a Dirac’s delta, the averaged number of minimizers is
Rewrite the loss Eq. (3) neglecting terms that are constant with respect to the configuration and thus do not contribute to the complexity
In the following we will use small letters , , , to characterize losses, gradient and Hessian constrained on the sphere and capital letters for the same quantities unconstrained. Define the
-dimensional identity matrix. The following lemma characterizes.
Given the loss function Eq. (5) and a configuration such that the correlation and the signal is , then there exists a reference frame such that the joint probability distribution of , and is given by
with standard Gaussians and , a standard multivariate Gaussian and a random matrix from the Gaussian orthogonal ensemble.
a random matrix from the Gaussian orthogonal ensemble.
Starting from Eq. (5), split the contributions of the matrix and tensor in and , two Gaussian variables and impose the spherical constrain with a Lagrange multiplier .
The expression for in a critical point can be derived as follows. Given , multiply Eq. (10) by , sum over the indices and we obtain:
. We now restrict our study to the unconstrained random variables and substitute. Since the quantities , , , , are linear functionals of Gaussians they will be distributed as Gaussian random variables and therefore can be characterized by computing expected values and covariances. Starting from the losses coming from the matrix and the tensor in Eq. (5), and
, respectively, consider the moments with respect to the realization of the noise,, . For the first moment leads to
Let’s consider the second moment but having two different configurations and ,
Using standard results for derivatives of Gaussians (see e.g.  Eq. 5.5.4) we can obtain means and covariances of the random variables taking derivatives with respect to and . Then set , imposing the spherical constrain and using .
The last step is the definition of a convenient reference frame . Align the configuration along the last coordinate and the signal with a combination of the first and last coordinates . Finally, project on the sphere by discarding the last coordinate. ∎
We can now rewrite the determinant of the conditioned Hessian by grouping the multiplicative factor in front of the GOE in Eq. (8)
with and given by
in the large -limit. Therefore the Hessian behaves like a GOE shifted by with a rank one perturbation of strength . This exact same problem has already been studied in  and we can thus deduce the expression for the complexity as
We note at this point that for the case of the pure spiked tensor model the above expression reduces exactly to the complexity derived in . The following theorem states that to the leading order Eq. (17) represents the complexity of our problem.
Given and , for any and it holds
The proof comes immediately from  Thm. 2, see also Sec. 4.1. ∎
The quantity that we are interested in is the projection of Eq. (17) to the maximal values of and :
Eq. (19) allows to understand if at a given correlation with the signal, there are regions with an exponential expected number of minima, see Fig. 2. Thus it allows to locate parameters where the landscapes is trivial.
We computed the expected number of minima, i.e. the so-called annealed average. The annealed average might be dominated by rare samples, and in general provides only an upper bound for typical samples. The quenched complexity, i.e. the average of the logarithm of the number of minima, is more involved. The quenched calculation was done in the case of a the spiked tensor model . It is interesting to notice that in  the authors found that the annealed complexity does not differ from the quenched complexity for . This combined with analogous preliminary results for the spiked matrix-tensor model, suggest that considering the quenched complexity would not change the conclusions of this paper presented in the phase diagrams Fig. 1.
4 Gradient flow analysis
In this section we analyze the performance of the gradient flow descent in the loss function (3)
where the Lagrange parameter is set in a way to ensure the spherical constraint . Our aim is to understand the final correlation between the ground truth signal and the configuration reached by the gradient flow in large but finite time, while .
The gradient flow (20) can be seen as a zero-temperature limit of the Langevin algorithm where
with being the Langevin noise with zero mean and covariance , where has the physical meaning of temperature, the notation stands for the average over the noises and . As we take the limit , the noise becomes peaked around zero, effectively recovering the gradient flow.
The performance of the Langevin algorithm was characterized recently in  using equations developed in physics of disordered systems [13, 14]. In  this characterization was given for an arbitrary temperature and compared to the landscape of the Bayes-optimal estimator . Here we hence summarize and use the results of  corresponding to the limit .
The Langevin dynamics with generic temperature is in the large size limit, , characterized by a set of PDEs for the self-correlation , the response function , and the correlation with the signal . Ref.  established that as the gradient flow evolves these quantities satisfy eqs. (74)-(76) in that paper. Taking the zero-temperature limit in those equations we obtain
with and the rescaled spherical constraint. Boundary conditions for the equations are , for all and . An additional equation for is obtained by fixing in Eq. (22). In the context of disordered systems those equations have been established rigorously for a related case of the matrix-tensor model without the spike .
4.1 Performance of the gradient flow
Eqs. (22-24) are integrated numerically showing the large-size-limit performance of the gradient flow algorithm. Example of this evolution is given in Fig. 3 for , . The code will be made available and linked to this paper. For consistency we confirm numerically that at large times the gradient flow reaches values of the correlation that correspond exactly to the value of the correlation of the minima correlated to the signal as obtained in the Kac-Rice approach.
As the variance increases the time it takes to the gradient flow to acquire good correlation with the signal increases. We define the convergence time as the time it takes to reach 1/2 of the final plateau. The dependence of on is consistent with a power law divergence at . This is illustrated in Fig. 4 where we plot the convergence time as a function of and show the power-law fit in the inset. The points are collected and plotted in Fig. 1, dotted blue line.
From Fig. 4 we see that the gradient flow algorithm undergoes a considerable slow-down even in the region where the landscape is trivial, i.e. does not have spurious local minimizers. At the same time divergence of the convergence time happens only well inside the phase where spurious local minimizers do exist.
5 Maximum-likelihood approximate message passing
Approximate Message Passing (AMP) is a popular iterative algorithm  with a key advantage of being analyzable via its state evolution . The maximum-likehood AMP (ML-AMP) algorithm studied in this paper is a generalization of AMP for the pure spiked tensor model from  to the spiked matrix-tensor model. We will show that its fixed points correspond to stationary points of the loss function (3). This should be contrasted with the Bayes-optimal AMP (BO-AMO) that was studied in  and aims to approximate the marginals of the corresponding posterior probability distribution. The ML-AMP instead aims to estimate the maximum-likelihood solution, . In information theory the BO-AMP would correspond to the sum-product algorithm, while the present one to the max-sum algorithm. In statistical physics language the BO-AMP corresponds to temperature one, while the present one to zero temperature. In the appendix, Sec. C, we provide a schematic derivation of the ML-AMP as a zero-temperature limit of the BO-AMP, using a scheme similar to .
The ML-AMP algorithm reads
with the -norm and the Onsager reaction term
5.1 ML-AMP & stationary points of the loss
Given a fixed point of ML-AMP, then satisfies the stationary condition of the loss.
which is exactly solution of the derivative of Eq. (3) with respect to when the spherical constraint is enforced by a Lagrange multiplier
Moreover ML-AMP by construction preserves the spherical constrain at every time iteration, thus the fixed point value of the Lagrange multiplier in (29) is the one leading to the spherical constraint. ∎
5.2 State evolution
The evolution of ML-AMP can be tracked through a set of equations called state evolution (SE). The state evolution can be characterized via an order parameter: , the correlation of the ML-AMP-estimator with the ground truth signal at time . According to the SE, as proven in [23, 16], this parameter evolves in the large limit as
and the mean square error correspondingly
A derivation of this state evolution is presented in the appendix, Sec. C.
Analysis of the simple scalar SE, Eq. (30), allows to identify the error reached by the ML-AMP algorithm. We first observe that is always a fixed point. For the performance of ML-AMP is the stability of this fixed point that determines whether the ML-AMP will be able to find a positive correlation with the signal or not. Analyzing Eq. (30) we obtain that the is a stable fixed point for where
Consequently for the ML-AMP algorithm converges to , i.e. zero correlation with the signal. The line is the line plotted in Fig. 1. For and , we obtain that for the ML-AMP algorithm converges to a positive correlation with the signal, depicted in Fig. 5. In Fig. 5 we also observe that this correlation agrees (up to the numerical precision) with the position of the maximum having largest value of in the complexity function , this is also depicted in the figure. The trivialization of the landscape occurs at , thus showing that for the ML-AMP algorithm is able to ignore a good portion of the spurious local minima and to converge to the local minima best correlated with the signal.
We analyzed the behavior of two descent algorithms in optimizing a rough high-dimensional loss landscape of the spiked matrix-tensor model. We used the Kac-Rice formula to count the average number of minima of the loss function having a given correlation with the signal. Analyzing the resulting formula we defined and located where the energy landscape becomes trivial in the sence that spurious local minima disappear. We analyzed the performance of gradient flow via integro-differential state-evolution-like equations. We numerically solved the equations and extrapolated the divergence of their convergence-time. We delimited a region of parameters for which the gradient flow is able to avoid the spurious minima and obtain a good correlation with the signal in time linear in the input size. We also analyzed the maximum-likelihood AMP algorithm, located the region of parameters in which this algorithm works, which is larger than the (numerically extrapolated) region for which the gradient flow works. We found that in cases when both the algorithms converge to an informative minima, the corresponding error is the same in both and also corresponds to the position of all the minima well correlated with the signal in the Kac-Rice approach. The relation between existence or absence of spurious local minima in the loss landscapes of a generic optimization problems and the actual performance of optimization algorithm is yet to be understood. Our analysis of the spiked matrix-tensor model brings a case-study where we were able to specify this relation quantitatively.
We thank G. Ben Arous, G. Biroli, C. Cammarota, G. Folena, and V. Ros for precious discussions. We acknowledge funding from the ERC under the European Union’s Horizon 2020 Research and Innovation Programme Grant Agreement 714608-SMiLe; from the French National Research Agency (ANR) grant PAIL; and from ”Investissements d’Avenir” LabEx PALM (ANR-10-LABX-0039-PALM) (SaMURai and StatPhysDisSys). This research was supported in part by the National Science Foundation under Grant No. PHY-1748958.
-  Stefano Sarao Mannelli, Giulio Biroli, Chiara Cammarota, Florent Krzakala, Pierfrancesco Urbani, and Lenka Zdeborová. Marvels and pitfalls of the langevin algorithm in noisy high-dimensional inference. arXiv preprint arXiv:1812.09066, 2018.
-  Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pages 586–594, 2016.
-  Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.
-  Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems, pages 2973–2981, 2016.
-  C Daniel Freeman and Joan Bruna. Topology and geometry of half-rectified network optimization. arXiv preprint arXiv:1611.01540, 2016.
-  Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Global optimality of local search for low rank matrix recovery. In Advances in Neural Information Processing Systems, pages 3873–3881, 2016.
-  Dohyung Park, Anastasios Kyrillidis, Constantine Carmanis, and Sujay Sanghavi. Non-square matrix sensing without spurious local minima via the Burer-Monteiro approach. In Artificial Intelligence and Statistics, pages 65–74, 2017.
-  Simon S Du, Jason D Lee, Yuandong Tian, Barnabas Poczos, and Aarti Singh. Gradient descent learns one-hidden-layer cnn: Don’t be afraid of spurious local minima. arXiv preprint arXiv:1712.00779, 2017.
-  Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems: A unified geometric analysis. In Proceedings of the 34th International Conference on Machine Learning, pages 1233–1242, 2017. arXiv preprint arXiv:1704.00708.
-  Haihao Lu and Kenji Kawaguchi. Depth creates no bad local minima. arXiv preprint arXiv:1702.08580, 2017.
-  Yan V Fyodorov. Complexity of random energy landscapes, glass transition, and absolute value of the spectral determinant of random matrices. Physical review letters, 92(24):240601, 2004.
-  Gerard Ben Arous, Song Mei, Andrea Montanari, and Mihai Nica. The landscape of the spiked tensor model. arXiv preprint arXiv:1711.05424, 2017.
-  A Crisanti, H Horner, and H-J Sommers. The spherical -spin interaction spin-glass model. Zeitschrift für Physik B Condensed Matter, 92(2):257–271, 1993.
-  Leticia F Cugliandolo and Jorge Kurchan. Analytical solution of the off-equilibrium dynamics of a long-range spin-glass model. Physical Review Letters, 71(1):173, 1993.
-  Iain M Johnstone. Annals of statistics, pages 295–327, 2001.
-  Emile Richard and Andrea Montanari. A statistical model for tensor PCA. In Advances in Neural Information Processing Systems, pages 2897–2905, 2014.
-  Robert J Adler and Jonathan E Taylor. Random fields and geometry. Springer Science & Business Media, 2009.
-  Yan V Fyodorov. High-dimensional random fields and random matrix theory. Markov Processes Relat. Fields, 21:483–518, 2015.
Valentina Ros, Gerard Ben Arous, Giulio Biroli, and Chiara Cammarota.
Complex energy landscapes in spiked-tensor and simple glassy models: Ruggedness, arrangements of local minima, and phase transitions.Physical Review X, 9(1):011003, 2019.
-  Fabrizio Antenucci, Silvio Franz, Pierfrancesco Urbani, and Lenka Zdeborová. On the glassy nature of the hard phase in inference problems. arXiv preprint arXiv:1805.05857, 2018.
-  Gerard Ben Arous, Amir Dembo, and Alice Guionnet. Cugliandolo-Kurchan equations for dynamics of spin-glasses. Probability theory and related fields, 136(4):619–660, 2006.
-  David L Donoho, Arian Maleki, and Andrea Montanari. Message-passing algorithms for compressed sensing. Proceedings of the National Academy of Sciences, 106(45):18914–18919, Nov 2009.
-  Adel Javanmard and Andrea Montanari. State evolution for general approximate message passing algorithms, with applications to spatial coupling. Information and Inference: A Journal of the IMA, 2(2):115–144, 2013.
-  Thibault Lesieur, Florent Krzakala, and Lenka Zdeborová. Constrained low-rank matrix estimation: Phase transitions, approximate message passing and applications. Journal of Statistical Mechanics: Theory and Experiment, 2017(7):073403, 2017.
-  Andrea Montanari. Graphical models concepts in compressed sensing. Compressed Sensing: Theory and Applications, pages 394–438, 2012.
-  Federico Ricci-Tersenghi, Guilhem Semerjian, and Lenka Zdeborová. Typology of phase transitions in bayesian inference problems. preprint:arXiv:1806.11013.
Appendix A Kac-Rice formula
a.1 -odd cases
In the cases in which the order of the tensor
is odd we encounter an interesting phenomenon due to the different symmetries of the two types of observation. The matrix is symmetric by inverting the sign of the signal,, while the tensor is not symmetric for odd . This creates an asymmetry in the complexity, Fig. 6 (to be compared with Fig. 2) and causes a shift toward lower correlations of the band characterizing the non-informative minima. Therefor observing when the complexity at becomes negative does not guarantee that the non-informative minima disappeared. To do so, one must check that the whole non-informative band disappears. This should be contrasted with the case of even where a maximum of the complexity is always at . These two definitions of the threshold have little, but not negligible, difference, see Fig. 7. Observe that as increases the peak of the complexity decreases, since the loss Eq. (3) tends to the simple matrix-factorization problem where the landscape is characterized by two isolated minima. This implies that the two definitions become indistinguishable for large . In the main text we use the definition taking into account the maximum (even when it is not strictly at ) because gives a more accurate characterization of the trivialization threshold.
Appendix B Gradient Flow
b.1 Dependence on the initial conditions
The dynamics of the gradient flow shows a dependence on the initial conditions, because formally zero correlation is a (unstable) fixed point of the GF state evolution. In practice we observe for both GF and ML-AMP that instability of the fixed point is sufficient for good performance of the algorithm. However, this makes the definition of the convergence time depend of the initial condition.
We observed from our numerical solution of the GF state evolution equations that the initial condition add a factor to the convergence times. Thus by fitting this term and rescaling the convergence times, the different initializations collapse into a single curve, see inset of Fig. 8. Finally, the collapsed points were used to extrapolate the critical line as shown in the main text, Fig. 4.
Appendix C Amp
c.1 From AMP to ML-AMP
In this section we consider the spiked-tensor model in a Bayesian way. We show how the Bayes-optimal AMP leads to the Maximum Likelihood AMP using a temperature-like parameter . We will introduce the algorithm AMP for a generic , and show that as we recover ML-AMP as presented in the main text. The probability distribution we consider is
The scheme for deriving AMP estimating marginals of such a probability distribution can be found in [24, 1] and consist in making a Gaussian assumption on the distribution of the messages in the belief propagation (BP) algorithm and neglecting the node-dependence in the messages. A final consideration to be used in order to derive the algorithm is that the spherical constrain can be imposed by setting at every iteration. The resulting AMP algorithm will iterate on the following equations:
with the -norm and the Onsager reaction term
In the limit AMP defined by Eqs. (34-37) is equivalent to ML-AMP, Eqs. (25-28). To see this we define the rescaled variables , and . Taking the limit the expression for Eq. (35) and the expression for Eq. (36) simplify as Eq. (26) and as Eq. (26) respectively. Dropping the tildes we obtain ML-AMP as presented in the main text.
c.2 State evolution
The generic version of AMP has a slightly more complicated SE that depends of two order parameters: the already introduced and the self-overlap of the estimator. The SE equations are:
with and .
c.3 Derivation of spinodals
From SE Eq. (30) we can obtain analytical equations for the spinodals, the threshold of stability of the different ML-AMP fixed points. We have with
with and . Observe that: . We can now define either or