Who is Afraid of Big Bad Minima? Analysis of Gradient-Flow in a Spiked Matrix-Tensor Model

07/18/2019 ∙ by Stefano Sarao Mannelli, et al. ∙ 1

Gradient-based algorithms are effective for many machine learning tasks, but despite ample recent effort and some progress, it often remains unclear why they work in practice in optimising high-dimensional non-convex functions and why they find good minima instead of being trapped in spurious ones. Here we present a quantitative theory explaining this behaviour in a spiked matrix-tensor model. Our framework is based on the Kac-Rice analysis of stationary points and a closed-form analysis of gradient-flow originating from statistical physics. We show that there is a well defined region of parameters where the gradient-flow algorithm finds a good global minimum despite the presence of exponentially many spurious local minima. We show that this is achieved by surfing on saddles that have strong negative direction towards the global minima, a phenomenon that is connected to a BBP-type threshold in the Hessian describing the critical points of the landscapes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A common theme in machine learning and optimisation is to understand the behaviour of gradient descent methods for non-convex problems with many minima. Despite the non-convexity, such methods often successfully optimise models such as neural networks, matrix completion and tensor factorisation. This has motivated a recent spur in research attempting to characterise the properties of the loss landscape that may shed some light on the reason of such success. Without the aim of being exhaustive these include

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11].

Over the last few years, a popular line of research has shown, for a variety of systems, that spurious local minima are not present in certain regimes of parameters. When the signal-to-noise ratio is large enough, the success of gradient descent can thus be understood by a trivialisation transition in the loss landscape: either there is only a single minima, or all minima become "good", and no spurious minima can trap the dynamics. This is what happens, for instance, in the limit of small noise and abundance of data for matrix completion and tensor factorization [3, 8], or for some very large neural networks [1, 2]. However, it is often observed in practice that these guarantees fall short of explaining the success of gradient descent, that is empirically observed to find good minima very far from the regime under mathematical control. In fact, gradient-descent-based algorithms may be able to perform well even when spurious local minima are present because the basins of attraction of the spurious minima may be small and the dynamics might be able to avoid them. Understanding this behaviour requires, however, a very detailed characterisation of the dynamics and of the landscape, a feat which is not yet possible in full generality.

A fruitful direction is the study of Gaussian functions on the -dimensional sphere, known as -spin spherical spin glass models in the physics literature, and as isotropic models in the Gaussian process literature [12, 13, 14, 15, 16]. In statistics and machine learning, these models have appeared following the studies of spiked matrix and tensor models [17, 18, 19]. In particular, a very recent work [20]

showed explicitly that for a spiked matrix-tensor model the gradient-flow algorithm indeed reaches global minimum even when spurious local minima are present and the authors estimated numerically the corresponding regions of parameters. In this work we consider this very same model and explain the mechanism by which the spurious local minima are avoided, and develop a quantitative theoretical framework that we believe has a strong potential to be generic and extendable to a much broader range of models in high-dimensional inference and neural networks.

The Spiked Matrix-Tensor Model. The spiked matrix-tensor model has been recently proposed to be a prototypical model for non-convex high-dimensional optimisation where several non-trivial regimes of cost-landscapes can be displayed quantitatively by tuning the parameters [21, 20]

. In this model, one aims at reconstructing a hidden vector (i.e. the spike)

from the observation of a noisy version of both the rank-one matrix and rank-one tensor created from the spike. Using the following notation: bold lowercase symbols represent vectors, bold uppercase symbols represent matrices or tensors, and represent the scalar product, the model is defined as follows: given a signal (or spike), , uniformly sampled on the -dimensional hyper-sphere of radius , it is given a tensor and a matrix such that

(1)
(2)

where and

are Gaussian random variables of variance

and respectively. Neglecting constant terms, the maximum likelihood estimation of the ground truth,

, corresponds to minimization of the following loss function:

(3)

The first two contributions of the last equation will be denoted and in the following. While the matricial observations correspond to a quadratic term and thus to a simple loss-landscape, the additional order- tensor contributes towards a rough non-convex loss landscape. As the information becomes irrelevant and the landscape becomes trivial, while in the opposite limit , the landscape becomes extremely rough and complex as analyzed recently in [16, 22].

We shall consider the behaviour of the gradient-flow (GF) algorithm aiming to minimise the loss:

(4)

where enforces that belongs to the hyper-sphere of radius and will be referred to as the spherical constraint. The algorithm is initialised in a random point drawn uniformly on the hyper-sphere, thus initially having no correlation with the ground-truth signal. We view the gradient-flow as a prototype of gradient-descent-based algorithms that are the work-horse of current machine learning.

Figure 1: Cartoon illustrating the mechanism by which gradient-flow avoids spurious local minima in the spiked matrix-tensor model. As the signal-to-noise () ratio is increased, the spurious local minima that attract the randomly initialised GF algorithm develop a single negative direction towards the global minimum before the others, in particular the lower-cost spurious local minima, do. This has drastic consequences on the GF algorithm. In region (i), for , the algorithm goes down the landscape, eventually reaches the high-energy threshold minima and remains stuck. In region (ii), however, these threshold minima are turned into saddles with a strong negative direction towards the signal. The algorithm is initially reaching these minima-turned-saddles, surfing on the negative slope, it then turns towards the "good" minima correlated with the signal, avoiding the exponentially many spurious minima at lower energies. The main technical contribution of this paper is a quantitative description of this scenario, including a simple formula for the corresponding threshold , eq. (5). As the snr is further increased, the negative direction appears in lower and lower minima until the trivialization transition in region (iii): for , all the spurious minima have been turned into saddles.

Main Contributions. The first main result of this paper is the expression for the threshold below which the gradient-flow algorithm finds a configuration correlated the hidden spike. This threshold is established in the asymptotic limit of large , fixed and , and reads:

(5)

We find that (i) for the gradient flow reaches in finite time the global minimum, well correlated with the signal, while (ii) for the algorithm remains uncorrelated with the signal for all times that do not grow as grows. We contrast it with the threshold , established in [20], below which the energy landscape does not present any spurious local minima. Note that is less than [21], below which the best known algorithm, specifically the approximate message passing, works.

The second main result of this paper, is the insight we obtain on the behaviour of the gradient-flow in the loss landscape, that is summarised in Fig. 1. The key point is to consider the fate of the spurious local minima that attract the GF algorithm when the signal to noise ratio is increased. As the increases, these minima turn into saddles with a single negative direction towards the signal (a phenomenon that we analyze in the next section, and that turns out to be linked to the BBP transition [23]

in random matrix theory), all that well before all the other spurious local minima disappear. We present two ways to quantify this insight:

(a) We use the Kac-Rice formula for the number of stationary points, as derived for the present model in [20]. In [20]

this formula is used to quantify the region with no spurious local minima. Here we focus on a BBP-type of phase transition that is crucial in the derivation of this formula and deduce the GF threshold (

5) from it.

(b) We use the CHSCK equations [24, 25] for closed-form description of the behaviour of the gradient-flow, as derived and numerically solved in [21, 20]. Building on dynamical theory of mean-field spin glasses we determine precisely when and how the algorithm escapes the manifold of zero overlap with the signal, leading again to the threshold (5).

Both these arguments are derived using reasoning common in theoretical physics. From a mathematically rigorous point of view the threshold (5) remains a conjecture and its rigorous proof is an interesting challenge for future work. We note that both the Kac-Rice approach [16] and the CHSCK equations [26] have been made rigorous in closely related problems. We believe that the reported results are not limited to the present model and we will investigate analytically and numerically other models and real-data-based learning in order to validate this theory and to understand its limitations.

Figure 2: The phase diagram shows the different regions for gradient-flow behaviour, for the spiked matrix-tensor model with . In the region shaded in red (light and dark), GF does not correlate with the signal, while it does in the grey and green regions. In the dark-red region obtaining correlation with the signal is impossible information-theoretically [21]. The possible region is divided by a red dashed line, below that line even best known algorithms are unable to obtain correlation with the signal [21]. The green region is characterised by a trivial landscape, i.e. all the spurious minima disappear [20]. The grey region is where gradient-flow succeeds to converge despite the presence of spurious minima. We marked with black crosses points predicting the gradient-flow threshold obtained numerically in [20], they perfectly agree with our theoretical prediction of the threshold (5), marked by the grey dashed line. The circles in colours are points that we will use to illustrate the different features of these regions.

2 Probing the Landscape by the Kac-Rice Method

The statistical properties of the landscape associated to the loss function (3) can be studied by the Kac-Rice method, which traces back to the statistical physics literature, see [27] for an overview, and was developed mathematically in [13, 14] and recently extended in [22].

The quantities of interest are the number of critical points at a given energy, , and the Hessian matrix evaluated at those critical points. We analyse the logarithm of , called the complexity. Since the complexity is a random quantity we compute its upper bound , along the lines of [16, 20]. We have also computed its typical value along the lines of [22], i.e. non-rigorously using the replica symmetry assumption (see SM Sec. A). In what follows we focus on complexity of stationary points with no correlation with the signal, in which case analytical and numerical arguments (see SM Sec. A.2.1) indicate that and are either very close numerically or possibly equal. Thus, in the following, we will simply refer to the complexity without further specification.

In the Kac-Rice analysis the statistics of the Hessian, , of critical points plays a key role. It was shown in [20], and the argumentation is reproduced in the SM Sec. A.2, that has a simple form for the loss (3). It is a matrix formed by the sum of three contributions: a random matrix belonging to the Gaussian orthogonal ensemble (GOE), a matrix proportional to the identity, and a rank one projector in the direction of the signal. The expression of for critical points with null overlap with the signal and with energies and reads:

(6)

with , , and . The normalisation of is chosen such that .

The Fate of the Spurious: The initial condition for the gradient-flow algorithm is a random configuration uniformly drawn on the hyper-sphere. Such an initial condition clearly belongs to the large manifold of configurations uncorrelated with the ground-truth signal. We aim to investigate how does the gradient flow manage to escape from this initial manifold. For this purpose we focus on the properties of the landscape in the subspace where the overlap with the signal is zero, .

In Fig. 3, we plot the complexity at as a function of the energy

for the points and (), which are marked with circles of the corresponding colour in Fig. 2. We use discontinuous lines for the complexity of critical points that have at least one negative direction, and full lines for the complexity of local minima. A finding of [20], that holds for any value of , is that for small the majority of critical points with zero overlap with the signal at low enough energies are spurious minima; they disappear increasing above a -dependent value corresponding to the green region of Fig. 2. In this part of the phase diagram, there are no spurious minima and the global minimum is correlated with the signal; this is an "easy" landscape for gradient flow which is therefore expected to succeed there. The main open question concerns the behavior for smaller values of : When does the existence of spurious minima, appearing in panel (b) and (c) of Fig. 3, start to be harmful to gradient flow?

Figure 3: Complexity curves for the number of critical points for an overlap value at fixed for (from left to right)

. The lines are dotted when the complexity is dominate by critical points having an extensive number of eigenvalues are negative, dashed when only one eigenvalue is negative, full when the points have only positive (or null) eigenvalues i.e. they are minima. The complexity of the minima is drawn in full lines with the same colours, and it merges with the complexity of stationary points when it becomes dominant.

In order to answer this question, we investigate more closely the nature of the spurious minima at different energies. We focus in particular on their Hessian, which plays a crucial role in order to understand which spurious minima have the largest basin of attraction and, hence, can trap the algorithm. For low signal-to-noise ratio, large and large , the spectrum of (6) is a shifted Wigner semicircle with support . The effect of the projector (third contribution to the RHS of (6)) on the support of the spectrum is negligible as long as , as follows from the work on low-rank perturbations of random GOE matrices [23]. Moreover, the most numerous critical points at fixed energy are characterized by a that is a monotonously decreasing function of , see Fig. 6 in the SM. Thus, moving towards higher energies, the spectrum of the Hessian shifts to the left, which indicates smaller curvature and wider minima. The transition between minima and saddles takes place at the threshold energy at which , i.e. where the left edge of the Wigner semi-circle law touches zero, the numerical value is obtained in the Appendix Sec. A.2.3. Above this energy, critical points have an extensive number of downward directions, as found also in spin-glass models [14, 27]. Putting the above findings together, minima at are the most numerous and the marginally stable ones. Therefore, they are the natural candidates for having the largest basin of attraction and the highest influence on the randomly initialised algorithm. This reasonable guess is at the basis of the theory of glassy dynamics in physics [25]. We take it as a working hypothesis for now, and we confirm it analytically and numerically in what follows.

Going back to the algorithm, when the signal-to-noise ratio is small we therefore expect that the configuration slowly approaches at long times the ubiquitous "threshold minima" characterised by energy and zero overlap with the signal. The last missing piece is unveiling what makes those minima unstable for large . We show below that it is a transition, called BBP (Baik-Ben Arous-Péché) [23], which takes place in the spectrum of the Hessian. Increasing at fixed , as in Fig. 3, leads to a larger . When becomes larger than one, an eigenvalue, equal to

pops out on the left of the Wigner semi-circle, and its corresponding eigenvector develops a finite overlap with the signal

[23]. This implies the development of an unstable direction for the threshold minima trapping the dynamics, as they were already at the verge of instability. Indeed, as soon as the isolated eigenvalue pops out, an unstable downward direction towards the signal emerges in the landscape around them, at which point the algorithmic threshold for gradient flow takes place. Interestingly, many other spurious minima at lower energy also undergo the BBP transition, but they remain stable for longer as the isolated eigenvalue is positive when it pops out from the semi-circle. In conclusion, our analysis of the landscape suggests a dynamical transition for signal estimation by gradient flow given by

(7)

which leads to a very simple expression for the transition line , Eq. (5). This theoretical prediction is shown in Fig. 2 as a dashed grey line: The agreement with the numerical estimation from [20] (black crosses) is perfect.

Our analysis unveils that the key property of the loss-landscape determining the performance of the gradient-flow algorithm, is the (in)stability in the direction of the signal of the minima with largest basin of attraction. These are the most numerous and the highest in energy, a condition that likely holds for many high-dimensional estimation problems.

The other spurious minima, which are potentially more trapping than the threshold ones and still stable at the algorithmic transition just derived, are actually completely innocuous since a random initial condition does not lie in their basin of attraction with probability one in the large

limit. This benign role of very bad spurious minima might appear surprising; it is due to the high-dimensionality of the non-convex loss function. Indeed it does not happen in finite dimensional cases, in which a random initial condition has instead a finite probability to fall into bad minima if those are present.

3 Probing the Gradient-Flow Dynamics

3.1 Closed-Form Dynamical Equations

In the large limit gradient-flow dynamics for the spiked matrix-tensor model can be analysed using techniques originally developed in statistical physics studies of spin-glasses [28, 29, 30] and later put on a rigorous basis in [26]. Three observables play a key role in this theory:

(i) The overlap (or correlation) of the estimator at two different times: .

(ii) The change (or response) of the estimator at time due to an infinitesimal perturbation in the loss at time , i.e. in Eq. (4): .

(iii) The average overlap of the estimator with the ground truth .

For the above quantities converge to a non-fluctuating limit, i.e. they concentrate with respect to the randomness in the initial condition and in the generative process, and satisfy closed equations. Following works of Crisanti-Horner-Sommers-Cugliandolo-Kurchan (CHSCK) [29, 30] and their recent extension to the spiked matrix-tensor model [21, 20] the above quantities satisfy:

(8)
(9)
(10)
(11)

with initial conditions and for all and . The additional function , and its associated equation, are due to the spherical constraint; plays the role of a Lagrange multiplier and guarantees that the solution of the previous equations is such that . The derivation of these equations can be found in [21] and in the SM Sec. B

. It is obtained using heuristic theoretical physics approach and can be very plausibly made fully rigorous generalising the work of

[26, 31].

This set of equations can be solved numerically as described in [21]. The numerical estimation of the algorithmic threshold of gradient-flow, reproduced in Fig. 2, was obtained in [20]. We have also directly simulated the gradient flow Eq. (4) and compare the result to the one obtained from solving Eqs. (8-11). As shown in the SM Sec. C, for , we find a very good agreement even for this large yet finite size.

Figure 4: Right panel: energy as a function of time for the set of parameters indicated by small circles in Fig. 2. The horizontal dotted lines correspond to value of the threshold energy , as derived both from the Kac-Rice approach in Appendix Sec. A.2.3 and from the large time behaviour of the dynamics in Appendix Sec. B.2.6. Left panel: Eigenvalue distribution of the Hessian of the threshold states for the same set of parameters. When becomes smaller than an isolated eigenvalue appears; it has been highlighted using vertical arrows. Concomitantly, the energy as a function of time first approaches the plateau and eventually departs from it and reaches the energy of the global minimum.

Surfing on saddles: Armed with the dynamical equations, we now confirm the prediction of the threshold (5) based on the Kac-Rice-type of landscape analysis. In the SM we check that the minima trapping the dynamics are indeed the marginally stable ones (), see Figs. 7 and 8 in the SM, and we show the energy can be expressed in terms of , and . In the right panel of Fig. 4 we then plot the energy as a function of time obtained from the numerical solution of Eqs. (8-11) for and (same points and colour code of Figs. 2 and 3). For the two smaller values of the energy converges to a plateau value at (dotted line), whereas for the energy plateaus close to but then eventually drifts away and reaches a lower value, corresponding to the global minimum correlated with the signal. This behaviour can be understood in terms of the spectral properties of the Hessian (6) of the minima trapping the dynamics. In the left panel of Fig. 4 we plot the corresponding density of eigenvalues of for the same values of and used in the right panel. This is an illustration of the dynamical phenomenon explained in the previous section: when the signal-to-noise ratio is large enough threshold minima become unstable because a negative eigenvalue, associated to a downward direction toward the signal, emerges. In this case first seems to converge to the threshold minima and then, at long times, drifts away along the unstable direction. The larger is the signal-to-noise ratio the more unstable is the downward direction and, hence, the shortest is the intermediate trapping time.

3.2 Gradient-flow Threshold from Dynamical Theory

We now show that the very same prediction (5) for the algorithmic threshold of gradient-flow can be directly obtained analysing the dynamical equations (8-11), without directly using results from the Kac-Rice analysis, thus establishing a firm and novel connection between the behaviour of the gradient-flow algorithm and Kac-Rice landscape approaches.

For small signal-to-noise ratios, when remains zero at all times, the dynamical equations (8-11) are identicall to the well-known one in spin glasses theory, for reviews see [32, 33]. These equations have been studied extensively for decades in statistical physics and a range of results about their behaviour has been established. Here we describe the results which are important for our analysis and devote the SM Sec. B.2 to a more extended presentation. It was shown analytically in [29] that the behaviour of the dynamics at large times is captured by an asymptotic solution of Eqs. (8-11) that verifies several remarkable properties. The ones of interest to us are that for and large:

(i) when finite; becomes less than one when diverges with and .

(ii) , where TTI stands for time-translational-invariance, ag stand for aging. Here goes to zero on a finite time-scale, whereas varies on timescales diverging with and . Moreover, verifies the so called "weak-long term memory" property: for any finite , is arbitrarily small. We refer to this function form for as the aging ansatz, adopting the physics terminology.

These properties are confirmed to hold by our numerical solution, see for instance Fig. 9 in the SM. The interpretation of these dynamical properties is that at long times decreases in the energy landscape and approaches the marginally stable minima. Concomitantly, dynamics slows down and takes place along the almost flat directions associated to the vanishing eigenvalues of the Hessian.

We remind that in the previous paragraphs we assume null correlation with the signal, . In order to find the algorithmic threshold beyond which the gradient-flow develops a positive correlation, we study the instability of the aging solution as a function of the signal-to-noise ratio. Our strategy is to start with an arbitrarily small overlap, , and determine whether it grows at long times thus indicating an instability towards the signal. Since the initial condition for the overlap is uncorrelated with the signal, then, for sufficiently small , and reach their asymptotic form before becomes of order one. We can thus plug the asymptotic aging ansatz for in the dynamical equation for :

(12)

In the linear approximation the solution has the form and we assume arbitrarily small since we want to find the algorithmic threshold where . The term becomes . Since has an arbitrarily slow evolution, whereas relaxes to zero on a finite timescale, the second term of the RHS of eq. (12) simplifies to:

where does not depend on (since can be taken arbitrarily large and relaxes to zero on finite time-scales). The contribution of to the last term on (12) reads:

Using that is bounded by and that cuts-off the integral on a time that does not diverge with , we can use the "weak-long term memory" property to conclude that the last term is arbitrarily small compared to and hence can be neglected with respect to the previous ones. Collecting all the pieces together we find:

(13)

This is solved by with which therefore justifies a posteriori our assumption of exponential growth. The condition for the instability of the aging solution towards the signal solution is therefore given by

(14)

From the analysis of the asymptotic aging solution presented in SM Sec. B.2 one finds that and , therefore obtaining . This condition is the same one found from the study of the landscape, and thus leads to the transition line eq. (5).

Acknowledgments

We thank Pierfrancesco Urbani for many related discussions. We acknowledge funding from the ERC under the European Union’s Horizon 2020 Research and Innovation Programme Grant Agreement 714608-SMiLe; from the French National Research Agency (ANR) grant PAIL; and from the Simons Foundation (#454935, Giulio Biroli).

References

Appendix A Kac-Rice method

a.1 Summary of the Kac-Rice complexity

Figure 5: Curves of the complexity of critical points, dotted and dashed curves from Eq. (17), and of minima, full curve from Eq. (19), at overlap value at fixed for different . The figure shows qualitatively the same features as Fig. 3, but displays the full positive part of the complexity for the four cases discussed in the main text, . Zooms of the curves of the annealed complexity of critical points and minima when they cross zero at negative loss are in the panels labelled from (a) to (d) for increasing .

In this section we introduce the Kac-Rice formula and we show how to reduce it to an explicit expression for the spiked matrix-tensor model. The Kac-Rice formula evaluates the expected number of critical points of a rough function subject to a number of conditions. For an inference problem it is interesting to focus on the expected number of critical points constrained to have of given loss and a given overlap with the ground truth. For convenience reasons we consider the rescaled loss . The Kac-Rice formula then reads

(15)

where represents the noise in the problem, the parameters and the joint probability density of the loss and its gradient.

The quantity of interest is the density of the logarithm of the number of critical points . It should be noted that, since the random variable representing the number of critical point fluctuates at the exponential scale, a correct estimation of the expected value of this quantity is not , as it would be immediately obtained by using the result of the Kac-Rice formula [20], but . These two quantities are called respectively annealed and quenched complexities. Using Jensen inequality one observes that the annealed complexity is just an upper bound of the quenched one. However, for mathematical convenience most of the studies have been focused on the former. Eventually the second moment of the number of critical points has been evaluated [34], by an extension of the Kac-Rice formula to higher moments [35], just to prove that the two are equivalent in some models [34]. The quenched complexity has been evaluated in a related model in a non rigorous way by studying the -th moment and applying replica trick, the so-called replicated Kac-Rice [22]. Given a random variable replica trick says

(16)

but instead of considering an arbitrary , the study is done using and performing an analytic continuation of the result to . The replica trick has already been used in a plethora of applications and, although not rigorous, it was found correct in all naturally motivated cases that have been later approached by other techniques. An important mathematical literature has developed in order to understand the method.

In the next section we sketch the derivation of the quenched Kac-Rice and we provide all the information to determine the annealed one. Since the threshold is determined considering the configuration with arbitrarily small overlap , we focus on that case. Remarkably we found that as the quenched complexity computed within the replica symmetric (RS) approximation is equal to the annealed one111We expect the RS approximation to be correct for in the whole phase diagram, and hence that quenched and annealed complexities coincide. For , results from replica theory [36] suggest that one needs to go beyond the RS approximation at least in some parts of the phase diagram.. We show that the corresponding Hessian is Eq. (6) in the main text, i.e. it is proportional to a GOE translated by and perturbed by a rank perturbation of strength that in the annealed case is of rank . Thus we find that the complexity for the stationary points is [20]

(17)

with

(18)

and as already introduced in the main text. Finally studying the eigenvalue of the Hessian to constrain them in the positive semi-axis, we find the complexity of minima [20]

(19)

with

(20)

and . In Fig. 5, we show the two complexities of the stationary points and of the minima in the parameter space discussed in the main text with discontinuous lines Eq. (17) and full lines Eq. (19), respectively. A positive complexity means an exponential number of critical points (minima). The region where exponentially many minima appear is highlighted in the small figures, showing the coexistence of exponentially many minima and saddles.

a.2 Derivation of the quenched complexity

We proceed with the computation of the quenched Kac-Rice complexity for the spiked matrix-tensor model, using replicated Kac-Rice prescription for the spiked pure-tensor model [22]. This implies, following replica trick Eq. (16), the evaluation of the -th moment of number of minima using Kac-Rice formula which is given by [35]

(21)

Where is the joint probability density of the loss and its gradients evaluated on the replicated configurations. We hereby sketch the main computation steps and present the results that are the most relevant for the theory presented in this paper, i.e. for . The details of the computation and further results for will be presented in a dedicated work elsewhere.

It is convenient to consider free variables on and constrain them using a Lagrange multiplier . Thus . Using the fact that the gradient must be zero on the sphere, i.e. that , we obtain a simple expression for the multiplier: . Moreover by separating the two terms in the loss that represent the contribution of the two channels, , we define and so to obtain for the multiplier the even simpler equation . To take advantage of this simple formula, in the following we work with the contributions of the two channels to the loss function separately and we impose the constraint on their sum, , only at the end.

The use of Cartesian coordinates allows us to evaluate easily the moments and covariances by means of standard derivatives.

(22)
(23)

Taking derivatives of these equation gives all the covariances of loss, gradient and Hessian. For instance, we can easily see that the covariance of the Hessian is given by:

(24)

where and are the reference frames associated to replica and respectively.

Remark 1 (annealed Hessian)

In particular notice that if there is only one replica and using an orthogonal basis where the -th direction is aligned with the replica and projecting on the sphere by discarding the last coordinates we obtain a simple expression:

(25)

with the representing Kronecker’s deltas. This is the expression of a GOE. We can as well compute the mean deriving twice in the -th and -th coordinate. Following [16] we make another convenient choice for the basis imposing that the signal lies in the space spanned by the and . This gives,

(26)

that, when , equals

(27)

Wrapping together Eq. (25), Eq. (27) and the expression for the Langrange multiplier that acts as a translation, we obtain the Hessian presented in the main text Eq. (6). Observe, however, that the Hessian in which we are interested in is not the simple Hessian of the loss but rather the Hessian of the loss conditioned to a given loss and a given gradient. Using Eq. (23) to compute the covariance of Hessian and loss, and of Hessian and gradient under this basis, we can observe that these random variables are unconditioned. Thus the conditioning does not affect the distribution of the Hessian of the loss and therefore Eq. (6) is recovered.

Eq. (22) and Eq. (23) are basic ingredients required to continue with the analysis. In the next two sections we first compute the joint density of the loss and its gradient, and second compute the expected value of the determinant of the Hessians. In the final section we put together the results obtaining the complexities already presented in the summary A.1.

a.2.1 Joint probability density.

In order to evaluate the joint probability density we focus on the covariance matrix of the loss and its gradient, that using Eq. (23) is given by:

(28)

The joint density corresponds to the probability of observing a zero gradient on the sphere and a given loss, , in the multivariate Gaussian variable . Thus taking into account the first moments of loss and gradient, obtained from Eq. (22), we define the auxiliary vector

(29)

The probability density is given by

(30)

This expression can be evaluated by observing that there is a set of -dimensional vectors that forms a closed group under the action of the covariance matrix Eq. (28). This set is composed by the following four vectors

(31)
(32)
(33)
(34)

where is an dimensional null vector. Indeed the auxiliary vector can be rewritten in terms of the elements of this set of newly defined vectors as follows

(35)

At this point we exploit the fact that the set of these vectors forms a closed group under the action of the covariance matrix. In fact we can invert its action on the set only, without the need to evaluate the inverse of the full covariance matrix. Using this trick, the integrand in Eq. (30) can be evaluated. The result for the integrand in Eq. (30) contains the dependence on the configurations of replicas only in terms of the overlaps with each other, and of the overlap of each of them with the ground truth, i.e. the magnetisation . In this formulation, hence, the problem of evaluating a free integral over vectors on the sphere has been translated into the task of evaluating an integral over the possible choices of the matrix of the overlaps provided that we consider the multiplying factor that comes from the volume of configurations that are compatible with that choice and the condition on the magnetisations.

The next step is to make an ansatz on the form of the matrix of these overlaps which must be consistent with the condition on the vector of magnetisations required in the Kac-Rice formula. The simplest ansatz is called replica symmetric ansatz and assumes that the overlaps of different replicas are independent of the indices and , i.e.

(36)

Note that the replica symmetric ansatz is compatible with the condition imposed in the Kac-Rice formula. Within this ansatz the probability density can be evaluated as a function of and for arbitrary and the analytic continuation at can be finally taken to evaluate the quenched complexity. The expression for generic is too long and convoluted to be reported here. However in the limit it corresponds to the expression of the probability density of losses and gradients evaluated in the annealed computation which has the following nice expression

(37)

We must also consider the normalisation of the density that is given by

(38)

Finally we come back to the volume term . Constraining the configurations to a given overlap with each other and with the ground truth produces a volume term that can be easily evaluated as

(39)

and for one single replica (which is useful in the computation of the annealed complexity) is simply

(40)

Under the replica symmetric assumption we make a Laplace approximation that allows to evaluate the quenched complexity as an extremisation of the entire expression that depends on the overlap variable through the volume term and the probability density . An interesting remark concerns the limit in quenched joint probability density. Indeed in that case the two joint probability coincide module a factor . We checked numerically that as the optimal