1 Introduction
Nonconvex formulations are the most popular approach for a number of problems across highdimensional statistics and machine learning. Over the last few years, substantial effort has been devoted to establishing rigorous guarantees for these methods in the context of important applications. A small subset of examples include matrix completion
[KMO10, GLM16], phase retrieval [CC17, SQW16], highdimensional regression with missing data [LW12], twolayers neural networks
[JSA15, ZSJ17], and so on. The general picture that emerges from theses studies –as formalized in [MBM16]– is that nonconvex losses can sometimes be ‘benign,’ and allow for nearly optimal statistical estimation using gradient descenttype optimization algorithms. Roughly speaking, this happens when the population risk does not have flat regions, i.e. regions in which the gradient is small and the Hessian is nearly rankdeficient.In this paper we explore the flipside of this picture, namely what happens when the population risk has large ‘flat regions.’ We focus on a simple problem, tensor principal component analysis under the spiked tensor model, and show that the empirical risk can easily become extremely complex in these cases. This picture matches recent computational complexity results on the same model.
The spiked tensor model [MR14]
captures –in a highly simplified fashion– a number of statistical estimation tasks in which we need to extract information from a noisy highdimensional data tensor, see e.g.
[LL10, Mør11, LMWY13, KSS13]. We are given a tensor of the form(1) 
where is a noise tensor, and would like to estimate the unit vector . The parameter corresponds to the signal to noise ratio. The noise tensor is distributed as , where , are permutations of the set , and . Throughout the paper .
We say that the weak recovery problem is solvable for this model if there exists an estimator (a measurable function) such that
(2) 
for some . It was proven in [MR14] that weak recovery is solvable provided and in [MRZ15] that it is unsolvable for , for some constant . In fact, for it is altogether impossible to distinguish between the distribution (1) and the null model . A sharp theshold for the weak recovery problem was established in [LML17] (see also [BMM17] for related results), and better lower bounds for the hypothesis testing problem were proved in [PWB16].
In light of these contributions, it is fair to say that optimal statistical estimation for the model (1) is well understood. In contrast, many questions are still open for what concerns computationally efficient procedures. Consider the maximum likelihood estimator, that requires solving
maximize  (3)  
subject to 
It was shown in [MR14] that the maximum likelihood estimator achieves weak recovery, cf. Eq. (2), provided that for some constant^{1}^{1}1Indeed, an exact characterization of should be possible using the ‘onestep replica symmetry breaking’ formula proven in [Tal06]. A nonrigorous analysis of the implications of this formula was carried out in [GS00], yielding . . However solving the problem (3) (maximizing an homogeneous degree polynomial over the unit sphere) is NPhard for all [BGL16a].
Note that the population risk associated to the problem (3) is
(4) 
For , the (Riemannian) gradient and Hessian of
vanishes on the hyperplane orthogonal to
: . In the intuitive language used above, the population risk has a large flat region. Since most of the volume of the sphere concentrates around this hyperplane [Led05], this is expected to have a dramatic impact on the optimization problem (3).Polynomialtime computable estimators have been studied in a number of papers. In particular [MR14] considers a spectral algorithm based on tensor unfolding and proved that is succeeds for even, provided . (Here and below, we denote by a constant that might depend on but is independent of .) This result was generalized in [HSS15] to arbitrary , using a sophisticated semidefinite programming relaxation from the sumofsquares hierarchy. A lower complexity spectral algorithm that succeeds under the same condition was developed in [HSSS16], and further results can be found in [ADGM16, BGL16b]. However, no polynomialtime algorithm is known that achieves weak recovery for , and it is possible that statistical estimation in the spiked tensor model is hard in this regime.
A large gap between known polynomialtime algorithms and statistical limits arises in the tensor completion problem, which shares many similarities with the spiked tensor model [GRY11, YZ15, MS16]. In the setting of tensor completion, hardness under Feige’s hypothesis was proven in [BM16] for a certain regime of the number of observed entries.
Here we reconsider the maximum likelihood estimator and we explore the landscape of the optimization problem (3). In what regime it is hard to maximize the function for a typical realization of the random tensor ? In [MR14] a power iteration algorithm was studied that attempts to compute the maximum likelihood estimator, and it was proven that it is successful for . What is the origin of this threshold at ? In this paper we compute the expected number of critical points of the likelihood function to the leading exponential order.
Let us summarize the qualitative picture that emerges from our results. For clarity of exposition, we summarize only our results on local maxima, but similar results will be presented about generic critical points.
The expected number of local maxima grows exponentially with the dimension . We compute the exponential growth rate, denoted by , as a function of the value of the cost function and of the scalar product . Namely, the expected number of local maxima with and is , with given explicitly below. The exponent and its variants , , and so on, are referred to as ‘complexity’ functions. In Figure 1 we plot , which is the exponential growth rate of the number of local maxima with scalar product , for the case , . (We also plot the analogous quantity for general critical points, .)
The expected number of local maxima with scalar product , i.e. lying close to the space orthogonal to the unknown vector , is exponentially large. The complexity function decreases as increases, i.e. as we move away form this plane, and eventually vanishes.
For sufficiently large (in particular, for given explicitly in Section 2.4), the complexity reveals an interesting structure. It is positive in an interval , where and becomes nonpositive outside this interval. however it increases again and touches zero for close to one (for even it also becomes zero for by symmetry). In other words, all the local maxima are either very close to the unknown vector (and to the global maximum) or they are on a narrow spherical annulus orthogonal to .
It is interesting to discuss the behavior of local ascent optimization algorithms in such a landscape. While at this point the discussion is necessarily heuristic, it points at some interesting directions for future work. The expected exponential number of local maxima in the annulus
suggests that algorithms can converge to a local maximum that is well correlated with , only if they are initialized outside that annulus. In other words, the initialization must be such that . If no side information is available on , a random initialization will be used. This achieveswith positive probability, and hence will escape local maxima provided
. Remarkably, this is the same scaling as the threshold for power iteration obtained in [MR14]. It would be interesting to make rigorous this connection.Let us emphasize that our results only concerns the expected number
of critical points. As is customary with random variables that fluctuate on the exponential scale, this is not necessarily close to the typical number of critical points. While we expect that several qualitative features found in this work will hold when considering the typical number, a rigorous justification is still open (see Section
3 for further discussion of this point).2 Main results
Our main results concern the number of critical points and the number of local maxima of the function introduced in Eq. (3), where is distributed as per Eq. (1).
Throughout, we denote by and be the Euclidean gradients and Hessians of at , and and be the Riemannian gradients and Hessians at . The completed real line is denoted by . For a set , we denote by its closure, and its interior.
2.1 Complexity of critical points
For any Borel sets and , we define to be the number of critical points of with function value in and correlation in :
(5) 
We define function as
(6) 
where
(7) 
For any Borel sets and , assume is fixed. Then, we have
(8)  
(9) 
2.2 Complexity of local maxima
For any Borel set and , we define to be the number of local maxima of with function value in and correlation in :
(10) 
We define function as
(11) 
where
(12) 
and , . We also note that
(13) 
For any Borel sets and , assume is fixed. Then, we have
(14)  
(15) 
2.3 Evaluating the complexity function
The expressions for and given in the previous section can be easily evaluated numerically: the figures in this section demonstrate such evaluations. Throughout this section we consider , but the behavior for other values of is qualitatively similar (with the important difference that, for even, the landscape is symmetric under change of sign of ). In Figure 1 we plot the region of the plane in which and are nonnegative, for . By Markov inequality, the probability of any critical point or any local maximum to be present outside these regions is exponentially small.
As anticipated in the introduction, we can identify two sets of local maxima:

Uninformative local maxima. These have small (i.e. small value of the objective) and small (small correlation with the ground truth ). They are also exponentially more numerous and we expect them to trap descent algorithms.

Good local maxima. These have large (i.e. large value of the objective) and large (large correlation with the ground truth ). Reaching such a local maximum results in accurate estimation.
Figure 3 shows the evolution of the two ‘projections’ and which give the exponential growth rate of the number of local maxima as functions of the objective value and the scalar product . Similar plots for the total number of critical points are found in Figure 4. We can identify several regimes of the signaltonoise ratio :

For small enough, we know that the landscape is the qualitatively similar to the case : local maxima are uninformative. While they are spread along the direction, this is purely because of random fluctuations. Local maxima with are exponentially more numerous and have larger value.

As crosses a threshold , the complexity develop a secondary maximum that touches at close to . This signals a group of local maxima (or possibly only one of them) that are highly correlated with . These are good local maxima, but have smaller value than the best uninformative local maxima. Maximum likelihood estimation still fails.

Above a third threshold in , good local maxima acquire a larger value of the objective than uninformative ones. Maximum likelihood succeeds. However, the most numerous local maxima are still uncorrelated with the signal and are likely to trap algorithms.
Let us emphasize once more that this qualitative picture is obtained by considering the expected number of critical points. In order to confirm that it holds for a typical realization of , it would be important to compute the typical number as well.
2.4 Explicit formula for complexity of critical points at a given location
The projection , which gives the expected number of critical points at a given scalar product , has a simple explicit formula in the hemisphere . This is derived using elementary calculus by analyzing equation (6).
The projection has the following explicit formula for :
(16) 
where
(17)  
(18)  
(19)  
(20) 
Analysis of this formula confirms some of the qualitative observations from Section 2.3. For very small, namely , we have that . In this case, and landscape is qualitatively similar to the case . When , we have that and the function captures the behavior of possible “good” critical points which may exist at . Further analysis of the function is carried out in Proposition 2.4.
The function is nonpositive, for all . Moreover, if and only if satisfies:
(21) 
In particular, if we set:
(22) 
then we have that if , then and if , then has a unique zero in the domain .
The critical point identified in Proposition 2.4 represents a qualitative change in the energy landscape. When , then and “good” critical points are exponentially rare. On the other hand, when then and has a unique zero. This is the only location in the region where critical points are not exponentially rare, and this represents the best correlation with the signal that is achievable.
2.5 Proof ideas
The proof of Theorems 2.1 and 2.2 relies on a representation of the expected number of critical points of a given index using the KacRice formula. This approach was pioneered in [Fyo04, ABČ13] to study the case of the present problem.
Evaluating the expression produced by the KacRice formula requires to estimate the expectation of determinant of . In the case considered in [ABČ13], is distributed as where
is a matrix form the Gaussian Orthogonal Ensemble. This fact, together with the explicitly known joint distribution of the eigenvalues of
, is used in [ABČ13] to express the expected determinant in terms of the distribution of one eigenvalue, and a normalization that is computed using Selberg’s integral.In the present case, is distributed as , i.e. a rankone deformation of the previous matrix. Instead of an exact representation, we use the asymptotic distribution of the eigenvalue of this matrix, as well as its large deviation properties, obtained in [Mai07].
3 Related literature
The complexity of random functions has been the object of large amount of work within statistical physics, in particular in the context of mean field glasses and spin glasses. The function of interest is –typically– the Hamiltonian or energy function, and its local minima are believed to capture the longtime behavior of dynamics, as well as thermodynamic properties.
In particular, the energy function (3) was first studied by Crisanti and Sommers in [CS95], for the case . This is referred to as the spherical spin model in the physics literature. The paper [CS95], uses nonrigorous methods from statistical physics to derive the complexity function, which corresponds to
in the notations used here. An alternative derivation using random matrix theory was proposed by Fyodorov
[Fyo04]. Connections with thermodynamic quantities can be found in [CHS93]. The impact of the rough energy landscape on the behavior of Langevin dynamics was studied in a number of papers, see e.g. [CHS93, BCKM98].A mathematically rigorous calculation often expected number of critical points of any index –and the associated complexity– was first carried out in [ABČ13], again for the pure noise case case . (See also [AB13] for mathematically rigorous results for the complexity of some more general “pure noise” random surfaces.) As mentioned above, the expected number of critical points is not necessarily representative of typical instances. However, for the pure noise case
, it was expected that the number of critical points concentrates around its expectation. This was recently confirmed by Subag via a second moment calculation
[Sub15]. (See also [Sub17] and [SZ17] for additional information about the landscape geometry.)Finally, the typical number of critical points of the spiketensor model and variants is derived in the forthcoming article [BBRC18] using nonrigorous methods from statistical physics. This computation indicates that typical and expected number of critical points do not always coincide for the spiked tensor model, contrary to what happens for the pure noise case . Also [BBRC18] studies generalized spiked models for which a large number of additional local maxima appear that are strongly correlated with the spike.
4 Proofs
In this section we prove Theorem 2.1 and 2.2. We begin by introducing some definitions and notations in Section 4.1. We next state some useful lemmas in Section 4.2, with proofs in Sections 4.3 and 4.4. Finally, we prove Theorems 2.1 and 2.2 in Section 4.5 and 4.6.
4.1 Definitions and notations
We will generally use lower case symbols (e.g. ) for scalars, lowercase boldface symbols (e.g. ) for vectors, and upper case boldface (e.g.
) for matrices. The identity matrix in
dimensions is denoted by , and the canonical basis in is denoted by . Given a vector , we write for the orthogonal projector onto the subspace spanned by , and by the projector onto the orthogonal subspace.For symmetric matrix , we denote by the eigenvalues of in decreasing order. We will also write and for the maximum and minimum eigenvalues.
We denote by the Gaussian Orthogonal Ensemble in dimensions. Namely, for a symmetric random matrix in , we write if the entries are independent, with and .
For a sequence of functions , , we say that is exponentially finite on a set , if
(23) 
We say that is exponentially vanishing on a set , if
(24) 
We say that is exponentially trivial on a set , if
(25) 
We say is exponentially decaying on a set , if
(26) 
For a metric space , we denote the open ball at with radius by . In , we will always use Euclidean distance. For any and , the open ball in is denoted by . Let be the space of probability measures on . We will equip with the Dudley distance: for two probability measures , this is defined as
(27) 
The open ball contains the probability measures with Dudley distance less than to .
Suppose is a probability measure on , we denote as the Stieltjes transform of defined by (here conv denotes the convex hull and the upper half plane)
(28)  
is always injective, so we can define its inverse : . Denote as the Rtransform defined by
(29)  
We denote as the semicircular law. The Stieltjes transform for the semicircular law is
(30) 
and its Rtransform is
(31) 
4.2 Preliminary lemmas
We start by stating a form of the KacRice formula that we will be a key tool for our proof. Essentially the same statement was used in [ABČ13], and we refer to [AT09] for general proofs and broader context. Let be a centered Gaussian field on and let be a finite atlas on . Set and define , . Assume that for all and all , the joint distribution of is nondegenerate, and
(32) 
for some and . For Borel sets and , let
(33) 
Then, using to denote the usual surface measure on , and denoting by the density of at , we have
(34) 
The next lemma specialize the last formula to our specific choice of , cf. Eq. (3). Its proof can be found in Section 4.3. We have
(35)  
(36) 
where
(37) 
is the area of the th dimensional sphere with radius , and is the density of at . Further the joint distribution of , , and is given by
(38)  
(39)  
(40) 
where , and are independent.
The next lemma provides a simplified expression. Its proof is deferred to Section 4.4. We have
(41)  
and
(42)  
where, for ,
(43)  
(44)  
(45)  
(46) 
Further, is exponentially trivial.
The next lemma contains a well known fact that we will use several times in the proofs. It follows immediately from the joint distribution of eigenvalues in the GOE ensemble [AGZ10] follows, see for instance [Mai07]. [Joint density of the eigenvalues of the spiked model] Let , where and . The density joint for the eigenvalues of is given by
(47) 
where denotes the vector , and is the spherical integral defined by
(48) 
with the Haar probability measure on the orthogonal group of size . Next, we state a lemma regarding the large deviations of the largest eigenvalue of the spiked model, proven^{2}^{2}2Notice that the formula in [Mai07] contains a typo, that is corrected here. Also, the normalization of is different from the one in [Mai07]. Here the empirical spectral distribution converges to a semicircle supported on , while in [Mai07] the support is . in [Mai07]. [Large deviation of largest eigenvalue of the spiked model [Mai07]] Let , where , and denote by the largest eigenvalue of . Then we have
(49)  
Comments
There are no comments yet.