The landscape of the spiked tensor model

11/15/2017
by   Gérard Ben Arous, et al.
0

We consider the problem of estimating a large rank-one tensor u^⊗ k∈( R^n)^⊗ k, k> 3 in Gaussian noise. Earlier work characterized a critical signal-to-noise ratio λ_Bayes= O(1) above which an ideal estimator achieves strictly positive correlation with the unknown vector of interest. Remarkably no polynomial-time algorithm is known that achieved this goal unless λ> C n^(k-2)/4 and even powerful semidefinite programming relaxations appear to fail for 1≪λ≪ n^(k-2)/4. In order to elucidate this behavior, we consider the maximum likelihood estimator, which requires maximizing a degree-k homogeneous polynomial over the unit sphere in n dimensions. We compute the expected number of critical points and local maxima of this objective function and show that it is exponential in the dimensions n, and give exact formulas for the exponential growth rate. We show that (for λ larger than a constant) critical points are either very close to the unknown vector u, or are confined in a band of width Θ(λ^-1/(k-1)) around the maximum circle that is orthogonal to u. For local maxima, this band shrinks to be of size Θ(λ^-1/(k-2)). These `uninformative' local maxima are likely to cause the failure of optimization algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/08/2018

Statistical thresholds for Tensor PCA

We study the statistical limits of testing and estimation for a rank one...
06/11/2021

The Complexity of Sparse Tensor PCA

We study the problem of sparse tensor principal component analysis: give...
05/25/2021

Rank-one matrix estimation: analytic time evolution of gradient descent dynamics

We consider a rank-one symmetric matrix corrupted by additive noise. The...
08/02/2021

A Random Matrix Perspective on Random Tensors

Tensor models play an increasingly prominent role in many fields, notabl...
03/31/2020

Likelihood landscape and maximum likelihood estimation for the discrete orbit recovery model

We study the non-convex optimization landscape for maximum likelihood es...
01/03/2022

Maximizing the Sum of the Distances between Four Points on the Unit Hemisphere

In this paper, we prove a geometrical inequality which states that for a...
02/26/2021

How do exponential size solutions arise in semidefinite programming?

As a classic example of Khachiyan shows, some semidefinite programs (SDP...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Non-convex formulations are the most popular approach for a number of problems across high-dimensional statistics and machine learning. Over the last few years, substantial effort has been devoted to establishing rigorous guarantees for these methods in the context of important applications. A small subset of examples include matrix completion

[KMO10, GLM16], phase retrieval [CC17, SQW16], high-dimensional regression with missing data [LW12]

, two-layers neural networks

[JSA15, ZSJ17], and so on. The general picture that emerges from theses studies –as formalized in [MBM16]– is that non-convex losses can sometimes be ‘benign,’ and allow for nearly optimal statistical estimation using gradient descent-type optimization algorithms. Roughly speaking, this happens when the population risk does not have flat regions, i.e. regions in which the gradient is small and the Hessian is nearly rank-deficient.

In this paper we explore the flipside of this picture, namely what happens when the population risk has large ‘flat regions.’ We focus on a simple problem, tensor principal component analysis under the spiked tensor model, and show that the empirical risk can easily become extremely complex in these cases. This picture matches recent computational complexity results on the same model.

The spiked tensor model [MR14]

captures –in a highly simplified fashion– a number of statistical estimation tasks in which we need to extract information from a noisy high-dimensional data tensor, see e.g.

[LL10, Mør11, LMWY13, KSS13]. We are given a tensor of the form

(1)

where is a noise tensor, and would like to estimate the unit vector . The parameter corresponds to the signal to noise ratio. The noise tensor is distributed as , where , are permutations of the set , and . Throughout the paper .

We say that the weak recovery problem is solvable for this model if there exists an estimator (a measurable function) such that

(2)

for some . It was proven in [MR14] that weak recovery is solvable provided and in [MRZ15] that it is unsolvable for , for some constant . In fact, for it is altogether impossible to distinguish between the distribution (1) and the null model . A sharp theshold for the weak recovery problem was established in [LML17] (see also [BMM17] for related results), and better lower bounds for the hypothesis testing problem were proved in [PWB16].

In light of these contributions, it is fair to say that optimal statistical estimation for the model (1) is well understood. In contrast, many questions are still open for what concerns computationally efficient procedures. Consider the maximum likelihood estimator, that requires solving

maximize (3)
subject to

It was shown in [MR14] that the maximum likelihood estimator achieves weak recovery, cf. Eq. (2), provided that for some constant111Indeed, an exact characterization of should be possible using the ‘one-step replica symmetry breaking’ formula proven in [Tal06]. A non-rigorous analysis of the implications of this formula was carried out in [GS00], yielding . . However solving the problem (3) (maximizing an homogeneous degree- polynomial over the unit sphere) is NP-hard for all [BGL16a].

Note that the population risk associated to the problem (3) is

(4)

For , the (Riemannian) gradient and Hessian of

vanishes on the hyperplane orthogonal to

: . In the intuitive language used above, the population risk has a large flat region. Since most of the volume of the sphere concentrates around this hyperplane [Led05], this is expected to have a dramatic impact on the optimization problem (3).

Polynomial-time computable estimators have been studied in a number of papers. In particular [MR14] considers a spectral algorithm based on tensor unfolding and proved that is succeeds for even, provided . (Here and below, we denote by a constant that might depend on but is independent of .) This result was generalized in [HSS15] to arbitrary , using a sophisticated semidefinite programming relaxation from the sum-of-squares hierarchy. A lower complexity spectral algorithm that succeeds under the same condition was developed in [HSSS16], and further results can be found in [ADGM16, BGL16b]. However, no polynomial-time algorithm is known that achieves weak recovery for , and it is possible that statistical estimation in the spiked tensor model is hard in this regime.

A large gap between known polynomial-time algorithms and statistical limits arises in the tensor completion problem, which shares many similarities with the spiked tensor model [GRY11, YZ15, MS16]. In the setting of tensor completion, hardness under Feige’s hypothesis was proven in [BM16] for a certain regime of the number of observed entries.

Here we reconsider the maximum likelihood estimator and we explore the landscape of the optimization problem (3). In what regime it is hard to maximize the function for a typical realization of the random tensor ? In [MR14] a power iteration algorithm was studied that attempts to compute the maximum likelihood estimator, and it was proven that it is successful for . What is the origin of this threshold at ? In this paper we compute the expected number of critical points of the likelihood function to the leading exponential order.

Figure 1: Complexity of the spiked tensor model of order at signal-to-noise ratio : exponential growth rate of the number of critical points , as a function of the scalar product . Left: complexity for the total number of critical points . Right: complexity for local maxima .

Let us summarize the qualitative picture that emerges from our results. For clarity of exposition, we summarize only our results on local maxima, but similar results will be presented about generic critical points.

The expected number of local maxima grows exponentially with the dimension . We compute the exponential growth rate, denoted by , as a function of the value of the cost function and of the scalar product . Namely, the expected number of local maxima with and is , with given explicitly below. The exponent and its variants , , and so on, are referred to as ‘complexity’ functions. In Figure 1 we plot , which is the exponential growth rate of the number of local maxima with scalar product , for the case , . (We also plot the analogous quantity for general critical points, .)

The expected number of local maxima with scalar product , i.e. lying close to the space orthogonal to the unknown vector , is exponentially large. The complexity function decreases as increases, i.e. as we move away form this plane, and eventually vanishes.

For sufficiently large (in particular, for given explicitly in Section 2.4), the complexity reveals an interesting structure. It is positive in an interval , where and becomes non-positive outside this interval. however it increases again and touches zero for close to one (for even it also becomes zero for by symmetry). In other words, all the local maxima are either very close to the unknown vector (and to the global maximum) or they are on a narrow spherical annulus orthogonal to .

It is interesting to discuss the behavior of local ascent optimization algorithms in such a landscape. While at this point the discussion is necessarily heuristic, it points at some interesting directions for future work. The expected exponential number of local maxima in the annulus

suggests that algorithms can converge to a local maximum that is well correlated with , only if they are initialized outside that annulus. In other words, the initialization must be such that . If no side information is available on , a random initialization will be used. This achieves

with positive probability, and hence will escape local maxima provided

. Remarkably, this is the same scaling as the threshold for power iteration obtained in [MR14]. It would be interesting to make rigorous this connection.

Let us emphasize that our results only concerns the expected number

of critical points. As is customary with random variables that fluctuate on the exponential scale, this is not necessarily close to the typical number of critical points. While we expect that several qualitative features found in this work will hold when considering the typical number, a rigorous justification is still open (see Section

3 for further discussion of this point).

The rest of the paper is organized as follows. We state formally our main results in Section 2, which also sketches the main ideas of the proofs. We will then review earlier literature in Section 3, and present proofs in Section 4.

2 Main results

Our main results concern the number of critical points and the number of local maxima of the function introduced in Eq. (3), where is distributed as per Eq. (1).

Throughout, we denote by and be the Euclidean gradients and Hessians of at , and and be the Riemannian gradients and Hessians at . The completed real line is denoted by . For a set , we denote by its closure, and its interior.

2.1 Complexity of critical points

For any Borel sets and , we define to be the number of critical points of with function value in and correlation in :

(5)

We define function as

(6)

where

(7)

For any Borel sets and , assume is fixed. Then, we have

(8)
(9)

2.2 Complexity of local maxima

For any Borel set and , we define to be the number of local maxima of with function value in and correlation in :

(10)

We define function as

(11)

where

(12)

and , . We also note that

(13)

For any Borel sets and , assume is fixed. Then, we have

(14)
(15)

2.3 Evaluating the complexity function

Figure 2: Spiked tensor model with and . The black region corresponds to non-negative complexity: (left) and (right). The arrows indicate the point where the complexity touches zero, in correspondence with the ‘good’ local maxima.

A AAAA AAAA AAAA AAAA

Figure 3: Complexity (exponential growth rate of the expected number of local maxima) in the spiked tensor model with and (from top to bottom) . Left column: complexity as a function of the objective value , . Right column: complexity as a function of the scalar product , .

A AAAA AAAA AAAA AAAA

Figure 4: Complexity (exponential growth rate of the expected number of critical points) in the spiked tensor model with and (from top to bottom) . Left column: complexity as a function of the objective value , . Right column: complexity as a function of the scalar product , .

The expressions for and given in the previous section can be easily evaluated numerically: the figures in this section demonstrate such evaluations. Throughout this section we consider , but the behavior for other values of is qualitatively similar (with the important difference that, for even, the landscape is symmetric under change of sign of ). In Figure 1 we plot the region of the plane in which and are non-negative, for . By Markov inequality, the probability of any critical point or any local maximum to be present outside these regions is exponentially small.

As anticipated in the introduction, we can identify two sets of local maxima:

  • Uninformative local maxima. These have small (i.e. small value of the objective) and small (small correlation with the ground truth ). They are also exponentially more numerous and we expect them to trap descent algorithms.

  • Good local maxima. These have large (i.e. large value of the objective) and large (large correlation with the ground truth ). Reaching such a local maximum results in accurate estimation.

Figure 3 shows the evolution of the two ‘projections’ and which give the exponential growth rate of the number of local maxima as functions of the objective value and the scalar product . Similar plots for the total number of critical points are found in Figure 4. We can identify several regimes of the signal-to-noise ratio :

  1. For small enough, we know that the landscape is the qualitatively similar to the case : local maxima are uninformative. While they are spread along the direction, this is purely because of random fluctuations. Local maxima with are exponentially more numerous and have larger value.

  2. As crosses a threshold , the complexity develop a secondary maximum that touches at close to . This signals a group of local maxima (or possibly only one of them) that are highly correlated with . These are good local maxima, but have smaller value than the best uninformative local maxima. Maximum likelihood estimation still fails.

  3. Above a third threshold in , good local maxima acquire a larger value of the objective than uninformative ones. Maximum likelihood succeeds. However, the most numerous local maxima are still uncorrelated with the signal and are likely to trap algorithms.

Let us emphasize once more that this qualitative picture is obtained by considering the expected number of critical points. In order to confirm that it holds for a typical realization of , it would be important to compute the typical number as well.

2.4 Explicit formula for complexity of critical points at a given location

The projection , which gives the expected number of critical points at a given scalar product , has a simple explicit formula in the hemisphere . This is derived using elementary calculus by analyzing equation (6).

The projection has the following explicit formula for :

(16)

where

(17)
(18)
(19)
(20)

Analysis of this formula confirms some of the qualitative observations from Section 2.3. For very small, namely , we have that . In this case, and landscape is qualitatively similar to the case . When , we have that and the function captures the behavior of possible “good” critical points which may exist at . Further analysis of the function is carried out in Proposition 2.4.

The function is non-positive, for all . Moreover, if and only if satisfies:

(21)

In particular, if we set:

(22)

then we have that if , then and if , then has a unique zero in the domain .

The critical point identified in Proposition 2.4 represents a qualitative change in the energy landscape. When , then and “good” critical points are exponentially rare. On the other hand, when then and has a unique zero. This is the only location in the region where critical points are not exponentially rare, and this represents the best correlation with the signal that is achievable.

The proofs of Proposition 2.4 and 2.4 are deferred to Appendix B

2.5 Proof ideas

The proof of Theorems 2.1 and 2.2 relies on a representation of the expected number of critical points of a given index using the Kac-Rice formula. This approach was pioneered in [Fyo04, ABČ13] to study the case of the present problem.

Evaluating the expression produced by the Kac-Rice formula requires to estimate the expectation of determinant of . In the case considered in [ABČ13], is distributed as where

is a matrix form the Gaussian Orthogonal Ensemble. This fact, together with the explicitly known joint distribution of the eigenvalues of

, is used in [ABČ13] to express the expected determinant in terms of the distribution of one eigenvalue, and a normalization that is computed using Selberg’s integral.

In the present case, is distributed as , i.e. a rank-one deformation of the previous matrix. Instead of an exact representation, we use the asymptotic distribution of the eigenvalue of this matrix, as well as its large deviation properties, obtained in [Mai07].

3 Related literature

The complexity of random functions has been the object of large amount of work within statistical physics, in particular in the context of mean field glasses and spin glasses. The function of interest is –typically– the Hamiltonian or energy function, and its local minima are believed to capture the long-time behavior of dynamics, as well as thermodynamic properties.

In particular, the energy function (3) was first studied by Crisanti and Sommers in [CS95], for the case . This is referred to as the spherical -spin model in the physics literature. The paper [CS95], uses non-rigorous methods from statistical physics to derive the complexity function, which corresponds to

in the notations used here. An alternative derivation using random matrix theory was proposed by Fyodorov

[Fyo04]. Connections with thermodynamic quantities can be found in [CHS93]. The impact of the rough energy landscape on the behavior of Langevin dynamics was studied in a number of papers, see e.g. [CHS93, BCKM98].

A mathematically rigorous calculation often expected number of critical points of any index –and the associated complexity– was first carried out in [ABČ13], again for the pure noise case case . (See also [AB13] for mathematically rigorous results for the complexity of some more general “pure noise” random surfaces.) As mentioned above, the expected number of critical points is not necessarily representative of typical instances. However, for the pure noise case

, it was expected that the number of critical points concentrates around its expectation. This was recently confirmed by Subag via a second moment calculation

[Sub15]. (See also [Sub17] and [SZ17] for additional information about the landscape geometry.)

Finally, the typical number of critical points of the spike-tensor model and variants is derived in the forthcoming article [BBRC18] using non-rigorous methods from statistical physics. This computation indicates that typical and expected number of critical points do not always coincide for the spiked tensor model, contrary to what happens for the pure noise case . Also [BBRC18] studies generalized spiked models for which a large number of additional local maxima appear that are strongly correlated with the spike.

4 Proofs

In this section we prove Theorem 2.1 and 2.2. We begin by introducing some definitions and notations in Section 4.1. We next state some useful lemmas in Section 4.2, with proofs in Sections 4.3 and 4.4. Finally, we prove Theorems 2.1 and 2.2 in Section 4.5 and 4.6.

4.1 Definitions and notations

We will generally use lower case symbols (e.g. ) for scalars, lower-case boldface symbols (e.g. ) for vectors, and upper case boldface (e.g.

) for matrices. The identity matrix in

dimensions is denoted by , and the canonical basis in is denoted by . Given a vector , we write for the orthogonal projector onto the subspace spanned by , and by the projector onto the orthogonal subspace.

For symmetric matrix , we denote by the eigenvalues of in decreasing order. We will also write and for the maximum and minimum eigenvalues.

We denote by the Gaussian Orthogonal Ensemble in dimensions. Namely, for a symmetric random matrix in , we write if the entries are independent, with and .

For a sequence of functions , , we say that is exponentially finite on a set , if

(23)

We say that is exponentially vanishing on a set , if

(24)

We say that is exponentially trivial on a set , if

(25)

We say is exponentially decaying on a set , if

(26)

For a metric space , we denote the open ball at with radius by . In , we will always use Euclidean distance. For any and , the open ball in is denoted by . Let be the space of probability measures on . We will equip with the Dudley distance: for two probability measures , this is defined as

(27)

The open ball contains the probability measures with Dudley distance less than to .

Suppose is a probability measure on , we denote as the Stieltjes transform of defined by (here conv denotes the convex hull and the upper half plane)

(28)

is always injective, so we can define its inverse : . Denote as the R-transform defined by

(29)

We denote as the semi-circular law. The Stieltjes transform for the semi-circular law is

(30)

and its R-transform is

(31)

4.2 Preliminary lemmas

We start by stating a form of the Kac-Rice formula that we will be a key tool for our proof. Essentially the same statement was used in [ABČ13], and we refer to [AT09] for general proofs and broader context. Let be a centered Gaussian field on and let be a finite atlas on . Set and define , . Assume that for all and all , the joint distribution of is non-degenerate, and

(32)

for some and . For Borel sets and , let

(33)

Then, using to denote the usual surface measure on , and denoting by the density of at , we have

(34)

The next lemma specialize the last formula to our specific choice of , cf. Eq. (3). Its proof can be found in Section 4.3. We have

(35)
(36)

where

(37)

is the area of the -th dimensional sphere with radius , and is the density of at . Further the joint distribution of , , and is given by

(38)
(39)
(40)

where , and are independent.

The next lemma provides a simplified expression. Its proof is deferred to Section 4.4. We have

(41)

and

(42)

where, for ,

(43)
(44)
(45)
(46)

Further, is exponentially trivial.

The next lemma contains a well known fact that we will use several times in the proofs. It follows immediately from the joint distribution of eigenvalues in the GOE ensemble [AGZ10] follows, see for instance [Mai07]. [Joint density of the eigenvalues of the spiked model] Let , where and . The density joint for the eigenvalues of is given by

(47)

where denotes the vector , and is the spherical integral defined by

(48)

with the Haar probability measure on the orthogonal group of size . Next, we state a lemma regarding the large deviations of the largest eigenvalue of the spiked model, proven222Notice that the formula in [Mai07] contains a typo, that is corrected here. Also, the normalization of is different from the one in [Mai07]. Here the empirical spectral distribution converges to a semicircle supported on , while in [Mai07] the support is . in [Mai07]. [Large deviation of largest eigenvalue of the spiked model [Mai07]] Let , where , and denote by the largest eigenvalue of . Then we have

(49)