Statistical thresholds for Tensor PCA

12/08/2018 ∙ by Aukosh Jagannath, et al. ∙ Harvard University 0

We study the statistical limits of testing and estimation for a rank one deformation of a Gaussian random tensor. We compute the sharp thresholds for hypothesis testing and estimation by maximum likelihood and show that they are the same. Furthermore, we find that the maximum likelihood estimator achieves the maximal correlation with the planted vector among measurable estimators above the estimation threshold. In this setting, the maximum likelihood estimator exhibits a discontinuous BBP-type transition: below the critical threshold the estimator is orthogonal to the planted vector, but above the critical threshold, it achieves positive correlation which is uniformly bounded away from zero.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Suppose that we are given an observation, , which is a -tensor of rank in dimension subject to additive Gaussian noise. That is,

(1.1)

where , the unit sphere in , is an i.i.d. Gaussian -tensor with , and is called the signal-to-noise ratio.111We note here that none of our results are changed if one symmetrizes , i.e., if we work with the symmetric Gaussian -tensor. Throughout this paper, we assume that

is drawn from an uninformative prior, namely the uniform distribution on

. We study the fundamental limits of two natural statistical tasks. The first task is that of hypothesis testing: for what range of is it statistically possible to distinguish the law of

, the null hypothesis, from the law of

, the alternative? The second task is one of estimation: for what range of does the maximum likelihood estimator of X, , achieve asymptotically positive inner product with ?

When

, this amounts to hypothesis testing and estimation for the well-known spiked matrix model. Here, maximum likelihood estimation corresponds to computing the top eigenvector of

. This problem was proposed as a natural statistical model of principal component analysis

[33]

. It is a fundamental result of random matrix theory that there is a critical threshold below which the spectral theory of

and are asymptotically equivalent, but above which the maximum likelihood estimator achieves asymptotically positive inner product with —called the correlation of the estimator with —where the correlation increases continuously from to as tends to infinity after [24, 4, 50, 5, 25, 16, 12]. This transition is called the BBP transition after the authors of [4] and has received a tremendous amount of attention in the random matrix theory community. Far richer information is also known, such as universality, large deviations, and fluctuation theorems. For a small sample of work in this direction, see [42, 11, 14, 13]. More recently, it has been shown that the BBP transition is also the transition for hypothesis testing [46]. See also [22, 7, 52, 6, 38, 39, 2] for analyses of the testing and estimation problem with different prior distributions.

Our goal in this paper is to understand the case , which is called the spiked tensor problem. This was introduced [53]

as a natural generalization of the above to testing and estimation problems where the data has more than two indices or requires higher moments, which occurs throughout data science

[41, 23, 3]. In this setting, it is known that there is an order 1 lower bound on the threshold for hypothesis testing which is asymptotically tight in [53, 51] and an order 1 upper bound on the threshold for estimation via the maximum likelihood [46, 51]. On the other hand, if one imposes a more informative, product prior distribution, i.e., for some , the threshold for minimal mean-square error estimation (MMSE) has been computed exactly [37, 40, 8] as has the threshold for hypothesis testing under the additional assumption that is compactly supported [51, 18, 19]. We note that by a standard approximation argument (see prop:IT below), the results of [40, 8] also imply a sharp threshold for which the MMSE achieves non-trivial correlation for the uniform prior considered here.

The authors of [10] and [54] began a deep geometric approach to studying this problem by studying the geometry of the sub-level sets of the log-likelihood function. In [10], the authors compute the (normalized) logarithm of the expected number of local minima below a certain energy level via the Kac-Rice approach and show that there is a transition at a point such that for this is negative for any strictly positive correlation, and for it has a zero with correlation bounded away from zero. In [54], study the (normalized) logarithm of the (random) number of local minima via a novel (but non rigorous) replica theoretic approach. In particular, they predict that this problem exhibits a much more dramatic transition than the BBP transition. They argued that there are in fact two transitions for the log-likelihood, called and . First, for , all local maxima of the log-likelihood only achieve asymptotically vanishing correlation. For , there is a local maximum of the log-likelihood with non-trivial correlation but the maximum likelihood estimator still has vanishing correlation. Finally, for the maximum likelihood estimator has strictly positive correlation. In particular, if we let , denote the limiting value of the correlation of the maximum likelihood estimator and , they predict that

has a jump discontinuity at

. Finally, they predict that should correspond to the hypothesis testing threshold. We verify several of these predictions.

We obtain here the sharp threshold for hypothesis testing and estimation by maximum likelihood and show that they are equal to . Furthermore, we compute the asymptotic correlation between and , , where denotes the Euclidean inner product. We find that the maximum likelihood estimator achieves the maximal correlation among measurable estimators, and that it is discontinuous at . This is in contrast to the matrix setting (, where this transition is continuous. As a consequence of these results, the threshold is also the threshold for multiple hypothesis testing: the maximum likelihood is able to distinguish between all of the hypotheses . Finally, as a consequence of our arguments, we compute the maximum of the log-likelihood for fixed correlation , call it , and find that, for , has a local maximum at some .

These testing and estimation problems have received a tremendous amount of attention recently as they are expected to be an extreme example of statistical problems that admit a statistical-to-algorithmic gap: the thresholds for estimation and detection are both order in ; on the other hand, the thresholds for efficient testing and estimation are expected to diverge polynomially in , . Indeed, this problem is known to be NP-hard for all [29]. Sharp algorithmic thresholds have been shown for semi-definite and spectral relaxations of the maximum likelihood problem [31, 30, 34] as well as optimization of the likelihood itself via Langevin dynamics [9]. Upper bounds have also been obtained for message passing and power iteration [53], as well as gradient descent [9]. Our work complements these results by providing sharp statistical thresholds for maximum likelihood estimation and hypothesis testing.

Let us now discuss our main results and methods. We begin this paper by computing the sharp threshold for hypothesis testing. There have been two approaches to this in the literature to date. One is by a modified second moment method [46, 51], which yields sharp results in the limit that tends to infinity after . The other approach, which we take here, is to control the fluctuations of the log-likelihood and yields sharp results for finite

. The key idea behind this approach is to prove a correspondence between the statistical threshold for hypothesis testing and a phase transition, called the “replica symmetry breaking” transition, in a corresponding statistical physics problem. For more on this connection see sec:connection-to-sg below.

Previous approaches to making this connection precise apply to the bounded i.i.d. prior setting [37, 51, 18, 19]. There one may apply a deep, inductive argument of Talagrand [56] related to the “cavity method” [43] to control these fluctuations. This approach uses the boundedness and product assumption on in an essential way, neither of which hold in our setting (though we note here the work [48] which applies for sufficiently small). Our main technical contribution in this direction is a simpler, large deviations based approach which allows us to obtain the sharp threshold without using the cavity method. This argument applies with little modification to the product prior setting as well, though we do not investigate this here.

We then turn to computing the threshold for maximum likelihood estimation. We begin by directly computing the almost sure limit of the normalized maximum likelihood, which is an immediate consequence of the results of [32, 20]. Combining this with of the results of [40] (and a standard approximation argument), we then obtain a sharp estimate for the correlation between the MLE and for and find that it matches that of the Bayes-optimal estimator, confirming a prediction from [27]. The fact that the MLE has non-trivial correlation down to the information-theoretic threshold is surprising in this setting as it is not expected to be true for all prior distributions. See, e.g., [26].

1.1. Main results

Let us begin by stating our first result regarding hypothesis testing. Consider an observation of a random tensor. Let denote the law of (1.1). The null hypothesis is then and the alternative . Define for ,

(1.2)

and let

(1.3)

Our goal is to show that and are mutually contiguous when and that for there is a sequence of tests which asymptotically distinguish these distributions. More precisely, we obtain the following stronger result regarding the total variation distance between these hypotheses which we state in the case even for simplicity.

Theorem 1.1.

For even,

The preceding result shows us that the transition for hypothesis testing occurs at . Let us now turn to the corresponding results regarding maximum likelihood estimation.

It is straightforward to show that maximizing the log-likelihood is equivalent to maximizing over the sphere, . The maximum likelihood estimator (MLE) is then defined as222 As shown in Proposition A.1, admits almost surely a unique maximizer over if

is odd, and two maximizers

and if is even. In the case of even , is simply picked uniformly at random among .

(1.4)

Our second result is that the preceding transition is also the transition for which maximum likelihood estimation yields an estimator which achieves positive correlation with .

Let be defined by

(1.5)

As shown in Lemma A.4, the function admits a unique positive maximizer on when , so that this is well-defined. Let denote the unique zero on of

(1.6)

Finally, let

(1.7)

We then have the following.

Theorem 1.2.

Let and . The following limit holds almost surely

(1.8)

Furthermore, we have that for

(1.9)

As a consequence of cor:upper_IT, the maximum likelihood estimator achieves maximal correlation. Unlike the case , the transition in is not continuous. See Figure 1.1.

Figure 1.1. Asymptotic correlation as a function of the signal-to-noise ratio , for different values of .

Regarding the second threshold

While the regime and the expected transition at is not relevant for testing and estimation, there is still a natural interpretation from the perspective of the landscape of the maximum likelihood. In [10, 54], this is explained explained in terms of the complexity. There is also an explanation in terms of the optimization of the maximum likelihood. We end this section with a brief discussion of this phase. Let be given by

(1.10)

Consider the constrained maximum likelihood,

(1.11)

This limit exists and is given by an explicit variational problem (see (5.5) below). For , let be the (unique) positive, strict local maximum of . By lem:gauss_scalar, this is well-defined and satisfies for . In [54], it is argued by the replica method that has a local maximum at for all . Establishing this rigorously is a key step in our proof of thm:max_likelihood. In particular, we prove the following, which is a direct consequence of lem:up_qs below.

Proposition 1.3.

For , the function has a strict local maximum at .

It is easy to verify (by direct differentiation) that the map is strictly increasing on . We have also that by lem:up_qs and lem:x_k, so we get that for the strict local maximum at has strictly less than the maximum likelihood. In fact, (5.5) can be solved numerically, as it can be shown that one may reduce this variational problem, in the setting we consider here, to a two-parameter family of problems in three real variables. This is discussed in rem:rigorous below. In particular, see Figure 1.2 for an illustration of these two transitions in the case .

Figure 1.2. Asymptotic constrained maximum likelihood for with . Here and . For , the function is (numerically) seen to be monotone. A secondary maximum occurs at the transition . This local maximum is bounded away from . Finally, at the information theoretic threshold , the maximum likelihood is now maximized at this second point.

Acknolwedgements

A.J. would like to thank G. Ben Arous for encouraging the preparation of this paper. A.J. and L.M. would like to thank the organizers of the BIRS workshop “Spin Glasses and Related Topics” where part of this research was conducted. This work was conducted while A.J. was supported by NSF OISE-1604232 and P.L. was partially supported by the NSF Graduate Research Fellowship Program under grant DGE-1144152.

2. Proof of thm:main-thm and connection to spin glasses

In this section, we prove thm:main-thm. In particular, we connect the phase transition for the hypothesis testing problem to a phase transition in a class of models from statistical physics, which is proved in the remaining sections.

Let us begin by explaining this connection. First note that the null hypothesis is a centered Gaussian distribution on the space of

-tensors in , whereas the alternative corresponds to one with a random mean . Thus by Gaussian change of density, the likelihood ratio, , satisfies

where denotes the uniform measure on . Observe that the total variation distance satisfies

(2.1)

We will show that this probability tends to zero almost everywhere when

.

Let us now make the following change of notation, motivated by statistical physics. For and , define

(2.2)

We view as a function on , which is called the Hamiltonian of the spherical -spin glass model in the statistical physics literature [21]. The log-likelihood ratio then has an interpretation in terms of what is called a “free energy” in the statistical physics literature. More precisely, define the free energy at temperature for the spherical -spin model by

(2.3)

and observe that under the null hypothesis,

(2.4)

The key conceptual step in our proof is to connect the phase transition for hypothesis testing to what is called the “replica symmetry breaking” transition in statistical physics. While it is not within the scope of this paper to provide a complete description of this transition, we note that one expects this transition to be reflected in the limiting properties of : if is small should fluctuate around , but for large it should be much smaller than . A sharp transition is expected to occur at . For an in-depth discussion of replica symmetry breaking transitions see [43]. In the remainder of this section we reduce the proof of our main result to the proof that the phase transition for the fluctuations of does in fact occur at . We then prove this phase transition exists in the next two sections.

Let us turn to this reduction. By (2.1) and the equivalence noted above,

(2.5)

We have the following theorem of Talagrand, which we state in a weak form for the sake of exposition. Here and in the following, unless otherwise specified and will always denote integration with respect to the law of the Gaussian random tensor .

Theorem 2.1 (Talagrand [55]).

For every , is a convergent sequence. Furthermore,

with equality if and only if .

With this in hand, it suffices to show the following.

Theorem 2.2.

For even, and , there is a constant such that for every and ,

Proof.

The proof of this theorem will constitute the next two sections. Let us begin by making the following elementary observations, which will reduce the theorem to certain fluctuation theorems. To this end, observe that by Chebyshev’s inequality,

(2.6)

The key point in the following will be to quantify the rate of convergence in Talagrand’s theorem when

. This rate of convergence will also allow us to control the variance of

. More precisely, in the subsequent sections we will prove the following two theorems.

Theorem 2.3.

Fix . For any there is a such that for

Theorem 2.4.

For and , for

The desired result then follows upon combining thm:conv-means and thm:variance-bound with (2.6). ∎

We can now prove the main theorem.

Proof of thm:main-thm.

Suppose first that . Then by thm:FE-decay combined with (2.5), the total variation distance vanishes.

Suppose now that . Note

by Gaussian concentration (see, e.g., [15, Theorem 5.6]). By Jensen’s inequality , and by Talagrand’s theorem, we know that

so for some ,

The desired result then follows by using this to lower bound the right side of (2.5). ∎

3. Rate of convergence of the mean and Decay of variance.

In this section, we prove thm:conv-means. In the following we will make frequent use of the measure

where is as in (2.2) and . We call this the Gibbs measure, which we normalize to be a probability measure. Observe that this normalization constant is given exactly by . Here and in the remaining sections, we will let denote expectation with respect to the (random) measure . We will suppress the dependence on whenever it is unambiguous as it will always be fixed. Throughout this section, will always be fixed and less than . It will also be useful to define the quantity

where denotes the Euclidean inner product. Evidently this is related to the large deviations rate function for the event . To simplify notation, for , we let

The starting point for our analysis is the estimate of the rate of convergence of to .

Proof of Theorem 2.3.

In the following, let

By Jensen’s inequality, .

Let us now turn to an upper bound. Recall from (2.2). Observe that is centered and has covariance

It then follows that

where the first equality is by definition of the Gibbs measure and the second follows by Gaussian integration by parts for Gibbs expectations, (A.6). We now claim that

(3.1)

for some constant and sufficiently small. With this claim in hand, we may apply Gronwall’s inequality and the lower bound from above to obtain

as desired. Let us now turn to the proof of this claim.

Observe that the maps and are uniformly -Lipschitz, so that Gaussian concentration of measure (see for instance [15], Theorem 5.6) implies that there are constants depending only on and such that for any , with probability at least

Thus, on this event, call it ,

(3.2)

As we shall show in Corollary 4.4, for every there is some such that for and for all ,

Let

Then for , on ,

(3.3)

Consequently, if we take to be the centers of a partition of the interval into intervals of size , then if we let ,

where we use that . From this it follows, by the inequality , that

(3.4)

for some , small enough and . The claim (3.1) then follows since for all . For this last claim, observe that is convex in with and right derivative , so that . As a result, . ∎

Notice that by the above argument, we also have the following.

Corollary 3.1.

For any and , there is a such that for sufficiently large,

Proof.

By thm:conv-means, we have, for large enough,

Combining this with (3.1) yields the desired. ∎

We are now in a position to prove the variance decay.

Proof of Theorem 2.4.

By the Gaussian Poincaré inequality (see for instance [15, Theorem 3.20]),

The result then follows by combining this with Corollary 3.1. ∎

4. The Parisi functional and large deviations

The main technical tool we need is a bound on the following expected value, which is related to large deviations of from its mean:

We relate the quantities and to explicit Parisi-type formulas. In the following, let and . For and , define

(4.1)
(4.2)

Then we have the following from [55]. See also [49, 36] for alternative presentations.

Theorem 4.1.

There exists a constant such that for every , , , , and , we have

(4.3)

where .

Proof.

We first observe that by symmetry of , it suffices to prove the same estimate for

We apply [49, Eq. 2.22] with the choice of parameters

(4.4)

to obtain

where the error term in [49, Eq. 2.22] satisfies

where , the

are i.i.d. gaussian random variables with variance

, as given in [49, Eq. 2.14], and is universal. Using the elementary bound of Lemma A.7, it follows that . Modifying appropriately yields the desired. ∎

Lemma 4.2.

For and , there are constants such that for every , and , we have

Proof.

Observe that is in and is a critical point with Hessian

This has an eigenvector of the form for some

with strictly negative eigenvalue

. It follows that for we have

for for some independent of . Combining this with (4.3), we obtain

If we choose and decrease , the result follows since for and . ∎

Lemma 4.3.

For and , there are such that for every , and

Proof.

Notice that

where is defined by (A.1). By Lemma A.4 we have for all and .

Note that . Thus for every , for some . Observe that is upper-semicontinuous. Thus for any , there exists such that for all ,

In particular, for such , it follows that

for , which implies the desired result. ∎

Combining these two results, we obtain the following.

Corollary 4.4.

For and sufficiently small, there is a such that for ,

for all .

Proof.

Fix and . By lem:parisi-1, there is some such that for all and ,

Now for , let be as in lem:parisi-2. Then , so that, if we take the result follows. ∎

5. Estimation

In this section, we prove Theorem 1.2. We begin by providing a lower bound for the maximum likelihood for every using results on the ground state of the mixed -spin model recently proved in [32, 20]. We then use the information-theoretic bound on the maximal correlation achievable by any estimator from [40] to obtain the matching upper bound. We end by proving the desired result for the correlation . In the remainder of this paper, for ease of notation, we let

(5.1)

where is as in (2.2).

5.1. Variational formula for the ground state of the mixed -spin model

We begin by recalling the following variational formula for the ground state of the mixed -spin model. Consider the Gaussian process indexed by :

where are i.i.d. standard Gaussian random variables and . The covariance of is given by

where Let denote the subset of of functions that are positive, non-increasing and concave. For any , we let be

Set

(5.2)

Let us recall the following variational formula. For , we write .

Theorem 5.1 ([20, 32]).

For all ,

almost surely and in .

Remark 5.2.

While the results of [20, 32] are stated with , they still hold when by replacing and . To see this, simply note that the Crisanti-Sommers formula still holds in this setting by the main result of [17]. The reformulation from [32, Eq. (1.0.1)] is then changed by this replacement by simply repeating the integration by parts argument from [32, Lemma 6.1.1]. From here the arguments are unchanged under the above replacement.

5.2. The lower bound

By Borell’s inequality, the constrained maximum likelihood (1.11) concentrates around its mean with sub-Gaussian tails. In particular, combining this with Borell-Cantelli we see that

(5.3)

Clearly, for all . Recall the definition of from (1.10) and , see, e.g., Lemma A.4. If we apply this for and (by Lemma A.4), Lemma 5.6 below will immediately yield the following lower bound.

Lemma 5.3.

For all ,

(5.4)

We now turn to the proof of lem:up_qs. We begin by observing the following explicit representation for .

Lemma 5.4.

For all the limit in (1.11) exists and

(5.5)

where .

Proof.

We begin by observing that by rotational invariance,

Let such that . Then

where are i.i.d. standard Gaussians.

So that , where is given by:

The function is a Gaussian process with covariance