The Overlap Gap Property in Principal Submatrix Recovery

08/26/2019 ∙ by David Gamarnik, et al. ∙ University of Waterloo MIT Harvard University 0

We study support recovery for a k × k principal submatrix with elevated mean λ/N, hidden in an N× N symmetric mean zero Gaussian matrix. Here λ>0 is a universal constant, and we assume k = N ρ for some constant ρ∈ (0,1). We establish that the MLE recovers a constant proportion of the hidden submatrix if and only if λ≳√(1/ρlog1/ρ). The MLE is computationally intractable in general, and in fact, for ρ>0 sufficiently small, this problem is conjectured to exhibit a statistical-computational gap. To provide rigorous evidence for this, we study the likelihood landscape for this problem, and establish that for some ε>0 and √(1/ρlog1/ρ)≪λ≪1/ρ^1/2 + ε, the problem exhibits a variant of the Overlap-Gap-Property(OGP). As a direct consequence, we establish that a family of local MCMC based algorithms do not achieve optimal recovery. Finally, we establish that for λ > 1/ρ, a simple spectral method recovers a constant proportion of the hidden submatrix.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In this paper, we study support recovery for a planted principal submatrix in a large symmetric Gaussian matrix. Formally, we observe a symmetric matrix ,

(1.1)

Throughout, we assume that

is a GOE random matrix; in other words,

are independent Gaussian random variables, with

i.i.d. , and i.i.d. . Regarding the mean matrix , we assume that there exists , , such that

where is a constant independent of . Equivalently, the observed matrix may be re-written as

(1.2)

where , with

. Throughout the subsequent discussion, we will denote the set of such boolean vectors as

.

In the setting introduced above, the following statistical questions are natural.

  1. (Detection) Can we detect the presence of the planted submatrix, i.e., can we consistently test

  2. (Recovery) How large should be, such that the support of can be recovered reliably?

  3. (Efficient Recovery) When can support recovery be accomplished using a computationally feasible procedure?

Here, we study support recovery in the special case , for some . To ensure that this problem is well-defined for all , we work along a sequence such that and . Note that in this case, the corresponding submatrix detection question [21] is trivial, and a test which rejects for large values of the sum of the matrix entries consistently detects the underlying submatrix for any . Motivated by the sparse PCA problem, we will study support recovery in this setup in the double limit , following . Deshpande-Montanari [28] initiated a study of the problem (1.2), and established that Bayes optimal recovery of the matrix can be accomplished using an Approximate Message Passing based algorithm, whenever is sufficiently large (specifically, ). In [50, 51]

, the authors analyze optimal Bayesian estimation in the

regime, and based on the behavior of the fixed points of a state-evolution system, conjecture the existence of an algorithmically hard phase in this problem; specifically, they conjecture that the minimum signal required for accurate support recovery using feasible algorithms should be significantly higher than the information theoretic minimum. This conjecture has been repeatedly quoted in various follow up works [30, 48, 49], but to the best of our knowledge, it has not been rigorously established in the prior literature. In this paper, we study the likelihood landscape of this problem, and provide rigorous evidence to the veracity of this conjecture.

From a purely conceptual viewpoint, the existence of a computationally hard phase in problem (1.2) is particularly striking. In the context of rank one matrix estimation contaminated with additive Gaussian noise (1.2), it is known that if the spike is sampled uniformly at random from the unit sphere, PCA recovers the underlying signal, whenever its detection is possible [58]

. In contrast, for rank one tensor estimation under additive Gaussian noise

[67], there exists an extensive gap between the threshold for detection [42], and the threshold where tractable algorithms are successful [13, 40, 71]. Thus at first glance, the matrix and tensor problems appear to behave very differently. However, as the present paper establishes, once the planted spike is sufficiently sparse, a hard phase re-appears in the matrix problem.

This problem has natural connections to the planted clique problem [5], sparse PCA [6], biclustering [17, 32, 69], and community detection [1, 56, 59]

. All these problems are expected to exhibit a statistical-computational gap—there are regimes where optimal statistical performance might be impossible to achieve using computationally feasible statistical procedures. The potential existence of such fundamental computational barriers has attracted significant attention recently in Statistics, Machine Learning, and Theoretical Computer Science. A number of widely distinct approaches have been used to understand this phenomenon better—prominent ones include average case reductions

[16, 19, 20, 22, 37, 53], convex relaxations [11, 23, 29, 39, 52, 54], query lower bounds [31, 68], and direct analysis of specific algorithms [10, 25, 47].

In comparison to the approaches originating from Computer Science and Optimization, a completely different perspective to this problem comes from statistical physics, particularly the study of spin glasses. This approach seeks to understand algorithmic hardness in random optimization problems as a consequence of the underlying geometry of the problem— specifically, the structure of the near optimizers. The Overlap Gap property (OGP) has emerged as a recurrent theme in this context (for an illustration, see Fig 1.3

). At a high level, the Overlap Gap Property (OGP) looks at approximate optimizers in a problem, and establishes that any two such approximate optimizers must either be close to each other, or far away. In other words, the one-dimensional set of distances between the near optimal states is disconnected. This property has been established for various problems arising from theoretical computer science and combinatorial optimization, for instance random constraint satisfaction

[2, 33, 55], Max Independent Set [32, 66], and a maxcut problem on hypergraphs [24]. Further, OGP has been shown to act as a barrier to the success of a family of “local algorithms” on sparse random graphs [26, 24, 32, 33]. This perspective has been introduced to the study of inference problems arising in Statistics and Machine Learning [13, 32, 34, 35, 36] in recent works by the first two authors which has yielded new insights into algorithmic barriers in these problems. As an aside, we note that exciting new developments in the study of mean field spin glasses [3, 57, 70], establish that in certain problems without OGP, it is actually possible to obtain approximate optimizers using appropriately designed polynomial time algorithms. This lends further credence to the belief that OGP captures a fundamental algorithmic barrier in random optimization problems— understanding this phenomenon in greater depth remains an exciting direction for future research.

1.1. Results

We initiate the study of reliable support recovery in the setting of (1.2). To introduce this notion, let us begin with the following definitions and observations. For , define the support of to be the subset , such that

To estimate the support, it is evidently equivalent to produce an estimator that takes values in the Boolean hypercube . Observe that if is drawn uniformly at random, then the intersection of the support of and satisfies

where denotes the usual Euclidean inner product. For an estimator of , in the following, we call the normalized inner product the overlap of the estimator with , or simply the overlap. We are interested in the statistical and algorithmic feasibility of recovering a non-trivial fraction of the support. To this end, we introduce the notion of reliable recovery, which is defined as follows.

Definition 1.1 (Reliable Recovery).

A sequence of estimators is said to recover the support reliably if there exists , independent of , such that

with high probability as

.

We study the question of reliable support recovery in this context, and exhibit a regime of parameters where this problem exhibits OGP. This provides rigorous evidence to the existence of a computationally hard phase in this problem, as suggested in [50, 51]

. To substantiate this claim, we establish that OGP acts as a barrier to the success of certain Markov-chain based algorithms. Finally, we show that for very large signal strengths

, reliable support recovery is easy, and can be achieved by rounding the largest eigenvector of

.

To state our results in detail, we first introduce the Maximum Likelihood Estimator (MLE) in this setting. Let

(1.3)

and consider the MLE for ,

(1.4)

Our first result is information theoretic, and derives the minimum signal strength required for the MLE to recover the underlying support in a reliable manner.

Theorem 1.2.

If , the MLE does not recover the support reliably. On the other hand, if , then for any , the MLE recovers the support reliably.

Thus for any and , there exists at least one estimator, namely, the MLE, which performs reliable recovery. However, the MLE is computationally intractable in general. Our next result analyzes the likelihood landscape for this problem, and establishes that the problem exhibits OGP for certain parameters in this phase.

To this end, we introduce a version of the overlap gap property in this context. Consider the constrained maximum likelihood,

(1.5)

which denotes the maximum likelihood subject to the additional constraint of achieving overlap with . For any , fix a sequence such that and . We establish in Theorem 2.1 below that for all , we have that

as and that can be computed by a deterministic Parisi-type [61] variational problem. In the subsequent, we refer to as the constrained ground state energy, or simply the constrained energy. We are now in the position to define the overlap gap property.

Definition 1.3.

We say that the model (1.2) with sparsity and signal-to-noise ratio exhibits the overlap gap property (OGP) if the function has a local maxima , and a global maximum , such that

  1. .

  2. there exists such that .

  3. There are no local maxima in the interval .

  4. These local maxima satisfy

Put simply, the overlap gap property states that the MLE achieves a overlap that is substantially better than , but in the interval

, the constrained energy has another local maximum. Heuristically, this suggests that a local optimization procedure, if initialized uniformly at random (and thus starting at overlap

), will get trapped at a local maximum which is sub-optimal as compared to the true global optimum in terms of both the likelihood and the overlap. We illustrate this notion visually in Figure 1.

Figure 1. An illustration of OGP. Observe that under the conditions of Definition 1.3, there exists a certain threshold energy such that the set of possible overlaps has a gap.

Our main result establishes that the planted submatrix problem (1.2) admits the overlap gap property in the limit of high sparsity and moderate signal-to-noise ratios.

Theorem 1.4.

There exist constants such that for sufficiently small, and for all , the planted sub-matrix problem has the overlap gap property.

Note that by thm:small_signal_main, reliable recovery is possible in the entire regime covered by Theorem 1.4; however, the likelihood-landscape exhibits OGP in this part of the parameter space. Observe that the hard phase becomes more prominent as .

Finally, we establish that the OGP established above acts as a barrier to a family of local MCMC type algorithms. To this end, consider a Gibbs distribution on the configuration space with

for some and defined as in (1.2). Note that for any fixed , as , the sample from approximates the MLE 1.4. Thus a simple proxy for the MLE seeks to sample from the distribution for sufficiently large. It is natural to use local Markov chains to sample from this distribution. Specifically, construct a graph with vertices being the elements of , and add an edge between two states if they are at Hamming distance . Finally, let denote a nearest-neighbor Markov chain on this graph started from that is reversible with respect to the stationary distribution . The following theorem establishes hitting time bounds for any such Markov Chain.

Theorem 1.5.

For as in thm:main, there exist points with as and an such that for some , with probability , if , then the exit time of , , satisfies

The proof of this result immediately follows by combining thm:main with Corollary 5.4. Thus OGP indeed acts as a barrier to these algorithms, and furnishes rigorous evidence of a hard phase in this problem.

As an aside, we also note that Theorem 2.1 implies

Our proof of Theorem 1.4 establishes that is maximized at . Thus, observing that , where is a matrix of i.i.d. random variables, we have,

(1.6)

To each principal submatrix of the matrix , assign a score, which corresponds to the sum of its entries. The LHS of (1.6) represents the largest score, as we scan over all principal submatrices of . This extends the results of [17] to the case of principal submatrices with size . We note that in spin-glass terminology, the score of the largest submatrix (not necessarily principal) corresponds to the ground state of a bipartite spin-glass model[12], which is out of reach of current techniques.

Returning to the planted model (1.2), we complete our discussion by establishing that when the signal is appropriately large, reliable support recovery is easily achieved using a spectral algorithm. To this end, we introduce the following two-step estimation algorithm, which rounds the leading eigenvector of .

  1. Let

    denote the eigenvector corresponding to the largest eigenvalue of the matrix

    , with . Denote .

  2. If , sample elements at random from and augment to the set in order to construct .

  3. Otherwise, sample elements uniformly at random from , and denote this set as .

  4. Finally, construct , i.e., a vector with ones corresponding to the entries in , and zeros otherwise.

Observe that the set depends on . We keep this dependence implicit for notational convenience. Then we have the following lemma.

Lemma 1.6.

For any and , there exists such that with high probability as , recovers the support reliably.

Outline: The remainder of this paper is structured as follows. Theorem 1.2 is established in Section 4. To derive this result, as a first step, Lemma 4.1 establishes a tight scaling limit for the maximum score in the unplanted problem as

. This is accomplished using first and second moment arguments, coupled with Gaussian concentration. Theorem

1.2 follows in a relatively straightforward manner, given Lemma 4.1. Theorem 1.4 is the main contribution of this paper, and is established in Section 2. The proof depends crucially on Theorem 2.1, which derives a Parisi type variational problem for the restricted energy . In turn, to derive Theorem 2.1, we first establish a finite temperature Parisi formula (Proposition 6.1) in Section 8, and subsequently compute a limit of this formula as the temperature converges to zero (see Sections 6 and 7). Similar zero temperature formulae have been instrumental in establishing properties of mean field spin glasses in the low-temperature regime (see e.g. [7, 8, 45, 43]). We emphasize that even with Theorem 2.1, the proof of Theorem 1.4 is subtle, and critically depends on understanding the scaling of the variational formula as . Lemma 1.6 is established in Section 3. Finally, Section 5 studies the effect of OGP on this problem, and establishes that certain local Markov Chain based recovery algorithms are stymied by this structural barrier.

Acknowledgments. SS thanks Yash Deshpande for introducing him to the problem. DG gratefully acknowledges the support of ONR grant N00014-17-1- 2790. A.J. gratefully acknowledges the partial support of NSF grant NSF OISE-1604232.

2. Proof of the overlap gap property

To establish Theorem 1.4, we first require the following variational formula for the limiting constrained energy (1.5). Let denote the space of non-negative Radon measures on equipped with the weak-* topology. Let denote the set

For and , let solve the Cauchy problem

(2.1)

where denotes the positive part, and is the Laplace operator. Note that any is locally absolutely continuous with respect to the Lebesgue measure on , so that is almost surely well-defined. As explained in [43], there is a unique weak solution to this PDE. Consider then the functional

We then have the following,

Theorem 2.1.

For as above, we have that

almost surely.

We defer the proof of this result to Section 6. Let us now complete the proof of the overlap gap property, Theorem 1.4, assuming Theorem 2.1.

Proof of Theorem 1.4.

By thm:var-form-gs, we have that

Observe that setting

, one may make the linear transformation

without changing the value of the functional. Thus it suffices to consider the problem

(2.2)

where are those with . By a standard argument (see e.g. [43]), we have that is jointly strictly convex in . Thus there is a unique minimum. Danskin’s envelope theorem [27] implies

(2.3)

where denote the maximizers of (2.2) corresponding to . On the other hand, by a standard differentiable dependence argument, is classically differentiable in for (see, e.g., [14, Lemma A.5]). Thus we may differentiate in to obtain the following fixed point equation for the optimal

(2.4)

Recall that solves the PDE (2.1), and thus is given by the solution to

(2.5)

where is an elliptic operator. Observe that is the infinitesimal generator of the diffusion , given by the solution to the stochastic differential equation

(2.6)

By Ito’s lemma we have, for , , the stochastic representation formula for ,

In particular, for , we have

(2.7)

Thus (2.4) and (2.3) identifies

with quantiles of

.

Next, we derive bounds on the Lagrange multipliers by analyzing the diffusion itself. Since is weakly differentiable, we see that its weak derivative weakly solves (2.5) as well. Thus

In particular we obtain the maximum principle . Finally we require the following estimate on for the unique minimizer .

Lemma 2.2.

We have that

We defer the proof of this estimate to sec:apriori-proof and complete the proof assuming this estimate.

By applying the the maximum principle for to the drift term in (2.6), we see that we may bound the above probability by

Combining this with the stationary point conditions for (2.4), and Lemma 2.2, we have, setting as the CDF of a standard Gaussian random variable,

(2.8)

Armed with these bounds on the optimizers , we can complete the proof in a relatively straightforward manner. First, observe that at , (2.4) ensures that , and consequently, (2.3) implies

Second, note that as , (2.3), combined with (2.8), implies that . Third, let us evaluate the at for some . To determine the sign of the derivative at this point, we will perform our analysis as . (2.3) implies

Using the bounds (2.8), we have,

for some and . The last inequality above uses as . Thus if we choose , then .

Finally, with the choice of detailed above, we evaluate at , for some , to be chosen appropriately. In this case, we again have, using (2.3),

(2.9)

Again, using the bounds derived in (2.8), we have,

If we take where , for , and sufficiently small, we obtain,

The above calculation establishes that there are such that for sufficiently small, and , there exist such that

As is continuous, this implies that there is at least two local maxima, one in and one in , yielding the desired multiple local maxima. It remains to show that the local maximum in is the greater of the two.

To this end, let denote a local maximum in , so that

where here and in the following we let

(2.10)

Recall again that a classical application of Slepian’s comparison inequality [18], comparing

to an IID process with the same variance, yields

(2.11)

for any . Applying (2.11) in the case , yields

(2.12)

By the preceding discussion, . Consequently,

On the other hand, by the preceding, for some . Thus, if we let denote a local maximum with , then

As do not vary in , increasing and slightly, we see that we may take small enough so that

as desired. This completes the proof. ∎

3. A Simple Rounding Scheme when

In this section, we establish that if the SNR is significantly large, it is algorithmically easy to reliably recover the support of the hidden principal sub-matrix. Specifically, we establish that for , there exists a simple spectral algorithm which reliably recovers the hidden support. To this end, we start with a simple lemma, which will be critical in the subsequent analysis.

Lemma 3.1.

Let

be jointly distributed random variables, with

. Further, assume , , and for some universal constant . Then

and

Proof.

First, note that implies . Further, implies . By the Paley-Zygmund inequality,

This implies the lower bound

On the other hand, Chebychev inequality implies

Thus,

Finally, using Chebychev’s inequality,

. Bayes Theorem implies

This completes the proof. ∎

Armed with Lemma 3.1, we turn to the proof of Lemma 1.6.

Proof of Lemma 1.6.

Recall that , and thus for

, the celebrated BBP phase transition

[9, 15] implies that there exists a universal constant such that w.h.p as

In the subsequent discussion, we condition on this good event. Consider the two dimensional measure . Set and note that satisfy the theses of Lemma 3.1. Lemma 3.1 implies that

On the event , we have,

On the other hand, if ,

Thus we have,

Further, direct computation reveals

where we denote . Upon observing that , we have, . Thus by Chebychev inequality,

This implies that whp over the sampling process, and establishes that the constructed estimator recovers the support reliably. This completes the proof.

4. Statistical feasibility of Estimation: proof of thm:small_signal_main

We prove thm:small_signal_main in two parts. Let us begin with the first part of Theorem 1.2, regarding the unreliability of support recovery. Before turning the proof let us introduce the following notations and lemma. We need the following Lemma regarding the maximum likelihood of the null model.

Lemma 4.1.

We have, as ,

We defer the proof of this result to the end of the section. In the following, for , let

Proof of part 1 of Theorem 1.2.

Fix , for some with as . Suppose by way of contradiction, that the maximum likelihood estimator reliably recovers the support. In this case, there exists a universal constant, , such that with high probability,

(4.1)

Using (1.5), Theorem 2.1, we have,

Next, observe that setting in the proof of Theorem 1.4, it follows that is maximized at . Thus we have, using Lemma 4.1,

(4.2)

On the other hand,

Finally, using Slepian’s inequality (2.11) with , we have,

where

denotes the hypergeometric distribution with sample size

, success states and observed successes. By Stirling’s formula,

(4.3)

Combining these estimates yields

Comparing this to (4.2), we obtain a contradiction for small enough, which completes the proof.

We now turn to the proof of reliable recovery for large enough .

Proof of Part 2 of Theorem 1.2.

Fix , and let , to be specified later. By Slepian’s inequality (2.11) and the the concentration of Gaussian maxima [18, Theorem 5.8], with high probability as ,