In this paper, we study support recovery for a planted principal submatrix in a large symmetric Gaussian matrix. Formally, we observe a symmetric matrix ,
Throughout, we assume that
is a GOE random matrix; in other words,
are independent Gaussian random variables, withi.i.d. , and i.i.d. . Regarding the mean matrix , we assume that there exists , , such that
where is a constant independent of . Equivalently, the observed matrix may be re-written as
where , with
. Throughout the subsequent discussion, we will denote the set of such boolean vectors as.
In the setting introduced above, the following statistical questions are natural.
(Detection) Can we detect the presence of the planted submatrix, i.e., can we consistently test
(Recovery) How large should be, such that the support of can be recovered reliably?
(Efficient Recovery) When can support recovery be accomplished using a computationally feasible procedure?
Here, we study support recovery in the special case , for some . To ensure that this problem is well-defined for all , we work along a sequence such that and . Note that in this case, the corresponding submatrix detection question  is trivial, and a test which rejects for large values of the sum of the matrix entries consistently detects the underlying submatrix for any . Motivated by the sparse PCA problem, we will study support recovery in this setup in the double limit , following . Deshpande-Montanari  initiated a study of the problem (1.2), and established that Bayes optimal recovery of the matrix can be accomplished using an Approximate Message Passing based algorithm, whenever is sufficiently large (specifically, ). In [50, 51]
, the authors analyze optimal Bayesian estimation in theregime, and based on the behavior of the fixed points of a state-evolution system, conjecture the existence of an algorithmically hard phase in this problem; specifically, they conjecture that the minimum signal required for accurate support recovery using feasible algorithms should be significantly higher than the information theoretic minimum. This conjecture has been repeatedly quoted in various follow up works [30, 48, 49], but to the best of our knowledge, it has not been rigorously established in the prior literature. In this paper, we study the likelihood landscape of this problem, and provide rigorous evidence to the veracity of this conjecture.
From a purely conceptual viewpoint, the existence of a computationally hard phase in problem (1.2) is particularly striking. In the context of rank one matrix estimation contaminated with additive Gaussian noise (1.2), it is known that if the spike is sampled uniformly at random from the unit sphere, PCA recovers the underlying signal, whenever its detection is possible 
. In contrast, for rank one tensor estimation under additive Gaussian noise, there exists an extensive gap between the threshold for detection , and the threshold where tractable algorithms are successful [13, 40, 71]. Thus at first glance, the matrix and tensor problems appear to behave very differently. However, as the present paper establishes, once the planted spike is sufficiently sparse, a hard phase re-appears in the matrix problem.
. All these problems are expected to exhibit a statistical-computational gap—there are regimes where optimal statistical performance might be impossible to achieve using computationally feasible statistical procedures. The potential existence of such fundamental computational barriers has attracted significant attention recently in Statistics, Machine Learning, and Theoretical Computer Science. A number of widely distinct approaches have been used to understand this phenomenon better—prominent ones include average case reductions[16, 19, 20, 22, 37, 53], convex relaxations [11, 23, 29, 39, 52, 54], query lower bounds [31, 68], and direct analysis of specific algorithms [10, 25, 47].
In comparison to the approaches originating from Computer Science and Optimization, a completely different perspective to this problem comes from statistical physics, particularly the study of spin glasses. This approach seeks to understand algorithmic hardness in random optimization problems as a consequence of the underlying geometry of the problem— specifically, the structure of the near optimizers. The Overlap Gap property (OGP) has emerged as a recurrent theme in this context (for an illustration, see Fig 1.3
). At a high level, the Overlap Gap Property (OGP) looks at approximate optimizers in a problem, and establishes that any two such approximate optimizers must either be close to each other, or far away. In other words, the one-dimensional set of distances between the near optimal states is disconnected. This property has been established for various problems arising from theoretical computer science and combinatorial optimization, for instance random constraint satisfaction[2, 33, 55], Max Independent Set [32, 66], and a maxcut problem on hypergraphs . Further, OGP has been shown to act as a barrier to the success of a family of “local algorithms” on sparse random graphs [26, 24, 32, 33]. This perspective has been introduced to the study of inference problems arising in Statistics and Machine Learning [13, 32, 34, 35, 36] in recent works by the first two authors which has yielded new insights into algorithmic barriers in these problems. As an aside, we note that exciting new developments in the study of mean field spin glasses [3, 57, 70], establish that in certain problems without OGP, it is actually possible to obtain approximate optimizers using appropriately designed polynomial time algorithms. This lends further credence to the belief that OGP captures a fundamental algorithmic barrier in random optimization problems— understanding this phenomenon in greater depth remains an exciting direction for future research.
We initiate the study of reliable support recovery in the setting of (1.2). To introduce this notion, let us begin with the following definitions and observations. For , define the support of to be the subset , such that
To estimate the support, it is evidently equivalent to produce an estimator that takes values in the Boolean hypercube . Observe that if is drawn uniformly at random, then the intersection of the support of and satisfies
where denotes the usual Euclidean inner product. For an estimator of , in the following, we call the normalized inner product the overlap of the estimator with , or simply the overlap. We are interested in the statistical and algorithmic feasibility of recovering a non-trivial fraction of the support. To this end, we introduce the notion of reliable recovery, which is defined as follows.
Definition 1.1 (Reliable Recovery).
A sequence of estimators is said to recover the support reliably if there exists , independent of , such that
with high probability as.
We study the question of reliable support recovery in this context, and exhibit a regime of parameters where this problem exhibits OGP. This provides rigorous evidence to the existence of a computationally hard phase in this problem, as suggested in [50, 51]
. To substantiate this claim, we establish that OGP acts as a barrier to the success of certain Markov-chain based algorithms. Finally, we show that for very large signal strengths
, reliable support recovery is easy, and can be achieved by rounding the largest eigenvector of.
To state our results in detail, we first introduce the Maximum Likelihood Estimator (MLE) in this setting. Let
and consider the MLE for ,
Our first result is information theoretic, and derives the minimum signal strength required for the MLE to recover the underlying support in a reliable manner.
If , the MLE does not recover the support reliably. On the other hand, if , then for any , the MLE recovers the support reliably.
Thus for any and , there exists at least one estimator, namely, the MLE, which performs reliable recovery. However, the MLE is computationally intractable in general. Our next result analyzes the likelihood landscape for this problem, and establishes that the problem exhibits OGP for certain parameters in this phase.
To this end, we introduce a version of the overlap gap property in this context. Consider the constrained maximum likelihood,
which denotes the maximum likelihood subject to the additional constraint of achieving overlap with . For any , fix a sequence such that and . We establish in Theorem 2.1 below that for all , we have that
as and that can be computed by a deterministic Parisi-type  variational problem. In the subsequent, we refer to as the constrained ground state energy, or simply the constrained energy. We are now in the position to define the overlap gap property.
We say that the model (1.2) with sparsity and signal-to-noise ratio exhibits the overlap gap property (OGP) if the function has a local maxima , and a global maximum , such that
there exists such that .
There are no local maxima in the interval .
These local maxima satisfy
Put simply, the overlap gap property states that the MLE achieves a overlap that is substantially better than , but in the interval
, the constrained energy has another local maximum. Heuristically, this suggests that a local optimization procedure, if initialized uniformly at random (and thus starting at overlap), will get trapped at a local maximum which is sub-optimal as compared to the true global optimum in terms of both the likelihood and the overlap. We illustrate this notion visually in Figure 1.
Our main result establishes that the planted submatrix problem (1.2) admits the overlap gap property in the limit of high sparsity and moderate signal-to-noise ratios.
There exist constants such that for sufficiently small, and for all , the planted sub-matrix problem has the overlap gap property.
Note that by thm:small_signal_main, reliable recovery is possible in the entire regime covered by Theorem 1.4; however, the likelihood-landscape exhibits OGP in this part of the parameter space. Observe that the hard phase becomes more prominent as .
Finally, we establish that the OGP established above acts as a barrier to a family of local MCMC type algorithms. To this end, consider a Gibbs distribution on the configuration space with
for some and defined as in (1.2). Note that for any fixed , as , the sample from approximates the MLE 1.4. Thus a simple proxy for the MLE seeks to sample from the distribution for sufficiently large. It is natural to use local Markov chains to sample from this distribution. Specifically, construct a graph with vertices being the elements of , and add an edge between two states if they are at Hamming distance . Finally, let denote a nearest-neighbor Markov chain on this graph started from that is reversible with respect to the stationary distribution . The following theorem establishes hitting time bounds for any such Markov Chain.
For as in thm:main, there exist points with as and an such that for some , with probability , if , then the exit time of , , satisfies
The proof of this result immediately follows by combining thm:main with Corollary 5.4. Thus OGP indeed acts as a barrier to these algorithms, and furnishes rigorous evidence of a hard phase in this problem.
As an aside, we also note that Theorem 2.1 implies
Our proof of Theorem 1.4 establishes that is maximized at . Thus, observing that , where is a matrix of i.i.d. random variables, we have,
To each principal submatrix of the matrix , assign a score, which corresponds to the sum of its entries. The LHS of (1.6) represents the largest score, as we scan over all principal submatrices of . This extends the results of  to the case of principal submatrices with size . We note that in spin-glass terminology, the score of the largest submatrix (not necessarily principal) corresponds to the ground state of a bipartite spin-glass model, which is out of reach of current techniques.
Returning to the planted model (1.2), we complete our discussion by establishing that when the signal is appropriately large, reliable support recovery is easily achieved using a spectral algorithm. To this end, we introduce the following two-step estimation algorithm, which rounds the leading eigenvector of .
denote the eigenvector corresponding to the largest eigenvalue of the matrix, with . Denote .
If , sample elements at random from and augment to the set in order to construct .
Otherwise, sample elements uniformly at random from , and denote this set as .
Finally, construct , i.e., a vector with ones corresponding to the entries in , and zeros otherwise.
Observe that the set depends on . We keep this dependence implicit for notational convenience. Then we have the following lemma.
For any and , there exists such that with high probability as , recovers the support reliably.
Outline: The remainder of this paper is structured as follows. Theorem 1.2 is established in Section 4. To derive this result, as a first step, Lemma 4.1 establishes a tight scaling limit for the maximum score in the unplanted problem as
. This is accomplished using first and second moment arguments, coupled with Gaussian concentration. Theorem1.2 follows in a relatively straightforward manner, given Lemma 4.1. Theorem 1.4 is the main contribution of this paper, and is established in Section 2. The proof depends crucially on Theorem 2.1, which derives a Parisi type variational problem for the restricted energy . In turn, to derive Theorem 2.1, we first establish a finite temperature Parisi formula (Proposition 6.1) in Section 8, and subsequently compute a limit of this formula as the temperature converges to zero (see Sections 6 and 7). Similar zero temperature formulae have been instrumental in establishing properties of mean field spin glasses in the low-temperature regime (see e.g. [7, 8, 45, 43]). We emphasize that even with Theorem 2.1, the proof of Theorem 1.4 is subtle, and critically depends on understanding the scaling of the variational formula as . Lemma 1.6 is established in Section 3. Finally, Section 5 studies the effect of OGP on this problem, and establishes that certain local Markov Chain based recovery algorithms are stymied by this structural barrier.
Acknowledgments. SS thanks Yash Deshpande for introducing him to the problem. DG gratefully acknowledges the support of ONR grant N00014-17-1- 2790. A.J. gratefully acknowledges the partial support of NSF grant NSF OISE-1604232.
2. Proof of the overlap gap property
To establish Theorem 1.4, we first require the following variational formula for the limiting constrained energy (1.5). Let denote the space of non-negative Radon measures on equipped with the weak-* topology. Let denote the set
For and , let solve the Cauchy problem
where denotes the positive part, and is the Laplace operator. Note that any is locally absolutely continuous with respect to the Lebesgue measure on , so that is almost surely well-defined. As explained in , there is a unique weak solution to this PDE. Consider then the functional
We then have the following,
For as above, we have that
Proof of Theorem 1.4.
By thm:var-form-gs, we have that
Observe that setting
, one may make the linear transformationwithout changing the value of the functional. Thus it suffices to consider the problem
where denote the maximizers of (2.2) corresponding to . On the other hand, by a standard differentiable dependence argument, is classically differentiable in for (see, e.g., [14, Lemma A.5]). Thus we may differentiate in to obtain the following fixed point equation for the optimal
Recall that solves the PDE (2.1), and thus is given by the solution to
where is an elliptic operator. Observe that is the infinitesimal generator of the diffusion , given by the solution to the stochastic differential equation
By Ito’s lemma we have, for , , the stochastic representation formula for ,
In particular, for , we have
Next, we derive bounds on the Lagrange multipliers by analyzing the diffusion itself. Since is weakly differentiable, we see that its weak derivative weakly solves (2.5) as well. Thus
In particular we obtain the maximum principle . Finally we require the following estimate on for the unique minimizer .
We have that
We defer the proof of this estimate to sec:apriori-proof and complete the proof assuming this estimate.
By applying the the maximum principle for to the drift term in (2.6), we see that we may bound the above probability by
Second, note that as , (2.3), combined with (2.8), implies that . Third, let us evaluate the at for some . To determine the sign of the derivative at this point, we will perform our analysis as . (2.3) implies
Using the bounds (2.8), we have,
for some and . The last inequality above uses as . Thus if we choose , then .
Finally, with the choice of detailed above, we evaluate at , for some , to be chosen appropriately. In this case, we again have, using (2.3),
Again, using the bounds derived in (2.8), we have,
If we take where , for , and sufficiently small, we obtain,
The above calculation establishes that there are such that for sufficiently small, and , there exist such that
As is continuous, this implies that there is at least two local maxima, one in and one in , yielding the desired multiple local maxima. It remains to show that the local maximum in is the greater of the two.
To this end, let denote a local maximum in , so that
where here and in the following we let
Recall again that a classical application of Slepian’s comparison inequality , comparing
to an IID process with the same variance, yields
for any . Applying (2.11) in the case , yields
By the preceding discussion, . Consequently,
On the other hand, by the preceding, for some . Thus, if we let denote a local maximum with , then
As do not vary in , increasing and slightly, we see that we may take small enough so that
as desired. This completes the proof. ∎
3. A Simple Rounding Scheme when
In this section, we establish that if the SNR is significantly large, it is algorithmically easy to reliably recover the support of the hidden principal sub-matrix. Specifically, we establish that for , there exists a simple spectral algorithm which reliably recovers the hidden support. To this end, we start with a simple lemma, which will be critical in the subsequent analysis.
Let be jointly distributed random variables, with
be jointly distributed random variables, with. Further, assume , , and for some universal constant . Then
First, note that implies . Further, implies . By the Paley-Zygmund inequality,
This implies the lower bound
On the other hand, Chebychev inequality implies
Finally, using Chebychev’s inequality,
. Bayes Theorem implies
This completes the proof. ∎
Proof of Lemma 1.6.
Recall that , and thus for
, the celebrated BBP phase transition[9, 15] implies that there exists a universal constant such that w.h.p as
On the event , we have,
On the other hand, if ,
Thus we have,
Further, direct computation reveals
where we denote . Upon observing that , we have, . Thus by Chebychev inequality,
This implies that whp over the sampling process, and establishes that the constructed estimator recovers the support reliably. This completes the proof.
4. Statistical feasibility of Estimation: proof of thm:small_signal_main
We prove thm:small_signal_main in two parts. Let us begin with the first part of Theorem 1.2, regarding the unreliability of support recovery. Before turning the proof let us introduce the following notations and lemma. We need the following Lemma regarding the maximum likelihood of the null model.
We have, as ,
We defer the proof of this result to the end of the section. In the following, for , let
Proof of part 1 of Theorem 1.2.
Fix , for some with as . Suppose by way of contradiction, that the maximum likelihood estimator reliably recovers the support. In this case, there exists a universal constant, , such that with high probability,
On the other hand,
Finally, using Slepian’s inequality (2.11) with , we have,
denotes the hypergeometric distribution with sample size, success states and observed successes. By Stirling’s formula,
Combining these estimates yields
Comparing this to (4.2), we obtain a contradiction for small enough, which completes the proof.
We now turn to the proof of reliable recovery for large enough .