Many problems in high-dimensional statistics are believed to exhibit gaps between what can be achievedinformation-theoretically (or statistically, i.e., with unbounded computational power) and what is possible with bounded computational power (e.g., in polynomial time). Examples include finding planted cliques [Jer92, DM15b, MPW15, BHK19] or dense communities [DKMZ11b, DKMZ11a, HS17] in random graphs, extracting variously structured principal components of random matrices [BR13, LKZ15a, LKZ15b] or tensors [HSS15, HKP17], and solving or refuting random constraint satisfaction problems [ACO08, KMOW17].
Although current techniques cannot prove that such average-case problems require super-polynomial time (even assuming ), various forms of rigorous evidence for hardness have been proposed. These include:
In these notes, we survey another emerging method, which we call the low-degree method, for understanding computational hardness in average-case problems. In short, we explore a conjecture that the behavior of a certain quantity – the second moment of the low-degree likelihood ratio – reveals the computational complexity of a given statistical task. We find the low-degree method particularly appealing because it is simple, widely applicable, and can be used to study a wide range of time complexities (e.g., polynomial, quasipolynomial, or nearly-exponential). Furthermore, rather than simply positing a certain “optimal algorithm,” the underlying conjecture captures an interpretable structural feature that seems to dictate whether a problem is easy or hard. Finally, and perhaps most importantly, predictions using the low-degree method have been carried out for a variety of average-case problems, and so far have always reproduced widely-believed results.
Historically, the low-degree method arose from the study of the sum-of-squares (SoS) semidefinite programming hierarchy. In particular, the method is implicit in the pseudo-calibration approach to proving SoS lower bounds [BHK19]. Two concurrent papers [HS17, HKP17] later articulated the idea more explicitly. In particular, Hopkins and Steurer [HS17] were the first to demonstrate that the method can capture sharp thresholds of computational feasibility such as the Kesten–Stigum threshold for community detection in the stochastic block model. The low-degree method was developed further in the PhD thesis of Hopkins [Hop18], which includes a precise conjecture about the complexity-theoretic implications of low-degree predictions. In comparison to sum-of-squares lower bounds, the low-degree method is much simpler to carry out and appears to always yield the same results for natural average-case problems.
In these notes, we aim to provide a self-contained introduction to the low-degree method; we largely avoid reference to SoS and instead motivate the method in other ways. We will briefly discuss the connection to SoS in Section 4.2.2, but we refer the reader to [Hop18] for an in-depth exposition of these connections.
These notes are organized as follows. In Section 1, we present the low-degree method and motivate it as a computationally-bounded analogue of classical statistical decision theory. In Section 2, we show how to carry out the low-degree method for a general class of additive Gaussian noise models. In Section 3, we specialize this analysis to two classical problems: the spiked Wigner matrix and spiked Gaussian tensor models. Finally, in Section 4
, we discuss various forms of heuristic and formal evidence for correctness of the low-degree method; in particular, we highlight a formal connection between low-degree lower bounds and the failure of spectral methods (Theorem4.2.3).
1 Towards a Computationally-Bounded Decision Theory
1.1 Statistical-to-Computational Gaps in Hypothesis Testing
The field of statistical decision theory (see, e.g., [LR06, LC12] for general references) is concerned with the question of how to decide optimally (in some quantitative sense) between several statistical conclusions. The simplest example, and the one we will mainly be concerned with here, is that of simple hypothesis testing
: we observe a dataset that we believe was drawn from one of two probability distributions, and want to make an inference (by performing a statisticaltest) about which distribution we think the dataset was drawn from.
However, one important practical aspect of statistical testing usually is not included in this framework, namely the computational cost of actually performing a statistical test. In these notes, we will explore ideas from a line of recent research about how one mathematical method of classical decision theory might be adapted to predict the capabilities and limitations of computationally bounded statistical tests.
The basic problem that will motivate us is the following. Suppose and are two sequences of probability distributions over a common sequence of measurable spaces . (In statistical parlance, we will think throughout of as the model of the alternative hypothesis and as the model of the null hypothesis. Later on, we will consider hypothesis testing problems where the distributions include a “planted” structure, making the notation a helpful mnemonic.) Suppose we observe which is drawn from one of or . We hope to recover this choice of distribution in the following sense.
We say that a sequence of events with occurs with high probability (in ) if the probability of tends to 1 as .
A sequence of (measurable) functions is said to strongly distinguish111We will only consider this so-called strong version of distinguishability, where the probability of success must tend to 1 as , as opposed to the weak version where this probability need only be bounded above . For high-dimensional problems, the strong version typically coincides with important notions of estimating the planted signal (see Section
. For high-dimensional problems, the strong version typically coincides with important notions of estimating the planted signal (see Section4.2.6), whereas the weak version is often trivial. and if with high probability when , and with high probability when . If such exist, we say that and are statistically distinguishable.
In our computationally bounded analogue of this definition, let us for now only consider polynomial time tests (we will later consider various other restrictions on the time complexity of , such as subexponential time). Then, the analogue of Definition 1.1 is the following.
and are said to be computationally distinguishable if there exists a sequence of measurable and computable in time polynomial in functions such that strongly distinguishes and .
Clearly, computational distinguishability implies statistical distinguishability. On the other hand, a multitude of theoretical evidence suggests that statistical distinguishability does not in general imply computational distinguishability. Occurrences of this phenomenon are called statistical-to-computational (stat-comp) gaps. Typically, such a gap arises in the following slightly more specific way. Suppose the sequence has a further dependence on a signal-to-noise parameter , so that . This parameter should describe, in some sense, the strength of the structure present under (or, in some cases, the number of samples received). The following is one canonical example.
[Planted Clique Problem [Jer92, Kuč95]] Under the null model , we observe an -vertex Erdős-Rényi graph , i.e., each pair of vertices is connected with an edge independently with probability . The signal-to-noise parameter is an integer . Under the planted model , we first choose a random subset of vertices of size uniformly at random. We then observe a graph where each pair of vertices is connected with probability if and with probability otherwise. In other words, the planted model consists of the union of with a planted clique (a fully-connected subgraph) on vertices.
As varies, the problem of testing between and can change from statistically impossible, to statistically possible but computationally hard, to computationally easy. That is, there exists a threshold such that for any , and are statistically distinguishable, but for are not. There also exists a threshold such that for any , and are computationally distinguishable, and (conjecturally) for are not. Clearly we must have , and a stat-comp gap corresponds to strict inequality . For instance, the two models in the planted clique problem are statistically distinguishable when (since is the typical size of the largest clique in ), so . However, the best known polynomial-time distinguishing algorithms only succeed when [Kuč95, AKS98], and so (conjecturally) , a large stat-comp gap.
The remarkable method we discuss in these notes allows us, through a relatively straightforward calculation, to predict the threshold for many of the known instances of stat-comp gaps. We will present this method as a modification of a classical second moment method for studying .
1.2 Classical Asymptotic Decision Theory
In this section, we review some basic tools available from statistics for understanding statistical distinguishability.
We retain the same notations from the previous section in the later parts, but in the first part of the discussion will only be concerned with a single pair of distributions and defined on a single measurable space .
For the sake of simplicity, let us assume in either case that (or ) is absolutely continuous with respect to (or , as appropriate).222 For instance, what will be relevant in the examples we consider later, any pair of non-degenerate multivariate Gaussian distributions satisfy this assumption.
For instance, what will be relevant in the examples we consider later, any pair of non-degenerate multivariate Gaussian distributions satisfy this assumption.
1.2.1 Basic Notions
We first define the basic objects used to make hypothesis testing decisions, and some ways of measuring their quality.
A test is a measurable function .
The type I error of is the event of falsely rejecting the null hypothesis, i.e., of having when . The type II error of is the event of falsely failing to reject the null hypothesis, i.e., of having when . The probabilities of these errors are denoted
The probability of correctly rejecting the null hypothesis is called the power of . There is a tradeoff between type I and type II errors. For instance, the trivial test that always outputs will have maximal power, but will also have maximal probability of type I error, and vice-versa for the trivial test that always outputs . Thus, typically one fixes a tolerance for one type of error, and then attempts to design a test that minimizes the probability of the other type.
1.2.2 Likelihood Ratio Testing
We next present the classical result showing that it is in fact possible to identify the test that is optimal in the sense of the above tradeoff.333It is important to note that, from the point of view of statistics, we are restricting our attention to the special case of deciding between two “simple” hypotheses, where each hypothesis consists of the dataset being drawn from a specific distribution. Optimal testing is more subtle for “composite” hypotheses in parametric families of probability distributions, a more typical setting in practice. The mathematical difficulties of this extended setting are discussed thoroughly in [LR06].
Let be absolutely continuous with respect to . The likelihood ratio444For readers not familiar with the Radon–Nikodym derivative: if , are discrete distributions then ; if , are continuous distributions with density functions , (respectively) then . of and is
The thresholded likelihood ratio test with threshold is the test
Let us first present a heuristic argument for why thresholding the likelihood ratio might be a good idea. Specifically, we will show that the likelihood ratio is optimal in a particular “ sense” (which will be of central importance later), i.e., when its quality is measured in terms of first and second moments of a testing quantity.
For (measurable) functions , define the inner product and norm induced by :
Let denote the Hilbert space consisting of functions for which , endowed with the above inner product and norm.555For a more precise definition of (in particular including issues around functions differing on sets of measure zero) see a standard reference on real analysis such as [SS09].
If is absolutely continuous with respect to , then the unique solution of the optimization problem
is the (normalized) likelihood ratio
and the value of the optimization problem is .
We may rewrite the objective as
and rewrite the constraint as . The result now follows since by the Cauchy-Schwarz inequality, with equality if and only if is a scalar multiple of . ∎
In words, this means that if we want a function to be as large as possible in expectation under while remaining bounded (in the sense) under , we can do no better than the likelihood ratio. We will soon return to this type of reasoning in order to devise computationally-bounded statistical tests.
The following classical result shows that the above heuristic is accurate, in that the thresholded likelihood ratio tests achieve the optimal tradeoff between type I and type II errors. [Neyman–Pearson Lemma [NP33]] Fix an arbitrary threshold . Among all tests with , is the test that maximizes the power . We provide the standard proof of this result in Appendix A.1 for completeness. (The proof is straightforward but not important for understanding the rest of these notes, and it can be skipped on a first reading.)
1.2.3 Le Cam’s Contiguity
Since the likelihood ratio is, in the sense of the Neyman–Pearson lemma, an optimal statistical test, it stands to reason that it should be possible to argue about statistical distinguishability solely by computing with the likelihood ratio. We present one simple method by which such arguments may be made, based on a theory introduced by Le Cam [Le 60].
We will work again with sequences of probability measures and , and will denote by the likelihood ratio . Norms and inner products of functions are those of . The following is the crucial definition underlying the arguments to come.
A sequence of probability measures is contiguous to a sequence , written , if whenever with (as ), then as well.
If or , then and are statistically indistinguishable (in the sense of Definition 1.1, i.e., no test can have both type I and type II error probabilities tending to 0).
We give the proof for the case , but the other case may be shown by a symmetric argument. For the sake of contradiction, let be a sequence of tests distinguishing and , and let . Then, and . But, by contiguity, implies as well, so , a contradiction. ∎
It therefore suffices to establish contiguity in order to prove negative results about statistical distinguishability. The following classical second moment method gives a means of establishing contiguity through a computation with the likelihood ratio. [Second Moment Method for Contiguity] If remains bounded as (i.e., ), then .
Let . Then, using the Cauchy–Schwarz inequality,
and so implies . ∎
This second moment method has been used to establish contiguity for various high-dimensional statistical problems (see e.g., [MRZ15, BMV18, PWBM18, PWB16]). Typically the null hypothesis is a “simpler” distribution than and, as a result, is easier to compute than . In general, and essentially for this reason, establishing is often more difficult than , requiring tools such as the small subgraph conditioning method (introduced in [RW92, RW94] and used in, e.g., [MNS15, BMNN16]). Fortunately, one-sided contiguity is sufficient for our purposes.
Note that , the quantity that controls contiguity per the second moment method, is the same as the optimal value of the optimization problem in Proposition 1.2.2:
We might then be tempted to conjecture that and are statistically distinguishable if and only if as . However, this is incorrect: there are cases when and are not distinguishable, yet a rare “bad” event under causes to diverge. To overcome this failure of the ordinary second moment method, some previous works (e.g., [BMNN16, BMV18, PWB16, PWBM18]) have used conditional second moment methods to show indistinguishability, where the second moment method is applied to a modified that conditions on these bad events not occurring.
1.3 Basics of the Low-Degree Method
We now describe the low-degree analogues of the notions described in the previous section, which together constitute a method for restricting the classical decision-theoretic second moment analysis to computationally-bounded tests. The premise of this low-degree method is to take low-degree multivariate polynomials in the entries of the observation as a proxy for efficiently-computable functions. The ideas in this section were first developed in a sequence of works in the sum-of-squares optimization literature [BHK19, HS17, HKP17, Hop18].
In the computationally-unbounded case, Proposition 1.2.2 showed that the likelihood ratio optimally distinguishes from in the sense. Following the same heuristic, we will now find the low-degree polynomial that best distinguishes from in the sense. In order for polynomials to be defined, we assume here that for some , i.e., our data (drawn from or
) is a real-valued vector (which may be structured as a matrix, tensor, etc.).
Let denote the linear subspace of polynomials of degree at most . Let denote the orthogonal projection666To clarify, orthogonal projection is with respect to the inner product induced by (see Definition 1.2.2). operator to this subspace. Finally, define the -low-degree likelihood ratio (-LDLR) as . We now have a low-degree analogue of Proposition 1.2.2, which first appeared in [HS17, HKP17]. The unique solution of the optimization problem
is the (normalized) -LDLR
and the value of the optimization problem is .
As in the proof of Proposition 1.2.2, we can restate the optimization problem as maximizing subject to and . Since is a linear subspace of , the result is then simply a restatement of the variational description and uniqueness of the orthogonal projection in (i.e., the fact that is the unique closest element of to ). ∎
The following informal conjecture is at the heart of the low-degree method. It states that a computational analogue of the second moment method for contiguity holds, with playing the role of the likelihood ratio. Furthermore, it postulates that polynomials of degree roughly are a proxy for polynomial-time algorithms. This conjecture is based on [HS17, HKP17, Hop18], particularly Conjecture 2.2.4 of [Hop18].
[Informal] For “sufficiently nice” sequences of probability measures and , if there exists and for which remains bounded as , then there is no polynomial-time algorithm that strongly distinguishes (see Definition 1.1) and . We will discuss this conjecture in more detail later (see Section 4), including the informal meaning of “sufficiently nice” and a variant of the LDLR based on coordinate degree considered by [HKP17, Hop18] (see Section 4.2.4). A more general form of the low-degree conjecture (Hypothesis 2.1.5 of [Hop18]) states that degree- polynomials are a proxy for time- algorithms, allowing one to probe a wide range of time complexities. We will see that the converse of these low-degree conjectures often holds in practice; i.e., if , then there exists a distinguishing algorithm of runtime roughly . As a result, the behavior of precisely captures the (conjectured) power of computationally-bounded testing in many settings.
The remainder of these notes is organized as follows. In Section 2, we work through the calculations of , , and their norms for a general family of additive Gaussian noise models. In Section 3, we apply this analysis to a few specific models of interest: the spiked Wigner matrix and spiked Gaussian tensor models. In Section 4, we give some further discussion of Conjecture 1.3, including evidence (both heuristic and formal) in its favor.
2 The Additive Gaussian Noise Model
We will now describe a concrete class of hypothesis testing problems and analyze them using the machinery introduced in the previous section. The examples we discuss later (spiked Wigner matrix and spiked tensor) will be specific instances of this general class.
2.1 The Model
[Additive Gaussian Noise Model] Let and let (the “signal”) be drawn from some distribution (the “prior”) over . Let (the “noise”) have i.i.d. entries distributed as . Then, we define and as follows.
Under , observe .
Under , observe .
One typical situation takes to be a low-rank matrix or tensor. The following is a particularly important and well-studied special case, which we will return to in Section 3.2.
[Wigner Spiked Matrix Model] Consider the additive Gaussian noise model with , identified with matrices with real entries, and defined by , where is a signal-to-noise parameter and is drawn from some distribution over . Then, the task of distinguishing from amounts to distinguishing from where has i.i.d. entries distributed as . (This variant is equivalent to the more standard model in which the noise matrix is symmetric; see Appendix A.2.)
This problem is believed to exhibit stat-comp gaps for some choices of but not others; see, e.g., [LKZ15a, LKZ15b, KXZ16, BMV18, PWBM18]. At a heuristic level, the typical sparsity of vectors under seems to govern the appearance of a stat-comp gap.
In the spiked Wigner problem, as in many others, one natural statistical task besides distinguishing the null and planted models is to non-trivially estimate the vector given , i.e., to compute an estimate such that with high probability, for some constant . Typically, for natural high-dimensional problems, non-trivial estimation of is statistically or computationally possible precisely when it is statistically or computationally possible (respectively) to strongly distinguish and ; see Section 4.2.6 for further discussion.
2.2 Computing the Classical Quantities
Suppose and are as defined in Definition 2.1, with a sequence of prior distributions . Then, the likelihood ratio of and is
Suppose and are as defined in Definition 2.1, with a sequence of prior distributions . Then,
where are drawn independently from .
We apply the important trick of rewriting a squared expectation as an expectation over the two independent “replicas” appearing in the result:
|where and are drawn independently from . We now swap the order of the expectations,|
and the inner expectation may be evaluated explicitly using the moment-generating function of a Gaussian distribution (if, then for any fixed , ),
from which the result follows by expanding the term inside the exponential.777Two techniques from this calculation are elements of the “replica method” from statistical physics: (1) writing a power of an expectation as an expectation over independent “replicas” and (2) changing the order of expectations and evaluating the moment-generating function. The interested reader may see [MPV87] for an early reference, or [MM09, BPW18] for two recent presentations. ∎
2.3 Computing the Low-Degree Quantities
In this section, we will show that the norm of the LDLR (see Section 1.3) takes the following remarkably simple form under the additive Gaussian noise model. Suppose and are as defined in Definition 2.1, with a sequence of prior distributions . Let be as in Definition 1.3. Then,
where are drawn independently from .
Note that (5) can be written as , where denotes the degree- truncation of the Taylor series of . This can be seen as a natural low-degree analogue of the full second moment (4). However, the low-degree Taylor series truncation in is conceptually distinct from the low-degree projection in , because the latter corresponds to truncation in the Hermite orthogonal polynomial basis (see below), while the former corresponds to truncation in the monomial basis.
We first give a brief and informal review of the multivariate Hermite polynomials (see Appendix B or the reference text [Sze39] for further information). The univariate Hermite polynomials888We will not actually use the definition of the univariate Hermite polynomials (although we will use certain properties that they satisfy as needed), but the definition is included for completeness in Appendix B. are a sequence for , with . They may be normalized as , and with this normalization satisfy the orthonormality conditions
The multivariate Hermite polynomials in variables are indexed by , and are merely products of the : . They also admit a normalized variant , and with this normalization satisfy the orthonormality conditions
which may be inferred directly from (6).
The collection of those for which form an orthonormal basis for (which, recall, is the subspace of polynomials of degree ). Thus we may expand
and in particular we have
Our main task is then to compute quantities of the form . Note that these can be expressed either as or . We will give three techniques for carrying out this calculation, each depending on a different identity satisfied by the Hermite polynomials. Each will give a proof of the following remarkable formula, which shows that the quantities are simply the moments of .
For any ,
Proof of Theorem 2.3.
We now proceed to the three proofs of Proposition 2.3. For the sake of brevity, we omit here the (standard) proofs of the three Hermite polynomial identities these proofs are based on, but the interested reader may review those proofs in Appendix B.
2.3.1 Proof 1: Hermite Translation Identity
The first (and perhaps simplest) approach to proving Proposition 2.3
uses the following formula for the expectation of a Hermite polynomial evaluated on a Gaussian random variable of non-zero mean.
For any and ,
2.3.2 Proof 2: Gaussian Integration by Parts
The second approach to proving Proposition 2.3 uses the following generalization of a well-known integration by parts formula for Gaussian random variables. If is times continuously differentiable and and its first derivatives are bounded by for some , then
(The better-known case is , where one may substitute .)
2.3.3 Proof 3: Hermite Generating Function
Finally, the third approach to proving Proposition 2.3 uses the following generating function for the Hermite polynomials.
For any ,
Proof 3 of Proposition 2.3.
Now that we have the simple form (5) for the norm of the LDLR, it remains to investigate its convergence or divergence (as ) using problem-specific statistics of . In the next section we give some examples of how to carry out this analysis.
3 Examples: Spiked Matrix and Tensor Models
In this section, we perform the low-degree analysis for a particular important case of the additive Gaussian model: the order- spiked Gaussian tensor model, also referred to as the tensor PCA (principal component analysis)
tensor PCA (principal component analysis)problem. This model was introduced by [RM14] and has received much attention recently. The special case of the spiked tensor model is the so-called spiked Wigner matrix model
which has been widely studied in random matrix theory, statistics, information theory, and statistical physics; see[Mio18] for a survey.
In concordance with prior work, our low-degree analysis of these models illustrates two representative phenomena: the spiked Wigner matrix model exhibits a sharp computational phase transition, whereas the spiked tensor model (with) has a “soft” tradeoff between statistical power and runtime which extends through the subexponential-time regime. A low-degree analysis of the spiked tensor model has been carried out previously in [HKP17, Hop18]; here we give a sharper analysis that more precisely captures the power of subexponential-time algorithms.
In Section 3.1, we carry out our low-degree analysis of the spiked tensor model. In Section 3.2, we devote additional attention to the special case of the spiked Wigner model, giving a refined analysis that captures its sharp phase transition and applies to a variety of distributions of “spikes.”
3.1 The Spiked Tensor Model
We begin by defining the model.
An -dimensional order- tensor is a multi-dimensional array with dimensions each of length , with entries denoted where . For a vector , the rank-one tensor has entries .
[Spiked Tensor Model] Fix an integer . The order- spiked tensor model is the additive Gaussian noise model (Definition 2.1) with , where is a signal-to-noise parameter and (the “spike”) is drawn from some probability distribution over (the “prior”), normalized so that in probability as . In other words:
Under , observe .
Under , observe .
Here, is a tensor with i.i.d. entries distributed as .999This model is equivalent to the more standard model in which the noise is symmetric with respect to permutations of the indices; see Appendix A.2.
Throughout this section we will focus for the sake of simplicity on the Rademacher spike prior, where has i.i.d. entries . We focus on the problem of strongly distinguishing and (see Definition 1.1), but, as is typical for high-dimensional problems, the problem of estimating seems to behave in essentially the same way (see Section 4.2.6).
We first state our results on the behavior of the LDLR for this model.
Consider the order- spiked tensor model with drawn from the Rademacher prior, i.i.d. for . Fix sequences and . For constants depending only on , we have the following.101010Concretely, one may take and .
If for all sufficiently large , then .
If and for all sufficiently large , and , then .
(Here we are considering the limit with held fixed, so and may hide constants depending on .)
Before we prove this, let us interpret its meaning. If we take degree- polynomials as a proxy for -time algorithms (as discussed in Section 1.3), our calculations predict that an -time algorithm exists when but not when . (Here we ignore log factors, so we use to mean .) These predictions agree precisely with the previously established statistical-versus-computational tradeoffs in the spiked tensor model! It is known that polynomial-time distinguishing algorithms exist when [RM14, HSS15, ADGM16, HSSS16], and sum-of-squares lower bounds suggest that there is no polynomial-time distinguishing algorithm when [HSS15, HKP17].
Furthermore, one can study the power of subexponential-time algorithms, i.e., algorithms of runtime for a constant . Such algorithms are known to exist when