Many high-dimensional statistical inference problems exhibit a gap between what can be achieved by the optimal statistical procedure and what can be achieved by the best knownpolynomial-time algorithms. As a canonical example, finding a planted -clique in a Erdős–Rényi random graph is statistically possible when exceeds (via exhaustive search) but all known polynomial time algorithms require , giving rise to a large conjectured “possible but hard” regime in between [jerrum-clique, alon-clique, sos-clique]. Such so-called statistical-to-computational gaps
are prevalent in many other key learning problems including sparse PCA (principal component analysis)[BR-reduction], community detection [sbm-hard]
, tensor PCA[MR-tensor-pca], and random constraint satisfaction problems [sos-csp], just to name a few. Unfortunately, since these are average-case problems where the input is drawn from a specific distribution, current techniques appear unable to prove computational hardness in these conjectured “hard” regimes based on standard complexity assumptions such as .
Still, a number of different methods have emerged for understanding these gaps and providing “rigorous evidence” for computational hardness of statistical problems. Many involve studying the power of restricted classes of algorithms that are tractable to analyze, including statistical query (SQ) algorithms [kearnsSQ1998, sq-clique], the sum-of-squares (SoS) hierarchy [parrilo-sos, lasserre-sos], low-degree polynomial algorithms [HS-bayesian, hopkins2017power, hopkins2018thesis], approximate message passing [amp], MCMC methods [jerrum-clique], and various notions of “local” algorithms [GS-local, zadikCOLT17, alg-tensor-pca]. As it turns out, the best known poly-time algorithms for a surprisingly wide range of statistical problems actually do belong to these restricted classes. As such, the above frameworks have been very successful at providing concrete explanations for statistical-to-computational gaps and allowing researchers to predict the location of the “hard” regime in new problems based on the location that such restricted classes of algorithms fail or succeed.
However, there are notorious exceptions where the above predictions turn out to be false. For example, the problem of learning parity (even in the absence of noise) is hard for SQ, SoS, and low-degree polynomials [kearnsSQ1998, grigoriev2001linear, schoenebeck2008linear], yet actually admits a simple poly-time solution via Gaussian elimination. Yet, to the best of our knowledge, prior to the present work, learning parities with no noise or other similar noiseless inference models based on linear equations, such as random 3-XOR-SAT, have been the only examples where some polynomial-time method (which appears to always be Gaussian elimination) works, while the SoS hierarchy and low-degree methods have been proven to fail.
In this work, we identify a new class of problems where the SoS hierarchy and low-degree lower bounds are provably bypassed by a polynomial-time algorithm. This class of problems is not based on linear equations, and the suggested optimal algorithm is not based on Gaussian elimination but on lattice basis reduction methods, which specifically seek to find a “short” vector in a lattice. Similar lattice-based method have over the recent years been able to “close” various statistical-to-computational gaps [NEURIPS2018_ccc0aa1b, andoni2017correspondence, song2021cryptographic], yet this is the first example we are aware of that they are able to “close a gap” where the SoS hierarchy is know to fail to do so.
The problems we analyze can be motivated from several angles in theoretical computer science and machine learning, and can be thought of as important special cases of well-studied problems such as Planted Vector in a Subspace, Gaussian Clustering, and Non-Gaussian Component Analysis. While our result is more general, one specific problem that we solve is the following: for a hidden unit vector, we observe independent samples of the form
where are i.i.d. uniform , and the goal is to recover the hidden signs and the hidden direction (up to a global sign flip). Prior to our work, the best known poly-time algorithm required samples111Here are throughout, the notation hides logarithmic factors. [planted-vec-ld]. Furthermore, this was believed to be unimprovable due to lower bounds against SoS algorithms and low-degree polynomials [lifting-sos, affine-planes, planted-vec-ld, unknown-cov]. Nevertheless, we give a poly-time algorithm under the much weaker assumption . In fact, this sample complexity is essentially optimal for the previous recovery problem, as shown by our information-theoretic lower bound (see Section 4). Our result makes use of the Lenstra–Lenstra–Lovász (LLL) algorithm for lattice basis reduction [lenstra1982factoring], a powerful algorithmic paradigm that has seen recent, arguably surprising, success in solving to information-theoretic optimality a few different “noiseless” statistical inference problems, some even in regimes where it was conjectured that no polynomial-time method works: Discrete Regression [NEURIPS2018_ccc0aa1b, LLL_TIT], Phase Retrieval [andoni2017correspondence, song2021cryptographic]
, Cosine Neuron Learning[song2021cryptographic], and Continuous Learning with Errors [bruna2020continuous, song2021cryptographic]222in the exponentially-small noise regime. Yet, to the best of our knowledge, this work is the first to establish the success of an LLL-based method in a regime where low-degree and SoS lower bounds both suggest computational intractability. This raises the question of whether LLL can “close” any other conjectured statistical-to-computational gaps. We believe that understanding the power and limitations of the LLL approach is an important direction for future research towards understanding the computational complexity of inference.
We also point out one weakness of the LLL approach: our algorithm is brittle to the specifics of the model, and relies on the observations being “noiseless” in some sense. For instance, our algorithm only solves the model in (1) because the values lie exactly in and the covariance has quadratic form exactly equal to zero (or, similarly to other LLL applications [NEURIPS2018_ccc0aa1b], of exponentially small magnitude). If we were to perturb the model slightly, say by adding an inverse-polynomial amount of noise to the ’s, our algorithm would break down because of the known non-robustness properties of the LLL algorithm. In fact, a noisy version (with inverse-polynomial noise) of one problem that we solve is the homogeneous Continuous Learning with Errors problem (hCLWE), which is provably hard based on the standard assumption [micciancio2009lattice, Conjecture 1.2] from lattice-based cryptography that certain worst-case lattice problems are hard against quantum algorithms [bruna2020continuous]. All existing algorithms for statistical problems based on LLL suffer from the same lack of robustness. In this sense, there is a strong analogy between LLL and the other known successful polynomial-time method for noiseless inference, namely the Gaussian elimination approach to learning parity: both exploit very precise algebraic structure in the problem and break down in the presence of even a small (inverse-polynomial) amount of noise.
As discussed above, our results “break” certain computational lower bounds based on SoS and low-degree polynomials. Still, we believe that these types of lower bounds are interesting and meaningful, but some care should be taken when interpreting them. It is in fact already well-established that such lower bounds can sometimes be beaten on “noiseless” problems (a key example being Gaussian elimination). However, there are some subtleties in how “noiseless” should be defined here, and whether fundamental problems with statistical-to-computational gaps such as planted clique—which has implications for many other inference problems via average-case reductions (e.g. [BR-reduction, MW-reduction, HWX-pds, BBH-reduction])—should be considered “noiseless.” We discuss these issues further in Section 1.3.
1.1 Main contribution
The main inference setting we consider in this work is as follows. Let be the growing ambient dimension the data lives in, be the number of samples, and be what we coin as the spacing parameter. We consider arbitrary labels on the one-dimensional lattice: where for , under the weak constraints and . We also consider an arbitrary direction and an unknown covariance matrix with and “reasonable” spectrum, in the sense that
does not have exponentially small or large eigenvalues in directions orthogonal to(see Assumption 3). In particular, the choice of is permissible per our assumptions but much more generality is possible. Our goal is to learn both the labels and the hidden direction (up to global sign flip applied to both and ) from independent samples
Exact recovery with polynomial-time algorithm.
Our main algorithmic result is informally stated as follows.
[Informal statement of Theorem 3.1] Under the above setting, if then there is an LLL-based algorithm (Algorithm 1) which terminates in polynomial time and outputs exactly, up to a global sign flip, both the correct labels and the correct hidden direction
with probability333During the final stages of this project, Ilias Diakonikolas and Daniel Kane [diakonikolas2021personal] shared with the authors through personal communication that they have independently obtained a similar result to ours, using similar techniques.
|Problems||Low-degree LB||SoS LB||Previous Best||Our Results|
Now, as explained in Section 1 our theorem has algorithmic implications for three previously studied problems: Planted Vector in a Subspace, Gaussian Clustering, and hCLWE, which is an instance of Non-Gaussian Component Analysis. In all three settings, the previous best algorithms required samples (formally this is true for the dense case of the planted vector in a subspace setting, but samples are required for many sparse settings as well). As explained, in many of these cases lower bounds have been achieved for the classes of low-degree methods and the SoS hierarchy. In this work, we show that LLL can surpass these lower bounds and succeed with samples in all three problems. We provide more context in Section 1.2 and exact statements in Section 3.2. A high-level description of our contributions can be found in Table 1.
Information-theoretic lower bound for exact recovery of the hidden direction.
One can naturally wonder whether for our setting there is something even better than LLL that can be achieved with bounded computational resources or even unbounded ones. We complement the previous result with an information-theoretic lower bound showing that anyestimation procedure cannot succeed at exact recovery of the hidden vector using at most samples. This means LLL is information-theoretically optimal for recovering the hidden direction up to at most one additional sample. [Informal statement of Theorem 4.1] Under the setting of (1), if there is no estimation procedure that can guarantee with probability larger than exact recovery of the hidden direction up to a global sign flip.
1.2 Relation to prior work
Non-gaussian component analysis.
Non-gaussian component analysis (NGCA) is the problem of identifying a non-gaussian direction in random data. Concretely, this is a generalization of (1) where the are drawn i.i.d. from some distribution on . When is the Rademacher distribution, we recover the problem in (1) as a special case.
The NGCA problem was first introduced in [blanchard2006search], inspiring a long line of algorithmic results [kawanabe2006estimating, sugiyama2008approximating, diederichs2010sparse, diederichs2013sparse, bean2014non, sasaki2016sufficient, nordhausen2017asymptotic, vempala2011structure, tan2018polynomial, goyal2019non]. This problem has also played a key role in many hardness results in the statistical query (SQ) model: starting from the work of [sq-robust], various special cases of NGCA have provided SQ-hard instances for a variety of learning tasks [sq-robust, diakonikolas2018list, diakonikolas2019efficient, diakonikolas2021statistical, bubeck2019adversarial, goel2020statistical, diakonikolas2020near, diakonikolas2021optimality, goel2020superpolynomial, diakonikolas2020algorithms, diakonikolas2020hardness, spiked-transport]. More recently, a special case of NGCA has also been shown to be hard under the widely-believed assumption that certain worst-case lattice problems are hard [bruna2020continuous]. NGCA is a special case of the more general spiked transport model [spiked-transport].
Our main result (Theorem 1.1) solves NGCA with only samples in the case where is supported arbitrarily on an exponentially large subset of a 1-dimensional discrete lattice. While this case is essentially the noiseless, equispaced version of the “parallel Gaussian pancakes” problem, which was first introduced and shown to be SQ-hard by [sq-robust], our result does not bypass known SQ lower bounds [sq-robust, bruna2020continuous] as the hard construction involves Gaussian pancakes with non-negligible “thickness”.
Planted vector in a subspace.
A line of prior work has studied the problem of finding a “structured” vector planted in an otherwise random -dimensional subspace of , where . A variety of algorithms have been proposed and analyzed in the case where the planted vector is sparse [demanet2014scaling, barak2014rounding, qu2016finding, sos-spectral, qu2020finding]. One canonical Gaussian generative model for this problem turns out to be equivalent to NGCA (with the same parameters
), where the entrywise distribution of the planted vector corresponds to the non-gaussian distributionin NGCA. More specifically, the subspace in question is the column span of the matrix whose rows are the NGCA samples ; see e.g. Lemma 4.21 of [planted-vec-ld] for the formal equivalence.
Motivated by connections to the Sherrington–Kirkpatrick model from spin glass theory, the setting of a planted hypercube (i.e. -valued) vector in a subspace has received recent attention; this is equivalent to the problem in (1). Specifically, sum-of-squares lower bounds have been given for refuting the existence of a hypercube vector in (or close to) a purely random subspace. The first such result [lifting-sos] shows failure of degree-4 SoS when . A later improvement [affine-planes] shows failure of degree- SoS when and conjectures that this condition can be improved to (see Conjectures 8.1 and 8.2 of [affine-planes]).
The state-of-the-art algorithmic result for recovering a planted vector in a subspace is [planted-vec-ld], which builds on [sos-spectral] and in particular analyzes a spectral method proposed by [sos-spectral]. For a planted -sparse Rademacher vector ( entries are nonzero, and these nonzero entries are ), this spectral method succeeds at recovering the vector provided [planted-vec-ld]. On the other hand, if then all low-degree polynomial algorithms fail, implying in particular that all spectral methods (in a large class) fail [planted-vec-ld]. These results cover the special case of a planted hypercube vector (), in which case there is a spectral method that succeeds when , and failure of low-degree and spectral methods when .
The above results suggest inherent computational hardness of planted sparse Rademacher vector when . However, perhaps surprisingly, our main result (Theorem 1.1) implies that this problem can actually be solved via LLL in polynomial time whenever . Thus, LLL beats all low-degree algorithms whenever . However, our algorithm requires the entries of the planted vector to exactly lie in , whereas the spectral method of [sos-spectral, planted-vec-ld] succeeds under more general conditions.
We remark that the planted hypercube vector problem is closely related to the negatively-spiked Wishart model with a hypercube spike, which can be thought of as a model for generating the orthogonal complement of the subspace. The work of [sk-cert] gives low-degree lower bounds for this negative Wishart problem and conjectures hardness when (Conjecture 3.1). However, our results do not refute this conjecture because the conjecture is for a slightly noisy version of the problem (since the SNR parameter is taken to be strictly greater than ).
Our model (1) is an instance of a broader clustering problem (2) under Gaussian mixtures. In the binary case, it consists of i.i.d. samples of the Gaussian mixture , and , with unknown mean and covariance . The goal of clustering is to infer the mixture variables from the observations . Clustering algorithms have been analysed extensively, both from the statistical and computational perspective.
The statistical performance is driven by the signal-to-noise ratio, in the sense that the error rate for recovering the mixture labels is [friedman1989regularized]. Exact recovery of the vector of labels is thus possible only when .
Recently, [unknown-cov] showed that the MLE estimator for the missing labels corresponds to a Max-Cut problem, which recovers the solution when . Moreover, the authors argued that while SNR drives the inherent statistical difficulty of the estimation problem, a relaxed quantity presumably controls the computational difficulty. In particular, the largest such gap is attained in the covariance choice of (1), for which while . In this regime, they identified a gap between the statistical and computational performance of multiple existing algorithms, raising the crucial question whether such guarantees can be obtained using polynomial-time algorithms. Several previous works [brubaker2008isotropic, moitrav2010mixture, bakshi2020robustly, cai2019chime, flammarion2017robust] introduce algorithms that either require larger sample complexity
, or have non-optimal error rates, for instance based on k-means relaxations[royer2017adaptive, mixon2017clustering, giraud2019partial, li2020birds]. Leveraging existing SoS lower bounds on the associated Max-Cut problem (see Section 1.3), [unknown-cov] suggest a statistical-to-computational gap for exact recovery in the binary Gaussian mixture. Our main result (Theorem 1.1) implies that this problem can be solved via the LLL basis reduction method in polynomial time whenever . Thus in the present work, we refute this conjecture for under a weak “niceness” assumption on the covariance matrix .
LLL-based statistical recovery.
Our algorithm is based on the breakthrough use of LLL for solving average-case subset sum problems in polynomial-time, specifically the works of [Lagarias85] and [FriezeSubset]. In these works, it is established that while the (promise) Subset-Sum problem is NP-hard, for some integer-valued distributions on the input weights it becomes polynomial-time solvable by applying the LLL basis reduction algorithm on a carefully designed lattice. Building on these ideas, [NEURIPS2018_ccc0aa1b, LLL_TIT] proposed a new algorithm for noiseless discrete regression and discrete phase retrieval which provably solves these problems using only one sample, surpassing previous local-search lower bounds based on the so-called Overlap Gap Property [zadikCOLT17]. Again using the Subset-Sum ideas the LLL approach has also “closed” the gap for noiseless phase retrieval [andoni2017correspondence, song2021cryptographic] which was conjectured to be hard because of the failure of approximate message passing in this regime [maillard2020phase]. Furthermore, for the problem of noiseless Continuous LWE (CLWE), the LLL algorithm has been shown to succeed with samples in [bruna2020continuous], and later in [song2021cryptographic] with the information-theoretically optimal samples.
Our work adds a perhaps important new conceptual angle to the power of LLL for noiseless inference. A common feature of all the above inference models where LLL has been successfully applied is that they fall into the class of generalized linear models (GLMs). A GLM is generally defined as follows: for some hidden direction
and “activation” functionone observes i.i.d. samples of the form where and
are i.i.d. random variables. Our work shows how to successfully apply LLL and achieve statistically optimal performance for the clustering setting (2), which importantly does not admit a GLM formulation. We consider this a potentially interesting conceptual contribution of the present work, since many “hard” inference settings, such as the planted clique model, also do not belong in the class of GLMs.
1.3 “Noiseless” problems and implications for SoS/low-degree lower bounds
SoS and low-degree lower bounds.
The sum-of-squares (SoS) hierarchy [parrilo-sos, lasserre-sos, sos-csp, sos-clique] (see [barak2016proofs, sos-survey, fleming2019semialgebraic] for a survey) and low-degree polynomials [HS-bayesian, hopkins2017power, hopkins2018thesis] (see [kunisky2019notes] for a survey) are two restricted classes of algorithms that are often studied in the context of statistical-to-computational gaps. These are not the only two such frameworks, but we will focus on these two because our result “breaks” lower bounds in these two frameworks. SoS is a powerful hierarchy of semidefinite programming relaxations. Low-degree polynomial algorithms are simply multivariate polynomials in the entries of the input, of degree logarithmic in the input dimension; notably, these can capture all spectral methods
(subject to some technical conditions), i.e., methods based on the leading eigenvalue/eigenvector of some matrix constructed from the input (see Theorem 4.4 of[kunisky2019notes]). Both SoS and low-degree polynomials have been widely successful at obtaining the best known algorithms for a wide variety of high-dimensional “planted” problems, where the goal is to recover a planted signal buried in noisy data. While there is no formal connection between SoS and low-degree algorithms, they are believed to be roughly equivalent in power [hopkins2017power]. It is often informally conjectured that SoS and/or low-degree methods are as powerful as the best poly-time algorithms for “natural” high-dimensional planted problems (nebulously defined). As a result, lower bounds against SoS and/or low-degree methods are often considered strong evidence for inherent computational hardness of statistical problems.
Issue of noise-robustness.
In light of the above, it is tempting to conjecture optimality of SoS and/or low-degree methods among all poly-time methods for a wide variety of statistical problems. While this conjecture seems to hold up for a surprisingly long and growing list of problems, there are, of course, limits to the class of problems for which this holds. As discussed previously, a well-known counterexample is the problem of learning parity (or the closely-related XOR-SAT problem), where Gaussian elimination succeeds in a regime where SoS and low-degree algorithms provably fail. This counterexample is often tossed aside by the following argument: “Gaussian elimination is a brittle algebraic algorithm that breaks down if a small amount of noise is added to the labels, whereas SoS/low-degree methods are more robust to noise and are therefore capturing the limits of poly-time robust inference, which is a more natural notion anyway. If we restrict ourselves to problems that are sufficiently noisy then SoS/low-degree methods should be optimal.” However, we note that in our setting, SoS/low-degree methods are strictly suboptimal for a problem that does have plenty of Gaussian noise; the issue is that the signal and noise have a particular joint structure that preserves certain exact algebraic relationships in the data. This raises an important question: what exactly makes a problem “noisy” or “noiseless”, and under what kinds of noise should we believe that SoS/low-degree methods are unbeatable? In the following, we describe one possible answer.
The low-degree conjecture.
The “low-degree conjecture” of Hopkins [hopkins2018thesis, Conjecture 2.2.4] formalized one class of statistical problems for which low-degree polynomials are believed to be optimal among poly-time algorithms. These are certain hypothesis testing problems where the goal is to decide whether the input was drawn from a null (i.i.d. noise) distribution or a planted distribution (containing a planted signal). In our setting, one should imagine testing between samples drawn from the model (1) and samples drawn i.i.d. from . Computational hardness of hypothesis testing generally implies hardness of the associated recovery/estimation/learning problem (which in our case is to recover and ) as in Theorem 3.1 of [planted-vec-ld]. The class of testing problems considered in Hopkins’ conjecture has two main features: first, the problem should be highly symmetric, which is typical for high-dimensional statistical problems (although Hopkins’ precise notion of symmetry does not quite hold for the problems we consider in this paper). Second, and most relevant to our discussion, the problem should be noise-tolerant. More precisely, Hopkins’ conjecture states that if low-degree polynomials fail to distinguish a null distribution from a planted distribution , then no poly-time algorithm can distinguish from a noisy version of . For our setting, the appropriate “noise operator” to apply to (which was refined in [HW-counter]) is to replace each sample by
where independently from , for an arbitrarily small constant . This has the effect of replacing with where . This noise is designed to “defeat” brittle algorithms such as Gaussian elimination, and indeed our LLL-based algorithm is also expected to be defeated by this type of noise.
To summarize, the problem we consider in this paper is not noise-tolerant in the sense of Hopkins’ conjecture because the Gaussian noise depends on the signal (specifically, there is no noise in the direction of ) whereas Hopkins posits that the noise should be oblivious to the signal. Thus, in hindsight we should perhaps not be too surprised that LLL was able to beat SoS/low-degree for this problem. In other words, our result does not falsify the low-degree conjecture or the sentiment behind it (low-degree algorithms are optimal for noisy problems), with the caveat that one must be careful about the precise meaning of “noisy.” We feel that this lesson carries an often-overlooked conceptual message that may have consequences for other fundamental statistical problems such as planted clique, which we discuss next.
The planted clique conjecture posits that there is no polynomial-time algorithm for distinguishing between a random graph and a graph with a clique planted on random vertices (by adding all edges between the clique vertices), when . The planted clique conjecture is central to the study of statistical-to-computational gaps because it has been used as a primitive to deduce computational hardness of many other problems via a web of average-case reductions (e.g. [BR-reduction, MW-reduction, HWX-pds, BBH-reduction]). A refutation of the planted clique conjecture would be a major breakthrough that could cast doubts on whether other statistical-to-computational gaps are “real” or whether the gap can be closed by a better algorithm. As a result, it is important to ask ourselves why we believe the planted clique conjecture. Aside from the fact that it has resisted all algorithmic attempts so far, the primary concrete evidence for the conjecture comes in the form of lower bounds against SoS and low-degree polynomials [sos-clique, hopkins2018thesis]. However, it is perhaps unclear whether these should really be thought of as strong evidence for inherent hardness because (like the problem we study in the paper) planted clique is not noise-tolerant in the sense of Hopkins’ conjecture (discussed above). Specifically, the natural noise operator would be to independently resample a small constant fraction of the edges, which would destroy the clique structure. In other words, the conjecture of Hopkins only implies that a noisy variant of planted clique (namely planted dense subgraph) is hard when .
While we do not have any concrete reason to believe that LLL could be used to solve planted clique, we emphasize that planted clique is in some sense a “noiseless” problem and so we do not seem to have a principled reason to conjecture its hardness based on SoS and low-degree lower bounds. On the other hand, we should perhaps be somewhat more confident in the “planted dense subgraph conjecture” because planted dense subgraph is a truly noisy problem in the sense of [hopkins2018thesis, Conjecture 2.2.4].
The key component of our algorithmic results is the LLL lattice basis reduction algorithm. The LLL algorithm receives as input linearly independent vectors and outputs an integer linear combination of them with “small” norm. Specifically, let us define the lattice generated by integer vectors as simply the set of integer linear combination of these vectors.
[Lattice] Given linearly independent , let
which we refer to as the lattice generated by integer-valued . We also refer to as an (ordered) basis for the lattice .
The LLL algorithm solves a search problem called the approximate Shortest Vector Problem (SVP) on a lattice , given a basis of it. [Shortest Vector Problem] An instance of the algorithmic -approximate SVP for a lattice is as follows. Given a lattice basis for the lattice , find a vector , such that
where . The following theorem holds for the performance of the LLL algorithm, whose details can be found in [lenstra1982factoring] or [lovasz1986algorithmic]. [[lenstra1982factoring]] There is an algorithm (namely the LLL lattice basis reduction algorithm), which receives as input a basis for a lattice given by which
returns a vector satisfying ,
terminates in time polynomial in and
In this work, we use the LLL algorithm for an integer relation detection application, a problem which we formally define below.
[Integer relation detection] An instance of the integer relation detection problem is as follows. Given a vector , find an , such that . In this case, is said to be an integer relation for the vector .
To define our class of problems, we make use of the following two standard objects. [Bernoulli–Rademacher vector] We say that a random vector is a Bernoulli–Rademacher vector with parameter and write , if the entries of are i.i.d. with
[Discrete Gaussian on ] Let be real numbers. We define the discrete Gaussian distribution with width supported on the scaled integer lattice to be the distribution whose probability mass function at each is proportional to .
The following tail bound on the discrete Gaussian will be useful in Section 3.2, in which we reduce the hCLWE model (also known as “Gaussian pancakes”) (Model 3.2) to our general model (Model 3) which is the central problem for our LLL-based algorithm.
[Adapted from [song2021cryptographic, Claim I.6]] Let be a real number, and let be the discrete Gaussian of width 1 supported on such that the probability mass function at is given by
where is the normalization constant. Then, the following bound holds.
In particular, for ,
The first inequality in (4) follows from [song2021cryptographic, Claim 1.6]. The fact that follows from our assumption that . Since , we have
Now notice that the terms of the series are decaying in a geometric fashion for since
It follows that
3 The LLL-based algorithm and its implications
We now present the main contribution of this work, which is an LLL-based polynomial-time algorithm that provably solves the general problem defined in Model 3 with access to only samples.
We deal formally with samples coming from -dimensional Gaussians, which have as their mean some unknown multiple of an unknown unit vector , and also some unknown covariance which nullifies and satisfies the following weak “separability” condition.
[Weak separability of the spectrum] Fix a unit vector . We say that a positive semi-definite is -weakly separable if for some constant it holds that
All other eigenvalues of lie in the interval
Notice that in particular the canonical case is -weakly separable as all eigenvalues of are equal to one, besides the zero eigenvalue which has multiplicity one and eigenvector
Under the weak separability assumption we establish the following generic result. [Our general model] Let , known spacing level satisfying for some constant and arbitrary . Consider also an arbitrary and an arbitrary unknown which is -weakly separable per Assumption 3. Conditional on and , we then draw independent samples where . The goal is to use to recover both and up to a global sign flip, with probability over the samples
In the following section we discuss the guarantee of the our proposed Algorithm. After, we discuss how our results implies that learning in polynomial time with samples is possible for: (1) the “planted sparse vector” problem (Model 3.2), (2) the homogeneous Continuous Learning with Errors (hCLWE) problem, also informally called the “Gaussian pancakes” problem (Model 3.2), and finally (3) Gaussian clustering (Model 3.2). As mentioned earlier in Section 1, our result bypasses known lower bounds for SoS and low-degree polynomials which suggested that the regime would not be achievable by polynomial-time algorithms.
In what follows and the rest of the paper, for some and we denote by the truncation of to its first bits after zero.
3.1 The algorithm and the main guarantee
Our proposed algorithm for solving Model 3 is described in Algorithm 1. Specifically we assume the algorithm receives independent samples according to Model 3. As we see in the following theorem, the algorithm is able to recover exactly (up to a global sign flip) both the hidden direction and the hidden labels in polynomial time.
Algorithm 1, given as input independent samples from Model 3 with hidden direction , covariance and true labels satisfies the following with probability : there exists such that the algorithm’s outputs and satisfy
Moreover, Algorithm 1 terminates in steps.
We now provide intuition behind the algorithm’s success. Note that for the unknown and it holds that
In the first step, the algorithm checks a certain general-position condition on the received samples, which naturally is satisfied almost surely for our random data. In the following crucial three steps, the algorithm attempts to recover only the hidden integer labels without learning . To do this, it exploits a certain random integer linear relation that the labels ’s satisfy which importantly does not involve any information about the unknown , besides its existence. The key observation leading to this relation is the following. Since we have vectors in a -dimensional space, there exist scalars (depending on the ’s) such that These are exactly the ’s that the algorithm computes in the second step. Using them, observe that the following linear equation holds, due to (5),
and therefore since it gives the following integer linear equation
Again note that the ’s can be computed from the given samples , so this is an equation whose sole unknowns are the labels . With this integer linear equation in mind, the algorithm in the following step employs the powerful LLL algorithm applied to an appropriate lattice. This application of the LLL is based on the breakthrough works of [lagarias1984knapsack, FriezeSubset] for solving random subset-sum problems in polynomial-time, as well as its recent manifestations for solving various other noiseless inference settings such as binary regression [NEURIPS2018_ccc0aa1b] and phase retrieval [andoni2017correspondence, song2021cryptographic]. To get some intuition for this connection, notice that in the case , (7) is really a (promise) subset-sum relation with weights and unknown subset for which the corresponding ’s sum to . Now, after some careful technical work, including an appropriate truncation argument to work with integer-valued data, and various anti-concentration arguments such as the Carbery–Wright anticoncentration toolkit [Carbery2001DistributionalAL], one can show that the LLL step indeed recovers a constant multiple of the labels with probability (see also the next paragraph for more details on this). At this point, it is relatively straightforward to recover using the linear equations (5).
Now we close by presenting the key technical lemma which ensures that LLL recovers the hidden labels by finding a “short” vector in the lattice defined by the columns of the matrix in Algorithm 1. Notice that if truncation at bits was not present, that is we were “allowed” to construct the lattice basis with the non-integer numbers instead of , then a direct calculation based on (7) would give that the hidden labels are embedded in an element of the lattice simply because we would have
As this “hidden vector” in the lattice is -independent (and is taken to be very large) this naturally suggests that this vector may be “short” compared to the others in the lattice. The following lemma states that with probability , this is indeed the case. The random lattice generated by the columns of indeed does not contain any “spurious” short vectors other than the vector of the hidden labels and, naturally, its integer multiples. This implies that the LLL algorithm, despite its approximation ratio, will indeed return the integer relation that is “hidden in” the ’s.
[No spurious short vectors] Let for some constant and . Let be an arbitrary unit vector, an arbitrary unknown -separable matrix, and let for be arbitrary but not all zero. Moreover, let be independent samples from , and let be the matrix constructed in Algorithm 1 using as input and -bit precision. Then, with probability over the samples, for any such that is not an integer multiple of , the following holds:
3.2 Implications of the success of LLL
The algorithmic guarantee described in Theorem 3.1 has interesting implications for various recently studied problems.
The planted sparse vector problem.
As we motivated in the introduction, a model of notable recent interest is the so-called “planted sparse vector” setting, where one plants a sparse vector in an unknown subspace. [Planted sparse vector] Let be such that , and let the sparsity level . First draw and
be a uniformly at random chosen orthogonal matrix. Then, we sample i.i.d.. Denote by the matrix whose columns we perceive as generating the “hidden subspace”:
The statistician observes the rotated matrix . The goal is, given access to , to recover the -sparse vector .
As explained in Section 1, all low-degree polynomial methods, and therefore all spectral methods (in a large class) fail for this model when [planted-vec-ld]. Furthermore, in the dense case where and , even degree- SoS algorithms are known to fail [affine-planes].
Yet, using our Theorem 3.1 we can prove that for any the LLL algorithm solves Model 3.2 with samples in polynomial-time. In particular, this improves the state-of-the-art for this task, and provably beats all low-degree algorithms, when . The corollary is based on the equivalence (up to rescaling) between Model 3.2 and some appropriate sub-case of Model 3 [unknown-cov].
As established in [planted-vec-ld, Lemma 4.21] for a sample from Model 3.2 from a setting where the planted sparse vector is , the distribution of is a sample from Model 3 with hidden direction , spacing and covariance . Note that the covariance satisfies Assumption 3 for straightforward reasons and same for the condition on the spacing as and . On top of this, since the output of is not the zero vector with probability Hence, combining the above and Theorem 3.1 we immediately conclude the desired result. ∎
Continuous Learning with Errors.
As we motivated in the introduction, a second model that is of interest, due to its connections to fundamental problems in lattice-based cryptography, is the homogeneous continuous learning with errors (hCLWE).
[hCLWE] Let be a real number and let be a discrete Gaussian of width 1 supported on . Let . First draw a random unit vector and i.i.d. . Conditional on and , draw independent samples where . The statistician observes the matrix with rows . The goal is, given access to , to recover the unit vector .
It is known that for , if we add any inverse-polynomial Gaussian noise in the hidden direction of hCLWE, even detecting the existence of such discrete structure is hard, under standard worst-case hardness assumptions from lattice-based cryptography [bruna2020continuous]. Moreover, there are SQ lower bounds, which are unconditional, for this noisy version [sq-robust, bruna2020continuous]. As a direct corollary of our Theorem 3.1 we show that in the noiseless case where no Gaussian noise is added, LLL can recover exactly the hidden direction in polynomial time with samples. We remark that while [bruna2020continuous] claim that their LLL-based algorithm for (inhomogeneous) CLWE could be generalized to the hCLWE setting, their algorithm uses samples, which is suboptimal.
The proof follows from Claim 3.2. ∎
Finally, a third model which has been extensively studied in the theoretical machine learning community is the (Bayesian) Gaussian clustering model.
[Gaussian clustering] Let . Fix some unknown positive semi-definite matrix Now draw a random unit vector and i.i.d. uniform Rademacher labels . Conditional on and , draw independent samples where . The statistician observes the matrix with rows . The goal is, given the observation matrix , to recover (up to a global sign flip) the labels .
As explained in the introduction, the recent work by [unknown-cov] shows that for Model 3.2, exact reconstruction of the labels is possible as long as if is invertible, and of course also in the regime where . The authors of [unknown-cov] show that for any such it is possible to achieve exact reconstruction with samples by some computationally inefficient method, and construct and analyze computationally efficient methods that work with samples. On top of this, they conjecture that the regime is computationally hard based on various forms of rigorous evidence such as failure of SoS and low-degree methods. Notably, the failure of the SoS hierarchy transfers even to the case in the regime . As a direct corollary of Theorem 3.1, we refute their conjecture for any covariance matrix which nullifies and satisfies a weak “niceness" assumption, namely Assumption 3. We show that under this assumption, exact reconstruction of the labels (and therefore of the clusters) is possible in polynomial time with only samples.
4 Information-theoretic lower bound
In this section we establish an information-theoretic lower bound associated with problem (1), and in particular (2), for parameter recovery of the hidden direction . This lower bound shows that the sample complexity for our LLL-based algorithm is optimal in the following sense: for exact recovery of the hidden direction (up to a global sign flip), even samples are not sufficient.
4.1 Optimality of samples for parameter recovery
In this section we establish that when , , and for , one cannot information-theoretically exactly recover the hidden direction from independent samples . This establishes the optimality of our LLL approach which achieves a much more generic guarantee in Theorem 3.1 in recovering exactly, up to one additional sample compared to the information-theoretic lower bound.
In fact, our lower bound is somewhat stronger. We assume that the statistician also has exact knowledge of the signs and yet we still establish that exact recovery of with probability larger than using at most samples is impossible. Notice that in this setting where are known to the statistician there is no global sign ambiguity with respect to the hidden direction
[Parameter recovery] Let arbitrary be fixed and known to the statistician. Moreover, let be a uniformly random unit vector. For each let be a sample generated independently from Then it is information-theoretically impossible to construct an estimate which satisfies with probability larger than
5 Proof of Algorithm 1 correctness
5.1 Towards proving Theorem 3.1: auxiliary lemmas
We present here three auxiliary lemmas for proving Theorem 3.1 and Lemma 1. The first lemma establishes that given a small (in -norm) “approximate” integer relation between real numbers, one can appropriately truncate each real number to a sufficiently large number of bits, so that the truncated numbers satisfy a small (in -norm) integer relation between them. This lemma, which is an immediate implication of [song2021cryptographic, Lemma D.6], is important for the appropriate application of the LLL algorithm, which needs to receive integer-valued input. Recall that for a real number we denote by its truncation to its first bits after zero, i.e.
[“Rounding” approximate integer relations [song2021cryptographic, Lemma D.6]] Let be a number and let be such that for some constant . Moreover, suppose for some constant , a (real-valued) vector satisfies for some . Then for some sufficiently large constant , if , there is an which is equal to in the first coordinates, satisfies , and is an integer relation for the numbers
We need the following anticoncentration result. [Anticoncentration of misaligned integer combinations] Assume that for some constant. Let be an arbitrary unit vector and let be an arbitrary sequence of integers, which are not all equal to zero. Now for a sequence of integers , we define the (multi-linear) polynomial in variables by
where each is assumed to have a -dimensional vector form, denotes the matrix with as its columns, and each for denotes the matrix formed by swapping out the -th column of with .
Suppose ’s are drawn independently from for some and which is -weakly separable per Assumption 3 and eigenvalues . Then, for any it holds that
Furthermore, for some universal constant the following holds. If for any , where we denote , then for any
We first describe how (12) follows from (10) and (11). First, notice that under the assumption on the integer sequence not being a multiple of the sequence of integers it holds that for some with . In particular, using (11) we have
But now notice that from Assumption 3 and , it holds for some constant that
Hence, it holds that
Now we employ [v012a011, Theorem 1.4] (originally proved in [Carbery2001DistributionalAL]) which implies that for some universal constant since our polynomial is multilinear and has degree , it holds for any that
Using our lower bound on the variance we conclude the result.
Now we proceed with the mean and variance calculation. As this statement is about the first and second moment ofand the determinant operator is invariant up to basis transformations, we may assume without loss of generality that