 # Learning Mixtures of Product Distributions via Higher Multilinear Moments

Learning mixtures of k binary product distributions is a central problem in computational learning theory, but one where there are wide gaps between the best known algorithms and lower bounds (even for restricted families of algorithms). We narrow many of these gaps by developing novel insights about how to reason about higher order multilinear moments. Our results include: 1) An n^O(k^2) time algorithm for learning mixtures of binary product distributions, giving the first improvement on the n^O(k^3) time algorithm of Feldman, O'Donnell and Servedio 2) An n^Ω(√(k)) statistical query lower bound, improving on the n^Ω( k) lower bound that is based on connections to sparse parity with noise 3) An n^O( k) time algorithm for learning mixtures of k subcubes. This special case can still simulate many other hard learning problems, but is much richer than any of them alone. As a corollary, we obtain more flexible algorithms for learning decision trees under the uniform distribution, that work with stochastic transitions, when we are only given positive examples and with a polylogarithmic number of samples for any fixed k. Our algorithms are based on a win-win analysis where we either build a basis for the moments or locate a degeneracy that can be used to simplify the problem, which we believe will have applications to other learning problems over discrete domains.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

### 1.1 Background

In this paper, we introduce and study the following natural problem: A mixture of subcubes is a distribution on the Boolean hypercube where each sample is drawn as follows:

1. There are mixing weights and centers .

2. We choose a center proportional to its mixing weight and then sample a point uniformly at random from its corresponding subcube. More precisely, if we choose the center, each coordinate is independent and the coordinate has expectation .

Our goal is to give efficient algorithms for estimating the distribution in the PAC-style model of Kearns et al.

[kearns1994learnability]. It is not always possible to learn the parameters because two mixtures of subcubes111Even with different numbers of components. can give rise to identical distributions. Instead, the goal is to output a distribution that is close to the true distribution in total variation distance.

The problem of learning mixtures of subcubes contains various classic problems in computational learning theory as a special case, and is itself a special case of others. For example, for any -leaf decision tree, the uniform distribution on assignments that satisfy it is a mixture of subcubes. Likewise, for any function that depends on just variables (a -junta), the uniform distribution on assignments that satisfy it is a mixture of -subcubes. And when we allow the centers to instead be in the set it becomes the problem of learning mixtures of binary product distributions.

Each of these problems has a long history of study. Ehrenfeucht and Haussler [ehrenfeucht1989learning] gave an time algorithm for learning -leaf decision trees. Blum [blum1992rank] showed that -leaf decision trees can be represented as a -width decision list and Rivest [rivest] gave an algorithm for learning -width decision lists in time . Mossel, O’Donnell and Servedio [mossel2003learning] gave an time algorithm for learning -juntas where is the exponent for fast matrix multiplication. Valiant [valiant2012finding] gave an improved algorithm that runs in time. Freund and Mansour [freund1999estimating] gave the first algorithm for learning mixtures of two product distributions. Feldman, O’Donnell and Servedio [fos] gave an time algorithm for learning mixtures of product distributions.

What makes the problem of learning mixtures of subcubes an interesting compromise between expressive power and structure is that it admits surprisingly efficient learning algorithms. The main result in our paper is an time algorithm for learning mixtures of subcubes. We also give applications of our algorithm to learning -leaf decision trees with at most stochastic transitions on any root-to-leaf path (which also capture interesting scenarios where the transitions are deterministic but there are latent variables). Using our algorithm for learning mixtures of subcubes, we can approximate the error of the Bayes optimal classifier within an additive in time with an inverse polynomial dependence on the accuracy parameter . The classic algorithms of [rivest, blum1992rank, ehrenfeucht1989learning] for learning decision trees with zero stochastic transitions achieve this runtime, but because they are Occam algorithms, they break down in the presence of stochastic transitions. Alternatively, the low-degree algorithm [linial1993constant] is able to get a constant factor approximation to the optimal error (again within an additive ), while running in time . The quasipolynomial dependence on is inherent to the low-degree approach because the degree needs to grow as the target accuracy decreases, which is undesirable when is small as a function of .

In contrast, we show that mixtures of subcubes are uniquely identified by their order moments. Ultimately our algorithm for learning mixtures of subcubes will allow us to simultaneously match the polynomial dependence on of Occam algorithms and achieve the flexibility of the low-degree algorithm in being able to accommodate stochastic transitions. We emphasize that proving identifiability from order moments is only a first step in a much more technical argument: There are many subtleties about how we can algorithmically exploit the structure of these moments to solve our learning problem.

### 1.2 Our Results and Techniques

Our main result is an time algorithm for learning mixtures of subcubes.

Let be given and let be a mixture of subcubes. There is an algorithm that given samples from runs in time and outputs a mixture of subcubes that satisfies

with probability at least

. Moreover the sample complexity is .222Throughout, the hidden constant depending on will be , which we have made no attempt to optimize.

The starting point for our algorithm is the following simple but powerful identifiability result:

[Informal] A mixture of subcubes is uniquely determined by its order moments.

In contrast, for many sorts of mixture models with components, typically one needs moments to establish identifiability [mv] and this translates to algorithms with running time at least and sometimes even much larger than that. In part, this is because the notion of identifiability we are aiming for needs to be weaker and as a result is more subtle. We cannot hope to learn the subcubes and their mixing weights because there are mixtures of subcubes that can be represented in many different ways, sometimes with the same number of subcubes. But as distributions, two mixtures of subcubes are the same if they match on their first moments. It turns out that proving this is equivalent to the following basic problem in linear algebra:

Given a matrix , what is the minimum for which the set of all entrywise products of at most rows of spans the set of all entrywise products of rows of ?

We show that can be at most , which is easily shown to be tight up to constant factors. We will return to a variant of this question later when we discuss why learning mixtures of product distributions requires much higher-order moments.

Unsurprisingly, our algorithm for learning mixtures of subcubes is based on the method of moments. But there is an essential subtlety. For any distribution on the hypercube, . From a technical standpoint, this means that when we compute moments, there is never any reason to take a power of larger than one. We call these multilinear moments

, and characterizing the way that the multilinear moments determine the distribution (but cannot determine its parameters) is the central challenge. Note that multilinearity makes our problem quite different from typical settings where tensor decompositions can be applied.

Now collect the centers into a size matrix that we call the marginals matrix and denote by . The key step in our algorithm is constructing a basis for the entrywise products of rows from this matrix. However we cannot afford to simply brute-force search for this basis among all sets of at most entrywise products of up to rows of because the resulting algorithm would run in time . Instead we construct a basis incrementally.

The first challenge that we need to overcome is that we cannot directly observe the entrywise product of a set of rows of the marginals matrix. But we can observe its weighted inner-product with various other vectors. More precisely, if

are respectively the entrywise products of subsets and of rows of some marginals matrix that realizes the distribution and is the associated vector of mixing weights, then the relation

 k∑i=1πiuivi=\E[∏i∈S∪Txi]

holds if and are disjoint. When and intersect, this relation is no longer true because in order to express the left hand side in terms of the ’s we would need to take some powers to be larger than one, which no longer correspond to multilinear moments that can be estimated from samples.

Now suppose we are given a collection of subsets of rows of and we want to check if the vectors (where is the entrywise product of the rows in ) are linearly independent. Set . We can define a helper matrix whose columns are indexed by the ’s and whose rows are indexed by subsets of . The entry in column , row is and it is easy to show that if this helper matrix has full row rank then the vectors are indeed linearly independent.

The second challenge is that this is an imperfect test. Even if the helper matrix is not full rank, might still be linearly independent. Even worse, we can encounter situations where our current collection is not yet a basis, and yet for any set we try to add, we cannot certify that the associated entrywise product of rows is outside the span of the vectors we have so far. Our algorithm is based on a win-win analysis. We show that when we get stuck in this way, it is because there is some with where the order entrywise products of subets of rows from do not span the full -dimensional space. We show how to identify such an by repeatedly solving systems of linear equations. Once we identify such an it turns out that for any string we can condition on and the resulting conditional distribution will be a mixture of strictly fewer subcubes, which we can then recurse on.

### 1.3 Applications

We demonstrate the power of our time algorithm for learning mixtures of subcubes by applying it to learning decision trees with stochastic transitions. Specifically suppose we are given a sample that is uniform on the hypercube, but instead of computing its label based on a -leaf decision tree with deterministic transitions, some of the transitions are stochastic — they read a bit and based on its value proceed down either the left or right subtree with some unknown probabilities. Such models are popular in medicine [hazen1998stochastic] and finance [hespos1965stochastic] when features of the system are partially or completely unobserved and the transitions that depend on these features appear to an outside observer to be stochastic. Thus we can also think about decision trees with deterministic transitions but with latent variables as having stochastic transitions when we marginalize on the observed variables.

With stochastic transitions, it is no longer possible to perfectly predict the label even if you know the stochastic decision tree. This rules out many forms of learning like Occam algorithms such as [ehrenfeucht1989learning, blum1992rank, rivest] that are based on succinctly explaining a large portion of the observed samples. It turns out that by accurately estimating the distribution on positive examples — via our algorithm for learning mixtures of subcubes — it is possible to approach the Bayes optimal classifier in time and with only a polylogarithmic number of samples:

Let be given and let be a distribution on labelled examples from a stochastic decision tree under the uniform distribution. Suppose further that the stochastic decision tree has leaves and along any root-to-leaf path there are at most stochastic transitions. There is an algorithm that given samples from runs in time and with probability at least outputs a classifier whose probability of error is at most where opt is the error of the Bayes optimal classifier. Moreover the sample complexity is .

Recall that the low-degree algorithm [linial1993constant] is able to learn -leaf decision trees in time by approximating them by degree polynomials. These results also generalize to stochastic settings [aiello1991learning]. Recently, Hazan, Klivans and Yuan [hazan2017hyperparameter] were able to improve the sample complexity even in the presence of adversarial noise using the low-degree Fourier approximation approach together with ideas from compressed sensing for learning low-degree, sparse Boolean functions [stobbe2012learning]. Although our algorithm is tailored to handle stochastic rather than adversarial noise, our algorithm has a much tamer dependence on which yields much faster algorithms when is small as a function of . Moreover we achieve a considerably stronger (and nearly optimal) error guarantee of rather than for some constant . Our algorithm even works in the natural variations of the problem [denis1998pac, letouzey2000learning, de2014learning] where it is only given positive examples.

Lastly, we remark that [de2014learning] studied a similar setting where the learner is given samples from the uniform distribution over satisfying assignments of some Boolean function and the goal is to output a distribution close to . Their techniques seem quite different from ours and also the low-degree algorithm. Among their results, the one most relevant to ours is the incomparable result that there is an -time learning algorithm for when is a -term DNF formula.

### 1.4 More Results

As we discussed earlier, mixtures of subcubes are a special case of mixtures of binary product distributions. The best known algorithm for learning mixtures of product distributions is due to Feldman, O’Donnell and Servedio [fos] and runs in time . A natural question which a number of researchers have thought about is whether the dependence on can be improved, perhaps to . This would match the best known statistical query (SQ) lower bound for learning mixtures of product distributions, which follows from the fact that the uniform distribution over inputs accepted by a decision tree is a mixture of product distributions and therefore from Blum et al.’s SQ lower bound [blumsq].

As we will show, it turns out that mixtures of product distributions require much higher-order moments even to distinguish a mixture of product distributions from the uniform distribution on . As before, this turns out to be related to a basic problem in linear algebra:

For a given , what is the largest possible collection of vectors for which the entries in the entrywise product of any vectors sum to zero and the entries in the entrywise product of all vectors do not sum to zero?333In Section 2.5 we discuss the relationship between Questions 1.2 and 1.4.

We show a rather surprising construction that achieves . An obvious upper bound for is . It is not clear what the correct answer ought to be. In any case, we show that this translates to the following negative result:

[Informal] There is a family of mixtures of product distributions that are all different as distributions but which match on all order moments.

Given a construction for Question 1.4, the idea for building this family is the same idea that goes into the SQ lower bound for -sparse parity [kearns1998efficient] and the SQ lower bound for density estimation of mixtures of Gaussians [diakonikolas2016statistical], namely that of hiding a low-dimensional moment-matching example inside a high-dimensional product measure. We leverage Lemma 1.4 to show an SQ lower bound for learning mixtures of product distributions that holds for small values of , which is exactly the scenario we are interested in, particularly in applications to learning stochastic decision trees.

[Informal] Any algorithm given -accurate statistical query access to a mixture of binary product distributions that outputs a distribution satisfying for must make at least queries.

This improves upon the previously best known SQ lower bound of , although for larger values of our construction breaks down. In any case, in a natural dimension-independent range of parameters, mixtures of product distributions are substantially harder to learn using SQ algorithms than the special case of mixtures of subcubes.

Finally, we leverage the insights we developed for reasoning about higher-order multilinear moments to give improved algorithms for learning mixtures of binary product distributions:

Let be given and let be a mixture of binary product distributions. There is an algorithm that given samples from runs in time and outputs a mixture of binary product distributions that satisfies with probability at least .

Here we can afford to brute-force search for a basis. However a different issue arises. In the case of mixtures of subcubes, when a collection of vectors that come from entrywise products of rows are linearly independent we can also upper bound their condition number, which allows us to get a handle on the fact that we only have access to the moments of the distribution up to some sampling noise. But when the centers are allowed to take on arbitrary values in there is no a priori upper bound on the condition number. To handle sampling noise, instead of finding just any basis, we find a barycentric spanner.444Specifically, we find a barycentric spanner for just the rows of the marginals matrix, rather than for the set of entrywise products of rows of the marginals matrix. We proceed via a similar win-win analysis as for mixtures of subcubes: in the case that condition number poses an issue for learning the distribution, we argue that after conditioning on the coordinates of the barycentric spanner, the distribution is close to a mixture of fewer product distributions. A key step in showing this is to prove the following robust identifiability result that may be of independent interest:

[Informal] Two mixtures of product distributions are -far in statistical distance if and only if they differ by on a -order moment.

In fact this is tight in the sense that -order moments are insufficient to distinguish between some mixtures of product distributions (see the discussion in Section 2.5). Another important point is that in the case of mixtures of subcubes, exact identifiability by -order moments (Lemma 1.2) is non-obvious but, once proven, can be bootstrapped in a black-box fashion to robust identifiability using the abovementioned condition number bound. On the other hand, for mixtures of product distributions, exact identifiability by -order moments is straightforward, but without a condition number bound, it is much more challenging to turn this into a result about robust identifiability.

### 1.5 Organization

The rest of this paper is organized as follows:

• Section 2 — we set up basic definitions, notation, and facts about mixtures of product distributions and provide an overview of our techniques.

• Section 3 — we describe our algorithm for learning mixtures of subcubes and give the main ingredients in the proof of Theorem 2.

• Section 4 — we prove the statistical query lower bound of Theorem 1.4.

• Section 5 — we describe our algorithm for learning general mixtures of product distributions, prove a robust low-degree identifiability lemma in Section 5.4, give the main ingredients in the proof of Theorem 1.4, and conclude in Section 5.6 with a comparison of our techniques to those of [fos].

• Appendix A — we make precise the sampling tree-based framework that our algorithms follow.

• Appendix B — we complete the proof of Theorem 2

• Appendix C — we complete the proof of Theorem 1.4

• Appendix D — we make precise the connection between mixtures of subcubes and various classical learning theory problems, including stochastic decision trees, juntas, and sparse parity with noise, and prove Theorem 1.3.

## 2 Preliminaries

### 2.1 Notation and Definitions

Given a matrix , we denote by the entry in in row and column . For a set , we denote as the restriction of to rows in . And similarly is the restriction of to columns in . We will let denote the maximum absolute value of any entry in and denote the induced operator norm of , that is, the maximum absolute row sum. We will also make frequent use of entrywise products of vectors and their relation to the multilinear moments of the mixture model.

The entrywise product of a collection of vectors is the vector whose coordinate is . When , is the all ones vector.

Given a set , we use to denote the powerset of . Let be the uniform distribution over . Also let for convenience. Let denote the density of at . Let be the all ones string of length .

For , the -moment of is . We will sometimes use the shorthand .

There can be many choices of mixing weights and centers that yield the same mixture of product distributions . We will refer to any valid choice of parameters as a realization of .

A mixture of product distributions is a mixture of subcubes if there is a realization of with mixing weights and centers for which each center has only values.

In this paper, when referring to mixing weights, our superscript notation is only for indexing and never for powering.

There are three main matrices we will be concerned with.

The marginals matrix is a matrix obtained by concatenating the centers , for some realization. The moment matrix is a matrix whose rows are indexed by sets and

 MS=⨀i∈Smi

Finally the cross-check matrix is a matrix whose rows and columns are indexed by sets and whose entries are in where

 CTS={\ED[xS∪T]if S∩T=∅?otherwise

We say that an entry of is accessible if it is not equal to .

It is important to note that and depend on the choice of a particular realization of , but that does not because its entries are defined through the moments of . The starting point for our algorithms is the following observation about the relationship between and :

For any realization of with mixing weights and centers . Then

1. For any set we have

2. For any pair of sets with we have

 CTS=(M⋅diag(π)⋅M⊤)TS

The idea behind our algorithms are to find a basis for the rows of or failing that to find some coordinates to condition on which result in a mixture of fewer product distributions. The major complications come from the fact that we can only estimate the accessible entries of from samples from our distribution. If we had access to all of them, it would be straightforward to use the above relationship between and to find a set of rows of that span the row space.

### 2.2 Rank of the Moment Matrix and Conditioning

First we will show that without loss of generality we can assume that the moment matrix has full column rank. If it does not, we will be able to find a new realization of as a mixture of strictly fewer product distributions.

A realization of is a full rank realization if has full column rank and all the mixing weights are nonzero. Furthermore if we will say has rank .

Fix a realization of with mixing weights and centers and let be the moment matrix. If then there are new mixing weights such that:

1. has nonzeros

2. and also realize .

Moreover the submatrix consisting of the columns of with nonzero mixing weight in has rank .

###### Proof.

We will proceed by induction on . When there is a vector . The sum of the entries in must be zero because the first row of is the all ones vector. Now if we take the line as we increase , there is a first time when a coordinate becomes zero. Let . By construction, is nonnegative and its entries sum to one and it has at most nonzeros. We can continue in this fashion until the columns corresponding to the support of in are linearly independent. Note that as we change the mixing weights, the moment matrix stays the same. Also the resulting matrix that we get must have rank because each time we update we are adding a multiple of a vector in the kernel of so the columns whose mixing weight is changing are linearly dependent. ∎

Thus when we fix an (unknown) realization of in our analysis, we may as well assume that it is a full rank realization. This is true even if we restrict our attention to mixtures of subcubes where the above lemma shows that if does not have full column rank, there is a mixture of subcubes that realizes . Next we show that mixtures of product distributions behave nicely under conditioning:

Fix a realization of with mixing weights and centers . Let and . The the conditional distribution can be realized as a mixture of product distributions with mixing weights and centers

 μ1|[n]∖S,μ2|[n]∖S,⋯,μk|[n]∖S
###### Proof.

Using Bayes’ rule we can write out the mixing weights explicitly as

 π′=π⨀(⨀i∈Sγi)PrD[xS=s]

where we have abused notation and used as an infix operator and where . This follows because the map is the identity when and when

We can straightforwardly combine Lemma 2.2 and Lemma 2.2 to conclude that if then for any there is a realization of as a mixture of product distributions. Moreover if was a mixture of subcubes then so too would the realization of be.

### 2.3 Linear Algebraic Relations between M and C

Even though not all of the entries of are accessible (i.e. can be estimated from samples from ) we can still use it to deduce linear algebraic properties among the rows of . All of the results in this subsection are elementary consequences of Observation 2.1.

Let and set . If the columns

 CT1|R(J),CT2|R(J),⋯,CTr|R(J)

are linearly independent then for any realization of the rows are also linearly independent.

###### Proof.

Fix any realization of . Using Observation 2.1, we can write:

 C|T1,...,TrR(J)=M|R(J)⋅diag(π)⋅(M⊤)|T1,...,Tr

Now suppose for the sake of contradiction that the rows of are not linearly independent. Then there is a nonzero vector so that which by the above equation immediately implies that the columns of are not linearly independent, which yields our contradiction. ∎

Next we prove a partial converse to the above lemma:

Fix a realization of and let have rank . Let and set . If and there are coefficients so that

 r∑i=1αiCTi|R(J)=0

then the corresponding rows of are linearly dependent too — i.e. .

###### Proof.

By the assumptions of the lemma, we have that

Now and the fact that the mixing weights are nonzero implies that is invertible. Hence we conclude that as desired. ∎

Of course, we don’t actually have exact estimates of the moments of , so in Appendix B we prove the sampling noise-robust analogues of Lemma 2.3 and Lemma 2.3 (see Lemma B.1) needed to get an actual learning algorithm.

### 2.4 Technical Overview for Learning Mixtures of Subcubes

With these basic linear algebraic relations in hand, we can explain the intuition behind our algorithms. Our starting point is the observation that if we know a collection of sets indexing a row basis of , then we can guess one of the possibilities for the entries of . Using a correct guess, we can solve for the mixing weights using (1) from Observation 2.1. The point is that because index a row basis of , the system of equations

 MTj⋅→π=\ED[xTj],j=1,...,k (1)

has a unique solution which thus must be the true mixing weights in the realization . We can then solve for the remaining rows of using part 2 of Observation 2.1, i.e. for every we can solve

 MTj⋅diag(π)⋅m⊤i=\ED[xTj∪{i}]∀j=1,...,k. (2)

Again, because the rows are linearly independent and has no zero entries, we conclude that the true value of is the unique solution.

There are three main challenges to implementing this strategy:

1. Identifiability. How do we know whether a given guess for is correct? More generally, how do we efficiently test whether a given distribution is close to the underlying mixture of subcubes?

2. Building a Basis. How do we produce a row basis for without knowing , let alone one for which is small enough that we can actually try all possibilities for ?

3. Sampling Noise. Technically we only have approximate access to the moments of , so even from a correct guess for we only obtain approximations to and the remaining rows of . How does sampling noise affect the quality of these approximations?

#### 2.4.1 Identifiability

As our algorithms will be based on the method of moments, an essential first question to answer is that of identifiability: what is the minimum for which mixtures of subcubes are uniquely identified by their moments of degree at most ? As alluded to in Section 1.2, it is enough to answer Question 1.2, which we can restate in our current notation as:

Given a matrix with associated moment matrix , what is the minimum for which the rows span all rows of ?

Let be the largest for Question 2.4.1 among all . Note that just from considering a -sparse parity with noise instance as a mixture of subcubes. The reason getting upper bounds on is directly related to identifiability is that subcubes are uniquely identified by their moments of degree at most . Indeed, if and realize different distributions and , then there must exist for which

 (M1)S⋅→π1=\ED1[xS]≠\ED2[xS]=(M2)S⋅→π2.

In other words, the vector does not lie in the right kernel of the matrix matrix . But because is the moment matrix of the matrix , its rows are spanned by the rows , so there in fact exists of size at most for which . Finally, note also that the reverse direction of this argument holds, that is, if mixtures of subcubes and agree on all moments of degree at most , then they are identical as distributions.

In Section 3.1, we show that . The idea is that there is a natural correspondence between 1) linear relations among the rows of for and 2) multilinear polynomials of degree at most which vanish on the rows of . The bound on then follows from cleverly constructing an appropriate low-degree multilinear polynomial.

Note that the above discussion only pertains to exact identifiability. For the purposes of our learning algorithm, we want robust identifiability, i.e. there is some such that and are far in statistical distance if and only if they differ noticeably on some moment of degree at most . It turns out that it suffices to take to be the same , and in Section 2.4.4 below, we sketch how we achieve this.

Once we have robust identifiability in hand, we have a way to resolve Challenge A above: to check whether a given guess for is correct, compute the moments of degree at most of the corresponding candidate mixture of subcubes and compare them to empirical estimates of the moments of the underlying mixture. If they are close, then the mixture of subcubes we have learned is close to the true distribution.

As we will see below though, while the bound of is a necessary first step to achieving a quasipolynomial running time for our learning algorithm, there will be many more steps and subtleties along the way to getting an actual algorithm.

#### 2.4.2 Building a Basis

We now describe how we address Challenge B. The key issue is that we do not have access to the entries of (and itself depends on the choice of a particular realization). Given the preceding discussion about Question 2.4.1, a naive way to circumvent this is simply to guess a basis from among all combinations of at most rows from , but this would take time .

As we hinted at in Section 1.2, we will overcome the issue of not having access to by using the accessible entries of , which we can easily estimate by drawing samples from , as a surrogate for (see Lemmas 2.3 and 2.3). To this end, one might first try to use to find a row basis for by looking at the submatrix of consisting of entries and simply picking out a column basis for this submatrix. Of course, the crucial issue is that we can only use the accessible entries of .

Instead, we will incrementally build up a row basis. Suppose at some point we have found a list of subsets indexing linearly independent rows of for some realization of and are deciding whether to add some set to this list. By Lemmas 2.3 and 2.3, if , where , then is linearly independent from if and only if the column vector is linearly independent from column vectors .555Note that while the dimension of these column vectors is exponential in , the discussion in Section 2.4.1 implies that it suffices to look only at the coordinates of these columns that are indexed by with .

If we make the strong assumption that we always have that in the course of running this procedure, the problem of finding a row basis for reduces to the following basic question:

Given indexing linearly independent rows of a moment matrix , as well as access to an oracle which on input decides whether lies in the span of , how many oracle calls does it take to either find for which lies outside the span of or successfully conclude that are a row basis for ?

Section 2.4.1 tells us it suffices to look at all remaining subsets of size at most which have not yet been considered, which requires checking at most subsets before we decide whether to add a new subset to our basis.

Later, in Section 3.4, we will show the following alternative approach which we call GrowByOne suffices: simply consider all subsets of the form for and . If have up to this point been constructed in this incremental fashion, we prove that if no such can be added to our list and moreover we have that for every , then indexes a row basis for .

The advantages of GrowByOne are that it 1) only requires checking at most subsets before we decide whether to add a new subset to our basis, 2) it works even when we assume is the moment matrix of a mixture of arbitrary product distributions, and 3) it will simplify our analysis regarding issues of sampling noise.

#### 2.4.3 Making Progress When Basis-Building Fails

The main subtlety is that the correctness of GrowByOne as outlined in Section 2.4.2 hinges on the fact that at every point in the algorithm. But if this is not the case and yet lies in the span of , we cannot conclude whether lies in the span of . In particular, suppose we found that lies in the span of for every candidate subset and therefore decided to add nothing more to the list . Then while Lemma 2.3 guarantees that the rows of corresponding to are linearly independent, we can no longer ascertain that they span all the rows of .

The key idea is that if this is the case, then there must have been some candidate such that . We call the set of all such the set of impostors. By Lemma 2.2, if is an impostor, the conditional distribution can be realized as a mixture of strictly fewer than subcubes for any bitstring . The upshot is that even if the list output by GrowByOne does not correspond to a row basis of , we can make progress by conditioning on the coordinates for an impostor and recursively learning mixtures of fewer subcubes.

On the other hand, the issue of actually identifying an impostor is quite delicate. Because there may be up to levels of recursion, we cannot afford to simply brute force over all possible coordinates. Instead, the idea will be to pretend that actually corresponds to a row basis of and use this to attempt to learn the parameters of the mixture. It turns out that either the resulting mixture will be close to on all low-degree moments and robust identifiability will imply we have successfully learned , or it will disagree on some low-degree moment, and we show in Section 3.3 that this low-degree moment must contain an impostor .

#### 2.4.4 Sampling Noise

Obviously we only have access to empirical estimates of the entries of , so for instance, instead of checking whether a column of lies in the span of other columns of , we look at the corresponding regression problem. In this setting, the above arguments still carry over provided that the submatrices of and used are well-conditioned. We show in Section 3.5 that the former are well-conditioned by Cramer’s, as they are matrices whose entries are low-degree powers of 1/2, and this on its own can already be used to show robust identifiability. By Observation 2.1, the submatrices of used in the above arguments are also well-conditioned provided that has no small entries. But if has small entries, intuitively we might as well ignore these entries and only attempt to learn the subcubes of the mixture which have non-negligible mixing weight.

In Section 3.5, we explain in greater detail the subtleties that go into dealing with these issues of sampling noise.

### 2.5 Technical Overview for SQ Lower Bound

To understand the limitations of the method of moments for more general mixtures of product distributions, we can first ask Question 2.4.1 more generally for arbitrary matrices , but in this case it is not hard to see that the minimum for which the rows span all rows of can be as high as . Simply take to have identical rows, each of which consists of distinct entries . Then , so by usual properties of Vandermonde matrices, the rows will not span the rows of until .666Note that by the connection between linear relations among rows of and multilinear polynomials vanishing on the rows of , this example is also tight, i.e. will span the rows of for any .

From such an , we immediately get a pair of mixtures and that agree on all moments of degree at most but differ on moments of degree : let and up to scaling be the positive and negative parts of an element in the kernel of , and let and be the corresponding disjoint submatrices of . But this is not yet sufficient to establish an SQ lower bound of .

Instead, we will exhibit a large collection of mixtures of product distributions that all agree with the uniform distribution over on moments up to some degree but differ on some moment of degree . This will be enough to give an SQ lower bound of .

The general approach is to construct a mixture of product distributions over whose top-degree moment differs noticeably from but whose other moments agree with that of the uniform distribution over . The collection of mixtures will then consist of all product measures given by in some coordinates and the uniform distribution over in the remaining coordinates . This general strategy of embedding a low-dimensional moment-matching distribution in some hidden set of coordinates is the same principle behind SQ lower bounds for learning sparse parity [kearns1998efficient], robust estimation and density estimation of mixtures of Gaussians [diakonikolas2016statistical], etc.

The main challenge is to actually construct the mixture . We reduce this problem to Question 1.4 and give an explicit construction in Section 4 with .

### 2.6 Technical Overview for Learning Mixtures of Product Distributions

The main difficulty with learning mixtures of general product distributions is that moment matrices can be arbitrarily ill-conditioned, which makes it far more difficult to handle sampling noise. Indeed, with exact access to the accessible entries of , one can in fact show there exists a algorithm for learning mixtures of general product distributions, where is the answer to Question 1.4, though we omit the proof of this in this work. In the presence of sampling noise, it is not immediately clear how to adapt the approach from Section 2.4. The three main challenges are:

1. Robust Identifiability. For mixtures of subcubes, robust identifiability essentially followed from exact identifiability and a condition number bound on . Now that can be arbitrarily ill-conditioned, how do we still show that two mixtures of product distributions that are far in statistical distance must differ noticeably on some low-degree moment?

2. Using as a Proxy for . Without a condition number bound, can approximate access to still be useful for deducing (approximate) linear algebraic relations among the rows of ?

3. Guessing Entries of . Entries of are arbitrary scalars now, rather than numbers from . We can still try discretizing by guessing integer multiples of some small scalar , but how small must be for this to work?

For Challenge A, we will show that if two mixtures of product distributions are far in statistical distance, they must differ noticeably on some moment of degree at most . Roughly, the proof is by induction on the total number of product distributions in the two mixtures, though the inductive step is rather involved and we defer the details to Section 5.4, which can be read independently of the other parts of the proof of Theorem 1.4.

Next, we make Challenges B and C more manageable by shifting our goal: instead of a row basis for , we would like a row basis for that is well-conditioned in an appropriate sense. Specifically, we want a row basis for such that if we express any other row of as a linear combination of this basis, the corresponding coefficients are small. This is precisely the notion of barycentric spanner introduced in [awerbuch2008online], where it was shown that any collection of vectors has a barycentric spanner. We can find a barycentric spanner for the rows of by simply guessing all possibilities. We then show that if is a barycentric spanner and is well-conditioned in an sense for all , then in analogy with Lemma 2.3, one can learn good approximations to the true coefficients expressing the remaining rows of in terms of . Furthermore, these approximations are good enough that it suffices to pick the discretization parameter in Challenge C to be , in which case the entries of can be guessed in time .

If instead is ill-conditioned for some “impostor” , we can afford now to simply brute-force search for the impostor, but we cannot appeal to Lemma 2.2 to argue as before that each of the conditional distributions is a mixture of fewer than product distributions, because might still have rank . Instead, we show in Section 5.5 that robust identifiability implies that these conditional distributions are close to mixtures of at most product distributions, and this is enough for us to make progress and recursively learn.

## 3 Learning Mixtures of Subcubes in Quasipolynomial Time

### 3.1 Logarithmic Moments Suffice

Recall that a mixture of subcubes can represent the distribution on positive examples from an -sparse parity with noise when . It is well known that every moments of such a distribution are indistinguishable from the uniform distribution. Here we prove a converse and show that for mixtures of subcubes all of the relevant information is contained within the moments. More precisely we show:

Let be a mixture of subcubes and fix a realization where the centers are -valued. Let be the corresponding moment matrix. Then

 {MT∣∣|T|<2logk}

span the rows of .

###### Proof.

Fix any set of size . Without loss of generality suppose that . We want to show that lies in the span of for all . Our goal is to show that there are coefficients so that

 ∑T⊆SαTMT=0

and that is nonzero. If we can do this, then we will be done. First we construct a multilinear polynomial

 p(x)=m∏i=1(xi−λi)

where each and with the property that for any , . If we had such a polynomial, we could expand

 p(x)=∑T⊆SαT∏i∈Txi

By construction . And now for any we can see that the coordinate of is exactly , which yields the desired linear dependence.

All that remains is to construct the polynomial . We will do this by induction. Suppose we have constructed a polynomial and let

 Rt={j∣∣pt(mj|S)≠0}

In particular is the set of surviving columns. By the pigeonhole principle we can choose so that . For some we have that at which point we can choose

 p(x)=(ℓ∏i=1(xi−λi))⋅m∏i=ℓ+1xi

which completes the proof. ∎

Recall that . Now Lemma 3.1 implies that

 \rank(M|R(J))=\rank(M|R′(J))

where is the set of all subsets with . Thus we can certify whether a basis is a basis by, instead of computing the entire vector , working with the much smaller vector , where as usual .

We remark that if were not a mixture of subcubes, but a general mixture of product distributions, then we would need to look at for in order to span the rows of . First this is necessary because we could set to be a length vector with distinct entries in the range . Now set each row of to be . In this example, the entrywise product of with itself times is linearly independent of the vectors we get from taking the entrywise product between zero and times. On the other hand, this is tight:

Let be a mixture of product distributions and fix a realization. Let be the corresponding moment matrix. Then

 {MT∣∣|T|

span the rows of .

###### Proof.

The proof is almost identical to the proof of Lemma 3.1. The only difference is that we allow and instead of reducing the size of geometrically each time, we could reduce it by one. ∎

### 3.2 Local Maximality

In the following three subsections, we explain in greater detail how to produce a row basis for , as outlined in Sections 2.4.2 and 2.4.3. Recall that Lemma 2.3 and Lemma 2.3 give us a way to certify that the sets we are adding to correspond to rows of that are linearly independent of the ones we have selected so far. Motivated by these lemmas, we introduce the following key definitions:

Given a collection of subsets we say that is certified full rank if has full column rank, where .

Note here we have used with Lemma 3.1 in mind.

Let be certified full column rank. Let . Suppose there is no

1. or

2. for

for which has full column rank, where . Then we say that is locally maximal.

We are working towards showing that any certified full rank and locally maximal spans a particular subset of the rows of . First we will show the following helper lemma:

Let and as usual. Suppose that

1. the rows of are a basis for the rows of and

2. for any and any , the row is in the row span of

Then the rows of are a basis for the rows of .

###### Proof.

We will proceed by induction. Suppose that the rows of are a basis for the rows of for some . Consider any