DeepAI

# The More, the Merrier: the Blessing of Dimensionality for Learning Large Gaussian Mixtures

In this paper we show that very large mixtures of Gaussians are efficiently learnable in high dimension. More precisely, we prove that a mixture with known identical covariance matrices whose number of components is a polynomial of any fixed degree in the dimension n is polynomially learnable as long as a certain non-degeneracy condition on the means is satisfied. It turns out that this condition is generic in the sense of smoothed complexity, as soon as the dimensionality of the space is high enough. Moreover, we prove that no such condition can possibly exist in low dimension and the problem of learning the parameters is generically hard. In contrast, much of the existing work on Gaussian Mixtures relies on low-dimensional projections and thus hits an artificial barrier. Our main result on mixture recovery relies on a new "Poissonization"-based technique, which transforms a mixture of Gaussians to a linear map of a product distribution. The problem of learning this map can be efficiently solved using some recent results on tensor decompositions and Independent Component Analysis (ICA), thus giving an algorithm for recovering the mixture. In addition, we combine our low-dimensional hardness results for Gaussian mixtures with Poissonization to show how to embed difficult instances of low-dimensional Gaussian mixtures into the ICA setting, thus establishing exponential information-theoretic lower bounds for underdetermined ICA in low dimension. To the best of our knowledge, this is the first such result in the literature. In addition to contributing to the problem of Gaussian mixture learning, we believe that this work is among the first steps toward better understanding the rare phenomenon of the "blessing of dimensionality" in the computational aspects of statistical inference.

• 4 publications
• 50 publications
• 24 publications
• 10 publications
• 5 publications
10/05/2022

### A Fourier Approach to Mixture Learning

We revisit the problem of learning mixtures of spherical Gaussians. Give...
10/06/2020

### Learning a mixture of two subspaces over finite fields

We study the problem of learning a mixture of two subspaces over 𝔽_2^n. ...
07/26/2022

### Efficient Algorithms for Sparse Moment Problems without Separation

We consider the sparse moment problem of learning a k-spike mixture in h...
02/18/2022

### Gaussian Mixture Convolution Networks

This paper proposes a novel method for deep learning based on the analyt...
03/23/2017

### Training Mixture Models at Scale via Coresets

How can we train a statistical mixture model on a massive data set? In t...
09/19/2013

### Predictive PAC Learning and Process Decompositions

We informally call a stochastic process learnable if it admits a general...
06/11/2017

### Low Complexity Gaussian Latent Factor Models and a Blessing of Dimensionality

Learning the structure of graphical models from data is a fundamental pr...

## 1 Introduction

The question of recovering a probability distribution from a finite set of samples is one of the most fundamental questions of statistical inference. While classically such problems have been considered in low dimension, more recently inference in high dimension has drawn significant attention in statistics and computer science literature.

In particular, an active line of investigation in theoretical computer science has dealt with the question of learning a Gaussian Mixture Model in high dimension. This line of work was started in [14] where the first algorithm to recover parameters using a number of samples polynomial in the dimension was presented. The method relied on random projections to a low dimensional space and required certain separation conditions for the means of the Gaussians. Significant work was done in order to weaken the separation conditions and to generalize the result (see e.g., [15, 5, 28, 1, 16]). Much of this work has polynomial sample and time complexity but requires strong separation conditions on the Gaussian components. A completion of the attempts to weaken the separation conditions was achieved in [7] and [20], where it was shown that arbitrarily small separation was sufficient for learning a general mixture with a fixed number of components in polynomial time. Moreover, a one-dimensional example given in [20] showed that an exponential dependence on the number of components was unavoidable unless strong separation requirements were imposed. Thus the question of polynomial learnability appeared to be settled. It is worth noting that while quite different in many aspects, all of these papers used a general scheme similar to that in the original work [15] by reducing high-dimensional inference to a small number of low-dimensional problems through appropriate projections.

However, a surprising result was recently proved in [18]. The authors showed that a mixture of Gaussians in dimension could be learned using a polynomial number of samples, assuming a non-degeneracy condition on the configuration of the means. The result in [18] is inherently high-dimensional as that condition is never satisfied when the means belong to a lower-dimensional space. Thus the problem of learning a mixture gets progressively computationally easier as the dimension increases, a “blessing of dimensionality!” It is important to note that this was quite different from much of the previous work, which had primarily used projections to lower-dimension spaces.

Still, there remained a large gap between the worst case impossibility of efficiently learning more than a fixed number of Gaussians in low dimension and the situation when the number of components is equal to the dimension. Moreover, it was not completely clear whether the underlying problem was genuinely easier in high dimension or our algorithms in low dimension were suboptimal. The one-dimensional example in [20] cannot answer this question as it is a specific worst-case scenario, which can be potentially ruled out by some genericity condition.

In our paper we take a step to eliminate this gap by showing that even very large mixtures of Gaussians can be polynomially learned. More precisely, we show that a mixture of Gaussians with equal known covariance can be polynomially learned as long as is bounded from above by a polynomial of the dimension and a certain more complex non-degeneracy condition for the means is satisfied. We show that if is high enough, these non-degeneracy conditions are generic in the smoothed complexity sense. Thus for any fixed , generic Gaussians can be polynomially learned in dimension .

Further, we prove that no such condition can exist in low dimension. A measure of non-degeneracy must be monotone in the sense that adding Gaussian components must make the condition number worse. However, we show that for points uniformly sampled from there are (with high probability) two mixtures of unit Gaussians with means on non-intersecting subsets of these points, whose distance is and which are thus not polynomially identifiable. More generally, in dimension the distance becomes . That is, the conditioning improves as the dimension increases, which is consistent with our algorithmic results.

To summarize, our contributions are as follows:

1. We show that for any , a mixture of Gaussians in dimension can be learned in time and number of samples polynomial in and a certain “condition number” . We show that if the dimension is sufficiently high, this results in an algorithm polynomial from the smoothed analysis point of view (Theorem 1). To do that we provide smoothed analysis of the condition number using certain results from [27]

and anti-concentration inequalities. The main technical ingredient of the algorithm is a new “Poissonization” technique to reduce Gaussian mixture estimation to a problem of recovering a linear map of a product distribution known as underdetermined Independent Component Analysis (ICA). We combine this with the recent work on efficient algorithms for underdetermined ICA from

[17] to obtain the necessary bounds.

2. We show that in low dimension polynomial identifiability fails in a certain generic sense (see Theorem 3). Thus the efficiency of our main algorithm is truly a consequence of the ”blessing of dimensionality” and no comparable algorithm exists in low dimension. The analysis is based on results from approximation theory and Reproducing Kernel Hilbert Spaces.

Moreover, we combine the approximation theory results with the Poissonization-based technique to show how to embed difficult instances of low-dimensional Gaussian mixtures into the ICA setting, thus establishing exponential information-theoretic lower bounds for underdetermined Independent Component Analysis in low dimension. To the best of our knowledge, this is the first such result in the literature.

We discuss our main contributions more formally now. The notion of Khatri–Rao power of a matrix is defined in Section 2.

###### Theorem 1 (Learning a GMM with Known Identical Covariance).

Suppose and let . Let be an -dimensional GMM, i.e. , , and . Let be the matrix whose column is . If there exists so that , then Algorithm 2 recovers each to within accuracy with probability . Its sample and time complexity are at most

 poly(md2,σd2,ud2,wd2,dd2,rd2,1/ϵ,1/δ,1/b,logd2(1/(bϵδ)))

where , , , are bounds provided to the algorithm, and .

Given that the means have been estimated, the weights can be recovered using the tensor structure of higher order cumulants (see Section 2 for the definition of cumulants). This is shown in Appendix I.

We show that is large in the smoothed analysis sense, namely, if we start with a base matrix and perturb each entry randomly to get , then is likely to be large. More precisely,

###### Theorem 2.

For , let be an arbitrary matrix. Let be a randomly sampled matrix with each entry iid from , for . Then, for some absolute constant ,

 Pr(σmin((M+N)⊙2)≤σ2/n7)≤2C/n.

We point out the simultaneous and independent work of [8], where the authors prove learnability results related to our Theorems 1 and 2. We now provide a comparison. The results in [8], which are based on tensor decompositions, are stronger in that they can learn mixtures of axis-aligned Gaussians (with non-identical covariance matrices) without requiring to know the covariance matrices in advance. Their results hold under a smoothed analysis setting similar to ours. To learn a mixture of roughly Gaussians up to an accuracy of their algorithm has running time and sample complexity and succeeds with probability at least , where the means are perturbed by adding an -dimensional Gaussian from . On the one hand, the success probability of their algorithm is much better (as a function of , exponentially close to as opposed to polynomially close to , as in our result). On the other hand, this comes at a high price in terms of the running time and sample complexity: The polynomial above has degree exponential in , unlike the degree of our bound which is polynomial in . Thus, in this respect, the two results can be regarded as incomparable points on an error vs running time (and sample complexity) trade-off curve. Our result is based on a reduction from learning GMMs to ICA which could be of independent interest given that both problems are extensively studied in somewhat disjoint communities. The technique of Poissonization is, to the best of our knowledge, new in the GMM setting. Moreover, our analysis can be used in the reverse direction to obtain hardness results for ICA.

Finally, in Section 6 we show that in low dimension the situation is very different from the high-dimensional generic efficiency given by Theorems 1 and 2: The problem is generically hard. More precisely, we show:

###### Theorem 3.

Let be a set of points uniformly sampled from . Then with high probability there exist two mixtures with equal number of unit Gaussians , centered on disjoint subsets of , such that, for some ,

 ∥p−q∥L1(Rn)

Combining the above lower bound with our reduction provides a similar lower bound for ICA; see a discussion on the connection with ICA below. Our lower bound gives an information-theoretic barrier. This is in contrast to conjectured computational barriers that arise in related settings based on the noisy parity problem (see [18] for pointers). The only previous information-theoretic lower bound for learning GMMs we are aware of is due to [20] and holds for two specially designed one-dimensional mixtures.

#### Connection with ICA.

A key observation of [18] is that methods based on the higher order statistics used in Independent Component Analysis (ICA) can be adapted to the setting of learning a Gaussian Mixture Model. In ICA, samples are of the form

where the latent random variables

are independent, and the column vectors

give the directions in which each signal acts. The goal is to recover the vectors up to inherent ambiguities. The ICA problem is typically posed when is at most the dimensionality of the observed space (the “fully determined” setting), as recovery of the directions then allows one to demix the latent signals. The case where the number of latent source signals exceeds the dimensionality of the observed signal is the underdetermined ICA setting.222See [12, Chapter 9] for a recent account of algorithms for underdetermined ICA. Two well-known algorithms for underdetermined ICA are given in [10] and [2]. Finally, [17] provides an algorithm with rigorous polynomial time and sampling bounds for underdetermined ICA in high dimension in the presence of Gaussian noise.

Nevertheless, our analysis of the mixture models can be embedded in ICA to show exponential information-theoretic hardness of performing ICA in low-dimension, and thus establishing the blessing of dimensionality for ICA as well.

###### Theorem 4.

Let be a set of random -dimensional unit vectors. Then with high probability, there exist two disjoint subsets of , such that when these two sets form the columns of matrices and respectively, there exist noisy ICA models and which are exponentially close as distributions in distance and satisfying: (1) The coordinate random variables of and are scaled Poisson random variables. For at least one coordinate random variable, , where is such that and are polynomially bounded away from 0. (2) The Gaussian noises and have polynomially bounded directional covariances.

We sketch the proof of Theorem 4 in Appendix G.

#### Discussion.

Most problems become harder in high dimension, often exponentially harder, a behavior known as “the curse of dimensionality.” Showing that a complex problem does not become exponentially harder often constitutes major progress in its understanding. In this work we demonstrate a reversal of this curse, showing that the lower dimensional instances are exponentially harder than those in high dimension. This seems to be a rare situation in statistical inference and computation. In particular, while high-dimensional concentration of mass can sometimes be a blessing of dimensionality, in our case the generic computational efficiency of our problem comes from anti-concentration.

We hope that this work will enable better understanding of this unusual phenomenon and its applicability to a wider class of computational and statistical problems.

## 2 Preliminaries

The singular values of a matrix

will be ordered in the decreasing order: . By we mean .

For a real-valued random variable , the cumulants of

are polynomials in the moments of

. For , the th cumulant is denoted . Denoting , we have, for example:

. In general, cumulants can be defined as certain coefficients of a power series expansion of the logarithm of the characteristic function of

:

. The first two cumulants are the same as the expectation and the variance, resp. Cumulants have the property that for two independent random variables

we have (assuming that the first moments exist for both and ). Cumulants are degree- homogeneous, i.e. if and is a random variable, then

. The first two cumulants of the standard Gaussian distribution are the mean,

, and the variance, , and all subsequent Gaussian cumulants have value .

#### Gaussian Mixture Model.

For , define Gaussian random vectors with distribution where and . Let be an integer-valued random variable which takes on value with probability , henceforth called weights. (Hence .) Then, the random vector drawn as is said to be a Gaussian Mixture Model (GMM) . The sampling of can be interpreted as first picking one of the components according to the weights, and then sampling a Gaussian vector from component . We will be primarily interested in the mixture of identical Gaussians of known covariance. In particular, there exists known such that for each . Letting , and denoting by the random variable which takes on the th canonical vector with probability , we can write the GMM model as follows:

 Z=[μ1|μ2|⋯|μm]eh+η . (1)

In this formulation, acts as a selector of a Gaussian mean. Conditioning on , we have , which is consistent with the GMM model.

Given samples from the GMM, the goal is to recover the unknown parameters of the GMM, namely the means and the weights .

#### Underdetermined ICA.

In the basic formulation of ICA, the observed random variable is drawn according to the model , where is a latent random vector whose components are independent random variables, and is an unknown mixing matrix. The probability distributions of the are unknown except that they are not Gaussian. The ICA problem is to recover to the extent possible. The underdetermined ICA problem corresponds the case . We cannot hope to recover fully because if we flip the sign of the th column of , or scale this column by some nonzero factor, then the resulting mixing matrix with an appropriately scaled will again generate the same distribution on as before. There is an additional ambiguity that arises from not having an ordering on the coordinates : If is a permutation matrix, then gives a new random vector with independent reordered coordinates, gives a new mixing matrix with reordered columns, and provides the same samples as since is the inverse of . As is a permutation of the columns of , this ambiguity implies that we cannot recover the order of the columns of . However, it turns out that under certain genericity requirements, we can recover up to these necessary ambiguities, that is to say we can recover the directions (up to sign) of the columns of , even in the underdetermined setting.

In this paper, it will be important for us to work with an ICA model where there is Gaussian noise in the data: , where is an additive Gaussian noise independent of , and the covariance of given by is in general unknown and not necessarily spherical. We will refer to this model as the noisy ICA model.

We define the flattening operation from a tensor to a vector in the natural way. Namely, when and is a tensor, then where is a bijection with indices running from to . Roughly speaking, each index is being converted into a digit in a base number up to the final offset by 1. This is the same flattening that occurs to go from a tensor outer product of vectors to the Kronecker product of vectors.

The ICA algorithm from [17] to which we will be reducing learning a GMM relies on the shared tensor structure of the derivatives of the second characteristic function and the higher order multi-variate cumulants. This tensor structure motivates the following form of the Khatri-Rao product:

###### Definition 1.

Given matrices , a column-wise Khatri-Rao product is defined by , where is the th column of , is the th column of , denotes the Kronecker product and is flattening of the tensor into a vector. The related Khatri-Rao power is defined by ( times).

This form of the Khatri-Rao product arises when performing a change of coordinates under the ICA model using either higher order cumulants or higher order derivative tensors of the second characteristic function.

#### ICA Results.

Theorem 22 (Appendix H.1, from [17]) allows us to recover up to the necessary ambiguities in the noisy ICA setting. The theorem establishes guarantees for an algorithm from [17] for noisy underdetermined ICA, UnderdeterminedICA. This algorithm takes as input a tensor order parameter , number of signals , access to samples according to the noisy underdetermined ICA model with unknown noise, accuracy parameter , confidence parameter , bounds on moments and cumulants and , a bound on the conditioning parameter , and a bound on the cumulant order . It returns approximations to the columns of up to sign and permutation.

## 3 Learning GMM means using underdetermined ICA: The basic idea

In this section we give an informal outline of the proof of our main result, namely learning the means of the components in GMMs via reduction to the underdetermined ICA problem. Our reduction will be discussed in two parts. The first part gives the main idea of the reduction and will demonstrate how to recover the means up to their norms and signs, i.e. we will get . We will then present the reduction in full. It combines the basic reduction with some preprocessing of the data to recover the

’s themselves. The reduction relies on some well-known properties of the Poisson distribution stated in the lemma below; its proof can be found in Appendix

B.

###### Lemma 5.

Fix a positive integer , and let be such that . If and then for all and are mutually independent.

#### Basic Reduction: The main idea.

Recall the GMM from equation (1) is given by . Henceforth, we will set . We can write the GMM in the form , which is similar in form to the noisy ICA model, except that does not have independent coordinates. We now describe how a single sample of an approximate noisy ICA problem is generated.

The reduction involves two internal parameters and that we will set later. We generate a Poisson random variable , and we run the following experiment times: At the th step, generate sample from the GMM. Output the sum of the outcomes of these experiments: .

Let be the random variable denoting the number of times samples were taken from the th Gaussian component in the above experiment. Thus, . Note that are not observable although we know their sum. By Lemma 5, each has distribution , and the random variables are mutually independent. Let .

For a non-negative integer , we define where the are iid according to . In this definition, can be a random variable, in which case the are sampled independent of . Using to indicate that two random variables have the same distribution, then . If there were no Gaussian noise in the GMM (i.e. if we were sampling from a discrete set of points) then the model becomes simply , which is the ICA model without noise, and so we could recover up to necessary ambiguities. However, the model fails to satisfy even the assumptions of the noisy ICA model, both because is not independent of and because is not distributed as a Gaussian random vector.

As the covariance of the additive Gaussian noise is known, we may add additional noise to the samples of to obtain a good approximation of the noisy ICA model. Parameter , the second parameter of the reduction, is chosen so that with high probability we have . Conditioning on the event we draw according to the rule , where , , and are drawn independently conditioned on . Then, conditioned on , we have .

Note that we have only created an approximation to the ICA model. In particular, restricting can be accomplished using rejection sampling, but the coordinate random variables would no longer be independent. We have two models of interest: (1) , a noisy ICA model with no restriction on , and (2) the restricted model.

We are unable to produce samples from the first model, but it meets the assumptions of the noisy ICA problem. Pretending we have samples from model (1), we can apply Theorem 22 (Appendix H.1) to recover the Gaussian means up to sign and scaling. On the other hand, we can produce samples from model (2), and depending on the choice of , the statistical distance between models (1) and (2) can be made arbitrarily close to zero. It will be demonstrated that given an appropriate choice of , running UnderdeterminedICA on samples from model (2) is equivalent to running UnderdeterminedICA on samples from model (1) with high probability, allowing for recovery of the Gaussian mean directions up to some error.

#### Full reduction.

To be able to recover the without sign or scaling ambiguities, we add an extra coordinate to the GMM as follows. The new means are with an additional coordinate whose value is for all , i.e. . Moreover, this coordinate has no noise. In other words, each Gaussian component now has an covariance matrix . It is easy to construct samples from this new GMM given samples from the original: If the original samples were , then the new samples are where . The reduction proceeds similarly to the above on the new inputs.

Unlike before, we will define the ICA mixing matrix to be such that it has unit norm columns. The role of matrix in the basic reduction will now be played by . Since we are normalizing the columns of , we have to scale the ICA signal obtained in the basic reduction to compensate for this: Define . Thus, the ICA models obtained in the full reduction are:

 X′ =A′S′+η′(τ) , (2) X′ =(A′S′+η′(τ))|R≤τ , (3)

where we define . As before, we have an ideal noisy ICA model (2) from which we cannot sample, and an approximate noisy ICA model (3) which can be made arbitrarily close to (2) in statistical distance by choosing appropriately. With appropriate application of Theorem 22 to these models, we can recover estimates (up to sign) of the columns of .

By construction, the last coordinate of each now tells us both the sign and magnitude of each : Let be the vector consisting of the first coordinates of , and let be the last coordinate of . Then with the sign indeterminacy canceling in the division.

## 4 Correctness of the Algorithm and Reduction

Subroutine 1 captures the sampling process of the reduction: Let be the covariance matrix of the GMM, be an integer chosen as input, and a threshold value also computed elsewhere and provided as input. Let . If is larger than , the subroutine returns a failure notice and the calling algorithm halts immediately. A requirement, then, should be that the threshold is chosen so that the chance of failure is very small; in our case, is chosen so that the chance of failure is half of the confidence parameter given to Algorithm 2. The subroutine then goes through the process described in the full reduction: sampling from the GMM, lifting the sample by appending a 1, then adding a lifted Gaussian so that the total noise has distribution . The resulting sample is from the model given by (3).

Algorithm 2 works as follows: it takes as input the parameters of the GMM (covariance matrix, number of means), tensor order (as required by UnderdeterminedICA), error parameters, and bounds on certain properties of the weights and means. The algorithm then calculates various internal parameters: a bound on directional covariances, Poisson parameter , threshold parameter , error parameters to be split between the “Poissonization” process and the call to UnderdeterminedICA, and values explicitly needed by [17] for the analysis of UnderdeterminedICA. Other internal values needed by the algorithm are denoted by the constant and polynomial ; their values are determined by the proof of Theorem 1. Briefly, is a constant so that one can cleanly compute a value of that will involve a polynomial, called , of all the other parameters. The algorithm then calls UnderdeterminedICA, but instead of giving samples from the GMM, it allows access to Subroutine 1. It is then up to UnderdeterminedICA to generate samples as needed (bounded by the polynomial in Theorem 1). In the case that Subroutine 1 returns a failure, the entire algorithm process halts, and returns nothing. If no failure occurs, the matrix returned by UnderdeterminedICA will be the matrix of normalized means embedded in , and the algorithm de-normalizes, removes the last row, and then has approximations to the means of of the GMM.

The bounds are used instead of actual values to allow flexibility — in the context under which the algorithm is invoked — on what the algorithm needs to succeed. However, the closer the bounds are to the actual values, the more efficient the algorithm will be.

#### Sketch of the correctness argument.

The proof of correctness of Algorithm 2 has two main parts. For brevity, the details can be found in Appendix A. In the first part, we analyze the sample complexity of recovering the Gaussian means using UnderdeterminedICA when samples are taken from the ideal noisy ICA model (2).

In the second part, we note that we do not have access to the ideal model (2), and that we can only sample from the approximate noisy ICA model (3) using the full reduction. Choosing appropriately, we use total variation distance to argue that with high probability, running UnderdeterminedICA with samples from the approximate noisy ICA model will produce equally valid results as running UnderdeterminedICA with samples from the ideal noisy ICA model. The total variation distance bound is explored in section A.2.

These ideas are combined in section A.3 to prove the correctness of Algorithm 2. One additional technicality arises from the implementation of Algorithm 2. Samples can be drawn from the noisy ICA model using rejection sampling on . In order to guarantee Algorithm 2 executes in polynomial time, when a sample of needs to be rejected, Algorithm 2 terminates in explicit failure. To complete the proof, we argue that with high probability, Algorithm 2 does not explicitly fail.

## 5 Smoothed Analysis

We start with a base matrix and add a perturbation matrix with each entry coming iid from for some . [We restrict the discussion to the second power for simplicity; extension to higher power is straightforward.] As in [17], it will be convenient to work with the multilinear part of the Khatri–Rao product: For a column vector define , a subvector of , given by for . Then for a matrix we have .

###### Theorem 6.

With the above notation, for any base matrix with dimensions as above, we have, for some absolute constant ,

 Pr(σmin((M+N)⊖2)≤σ2n7)≤2Cn.

Theorem 2 follows immediately from the theorem above by noting that .

###### Proof.

In the following, for a vector space (over the reals) denotes the distance between vector and subspace ; more precisely, . We will use a lower bound on , found in Appendix H.2.

With probability , the columns of the matrix are linearly independent. This can be proved along the lines of a similar result in [17]. Fix and let be a unit vector orthogonal to the subspace spanned by the columns of other than column . Vector is well-defined with probability . Then the distance of the ’th column from the span of the rest of the columns is given by

 uTCk =uT(Mk+Nk)⊖2=∑1≤i

Now note that this is a quadratic polynomial in the random variables . We will apply the anticoncentration inequality of vstdCarbery–Wright [9] to this polynomial to conclude that the distance between the ’th column of and the span of the rest of the columns is unlikely to be very small (see Appendix H.3 for the precise result).

Using , the variance of our polynomial in (4) becomes

In our application, our random variables for are not standard Gaussians but are iid Gaussian with variance , and our polynomial does not have unit variance. After adjusting for these differences using the estimate on the variance of above, Lemma 24 gives .

Therefore, by the union bound over the choice of .

Now choosing , Lemma 23 gives . ∎

We note that while the above discussion is restricted to Gaussian perturbation, the same technique would work for a much larger class of perturbations. To this end, we would require a version of the Carbery-Wright anticoncentration inequality which is applicable in more general situations. We omit such generalizations here.

## 6 The curse of low dimensionality for Gaussian mixtures

In this section we prove Theorem 3, which informally says that for small there is a large class of superpolynomially close mixtures in with fixed variance. This goes beyond the specific example of exponential closeness given in [20] as we demonstrate that such mixtures are ubiquitous as long as there is no lower bound on the separation between the components.

Specifically, let be the cube . We will show that for any two sets of points and in , with fill (we say that has fill , if there is a point of within distance of any point of ), there exist two mixtures with means on disjoint subsets of , which are exponentially close in in the

norm. Note that the fill of a sample from the uniform distribution on the cube can be bounded (with high probability) by

.

We start by defining some of the key objects. Let be the unit Gaussian kernel. Let be the integral operator corresponding to the convolution with a unit Gaussian: . Let be any subset of points in . Let be the kernel matrix corresponding to , . It is known to be positive definite. For a function , the interpolant is defined as , where the coefficients are chosen so that

. It is easy to see that such interpolant exists and is unique, obtained by solving a linear system involving

.

We will need some properties of the Reproducing Kernel Hilbert Space corresponding to the kernel (see [29, Chapter 10] for an introduction). In particular, we need the bound and the reproducing property, . For a function of the form we have .

###### Lemma 7.

Let be any positive function with norm supported on and let . If has fill , then there exists such that

 ∥f−fX,k∥L∞(Rn)
###### Proof.

From [24], Theorem 6.1 (taking ) we have that for some and sufficiently small Note that the norm is on while we need to control the norm on . To do that we need a bound on the RKHS norm of . This ultimately gives control of the norm over because there is a canonical isometric embedding of elements of interpreted as functions over into elements of interpreted as functions over . We first observe that for any , . Thus, from the reproducing property of RKHS, . Using properties of RKHS with respect to the operator (see, e.g., Proposition 10.28 of  [29])

 ∥f−fX,k∥2H =⟨f−fX,k,f−fX,k⟩H=⟨f−fX,k,f⟩H=⟨f−fX,k,Kg⟩H =⟨f−fX,k,g⟩L2([0,1]n)≤∥f−fX,k∥L2([0,1]n)∥g∥L2([0,1]n)

Thus

###### Theorem 8.

Let and be any two subsets of with fill . Then there exist two Gaussian mixtures and (with positive coefficients summing to one, but not necessarily the same number of components), which are centered on two disjoint subsets of and such that for some

 ∥p−q∥L1(Rn)
###### Proof.

To simplify the notation we assume that . The general case follows verbatim, except that the interval of integration, , and its complement need to be replaced by the sphere of radius and its complement respectively.

Let and be the interpolants, for some fixed sufficiently smooth (as above, ) positive function with . Using Lemma 7, we see that . Functions and are both linear combinations of Gaussians possibly with negative coefficients and so is . By collecting positive and negative coefficients we write

 fX,k−fY,k=p1−p2, (5)

where, and are mixtures with positive coefficients only.

Put , , where and are disjoint subsets of . Now we need to ensure that the coefficients can be normalized to sum to .

Let , . From (5) and by integrating over the interval , and since is strictly positive on the interval, it is easy to see that . We have

 |α−β|=∣∣∣∫Rp1(x)−p2(x)dx∣∣∣≤∥p1−p2∥L1(R)
 ∥p1−p2∥L1(R)≤∫[−1/h,1/h]∥fX,k−fY,k∥L∞(R)dx+2(α+β)∫x∈[1/h,∞)K(0,x−1)dx.

Noticing that the first summand is bounded by and the integral in the second summand is even smaller (in fact, ) , it follows immediately, that for some and sufficiently small.

Hence, we have

Collecting exponential inequalities completes the proof. ∎

###### of Theorem 3.

For convenience we will use a set of points instead of . Clearly it does not affect the exponential rate.

By a simple covering set argument (cutting the cube into cubes with size ) and basic probability, we see that the fill of points is at most with probability . Hence, given points, we have . We see, that with a smaller probability (but still close to for large ), we can sample points times and still have the same fill.

Partitioning the set of points into disjoint subsets of points and applying Theorem 8 (to points) we obtain pairs of exponentially close mixtures with at most components each. If one of the pairs has the same number of components, we are done. If not, by the pigeon-hole principle for at least two pairs of mixtures and the differences of the number of components (an integer number between and ) must coincide. Assume without loss of generality that has no more components that and has no more components than .Taking and completes the proof. ∎

## References

• [1] D. Achlioptas and F. McSherry. On spectral learning of mixture of distributions. In The 18th Annual Conference on Learning Theory, 2005.
• [2] L. Albera, A. Ferreol, P. Comon, and P. Chevalier. Blind Identification of Overcomplete MixturEs of sources (BIOME). Lin. Algebra Appl., 391:1–30, 2004.
• [3] N. Alon and J. H. Spencer. The probabilistic method. Wiley, 2004.
• [4] S. Arora, R. Ge, A. Moitra, and S. Sachdeva.

Provable ICA with unknown Gaussian noise, with implications for Gaussian mixtures and autoencoders.

In NIPS, pages 2384–2392, 2012.
• [5] S. Arora and R. Kannan. Learning Mixtures of Arbitrary Gaussians. In

33rd ACM Symposium on Theory of Computing

, 2001.
• [6] M. Belkin, L. Rademacher, and J. Voss. Blind signal separation in the presence of Gaussian noise. In JMLR W&CP, volume 30: COLT, pages 270–287, 2013.
• [7] M. Belkin and K. Sinha. Polynomial learning of distribution families. In FOCS, pages 103–112. IEEE Computer Society, 2010.
• [8] A. Bhaskara, M. Charikar, A. Moitra, and A. Vijayaraghavan. Smoothed analysis of tensor decompositions. CoRR, abs/1311.3651v4, 2014.
• [9] A. Carbery and J. Wright. Distributional and norm inequalities for polynomials over convex bodies in . Mathematical Research Letters, 8:233–248, 2001.
• [10] J.-F. Cardoso. Super-symmetric decomposition of the fourth-order cumulant tensor. blind identification of more sources than sensors. In Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991 International Conference on, pages 3109–3112. IEEE, 1991.
• [11] J.-F. Cardoso and A. Souloumiac. Blind beamforming for non-gaussian signals. In Radar and Signal Processing, IEE Proceedings F, volume 140, pages 362–370, 1993.
• [12] P. Comon and C. Jutten, editors. Handbook of Blind Source Separation. Academic Press, 2010.
• [13] A. Dasgupta.

Probability for Statistics and Machine Learning

.
Springer, 2011.
• [14] S. Dasgupta. Learning Mixture of Gaussians. In 40th Annual Symposium on Foundations of Computer Science, 1999.
• [15] S. Dasgupta and L. Schulman. A Two Round Variant of EM for Gaussian Mixtures. In

16th Conference on Uncertainty in Artificial Intelligence

, 2000.
• [16] J. Feldman, R. A. Servedio, and R. O’Donnell. PAC Learning Axis Aligned Mixtures of Gaussians with No Separation Assumption. In The 19th Annual Conference on Learning Theory, 2006.
• [17] N. Goyal, S. Vempala, and Y. Xiao. Fourier PCA. CoRR, http://arxiv.org/abs/1306.5825, 2013.
• [18] D. Hsu and S. M. Kakade. Learning mixtures of spherical Gaussians: moment methods and spectral decompositions. In ITCS, pages 11–20, 2013.
• [19] M. Kendall, A. Stuart, and J. K. Ord. Kendall’s advanced theory of statistics. Vol. 1. Halsted Press, sixth edition, 1994. Distribution theory.
• [20] A. Moitra and G. Valiant. Settling the polynomial learnability of mixtures of Gaussians. In 51st Annual IEEE Symposium on Foundations of Computer Science (FOCS 2010), 2010.
• [21] E. Mossel, R. O’Donnell, and K. Oleszkiewicz. Noise stability of functions with low influences: Invariance and optimality. Annals of Math., 171:295–341, 2010.
• [22] O. A. Nielsen. An Introduction to Integration Theory and Measure Theory. Wiley, 1997.
• [23] B. Rennie and A. Dobson. On Stirling numbers of the second kind. Journal of Combinatorial Theory, 7(2):116 – 121, 1969.
• [24] C. Rieger and B. Zwicknagl. Sampling inequalities for infinitely smooth functions, with applications to interpolation and machine learning. Advances in Computational Mathematics, 32(1):103–129, 2010.
• [25] J. Riordan. Moment recurrence relations for binomial, poisson and hypergeometric frequency distributions. Annals of Mathematical Statistics, 8:103–111, 1937.
• [26] H. L. Royden, P. Fitzpatrick, and P. Hall. Real analysis, volume 4. Prentice Hall New York, 1988.
• [27] M. Rudelson and R. Vershynin. Smallest singular value of a random rectangular matrix. Comm. Pure Appl. Math., 62(12):1707–1739, 2009.
• [28] S. Vempala and G. Wang. A Spectral Algorithm for Learning Mixtures of Distributions. In 43rd Annual Symposium on Foundations of Computer Science, 2002.
• [29] H. Wendland. Scattered data approximation, volume 17. Cambridge University Press Cambridge, 2005.
• [30] A. Winkelbauer.

Moments and Absolute Moments of the Normal Distribution.

ArXiv e-prints, Sept. 2012.

## Appendix A Theorem 1 Proof Details

### a.1 Error Analysis of the Ideal Noisy ICA Model

The proposed full reduction from Section 3 provides us with two models. The first is a noisy ICA model from which we cannot sample:

 (Ideal ICA)X′ =A′S′+η′(τ) . (6)

The second is a model that fails to satisfy the assumption that has independent coordinates, but it is a model from which we can sample:

 (Approximate ICA)X′ =(A′S′+η′(τ))|R≤τ . (7)

Both models rely on the choice of two parameters, and . The dependence on is explicit in the models. The dependence on can be summarized in the unrestricted model as independently of each other, and .

The probability of choosing will be seen to be exponentially small in . For this reason, running UnderdeterminedICA with polynomially many samples from model (6) will with high probability be equivalent to running the ICA Algorithm with samples from model (7). This notion will be made precise later using total variation distance.

For the remainder of this subsection, we proceed as if samples are drawn from the ideal noisy ICA model (6). Thus, to recover the columns of , it suffices to run UnderdeterminedICA on samples of . Theorem 22 can be used for this analysis so long as we can obtain the necessary bounds on the cumulants of , moments of