are the component means and covariance matrices. Gaussian mixture models are arguably the most important latent variable model with their study dating back over a century to Pearson.
Given the importance of these mixture models, it is natural to consider the problem of trying to learn an unknown mixture from samples. There is a long line of work on trying to solve this problem with different notions of error and different assumptions on the underlying mixture. A large number of papers ([6, 3, 23, 1, 17, 5]) have studied the problem of learning such mixtures under various separation assumptions between the components. In relatively recent work,  gave the first efficient algorithm for learning mixtures of two Gaussians in parameter distance. This was later improved to a nearly optimal algorithm by . Another pair of works ([20, 4]) generalize this result to mixtures of
Gaussians. There have also been a number of papers that have studied the related problem of density estimation for such mixture ([8, 22, 7, 20, 14, 19]).
) attempts to understand which statistical estimation tasks can still be accomplished in the presence of outliers. While the information-theoretic aspects of many of these problems have been well understood for some time, until recently all known estimators were either computationally intractable or had error rates that scaled poorly with dimension.
It was only recently that the first works on computationally efficient, high dimensional robust statistics ([10, 18]) overcame this obstacle. These papers provided the first techniques giving computationally efficient algorithms that learn high dimensional distributions such as Gaussians and product distributions to small error in total variational distance even in the presence of a constant fraction of adversarial noise. Since the publication of these works, there has been an explosion of results in this area with a number of papers expanding upon the original techniques, finding ways to make these algorithms more efficient and finding ways to apply them to new classes of distributions. For a survey of the recent work in this area see .
Since the inception of robust computational statistics, the problem of robustly learning a mixture of even two arbitrary Gaussians has remained a major open problem. Although a number of works (for example [10, 2, 9]) have solved special cases of this problem, the general case has remained illusive. In this paper, we resolve this problem, providing the first efficient, robust algorithm for learning an arbitrary (equal weight) mixture of two Gaussians.
1.2 Our Results
In order to introduce our results, we will first need to define the error model:
Definition 1.1 (Strong Contamination Model).
We say that an algorithm has access to -noisy samples from a distribution if the algorithm can access the following oracle a single time:
The algorithm picks a number , then i.i.d. samples from are generated: . An adversary is then allowed to inspect these samples and replace at most of them with arbitrarily chosen new samples. The algorithm is then given the list of (modified) samples.
We note that this is often referred to as the strong adversary model and is the strongest of the contamination models commonly studied in robust statistics. We also note that while as stated the algorithm can only make a single call to this oracle, it is not hard to see that at the cost of slightly increasing , it can simulate any polynomial number of calls simply by asking for a larger number of samples and randomly partitioning these samples into smaller subsets. Since the adversary does not know ahead of time what the partition is going to be, it will be unlikely that any subset will have more than corrupted samples (at least assuming that the number of samples in each part is large relative to and the number of parts).
We also note that if and are distributions that are guaranteed to have total variational distance at most , then an algorithm can simulate (sufficiently many) -noisy samples from given access to -noisy samples from . This is because a sample from can be thought of as a sample from that is corrupted with a probability of at most . Given a sufficiently large number of samples , then with high probability at most of these samples will be corrupted in this way.
In these terms, our main theorem is easy to state:
Let and be arbitrary Gaussians in . There exists an algorithm that given -noisy samples to runs in time and with probability at least returns a distribution so that .
1.3 Comparison with Prior Work
is the variance of the mixture. We note that this notion of parameter distance is somewhat different than the total variational distance considered in our work. However, it is not hard to show that this notion of parameter distance is stronger. In particular, the algorithms in these papers can be used to learn a mixture of two Gaussians to-error in total variational distance in polynomial time (though this reduction is not entirely trivial). That being said, these algorithms hold only in the non-robust setting and fail very quickly when even a small amount of noise is introduced. Furthermore, while the parameter distance metric might be considered more powerful than the total variational distance metric, the latter is more natural to consider when looking at robust learning. In particular, it is easy to see that in most settings it is impossible even information-theoretically to learn to better than error in total variational distance when only given access to -corrupted samples.
In terms of robust learning of mixtures of Gaussians, only limited results were known until this point. In  it was shown how to learn a single Gaussian robustly to small error in total variational distance. That paper also showed how to learn mixtures of identity covariance Gaussians in polynomial time for any constant . More recently,  and  independently showed how to learn mixtures of Gaussians robustly under the assumption that the component Gaussians were highly separated in total variational distance. Our paper for the first time solves this problem in nearly full generality. It learns an arbitrary mixture of two Gaussians robustly under only the assumption that the weights are equal.
It is shown in 
that learning sixth moments of a mixture of two Gaussians is sufficient to uniquely identify them. At a high level this will also be our strategy for learning the mixture. However, there are several problems with this strategy.
To begin with, even learning moments of a distribution at all is non-trivial in the presence of adversarial errors. A single corrupted sample can already change the empirical moments of a distribution by arbitrarily much. Fortunately, the robust statistics literature has figured out how to get around this in many cases. In particular, it has been known for some time how one can estimate the mean of a random variablerobustly given that one knows that the covariance of is bounded. If one wants to learn higher moments of a distribution, one can apply this statement to learn an approximation to the mean of , assuming that the covariance of this random variable is bounded.
Of course boundedness will also be a problem for us. A mixture of arbitrary Gaussians will have no a priori bounds on its covariance. Thus, an important first step will be to normalize the mixture. In particular, we need a way to approximate the mean and covariance matrix of , so that by applying an appropriate affine transformation we can reduce to the problem where is close to mean and identity covariance.
It was shown in  how to robustly learn an unknown covariance Gaussian to small total variational distance error (which corresponds the learning the covariance matrix in terms of a relative Frobenius norm metric). This result requires that we have a particular kind of relationship hold between the second moments of and the fourth moments of . This unfortunately, does not hold for arbitrary mixtures of Gaussians, but we will show that it holds for mixtures where the components are not too far apart from each other in total variational distance.
Fortunately, we can use the result of  to learn our mixture in the case where the two components are substantially separated in total variational distance. This allows us to reduce to the case where the two components are relatively close, which is sufficient to allow us to perform the normalization procedure described above.
The next obstacle comes in estimating the higher moments of once we have normalized it. We know that we can estimate the moments of to small error given that the covariance of is bounded. Unfortunately, even under good circumstances, this is unlikely to be the case. Instead, we need to replace
by an appropriate tensor of Hermite polynomials. This, we can show will have bounded covariance assuming that we are in the case where the individual components ofare not too far separated.
Finally, given estimates of the higher moments of we need to be able to recover the individual components to relatively small error. This is helped by the assumption that our components are not too far separated, as it means that the distance to which we need to learn the parameters of the individual components is not too bad. For example, if one component had much smaller covariance than the other, than we might be forced to learn the parameters of that component to much higher precision in order to guarantee a small error in total variational distance. However, even learning the parameters from an approximation of the moments is non-trivial. A method is given in  that does this by considering a number of one-dimensional projections, however, this technique will lose dimension-dependent factors, which we cannot afford. Instead we devise a new technique that involves random projections of higher moment tensors into lower dimensions, and doing some guessing to remove some low rank noise.
1.5 Structure of the Paper
We begin in Section 2 with some basic notation and results that will be used throughout the rest of the paper. In Section 3, we deal with the special case where the component Gaussians have small overlap. Then in Section 4, we show how if this is not the case we can reduce to the situation where is mean and identity covariance. Once we have done this, Section 5 shows how we can robustly compute moments of and how to use them to approximate the individual components. Next, in Section 6, we show how to combine everything and prove Theorem 1.2. Finally, in Section 7, we discuss some ideas on how to further extend these results.
We will use to denote the set . For weights summing to and probability distributions, we use
to denote the probability distribution given by their mixture. In particular, the probability density functionis given by the average of the density functions, .
2.2 Distance Between Gaussians
We will need the following approximation of the variational distance between two Gaussians:
The cases where or are proved in . The full result follows from noting that
For our purposes an -tensor will be an element in
. This can be thought of as a vector withcoordinates. These coordinates however, instead of being indexed are index by -tuples of integers from to . We will often used to denote the coordinate of the -tensor indexed by the -tuple . By abuse of notation, we will also sometimes use this to denote the entire tensor. This allows us to make use of Einstein summation notation. Given an -tensor and an -tensor written as a product with of their indices in common, such as:
This represents the -tensor given by summing over the shared indices. In particular,
The special case where (there are not shared indices) we abbreviate as . Similarly, we write to denote the -fold tensor power of with itself. In a further abuse of notation, we will at times associate vectors in with -tensors and matrices in with -tensors.
Another important special case is where . In this case is a -tensor, or equivalently a single real number given by the sum of the products of the corresponding entries of and . We call this the dot product of and , which we denote or
2.3.1 Tensor Norms
It will be important for us to bound various norms of tensors. Perhaps the most fundamental such norm is the or Frobenius norm. In particular, given a tensor , we denote by the square root of the sum of the squares of the entries of . Equivalently,
Another relevant tensor norm involves a relationship between tensors and matrices. In particular, if is an -tensor and is a subset of we note that if is a -tensor, the product
is an -tensor. This allows us to interpret
as a linear transformation from the space of-tensors to the space of -tensors, or equivalently as a matrix. We will (assuming the subset is clear from context) use
to denote the largest singular value of this matrix.
2.3.2 Symmetric Tensors
Many of the tensors that we will be working with will have a large degree of symmetry. Taking advantage of this will allow us to simplify some of our formulas. To be specific if is an -tensor is a permutation of we define the -tensor by
Furthermore, if is any partition of the set , use to denote the above but averaged only over permutations that preserve .
2.4 Robust Statistics
We will need a couple of basic results in robust statistics. We begin with one of the most basic results in the area, namely that an algorithm with access to noisy samples from a distribution with bounded covariance can efficiently approximate the mean of the distribution.
Let be a distribution on with , then there exists a polynomial time algorithm which given -noisy samples from returns a so that with high probability.
There are many proofs of Theorem 2.2. For completeness, and in order to assist with our next result we provide one below.
Let be a discrete probability distribution in with and let a discrete measure on so that for a sufficiently small constant . Let be the normalization of to a probability distribution. Then either:
in which case
There exists an algorithm which given runs in polynomial time and returns a measure so that and .
Theorem 2.2 follows from Lemma 2.3 by letting be the empirical distribution over the uncorrupted samples (perhaps scaled down slightly so that the covariance is less than ) and starting with as the empirical distribution over the noisy samples handed to the algorithm. It is clear that and that is close to . The algorithm them iteratively applies Lemma 2.3 at each step finding a new distribution whose support is smaller and smaller, eventually terminating at a distribution so that is sufficiently close to .
We now prove Lemma 2.3
We begin with a proof of the first statement. If , we note that . Thus, for some distributions and we can write and . We claim that and similarly that . The final result will follow from the triangle inequality. For the latter statement (the former follows similarly), we note that
Since , this implies that Given that , we have .
For the latter result, we assume that
has largest eigenvalue. We find a unit vector so that . We define a function on by
We define by letting for not in the support of and otherwise letting , where is the maximum value of . It is clear that has smaller support than . It remains to show that it is closer to . We begin by comparing the amount of mass lost to the amount that would have been lost if were equal to .
Note that . On the other than
Where we use the bound on from above. Note that if and sufficiently small we have that This is enough. In particular, let be with the probability mass at decreased by a factor for each . We note that since this operation keeps the sign of each coordinate the same,
However, it is easy to see that and Combining with the above inequality completes our proof.
We will also need an algorithm for learning the covariance of a random variable under appropriate conditions. The following is a generalization of the argument from  for learning the covariance matrix of a Gaussian (note here that is standing in for the random variable ).
Let be a distribution on , where is supported on the subset of corresponding to the symmetric, positive semi-definite matrices. Suppose that and that for any symmetric matrix we have that Then there exists a polynomial time algorithm that given sample access to an -corrupted version of for less than a sufficiently small multiple of returns a matrix so that with high probability
We being by reducing to the case where is bounded. It is easy to see from the above bounds that has covariance bounded by . This implies that it is only with at most probability that . Replacing by , the conditional distribution on this event not happening, we note that and so this difference can be thought of as merely increasing our noise rate by . Furthermore, the bounded covariance implies that removing an -probability event changes the expectation of by at most , and so will not change the correctness of our approximation. Furthermore, although the removal of these extreme samples might decrease the covariance of , it cannot increase it substantially. This shows that it suffices for our algorithm to work for the bounded variable . We henceforth assume that is bounded in this way.
be the uniform distribution over the uncorrupted samples. Assuming that we took sufficiently many samples, it is easy to see that with high probability the following hold:
For every matrix ,
We assume throughout the following that the above hold.
The algorithm is given a discrete measure , the uniform distribution over the noisy samples. It is the case that . The algorithm will iteratively produce a sequence of such measures each with , and with smaller and smaller support until it eventually returns a hypothesis .
We begin by letting be the measure given by the pointwise minimum of and . We note that is obtained from by removing mass. Therefore, since has covariance , we have that In particular, this means that Since we have that .
In particular, this means that if is a sufficiently large constant that
This allows us to apply Lemma 2.3 to the distribution and the measure as they have distance at most and the former has covariance bounded by the identity. This either gives us a new measure with smaller support that is not too far from . Or it is the case that
Calling and we have
Therefore . Therefore,
as desired. ∎
2.5 Moment Computations
We will also be working heavily with higher moments of Gaussians and will need to perform some basic computations about them.
Let be a Gaussian in then
We note that it is enough to show that both sides are equal after dotting them with for any vector . The left hand side becomes . We note that is a Gaussian with mean and variance . Letting be the standard Gaussian, this yields . Expanding with the binomial theorem yields
When we dot the right hand side with on the other hand, each term in the sum contributes where is the number of pairs in the partition . The number of such partitions with exactly pairs is . This is because there are ways to choose which elements are in the pairs and once that is decided, ways to pair them up. Summing over gives the same expression as the above, proving our proposition. ∎
Unfortunately, the higher moments will often be difficult to compute directly (at least in the robust setting). Instead, we will need to get at them indirectly through a slightly different set of “moments”. The following tensors correspond to the standard multivariate Hermite polynomials.
Define the degree- Hermite polynomial tensor as
By the definition of the Hermite tensor and Proposition 2.5, we have that
Combining the two sums, we note that this is equivalent to partitions of into sets of size and with the sets of size being marked as type (coming from ) or type (coming from ). Thus, this is
However, if we fix and sum over the choices for each part of size of whether it is type or type , we get
which is easily seen to be equal to the desired quantity. ∎
Finally, we will also need to bound the covariance of the Hermite polynomial tensors. For this we have the following lemma.
If then equals
The proof is essentially the same as that of Lemma 2.7. The primary difference is that in our version of Equation (1) we will only allow to contain pairs that do not cross between different halves of . This means that for the partition , only pairs that do not cross can be type 1, and thus these pairs contribute rather than just ∎
In order to learn our mixture of Gaussians in full generality, we will need to have different algorithms for different cases and will need to make several correct guesses in order to succeed. By considering all possible combinations of guesses, the algorithm will end up with a number of hypothesis distributions at least one of which is guaranteed to be close to the true one. From this point, we will need to run a tournament in order to find a hypothesis that is not too far away. This is by now a fairly standard procedure in learning theory, though we need to verify here that this can be done even with only access to -noisy samples.
Let be an unknown distribution and let be distributions with explicitly computable probability density functions that can be efficiently sampled from. Assume furthermore than . Then there exists an efficient algorithm that given access to -noisy samples from along with computes a so that with high probability
For each define the set to be the set of point where the probability density of is bigger than the probability density of . We note in particular that . Taking enough samples from , we can ensure that with high probability is within of the fraction of the uncorrupted samples lying in for each . Note that this will imply that , the fraction of the noisy samples lying in , is within of .
Additionally, for each we sample enough samples from to compute an approximation to to enough accuracy so that with high probability for all .
Our algorithm then returns any so that for all . This will necessarily exist because
which is at most for any for which .
However, such an will be sufficient this is because
On the other hand, we also have that
If we take so that and take and , we have that , and combining this with the above, we have
Therefore, given any with for all will have .
This completes our proof. ∎
3 Separated Gaussians
We note that if and are separated in variational distance that our problem is already solved by work of .
If and are Gaussians with , then there is a polynomial time algorithm that given access to -noisy samples from learns to error .
We can henceforth assume that and have total variation distance at most with for some small positive constant . We would like to know what this entails.
Suppose that and are Gaussians with total variation distance at most . Let have covariance . Then we have that:
We note that this statement is invariant under affine transformations, by applying such a transformation, we can assume that . We will proceed by contradiction. In particular, we will show that if either of the above are violated, then . We also assume throughout that is sufficiently small.
For the first condition, assume that (without loss of generality) has an eigenvalue less than for a sufficiently small constant . In particular this means that there is a unit vector so that We will show that . We note that and are one dimensional Gaussians whose mixture has unit variance. Let be the distance between the means and let be the variance of . We note that . In particular this implies that either or In fact if , it is easy to see that
since the standard deviations of these Gaussians differ by a factor of at least. Otherwise, if then and the components are separated by standard deviations, which implies that the total variational distance is at least .
For the second condition, we will use some results from . For Gaussians they define where They also show that if we have that
We note that if , then
Note that if we apply an orthogonal change of variables so that and are simultaneously diagonalized, that the eigenvalues of are just the ratios of the eigenvalues of with the corresponding eigenvalues of . We note that since , we have that
However, if . We have that , and so the above is at least , and so must be . ∎
4 Covariance Approximation for Non-Separated Mixtures
Our next result shows allows us to robustly estimate the covariance of assuming that the component Gaussians are not too far apart.
Let be a mixture of Gaussians with . Let . There exists a polynomial time algorithm that given -noisy samples from for less than a sufficiently small constant returns a hypothesis so that
Let be the difference of independent copies of . Note that our algorithm has sample access to a -noisy samples of by subtracting pairs of -noisy samples from . Let . Note that is proportional to . It therefore suffices to show that satisfies the hypotheses of Theorem 2.2.
Let , since this problem is invariant under linear change of variables, so we assume for convenience that . Note that this problem is unaffected by translating to have mean . Let where . We note that . We also note that by Lemma 3.2 that and
We note that is a mixture of , , and (where the primed versions of the variables are independent copies). Hence is a mixture of where each is one of these four distributions. Since the covariance of a mixture of distributions is the mixture of the covariances plus the covariance of the distribution over the component means, we need to show that:
For each of these distributions , and every matrix we have that .
For any two of these distributions and , we have that