Gaussian mixture models are among the most well-studied models in statistics, signal processing, and computer science with a venerable history spanning more than a century. Gaussian mixtures arise naturally as way of explaining data that arises from two or more homogeneous populations mixed in varying proportions. There have been numerous applications of gaussian mixtures in disciplines including astronomy, biology, economics, engineering and finance.
The most basic estimation problem when dealing with any mixture model is to approximately identify the parameters that specify the model given access to random samples. In the case of a gaussian mixture the model is determined by a collection of means, covariance matrices and mixture probabilities. A sample is drawn by first selecting a component according to the mixture probabilities and then sampling from the normal distribution specified by the corresponding mean and covariance. Already in 1894, Pearson[Pea94] proposed the problem of estimating the parameters of a mixture of two one-dimensional gaussians in the context of evolutionary biology. Pearson analyzed a population of crabs and found that a mixture of two gaussians faithfully explained the size of the crab “foreheads”. He concluded that what he observed was a mixture of two species rather than a single species and further speculated that “a family probably breaks up first into two species, rather than three or more, owing to the pressure at a given time of some particular form of natural selection.”
Fitting a mixture of two gaussians to the observed crab data was a formidable task at the time that required Pearson to come up with a good approach. His approach is based on the method of moments which uses the empirical moments of a distribution to distinguish between competing models. Given samples the -th empirical moment is defined as , which for sufficiently large will approximate the true moment A mixture of two one-dimensional gaussians has parameters so one might hope that moments are sufficient to identify the parameters. Pearson derived a ninth degree polynomial in the first moments and located the roots of this polynomial. Each root gives a candidate mixture that matches the first moments; there were two valid solutions, among which Pearson selected the one whose -th moment was closest to the observed empirical -th moment.
In this work, we extend the method proposed by Pearson and prove that the extended method reliably recovers the parameters of the unknown mixture. Moreover, we show that the sample complexity we achieve is essentially optimal. To illustrate the quantitative bound that we get, if the means and variances are separated by constants and the total variance of the mixture is then we show that up to constant factors it is necessary and sufficient to use samples to recover the parameters up to small additive error. Our work can be interpreted as providing an extension of Pearson’s 120-year old estimator that achieves an optimal convergence rate. We extend our result to arbitrary dimension using an apparently novel but surprisingly simple dimensionality reduction technique. This allows us to obtain the same sample complexity in any dimension up to a logarithmic loss in , which we can also show is necessary.
Closely related to our results is an important recent work of Kalai, Moitra and Valiant [KMV10] who gave the first proof of a computationally efficient estimator with an inverse-polynomial convergence rate for the problem we consider. In particular, they show that six moments suffice to identify a mixture of two one-dimensional gaussians. Moreover, the result is robust in the sense that if the parameters of a mixture with variance are separated by constants, then one of the first moments must differ by In particular, the first six empirical moments suffice provided that they’re within of the true moments (which happens for ). This then leads to an estimator up to some polynomial loss. They also show that a solution to the -dimensional problem extends to any dimension up to some loss that’s polynomial in and using a suitable dimensionality reduction technique. In contrast to their result which is within a polynomial factor of optimal, our result is within a constant factor of optimal in one dimension and within a factor of optimal in dimensions.
1.1 Problem description
A mixture of two -dimensional gaussians is specified by mixing probabilities such that two means and two covariance matrices A sample from is generated by first picking an integer from the distribution and then sampling from the -dimensional gaussian measure
The variance of a -dimensional mixture of two gaussians is . For a -dimensional mixture, it is useful to define its “variance” as the maximum variance of any coordinate,
Given samples from our goal is to recover the parameters that specify the mixture up to small additive error; this is known as parameter distance. It is easy to see that we can only hope to recover the components of the mixture up to permutation. For simplicity it is convenient to combine the error in estimating the parameters:
We say that mixture is -close to mixture if there is a permutation for which
We say that an algorithm -learns a mixture of two gaussians from samples, if given i.i.d. samples from , it outputs a mixture that is -close to with probability
Note that this definition does not require good recovery of the . If the two components of the mixture are indistinguishable, one cannot hope to recover the to additive error. On the other hand, if the components are well-separated, one can use that the overall mean is the -weighted average of the component means—or an analogous statement for the variances—to estimate the from estimates of the parameters. Our main theorem will give a more precise characterization of how well we estimate , but for simplicity we ignore it in much of the paper.
We also consider learning a mixture of Gaussians component-wise in the total variation norm.
We say that mixture is component-wise -close to mixture in total variation if there is a permutation for which
We say that an algorithm -learns a mixture of two gaussians in total variation from samples, if given i.i.d. samples from , it outputs a mixture that is component-wise -close to in total variation with probability
Why parameter distance?
We believe that proper learning of each component of a Gaussian mixture in the parameter distance is the most natural objective. Consider the simple example of estimating the height distribution of adult men and women from unlabeled population data, which is well approximated by a mixture of Gaussians [SWW02]. By using parameter distance, our results give a tight characterization of how precisely you can estimate the average male and female heights from a given number of samples. A guarantee in total variation norm is less easily interpreted.
Focusing on parameter distance also has technical advantages in our context. First, it leads to a cleaner quantitative analysis. Second, if the covariance matrices are (close to) sparse we can recover only the dominant entries of the covariance matrices and ignore the rest, decreasing our sample complexity. An affine invariant measure such as total variation distance could not benefit from sparsity this way. Nevertheless, to facilitate comparison with previous work, we state our results for total variation norm as well.
Finally, it is important that we learn each component rather than the mixture distribution. Finding a distribution that closely approximates the mixture distribution is easier but less useful than approximating the individual components, and is the focus of a different line of work that we discuss in the related work section. In fact, the individal parameters are a strong reason for modeling with mixtures of gaussians in the first place.
1.2 Main results
Our main theorem is a general result that achieves tight bounds in multiple parameter regimes. As a consequence it’s a little cumbersome to state, so we start with two simpler corollaries. The first corollary is that the algorithm -learns a mixture with samples.
Let be any mixture of -dimensional gaussians where and are bounded away from zero. Then Algorithm 3 can -learn with samples.
The th power dependence on arises because our algorithm uses the th moment. In fact, we will see that in general this result is tight: there exist distributions for which one cannot reliably estimate either the to or the to with samples.
However, for many distributions one can estimate the parameters with fewer samples. One important special case is when the two gaussians have means that are separated by standard deviations. In this case, our algorithm requires only samples.
Let be a mixture of -dimensional gaussians where and are bounded away from zero and . Then Algorithm 3 can -learn with samples.
This result is also tight: even if the samples from the mixture were labeled, it still would take samples to estimate the mean and variance of each gaussian to the desired precision. Our main theorem gives a smooth tradeoff between these two corollaries.
Let be any mixture of two gaussians with variance and bounded away from Then, given samples Algorithm 3 with probability outputs the parameters of a mixture so that for some permutation and all we have the following guarantees:
If , then , , and .
If , then and .
For any , the algorithm performs as well as assuming the mixture is a single gaussian: and
In essence, the theorem states that the algorithm can distinguish the two gaussians in the mixture if it has at least samples. Once this happens, the parameters can be estimated to relative accuracy with only a factor more samples. If the means are reasonably separated, then the first clause of the theorem provides the strongest bounds. If there is no separation in the means, we cannot hope to learn the means to relative accuracy, but we can still learn the variances to relative accuracy provided that they’re separated. This is the content of the second clause. If neither means nor variances are separated, our algorithm is no better or worse than treating the mixture as a single gaussian.
The only assumption present in our main theorem requires that be bounded away from zero. Making this assumption simplifies the proof on a syntactic level considerably. A polynomial dependence on the separation from could be extracted from our techniques, but we don’t know if this dependence would be optimal.
Our second main result is that the bound in Theorem 3.10 is essentially best possible among all estimators—even computationally inefficient ones. More concretely, we exhibit a pair of mixtures that satisfy the following strong bound on the squared Hellinger distance111For probability measures and with densities and respectively, the squared Hellinger distance is defined as between the two distributions.
There are two one-dimensional gaussian mixtures with variances and all of the , and separated by from each other such that the squared Hellinger distance satisfies
Denoting by the distribution obtained by taking independent samples from the squared Hellinger distance satisfies the direct sum rule Moreover, if then the total variation distance also satisfies In particular, in this case no statistical test can distinguish and from samples with high confidence and parameter estimation is therefore impossible. The following theorem follows, showing that Corollary 1.3 is optimal.
Consider any algorithm that, given samples of any gaussian mixture with variance , with probability learns either to or to . Then .
Since -learning the mixture requires learning both the and the to this precision, we get that Corollary 1.3 is tight. This also justifies our definition of -approximation in parameter distance meaning approximating the means to and the variances to .
Any algorithm that uses samples to -learn arbitrary mixtures of two -dimensional gaussians with and bounded away from zero requires .
We also note that our lower bound technique directly gives a lower bound of for the problem of learning a mixture of Gaussians for constant (Theorem 2.11). This is incomparable to the lower bound of, roughly, for due to [MV10]. Our bound is useful when is a small constant and is going to zero, while their bound is useful when both and are large.
Upper bound in arbitrary dimensions.
Our main result holds for the -dimensional problem up to replacing by in the sample complexity.
Let be any mixture of -dimensional gaussians where and are bounded away from zero. Then we can -learn with samples.
Notably, our bound is essentially dimension-free and incurs only a logarithmic dependence on The best previous bound for the problem is the bound due to [KMV10] that gives a polynomial dependence of for some large constant . The proof of our theorem is based on a new dimension-reduction technique for the mixture problem that is quite different from the one in [KMV10]. Apart from the quantitative improvement that it yields, it is also notably simpler.
Lower bound in higher dimension.
We can extend our lower bound (Theorem 2.5) to show that samples are necessary to achieve the guarantee of Theorem 4.11; one can embed a different instance of the hard distribution in each of the dimensions, and the guarantee requires that the algorithm solve all the copies. That this direct product is hard is shown in Theorem 2.7. Hence Theorem 4.11 is optimal up to the term, and optimal up to constant factors when .
Learning in total variation norm.
In Section 5 we derive various results for learning mixtures of gaussians in the total variation norm.
Let be any mixture of -dimensional gaussians where and are bounded away from zero. For any dimension Algorithm 1 -learns in total variation with samples.
dependence here is probably not close to optimal, the exponent is nonetheless several orders of magnitude smaller than the exponent of the polynomial dependence that follows from previous work. Interestingly, this large sample complexity of the general case can be improved if the covariances of the two Gaussians have similar eigenvalues and eigenvectors (e.g., they are isotropic):
Let be any mixture of -dimensional gaussians with covariance matrices and where the mixing probabilities and are bounded away from zero. Further suppose that there exists a constant such that
Then there is an algorithm that can -learn in total variation with samples.
1.3 Related Work
The body of related work on gaussian mixture models is too broad to survey here. We refer the reader to [KMV10] for a helpful discussion of work prior to 2010. Since then a number of works have further contributed to the topic. Moitra and Valiant [MV10] gave polynomial bounds for estimating the parameters of a mixture of gaussians based on the method of moments. Belkin and Sinha [BS10] achieved a similar result. It is an interesting question if our techniques extend to the case of gaussians, but as our lower bounds show the sample size must be at least which is prohibitive for small and even moderate .
Work of Chan et al. [CDSS13, CDSS14] implies an improper learning algorithm for a mixture of two single-dimensional gaussians that learns the overall mixture (not the components) in total variation distance to error using samples. An improper learning algorithm in general does not return a mixture of gaussians nor does it return an approximation to the individual components of the mixture.
Daskalakis and Kamath [DK14] strengthen this result by giving a proper learning algorithm for learning a one-dimensional mixture with the same sample complexity. However, unlike our algorithm, it does not learn the individal components of the mixture. Indeed, this is impossible in general given the stated sample complexity in light of our lower bound. Nonetheless, our bounds do imply a proper learning algorithm for the mixture itself (which is a strictly weaker task than learning both components). In the case where our algorithm for learning under total variation norm implies the best known bounds also for this weaker task when no assumptions are placed on the mixture.
1.4 Proof overview
We now give a high-level outline of our algorithmic approach (and the related approach of Pearson). The starting point for the method of moments is to set up a system of polynomial equations whose coefficients are determined by the moments of the mixture and whose variables are the unknown parameters. Solving the system of polynomial equations recovers the unknown parameters. The main stumbling block is that the roots of polynomials are notoriously unstable with respect to small perturbations in the coefficients. A famous example is Wilkinson’s polynomial. Perturbations arise inevitably in our context because we do not know the moments of the mixture model exactly but rather need to estimate them empirically from samples. Our main contribution is to exhibit a robust set of polynomial equations from which the parameters can be recovered. We hope that similar techniques may be useful in extending our results to other settings such as learning a mixture of more than two gaussians.
We begin by reparametrizing the gaussian mixture
in such a way to get parameters that are independent of adding
gaussian noise to the mixture. Formally, adding or subtracting the
same term from each of the variances leaves these parameters
unchanged. Assuming the overall mean of the mixture is , this
leaves us with free parameters that we call
Since these parameters are independent of adding gaussian noise it is
useful to also define the moments of the mixture in such a way that
they are independent under adding gaussian noise. This is accomplished
by considering what we call excess moments. The name is
inspired by the term excess kurtosis
excess kurtosis, a well-known measure of “peakedness” of a distribution introduced in [Pea94] that corresponds to the fourth excess moment. At this point, the third through sixth excess moments give us four equations in the three variables .
Three different precision regimes.
Our analysis distinguishes between three different parameter regimes. In the first parameter regime we know each excess moment for up to an additive error of This analysis is applicable when the means are separated and it leads to the first case in Theorem 3.10. The second regime is when the separation between the means is small, but we nevertheless know each excess moment up to error This analysis in this case applies when the variances are separated and leads to the second case in Theorem 3.10. Finally, when neither of the cases applies the two gaussians are indistinguishable and we simply fit a single gaussian. We show that we can figure out which parameter regime we’re in and run the appropriate algorithm.
We focus here on a discussion of the first parameter regime, since it is the most interesting case. The full argument is in Section 3.3.
Robustifying Pearson’s polynomial.
Expressing the excess moments in terms of our new parameters we can derive in a fairly natural manner a ninth degree polynomial whose coefficients depend on and so that has to satisfy The polynomial was already used by Pearson. Unfortunately, can have multiple roots and this is to be expected since moments are not sufficient to identify a mixture of two gaussians. Pearson computed the mixtures associated with each of the roots and threw out the invalid solutions (e.g. the ones that give imaginary variances), getting two valid mixtures that matched on the first five moments. He then chose the one whose sixth moment was closest to the observed sixth moment.
We proceed somewhat differently from Pearson after computing . First, we use the first 4 excess moments to compute an upper bound on . We show that the set of valid mixtures that match the first 5 moments correspond precisely to the roots of with . We then derive another ninth degree polynomial using and that we call We prove that is the only solution to the system of equations
This approach isn’t yet robust to small perturbations in the moments; for example, if has a double root at , it may have no nearby root after perturbation. We therefore consider the polynomial which we know is zero at We argue that is significantly nonzero for any significantly far from . This is the main technical claim we need.
For intuition of why this is the case, consider the normalization and the setting where . Because the excess moments are polynomials in we can think of as a polynomial in . We are interested in some region where every root of corresponds to a mixture matching the first six moments. Because six moments suffice to identify the mixture by [KMV10], has no roots in outside . This lets us show that has no roots over , which for a compact implies that over . Thus over the region of interest.
Now, with samples we can estimate all the to , which lets us estimate both and to . This means is estimated to . Since , this lets us find to . We then work back through our equations to get and to , which give the and to .
The analysis proceeds slightly differently in the setting where . In this setting the region of interest is not compact, because the parameter (which here equals ) is unbounded. However, we can show directly that the highest (12th) degree coefficient of in is bounded away from zero, getting that . Since the are now not constant, while we can estimate each to with samples, we only estimate and to . Since , this lets us estimate to . This is sufficient to recover to , which lets us recover to and then the and to .
In Section 4 we extend our theorem to arbitrary dimensional mixtures using two simple ideas. The first idea is used to reduce the -dimensional case to the -dimensional case and is straightforward. The second argument reduces the -dimensional case to the -dimensional and is only slightly more involved. How can we use an algorithm for to solve the problem in arbitrary dimension? Consider the case where differ in some entry . We can find by running our assumed algorithm for all pairs of variables. Each pair of variables leads to a two-dimensional mixture problem where the covariances are obtained by restricting to the corresponding entries. Once we have found an entry where we are in good shape. We now iterate over all and solve the -dimensional mixture problem on the variables to within accuracy This not only reveals an additional entry of the covariance matrix but it also tells us which of the two values for position is associated with which of the values for position This is because we solved the -dimensional problem to accuracy and we know that Hence, each newly recovered value for position must be close to the value that we previously recovered. This ensures that we do not mix up any entries and so we recover the covariance matrices entry by entry. A similar but simpler argument works for the means.
Finally, the four-dimensional problem reduces to one dimension by brute forcing over an -net of all possible four-dimensional solutions (which is now doable in polynomial time) and using the algorithm for to verify whether we picked a valid solution. The verification works by projecting the four-dimensional mixture in a random direction. Using anti-concentration results for quadratic forms in gaussian variables, we can show that any covariance matrix -far from the true covariance matrices will be ruled out with constant probability by each projection. Therefore projections will identify the covariance matrices among the possibilities. A union bound requires , giving overhead beyond the -dimensional algorithm.
2 Lower bounds
2.1 Mixtures with matching moments are very close under Gaussian noise
Our main lemma shows that if we have two gaussian mixtures whose first
moments are matching and we add a gaussian random variableto each mixture, then the resulting distributions are -close in squared Hellinger distance. The idea is illustrated in Figure 1.
Let be probability distributions that are absolutely
continuous with respect to the Lebesgue measure. Let
be probability distributions that are absolutely continuous with respect to the Lebesgue measure. Letand denote density functions of and respectively. Then, the squared Hellinger distance between and is defined as
Let and be distributions that are subgaussian with constant parameters and identical first moments for . Let and for . Then
We have that and are subgaussian with constant parameters, i.e., for any we have
and similarly for . Denote by density functions of respectively. We would like to bound
We split the integral (2) into two regimes, and for .
For the regime, we have
The challenging part is the regime.
For we have
Let be such that Let be such that Note that since all the parameters of are constant. In particular, denoting by the density of we have for every