Gaussian mixture models are one of the most widely used statistical models for clustering. In this model, we are given random samples, where each sample point is drawn independently from one of Gaussian components according to mixing weights , where each Gaussian component has a mean and a covariance . We focus on an important special case of the problem where each of the components is a spherical Gaussian, i.e., the covariance matrix of each component is a multiple of the identity. If represents the p.d.f. of the Gaussian mixture , and represents the p.d.f. of the th Gaussian component,
The goal is to estimate the parameters up to required accuracy in time and number of samples that is polynomial in .
Learning mixtures of Gaussians has a long and rich history, starting with the work of Pearson . (See Section 1.2 for an overview of prior work.) Most of the work on this problem, especially in the early years but also recently, is under the assumption that there is some minimum separation between the means of the components in the mixture. Starting with work by Dasgupta , and continuing with a long line of work (including [4, 45, 2, 29, 39, 17, 14, 30, 7, 8, 46, 19]), efficient algorithms were found under mild separation assumptions. Considering for simplicity the case of uniform mixtures (i.e., all weights are ) of standard Gaussians (i.e., spherical with ), the best known result due to Vempala and Wang  provides an efficient algorithm (both in terms of samples and running time) under separation of at least between any two means.
A big open question in the area is whether efficient algorithms exist under weaker separation assumptions. It is known that when the separation is , a super-polynomial number of samples is required (e.g., [35, 3, 26]), but the gap between this lower bound and the above upper bound of is quite wide. Can it be that efficient algorithms exist under only separation? In fact, prior to this work, this was open even in the case of .
What is the minimum order of separation that is needed to learn the parameters of a mixture of spherical Gaussians up to accuracy using samples?
1.1 Our Results
By improving both the lower bounds and the upper bounds mentioned above, we characterize (up to constants) the minimum separation needed to learn the mixture from polynomially many samples. Our first result shows super-polynomial lower bounds when the separation is of the order . In what follows, represents the “distance” between the parameters of the two mixtures of Gaussians (see Definition 2.2 for the precise definition).
Informal Theorem 1.2 (Lower Bounds).
For any , there are two uniform mixtures of standard spherical Gaussians in dimensions with means respectively, that are well separated
and whose parameter distance is large , but have very small statistical distance .
The above statement implies that we need at least many samples to distinguish between , and identify . See Theorem 3.1 for a formal statement of the result. In fact, these sample complexity lower bounds hold even when the means of the Gaussians are picked randomly in a ball of radius in dimensions. This rules out obtaining smoothed analysis guarantees for small dimensions (as opposed to [10, 3] which give polytime algorithms for smoothed mixtures of Gaussians in dimensions).
Our next result shows that the separation of is tight – this separation suffices to learn the parameters of the mixture with polynomial samples. We state the theorem for the special case of uniform mixtures of spherical Gaussians. (See Theorem 5.1 for the formal statement.)
Informal Theorem 1.3 (Tight Upper Bound in terms of ).
There exists a universal constant , such that given samples from a uniform mixture of standard spherical Gaussians in with well-separated means, i.e.,
there is an algorithm that for any uses only samples and with high probability finds
samples and with high probability findssatisfying .
While the above algorithm uses only samples, it is computationally inefficient.
Our next result shows that in constant dimensions, one can obtain a computationally efficient algorithm. In fact, in such low dimensions a separation of order suffices.
Informal Theorem 1.4 (Efficient algorithm in low dimensions).
There exists a universal constant , such that given samples from a uniform mixture of standard spherical Gaussians in with well-separated means, i.e.,
there is an algorithm that for any uses only time (and samples) and with high probability finds satisfying .
See Theorem 6.1 for a formal statement. An important feature of the above two algorithmic results is that the separation is independent of the accuracy that we desire in parameter estimation ( can be arbitrarily small compared to and ). These results together almost give a tight characterization (up to constants) for the amount of separation needed to learn with samples.
The core technical portion of Theorem 1.3 and Theorem 1.4 is a new iterative algorithm, which is the main algorithmic contribution of the paper. This algorithm takes coarse estimates of the means, and iteratively refines them to get arbitrarily good accuracy . We now present an informal statement of the guarantees of the iterative algorithm.
Informal Theorem 1.5 (Iterative Algorithm Guarantees).
There exists a universal constant , such that given samples from a uniform mixture of standard spherical Gaussians in with well-separated means, i.e.
and suppose we are given initializers for the means satisfying
There exists an iterative algorithm that for any that runs in time (and samples), and after iterations, finds with high probability such that .
The above theorem also holds when the weights and variances are unequal. See Theorem 4.1 for a formal statement. Note that in the above result, the desired accuracy can be arbitrarily small compared to , and the separation required does not depend on . To prove the polynomial identifiability results (Theorems 1.3 and 1.4), we first find coarse estimates of the means that serve as initializers to this iterative algorithm, which then recovers the means up to arbitrarily fine accuracy independent of the separation.
The algorithm works by solving a system of non-linear equations that is obtained by estimating simple statistics (e.g., means) of the distribution restricted to certain carefully chosen regions. We prove that the system of non-linear equations satisfies a notion of “diagonal dominance” that allows us to leverage iterative algorithms like Newton’s method and achieve rapid (quadratic) convergence.
The techniques developed here can find such initializers using only many samples, but use time that is exponential in . This leads to the following natural open question:
Open Question 1.6.
Given a mixture of spherical Gaussians with equal weights and variances, and with separation
for some sufficiently large absolute constant , is there an algorithm that recovers the parameters up to accuracy in time ?
Our iterative algorithm shows that to resolve this open question affirmatively, it is enough to find initializers that are reasonably close to the true parameters. In fact, a simple amplification argument shows that initializers that are close to the true means will suffice for this approach.
Our iterative algorithm is reminiscent of some commonly used iterative heuristics, such as Lloyd’s Algorithm and especially Expectation Maximization (EM). While these iterative methods are the practitioners’ method-of-choice for learning probabilistic models, they have been notoriously hard to analyze. We believe that the techniques developed here may also be useful to prove guarantees for these heuristics.
1.2 Prior Work and Comparison of Results
. Algorithmic results fall into two broad classes — separation-based results, and moment-based methods that do not assume explicit geometric separation.
The body of work that is most relevant to this paper assumes that there is some minimum separation between the means of the components in the mixture. The first polynomial time algorithmic guarantees for mixtures of Gaussians were given by Dasgupta , who showed how to learn mixtures of spherical Gaussians when the separation is of the order of . This was later improved by a series of works [4, 45, 2, 29, 17, 14, 30, 7] for both spherical Gaussians and general Gaussians. The algorithm of Vempala and Wang  gives the best known result, and uses PCA along with distance-based clustering to learn mixtures of spherical Gaussians with separation
We note that all these clustering-based algorithms require a separation that either implicitly or explicitly depend on the estimation accuracy .111Such a dependency on seems necessary for clustering-based algorithms that cluster every point accurately with high probability. Finally, although not directly related to our work, we note that a similar separation condition was shown to suffice also for non-spherical Gaussians , where separation is measured based on the variance along the direction of the line joining the respective means (as opposed, e.g., to the sum of maximum variances which could be much larger).
Iterative methods like Expectation Maximization (EM) and Lloyd’s algorithm (sometimes called the -means heuristic) are commonly used in practice to learn mixtures of spherical Gaussians but, as mentioned above, are notoriously hard to analyze. Dasgupta and Schulman  proved that a variant of the EM algorithm learns mixtures of spherical Gaussians with separation of the order of . Kumar and Kannan  and subsequent work 
showed that the spectral clustering heuristic (i.e., PCA followed by Lloyd’s algorithm) provably recovers the clusters in a rather wide family of distributions which includes non-spherical Gaussians; in the special case of spherical Gaussians, their analysis requires separation of order.
Very recently, the EM algorithm was shown to succeed for mixtures of spherical Gaussians with separation [8, 46, 19] (we note that in this setting with , polynomial time guarantees are also known using other algorithms like the method-of-moments , as we will see in the next paragraph). SDP-based algorithms have also been studied in the context of learning mixtures of spherical Gaussians with a similar separation requirement of . The question of how much separation between the components is necessary was also studied empirically by Srebro et al. , who observed that iterative heuristics successfully learn the parameters under much smaller separation compared to known theoretical bounds.
In a series of influential results, algorithms based on the method-of-moments were developed by [28, 35, 9] for efficiently learning mixtures of Gaussians under arbitrarily small separation. To perform parameter estimation up to accuracy , the running time of the algorithms is (this holds for mixtures of general Gaussians). This exponential dependence on is necessary in general, due to statistical lower bound results . The running time dependence on was improved in the case of Gaussians in .
use uniqueness of tensor decompositions (of orderand above) to implement the method of moments and give polynomial time algorithms assuming the means are sufficiently high dimensional, and do not lie in certain degenerate configurations. Hsu and Kakade  gave a polynomial time algorithm based on tensor decompositions to learn a mixture of spherical Gaussians, when the means are linearly independent. This was extended by [25, 10, 3] to give smoothed analysis guarantees to learn “most” mixtures of spherical Gaussians when the means are in dimensions. These algorithms do not assume any strict geometric separation conditions and learn the parameters in time (and samples), when these non-degeneracy assumptions hold. However, there are many settings where the Gaussian mixture consists of many clusters in a low dimensional space, or have their means lying in a low-dimensional subspace or manifold, where these tensor decomposition guarantees do not apply. Besides, these algorithms based on tensor decompositions seem less robust to noise than clustering-based approaches and iterative algorithms, giving further impetus to the study of the latter algorithms as we do in this paper.
Moitra and Valiant  showed that samples are needed to learn the parameters of a mixture of Gaussians . In fact, the lower bound instance of  is one dimensional, with separation of order . Anderson et al.  proved a lower bound on sample complexity that is reminiscent of our Theorem 1.2. Specifically, they obtain a super-polynomial lower bound assuming separation for . This is in contrast to our lower bound which allows separation greater than , or to be precise.
Other Related work.
While most of the previous work deals with estimating the parameters (e.g., means) of the Gaussians components in the given mixture , a recent line of work [23, 40, 18] focuses on the task of learning a mixture of Gaussians (with possibly very different parameters) that is close in statistical distance i.e., (this is called “proper learning”, since the hypothesis that is output is also a mixture of Gaussians). When identifiability using polynomial samples is known for the family of distributions, proper learning automatically implies parameter estimation. Algorithms for proper learning mixtures of spherical Gaussians [40, 18] give polynomial sample complexity bounds when (note that the lower bounds of  do not apply here) and have run time dependence that is exponential in ; the result of  has sample complexity that is polynomial in but exponential in . Algorithms that take time and samples are also known for “improper learning” and density estimation for mixtures of Gaussians (the hypothesis class that is output may not be a mixture of Gaussians) [15, 12]. We note that known algorithms have sample complexity that is either exponential in or , even though proper learning and improper learning are easier tasks than parameter estimation. To the best of our knowledge, better bounds are not known under additional separation assumptions.
A related problem in the context of clustering graphs and detecting communities is the problem of learning a stochastic block model or planted partitioning model . Here, a sharp threshold phenomenon involving an analogous separation condition (between intra-cluster probability of edges and inter-cluster edge probability) is known under various settings [36, 32, 37] (see the recent survey by Abbe  for details). In fact, the algorithm of Kumar and Kannan  give a general separation condition that specializes to separation between the means for mixtures of Gaussians, and separation between the intra-cluster and inter-cluster edge probabilities for the stochastic block model.
1.3 Overview of Techniques
Lower bound for separation.
The sample complexity lower bound proceeds by showing a more general statement: in any large enough collection of uniform mixtures, for all but a small fraction of the mixtures, there is at least one other mixture in the collection that is close in statistical distance (see Theorem 3.2). For our lower bounds, we will just produce a large collection of uniform mixtures of well-separated spherical Gaussians in dimensions, whose pairwise parameter distances are reasonably large. In fact, we can even pick the means of these mixtures randomly in a ball of radius in dimensions; w.h.p. most of these mixtures will need at least samples to identify.
To show the above pigeonhole style statement about large collections of mixtures, we will associate with a uniform mixture having means , the following quantities that we call “mean moments,” and we will use them as a proxy for the actual moments of the distribution:
The mean moments just correspond to the usual moments of a mixture of delta functions centered at . Closeness in the first mean moments (measured in injective tensor norm) implies that the two corresponding distributions are close in statistical distance (see Lemma 3.7 and Lemma 3.8). The key step in the proof uses a careful packing argument to show that for most mixtures in a large enough collection, there is a different mixture in the collection that approximately matches in the first mean moments (see Lemma 3.6).
Our iterative algorithm will function in both settings of interest: the high-dimensional setting when we have separation, and the low-dimensional setting when and we have separation. For the purpose of this description, let us assume is arbitrarily small compared to and . In our proposed algorithm, we will consider distributions obtained by restricting the support to just certain regions around the initializers that are somewhat close to the means respectively. Roughly speaking, we first partition the space into a Voronoi partition given by , and then for each component in , let denote the region containing (see Definition 4.3 for details). For each we consider only the samples in the set and let be the (sample) mean of these points in , after subtracting .
The regions are chosen in such a way that has a large fraction of the probability mass from the th component, and the total probability mass from the other components is relatively small (it will be at most with separation, and with separation in constant dimensions). However, since can be arbitrarily small functions of , there can still be a relatively large contribution from the other components. For instance, in the low-dimensional case with separation, there can be mass from a single neighboring component! Hence, does not give a -close estimate for (even up to scaling), unless the separation is at least of order – this is too large when with separation, or with separation in constant dimensions.
Instead we will use these statistics to set up a system of non-linear equations where the unknowns are the true parameters and solve for them using the Newton method. We will use the initializers , to define the statistics that give our equations. Hence the unknown parameters satisfy the following equation for each :
Note that in the above equation, the only unknowns or variables are the true means . After scaling the equations, and a suitable change of variables to make the system “dimensionless” we get a non-linear system of equations denoted by . For the above system, represents a solution to the system given by the parameters of . The Newton algorithm uses the iterative update
For the Newton method we need access to the estimates for , and the derivative matrix (the Jacobian) evaluated at . The derivative of the equation w.r.t. corresponds to
where represents the p.d.f. at a point due to a spherical Gaussian with mean at and covariance in each direction. Unlike usual applications of the Newton method, we do not have closed form expressions for (the Jacobian), due to our definition of the set . However, we will instead be able to estimate the Jacobian at by calculating the above expression (RHS) by considering a Gaussian with mean and variance . The Newton method can be shown to be robust to errors in (see Theorem B.2).
We want to learn each of the means up to good accuracy; hence we will measure the error and convergence in norm. This is important in low dimensions since measuring convergence in norm will introduce extra factors, that are prohibitive for us since the means are separated only by . The convergence of the Newton’s method depends on upper bounding the operator norm of the inverse of the Jacobian and the second-derivative , with the initializer being chosen -close to the true parameters so that .
The main technical effort for proving convergence is in showing that the inverse evaluated at any point in the neighborhood around is well-conditioned. We will show the convergence of the Newton method by showing “diagonal dominance” properties of the matrix . This uses the separation between the means of the components, and the properties of the region that we have defined. For separation, this uses standard facts about Gaussian concentration to argue that each of the off-diagonal blocks (in the th row of ) is at most factor of the corresponding diagonal term. With separation in dimensions, we can not hope to get such a uniform bound on all the off-diagonal blocks (a single off-diagonal block can itself be times the corresponding diagonal entry). We will instead use careful packing arguments to show that the required diagonal dominance condition (see Lemma 4.13 for a statement).
Consider a mixture of spherical Gaussians in that has parameters . The th component has mean and covariance . For , let represent the p.d.f. of a spherical Gaussian centered at and with covariance . We will use to represent the p.d.f. of the mixture of Gaussians , and to represent the p.d.f. of the th Gaussian component.
Definition 2.1 (Standard mixtures of Gaussians).
A standard mixture of Gaussians with means is a mixture of spherical Gaussians .
A standard mixture is just a uniform mixture of spherical Gaussians with all covariances . Before we proceed, we define the following notion of parameter “distance” between mixtures of Gaussians:
Definition 2.2 (Parameter distance).
Given two mixtures of Gaussians in , and , define
For standard mixtures, the definition simplifies to
Note that this definition is invariant to scaling the variances (for convenience). We note that parameter distance is not a metric, but it is just a convenient way of measure closeness of parameters between two distributions.
The distance between two individual Gaussian components can also be measured in terms of the total variation distance between the components . For instance, in the case of standard spherical Gaussians, a parameter distance of corresponds to a total variation distance of .
Definition 2.3 (-bounded mixtures).
For , a mixture of spherical Gaussians in is called -bounded if for each , and . In particular, a standard mixture is -bounded if for each , .
Also, for a given mixture of spherical gaussians , we will denote , and .
In the above notation the bound can be thought of as a sufficiently large polynomial in , since we are aiming for bounds that are polynomial in . Since we can always scale the points by an arbitrary factor without affecting the performance of the algorithm, we can think of as the (multiplicative) range of values taken by the parameters . Since we want separation bounds independent of , we will denote individual aspect ratios for variances and weights given by , and .
Finally, we list some of the conventions used in this paper. We will denote by
a normal random variable with meanand variance . For generated according to , let denote the probability that , and let
denote the quantileat which . For any function , will denote the first derivative (or gradient) of the function, and will denote the second derivative (or Hessian). We define to be the norm of restricted to the set . Typically, we will use indices to represent one of the components of the mixture, and we will use (and
) for coordinates. For a vector, we will use denote the th coordinate. Finally, we will use w.h.p. in statements about the success of algorithms to represent probability at least where .
For any , given a matrix , we define the matrix norm
2.1 Notation and Preliminaries about Newton’s method
Consider a system of non-linear equations in variables :
Let be the Jacobian of the system given by the non-linear functional , where the entry of is the partial derivative is evaluated at . Newton’s method starts with an initial point , and updates the solution using the iteration:
Standard results shows quadratic convergence of the Newton method for general normed spaces . We restrict our attention in the restricted setting where both the range and domain of is , equipped with an appropriate norm to measure convergence.
Theorem 2.4 (Theorem 5.4.1 in ).
Assume is a solution to the equation where and the inverse Jacobian exists in a neighborhood , and is locally -Lipschitz continuous in the neighborhood i.e., . Then we have .
In particular, for Newton’s method to work, will guarantee convergence. A statement of the robust convergence of Newton’s method in the presence of estimates is given in Theorem B.2 and Corollary B.4.
We want to learn each of the sets of parameters up to good accuracy; hence we will measure the error in norm. To upper bound , we will use diagonal dominance properties of the matrix . Note that is just the maximum norm of the rows of . The following lemma bound for a diagonally dominant matrix .
Lemma 2.5 ().
Consider any square matrix of size satisfying
3 Lower Bounds with Separation
Here we show a sample complexity lower bound for learning standard mixtures of spherical Gaussians even when the separation is of the order of . In fact, this lower bound will also hold for a random mixture of Gaussians in dimensions (for sufficiently small constant ) with high probability.222In particular, this rules out polynomial-time smoothed analysis guarantees of the kind shown for in [10, 3].
For any large enough there exist , such that the following holds for all . Let be the distribution over standard mixtures of spherical Gaussians obtained by picking each of the means independently and uniformly from a ball of radius around the origin in dimensions. Let be a mixture chosen according to . Then with probability at least there exists another standard mixture of spherical Gaussians with means such that both mixtures are bounded and well separated, i.e.,
and their p.d.f.s satisfy
even though their parameter distance is at least . Moreover, we can take and .
Remark. In Theorem 3.1, there is a trade-off between getting a smaller statistical distance , and a larger separation between the means in the Gaussian mixture. When , with we see that when the separation is . On the other hand, we can also set (for some small constant ) to get lower bounds for mixtures of spherical Gaussians in dimension with and separation between the means. We note that in this setting of parameters, our lower bound is nearly identical to that in . Namely, they achieve statistical distance (which is better than ours) with a similar separation of in one dimension. One possible advantage of our bound is that it holds with a random choice of means, unlike their careful choice of means.
3.1 Proof of Theorem 3.1
The key to the proof of Theorem 3.1 is the following pigeonhole statement, which can be viewed as a bound on the covering number (or equivalently, the metric entropy) of the set of Gaussian mixtures.
Suppose we are given a collection of standard mixtures of spherical Gaussians in dimensions that are bounded, i.e., for all . There are universal constants , such that for any , if
then for at least fraction of the mixtures from , there is another mixture from with p.d.f. such that . Moreover, we can take and .
Notice that plays no role in the statement above. In fact, the proof also holds for mixtures with arbitrary number of components and arbitrary weights.
Let be chosen independently and uniformly from the ball of radius in . Then for any , with probability at least , we have that for all , .
For any fixed , the probability that is at most , because the volume of a ball of radius is times that of a ball of radius . The claim now follows by a union bound. ∎
Proof of Theorem 3.1.
Set , and consider the following probabilistic procedure. We first let be a set of points chosen independently and uniformly from the ball of radius . We then output a mixture chosen uniformly from the collection , defined as the collection of all standard mixtures of spherical Gaussians obtained by selecting distinct means from . Observe that the output of this procedure is distributed according to . Our goal is therefore to prove that with probability at least , the output of the procedure satisfies the property in the theorem.
First, by Claim 3.4, with probability at least , any two points in are at distance at least . It follows that in this case, the means in any mixture in are at least apart, and also that any two distinct mixtures in have a parameter distance of at least since they must differ in at least one of the means. Note that for our choice of .
To complete the proof, we notice that by our choice of parameters, and denoting ,
The last inequality follows since for our choice of , and is large enough with , so that
Hence applying Theorem 3.2 to , for at least fraction of the mixtures in , there is another mixture in that is close in total variation distance. We conclude that with probability at least , a random mixture in satisfies all the required properties, as desired. ∎
3.2 Proof of Theorem 3.2
It will be convenient to represent the p.d.f. of the standard mixture of spherical Gaussians with means as a convolution of a standard mean zero Gaussian with a sum of delta functions centered at ,
Instead of considering the moments of the mixture of Gaussians, we will consider moments of just the corresponding mixture of delta functions at the means. We will call them “mean moments,” and we will use them as a proxy for the actual moments of the distribution.
To prove Theorem 3.2 we will use three main steps. Lemma 3.6 will show using the pigeonhole principle that for any large enough collection of Gaussian mixtures , most Gaussians mixtures in the family have other mixtures which approximately match in their first mean moments. This closeness in moments will be measured using the symmetric injective tensor norm defined for an order- tensor as
Next, Lemma 3.7 shows that the two distributions that are close in the first mean moments are also close in the distance. This translates to small statistical distance between the two distributions using Lemma 3.8.
We will use the following standard packing claim.
Let be an arbitrary norm on . If are such that for all , and for all , , then . In particular, if are such that for all , then for all but of the indices , there exists a such that .
Let be the unit ball of the norm . Then by assumption, the sets for are disjoint. But since they are all contained in ,
The “in particular” part follows by taking a maximal set of -separated points. ∎
Suppose we are given a set of standard mixtures of spherical Gaussians in dimensions with means of length at most . Then for any integer , if , it holds that for at least fraction of the mixtures in , there is another mixture in satisfying that for ,
With any choice of means we can associate a vector of moments where for ,
Notice that the image of lies in a direct sum of symmetric subspaces whose dimension is (i.e., the number of ways of distributing identical balls into different bins). Since ,
We define a norm on these vectors in terms of the maximum injective norm of the mean moments,
Since each of our means has length at most , we have that
Using Claim 3.5, if where
we have that for at least fraction of the Gaussian mixtures in , there is another mixture from which is close, as required. ∎
Next we show that the closeness in moments implies closeness in the distance. This follows from fairly standard Fourier analytic techniques. We will first show that if the mean moments are close, then the low-order Fourier coefficients are close. This will then imply that the Fourier spectrum of the corresponding Gaussian mixtures and are close.
Suppose are the p.d.f. of which are both standard mixtures of Gaussians in dimensions with means and respectively that are both bounded. There exist universal constants , such that for every if the following holds for