1 Introduction
is one of the most wellstudied and widelyused models in applied statistics and machine learning. An important special case of this model (the primary focus of this work) restricts the Gaussian components to have spherical covariance matrices; this probabilistic model is closely related to the (nonprobabilistic)
means clustering problem (MacQueen, 1967).The mixture of spherical Gaussians model is specified as follows. Let
be the probability of choosing component
, letbe the component mean vectors, and let
be the component variances. Define
so is a probability vector, and is the matrix whose columns are the component means. Let be the (observed) random vector given by
where
is the discrete random variable with
for , and is a random vector whose conditional distribution given (for some ) is the multivariate Gaussian with mean zero and covariance .The estimation task is to accurately recover the model parameters (component means, variances, and mixing weights) from independent copies of .
This work gives a procedure for efficiently and exactly recovering the parameters using a simple spectral decomposition of loworder moments of , under the following condition.
Condition 1 (Nondegeneracy).
The component means span a dimensional subspace (i.e., the matrix has column rank ), and the vector has strictly positive entries.
The proposed estimator is based on a spectral decomposition technique (Chang, 1996; Mossel and Roch, 2006; Anandkumar et al., 2012b), and is easily stated in terms of exact population moments of the observed . With finite samples, one can use a plugin estimator based on empirical moments of in place of exact moments. These empirical moments converge to the exact moments at a rate of , where is the sample size. As discussed in Section 3, sample complexity bounds for accurate parameter estimation can be derived using matrix perturbation arguments (Anandkumar et al., 2012b). Since only loworder moments are required by the plugin estimator, the sample complexity is polynomial in the relevant parameters of the estimation problem.
Related work.
The first estimators for the Gaussian mixture models were based on the methodofmoments, as introduced by Pearson (1894) (see also Lindsay and Basak, 1993, and the references therein). Roughly speaking, these estimators are based on finding parameters under which the Gaussian mixture distribution has moments approximately matching the observed empirical moments. Finding these parameters typically involves solving systems of multivariate polynomial equations, which is typically computationally challenging. Besides this, the order of the moments of some of the early momentbased estimators were either growing with the dimension or the number of components , which is undesirable because the empirical estimates of such highorder moments may only be reliable when the sample size is exponential in or
. Both the computational and sample complexity issues have been addressed in recent years, at least under various restrictions. For instance, several distancebased estimators require that the component means be wellseparated in Euclidean space, by at least some large factor times the directional standard deviation of the individual component distributions
(Dasgupta, 1999; Arora and Kannan, 2001; Dasgupta and Schulman, 2007; Vempala and Wang, 2002; Chaudhuri and Rao, 2008), but otherwise have polynomial computational and sample complexity. Some recent momentbased estimators avoid the minimum separation condition of distancebased estimators by requiring either computational or data resources exponential in the number of mixing components (but not the dimension ) (Belkin and Sinha, 2010; Kalai et al., 2010; Moitra and Valiant, 2010) or by making a nondegenerate multiview assumption (Anandkumar et al., 2012b).By contrast, the momentbased estimator described in this work does not require a minimum separation condition, exponential computational or data resources, or nondegenerate multiple views. Instead, it relies only on the nondegeneracy condition discussed above together with a spherical noise condition. The nondegeneracy condition is much weaker than an explicit minimum separation condition because the parameters can be arbitrarily close to being degenerate, as long as the sample size grows polynomially with a natural quantity measuring this closeness to degeneracy (akin to a condition number). Like other momentbased estimators, the proposed estimator is based on solving multivariate polynomial equations, although these solutions can be found efficiently because the problems are cast as eigenvalue decompositions of symmetric matrices, which are efficient to compute.
Recent work of Moitra and Valiant (2010)
demonstrates an informationtheoretic barrier to estimation for general Gaussian mixture models. More precisely, they construct a pair of onedimensional mixtures of Gaussians (with separated component means) such that the statistical distance between the two mixture distributions is exponentially small in the number of components. This implies that in the worst case, the sample size required to obtain accurate parameter estimates must grow exponentially with the number of components, even when the component distributions are nonnegligibly separated. A consequence of the present work is that natural nondegeneracy conditions preclude these worst case scenarios. The nondegeneracy condition in this work is similar to one used for bypassing computational (cryptographic) barriers to estimation for hidden Markov models
(Chang, 1996; Mossel and Roch, 2006; Hsu et al., 2012a; Anandkumar et al., 2012b).Finally, it is interesting to note that similar algebraic techniques have been developed for certain models in independent component analysis (ICA) (Comon, 1994; Cardoso and Comon, 1996; Hyvärinen and Oja, 2000; Comon and Jutten, 2010; Arora et al., 2012) and other closely related problems (Frieze et al., 1996; Nguyen and Regev, 2009). In contrast to the ICA setting, handling nonspherical Gaussian noise for mixture models appears to be a more delicate issue. These connections and open problems are further discussed in Section 3.
2 Momentbased estimation
This section describes a methodofmoments estimator for the spherical Gaussian mixture model.
The following theorem is the main structural result that relates the model parameters to observable moments.
Theorem 1 (Observable moment structure).
Assume Condition 1 holds. The average variance is the smallest eigenvalue of the covariance matrix . Let
be any unit norm eigenvector corresponding to the eigenvalue
. Define(where
denotes tensor product, and
is the coordinate basis for ). ThenRemark 1.
We note that in the special case where (i.e., the mixture components share a common spherical covariance matrix), the average variance is simply , and has a simpler form:
There is no need to refer to the eigenvectors of the covariance matrix or .
Proof of Theorem 1.
We first characterize the smallest eigenvalue of the covariance matrix of , as well as all corresponding eigenvectors . Let . The covariance matrix of is
Since the vectors for are linearly dependent (), the positive semidefinite matrix has rank . Thus, the smallest eigenvalues are exactly , while all other eigenvalues are strictly larger than . The strict separation of eigenvalues implies that every eigenvector corresponding to is in the null space of ; thus for all .
Now we can express , , and in terms of the parameters , , and . First,
where the last step uses the fact that , which implies that conditioned on , and . Next, observe that , so
Finally, for , we first observe that
(terms such as and vanish because ). We now claim that . This holds because
crucially using the fact that for and . By the same derivation, we have and . Therefore,
as claimed. ∎
Theorem 1 shows the relationship between (some functions of) the observable moments and the desired parameters. A simple estimator based on this moment structure is given in the following theorem. For a thirdorder tensor , we define for any vector .
Theorem 2 (Momentbased estimator).
The following can be added to the results of Theorem 1. Suppose are distinct and nonzero (which is satisfied almost surely, for instance, if is chosen uniformly at random from the unit sphere in ). Then the matrix
is diagonalizable (where denotes the MoorePenrose pseudoinverse); its nonzero eigenvalue / eigenvector pairs satisfy and for some permutation on and signs . The , , and are recovered (up to permutation) with
Proof.
Let be the thin SVD of (, , and ), so and since has rank by assumption. Also by assumption, the diagonal entries of are distinct and nonzero. Therefore, every nonzero eigenvalue of the symmetric matrix has geometric multiplicity one. Indeed, these nonzero eigenvalues are the diagonal entries of (up to some permutation on ), and the corresponding eigenvectors are the columns of up to signs:
Now, since
it follows that
The claims regarding and are also evident from the structure of and . ∎
An efficiently computable plugin estimator can be derived from Theorem 2. We state one such algorithm (called LearnGMM) in Appendix C; for simplicity, we restrict to the case where the components share the same common spherical covariance, i.e., . The following theorem provides a sample complexity bound for accurate estimation of the component means. Since only loworder moments are used, the sample complexity is polynomial in the relevant parameters of the estimation problem (in particular, the dimension and the number of mixing components ). It is worth noting that the polynomial is quadratic in the inverse accuracy parameter ; this owes to the fact that the empirical moments converge to the population moments at the usual
rate as per the central limit theorem.
Theorem 3 (Finite sample bound).
There exists a polynomial such that the following holds. Let be the matrix defined in Theorem 2, and be its
th largest singular value (for
). Let and . Pick any . Suppose the sample size satisfiesThen with probability at least over the random sample and the internal randomness of the algorithm, there exists a permutation on such that the returned by LearnGMM satisfy
for all .
The proof of Theorem 3 is given in Appendix C. It is also easy to obtain accuracy guarantees for estimating and . The role of Condition 1 enters by observing that if either or , as . The sample complexity bound then becomes trivial in this case, as the bound grows with and . Finally, we also note that LearnGMM is just one (easy to state) way to obtain an efficient algorithm based on the structure in Theorem 1. It is also possible to use, for instance, simultaneous diagonalization techniques (BunseGerstner et al., 1993) or orthogonal tensor decompositions (Anandkumar et al., 2012a) to extract the parameters from (estimates of) and ; these alternative methods are more robust to sampling error, and are therefore recommended for practical implementation.
3 Discussion
Multiview methods and a simpler algorithm in higher dimensions.
Some previous work of the authors on momentbased estimators for the Gaussian mixture model relies on a nondegenerate multiview assumption (Anandkumar et al., 2012b). In this work, it is shown that if each mixture component has an axisaligned covariance , then under some additional mild assumptions (which ultimately require ), a momentbased method can be used to estimate the model parameters. The idea is to partition the coordinates into three groups, inducing multiple “views” with each for some such that , , and are conditionally independent given . When the matrix of conditional means for each view has rank , then an efficient technique similar to that described in Theorem 2 will recover the parameters. Therefore, the problem is reduced to partitioning the coordinates so that the resulting matrices have rank .
In the case where each component covariance is spherical (), we may simply apply a random rotation to before (arbitrarily) splitting into the three views. Let
for a random orthogonal matrix
, and partition the coordinates so that with and. By the rotational invariance of the multivariate Gaussian distribution, the distribution of
is still a mixture of spherical Gaussians, and moreover, the matrix of conditional means for each view has rank with probability . To see this, observe that a random rotation in followed by a restriction to coordinates is simply a random projection from to , and that a random projection of a linear subspace of dimension (in particular, the range of ) to is almost surely injective as long as . Therefore it is sufficient to require so that it is possible to split into three views, each of dimension . To guarantee that the th largest singular value of each is bounded below in terms of the th largest singular value of (with high probability), we may require to be somewhat larger: certainly works (see Appendix B), and we conjecture for some is in fact sufficient.Spectral decomposition approaches for ICA.
The Gaussian mixture model shares some similarities to a standard model for independent component analysis (ICA) (Comon, 1994; Cardoso and Comon, 1996; Hyvärinen and Oja, 2000; Comon and Jutten, 2010). Here, let be a random vector with independent entries, and let be multivariate Gaussian random vector. We think of as an unobserved signal and as noise. The observed random vector is
for some , where and are assumed to be independent. (For simplicity, we only consider square , although it is easy to generalize to for .)
In contrast to this ICA model, the spherical Gaussian mixture model is one where would take values in , and the covariance of (given ) is spherical.
For ICA, a spectral decomposition approach related to the one described in Theorem 2 can be used to estimate the columns of (up to scale), without knowing the noise covariance . Such an estimator can be obtained from Theorem 4 using techniques commonplace in the ICA literature; its proof is given in Appendix A for completeness.
Theorem 4.
In the ICA model described above, assume , , and (i.e.
, the excess kurtosis is nonzero), and that
is nonsingular. Define bywhere . Suppose and are such that are distinct. Then the matrix
is diagonalizable; the eigenvalues are and each have geometric multiplicity one, and the corresponding eigenvectors are (up to scaling and permutation).
Again, choosing and as random unit vectors ensures the distinctness assumption is satisfied almost surely, and a finite sample analysis can be given using standard matrix perturbation techniques (Anandkumar et al., 2012b). A number of related deterministic algorithms based on algebraic techniques are discussed in the text of Comon and Jutten (2010). Recent work of Arora et al. (2012) provides a finite sample complexity analysis for an efficient estimator based on local search.
Nondegeneracy.
The nondegeneracy assumption (Condition 1) is quite natural, and its has the virtue of permitting tractable and consistent estimators. Although previous work has typically tied it with additional assumptions, this work shows that they are largely unnecessary.
One drawback of Condition 1 is that it prevents the straightforward application of these techniques to certain problem domains (e.g., automatic speech recognition (ASR), where the number of mixture components is typically enormous, but the dimension of observations is relatively small; alternatively, the span of the means has dimension ). To compensate, one may require multiple views, which are granted by a number of models, including hidden Markov models used in ASR (Hsu et al., 2012a; Anandkumar et al., 2012b), and combining these views in a tensor product fashion (Allman et al., 2009). This increases the complexity of the estimator, but that may be inevitable as estimation for certain nonsingular models is conjectured to be computationally intractable (Mossel and Roch, 2006).
Acknowledgements
We thank Dean Foster and Anima Anandkumar for helpful insights. We also thank Rong Ge and Sanjeev Arora for discussions regarding their recent work on ICA.
References
 Allman et al. (2009) E. S. Allman, C. Matias, and J. A. Rhodes. Identifiability of parameters in latent structure models with many observed variables. The Annals of Statistics, 37(6A):3099–3132, 2009.
 Anandkumar et al. (2012a) A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for learning latent variable models, 2012a. Manuscript.
 Anandkumar et al. (2012b) A. Anandkumar, D. Hsu, and S. M. Kakade. A method of moments for mixture models and hidden Markov models. In COLT, 2012b.
 Arora and Kannan (2001) S. Arora and R. Kannan. Learning mixtures of arbitrary Gaussians. In STOC, 2001.

Arora et al. (2012)
S. Arora, R. Ge, A. Moitra, and S. Sachdeva.
Provable ICA with unknown Gaussian noise, and implications for Gaussian mixtures and autoencoders.
In NIPS, 2012.  Belkin and Sinha (2010) M. Belkin and K. Sinha. Polynomial learning of distribution families. In FOCS, 2010.
 BunseGerstner et al. (1993) A. BunseGerstner, R. Byers, and V. Mehrmann. Numerical methods for simultaneous diagonalization. SIAM Journal on Matrix Analysis and Applications, 14(4):927–949, 1993.
 Cardoso and Comon (1996) J.F. Cardoso and P. Comon. Independent component analysis, a survey of some algebraic methods. In IEEE International Symposium on Circuits and Systems Circuits and Systems Connecting the World, 1996.
 Chang (1996) J. T. Chang. Full reconstruction of Markov models on evolutionary trees: Identifiability and consistency. Mathematical Biosciences, 137:51–73, 1996.
 Chaudhuri and Rao (2008) K. Chaudhuri and S. Rao. Learning mixtures of product distributions using correlations and independence. In COLT, 2008.
 Comon (1994) P. Comon. Independent component analysis, a new concept? Signal Processing, 36(3):287–314, 1994.
 Comon and Jutten (2010) P. Comon and C. Jutten. Handbook of Blind Source Separation: Independent Component Analysis and Applications. Academic Press. Elsevier, 2010.
 Dasgupta (1999) S. Dasgupta. Learning mixutres of Gaussians. In FOCS, 1999.
 Dasgupta and Gupta (2003) S. Dasgupta and A. Gupta. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures and Algorithms, 22(1):60–65, 2003.
 Dasgupta and Schulman (2007) S. Dasgupta and L. Schulman. A probabilistic analysis of EM for mixtures of separated, spherical Gaussians. Journal of Machine Learning Research, 8(Feb):203–226, 2007.

Frieze et al. (1996)
A. M. Frieze, M. Jerrum, and R. Kannan.
Learning linear transformations.
In FOCS, 1996.  Hsu et al. (2011) D. Hsu, S. M. Kakade, and T. Zhang. An analysis of random design linear regression, 2011. arXiv:1106.2363.
 Hsu et al. (2012a) D. Hsu, S. M. Kakade, and T. Zhang. A spectral algorithm for learning hidden Markov models. Journal of Computer and System Sciences, 78(5):1460–1480, 2012a.
 Hsu et al. (2012b) D. Hsu, S. M. Kakade, and T. Zhang. Tail inequalities for sums of random matrices that depend on the intrinsic dimension. Electronic Communications in Probability, 17(14):1–13, 2012b.
 Hyvärinen and Oja (2000) A. Hyvärinen and E. Oja. Independent component analysis: algorithms and applications. Neural Networks, 13(4–5):411–430, 2000.
 Kalai et al. (2010) A. T. Kalai, A. Moitra, and G. Valiant. Efficiently learning mixtures of two Gaussians. In STOC, 2010.
 Laurent and Massart (2000) B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. The Annals of Statistics, 28(5):1302–1338, 2000.
 Lindsay and Basak (1993) B. G. Lindsay and P. Basak. Multivariate normal mixtures: a fast consistent method. Journal of the American Statistical Association, 88(422):468–476, 1993.
 Litvak et al. (2005) A. Litvak, A. Pajor, M. Rudelson, and N. TomczakJaegermann. Smallest singular values of random matrices and geometry of random polytopes. Advances in Mathematics, 195:491–523, 2005.
 MacQueen (1967) J. B. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 281–297. University of California Press, 1967.
 Moitra and Valiant (2010) A. Moitra and G. Valiant. Settling the polynomial learnability of mixtures of Gaussians. In FOCS, 2010.
 Mossel and Roch (2006) E. Mossel and S. Roch. Learning nonsingular phylogenies and hidden Markov models. Annals of Applied Probability, 16(2):583–614, 2006.
 Nguyen and Regev (2009) P. Q. Nguyen and O. Regev. Learning a parallelepiped: Cryptanalysis of GGH and NTRU signatures. Journal of Cryptology, 22(2):139–160, 2009.
 Pearson (1894) K. Pearson. Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society, London, A., page 71, 1894.
 Pisier (1989) G. Pisier. The volume of convex bodies and Banach space geometry. Cambridge University Press, 1989.
 Stewart and Sun (1990) G. W. Stewart and JiGuang Sun. Matrix Perturbation Theory. Academic Press, 1990.
 Titterington et al. (1985) D. M. Titterington, A. F. M. Smith, and U. E. Makov. Statistical analysis of finite mixture distributions. Wiley, 1985.
 Vempala and Wang (2002) S. Vempala and G. Wang. A spectral algorithm for learning mixtures of distributions. In FOCS, 2002.
Appendix A Connection to independent component analysis
Proof of Theorem 4.
It can be shown that
By the assumptions,
and therefore
The Hessian of is given by
Define the diagonal matrices
and observe that
By assumption, the diagonal entries of are distinct, and therefore
is diagonalizable, and every eigenvalue has geometric multiplicity one. ∎
Appendix B Incoherence and random rotations
The multiview technique of Anandkumar et al. (2012b) can be used to estimate mixtures of product distributions, which include, as special cases, mixtures of Gaussians with axisaligned covariances . Spherical covariances are, of course, also axisaligned. The idea is to randomly partition the coordinates into three groups, inducing multiple “views” with each for some such that , , and are conditionally independent given . When the matrix of conditional means for each view has rank , then an efficient technique similar to that described in Theorem 2 will recover the parameters (for details, see Anandkumar et al., 2012b, a).
Anandkumar et al. (2012b) show that if has rank and also satisfies a mild incoherence condition, then a random partitioning guarantees that each has rank , and lowerbounds the th largest singular value of each by that of . The condition is similar to the spreading condition of Chaudhuri and Rao (2008).
Define to be the largest diagonal entry of the orthoprojector to the range of . When has rank , we have ; it is maximized when and minimized when the range is spanned by a subset of the Hadamard basis of cardinality . Roughly speaking, if the matrix of conditional means has low coherence, then its fullrank property is witnessed by many partitions of ; this is made formal in the following lemma.
Lemma 1.
Assume has rank and that for some . With probability at least , a random partitioning of the dimensions into three groups (for each , independently pick uniformly at random and put in group ) has the following property. For each , the matrix obtained by selecting the rows of in group has full column rank, and the th largest singular value of is at least times that of .
For a mixture of spherical Gaussians, one can randomly rotate before applying the random coordinate partitioning. This is because if is an orthogonal matrix, then the distribution of is also a mixture of spherical Gaussians. Its matrix of conditional means is given by . The following lemma implies that multiplying a tall matrix by a random rotation causes the product to have low coherence.
Lemma 2 (Hsu et al., 2011).
Let be a fixed matrix with rank , and let be chosen uniformly at random among all orthogonal matrices. For any , with probability at least , the matrix satisfies
Appendix C Learning algorithm and finite sample analysis
In this section, we state and analyze a learning algorithm based on the estimator from Theorem 2, which assumed availability of exact moments of . The proposed algorithm only uses a finite sample to estimate moments, and also explicitly deals with the eigenvalue separation condition assumed in Theorem 2 via internal randomization.
c.1 Notation
For a matrix , we use to denote the th largest singular value of a matrix , and to denote its spectral norm (so ).
For a thirdorder tensor and , we use the notation to denote the thirdorder tensor given by
Note that this is the analogue of for a matrix and . For , we use to denote its operator (or supremum) norm .
c.2 Algorithm
The proposed algorithm, called LearnGMM, is described in Figure 1. The algorithm essentially implements the decomposition strategy in Theorem 2 using plugin moments. To simplify the analysis, we split our sample (say, initially of size ) in two: we use the first half for empirical moments ( and ) used in constructing , , , and ; and we use the second half for empirical moments ( and used in constructing . Observe that this ensures is independent of .
Let be i.i.d. copies of , and write . Let be an independent copy of . Furthermore, define the following moments and empirical moments:
So represents the first half of the sample, and represents the second half of the sample.