1 Introduction
In a Gaussian mixture model (GMM), the observed data comprise an i.i.d. sample from a mixture of Gaussians:
(1) 
where denote the weight, mean, and covariance matrix of the mixture component. Parameters of the GMM are often estimated using the Expectation Maximization (EM) algorithm, which aims to find the maximizer of the loglikelihood objective. However, the loglikelihood function is not concave, so EM is only guaranteed to find its stationary points. This leads to the following natural and fundamental question in the study of EM and nonconvex optimization: How can EM escape spurious local maxima and saddle points to reach the maximum likelihood estimate (MLE)? In this work, we give theoretical and empirical evidence that overparameterizing the mixture model can help EM achieve this objective.
Our evidence is based on models in (1) where the mixture components share a known, common covariance, i.e., we fix for all . First, we assume that the mixing weights are also fixed to known values. Under this model, which we call Model 1, EM finds a stationary point of the loglikelihood function in the parameter space of component means . Next, we overparameterize Model 1 as follows. Despite the fact that the weights fixed in Model 1, we now pretend that they are not fixed. This gives a second model, which we call Model 2. Parameter estimation for Model 2 requires EM to estimate the mixing weights in addition to the component means. Finding the global maximizer of the loglikelihood over this enlarged parameter space is seemingly more difficult for Model 2 than it is for Model 1, and perhaps needlessly so. However, in this paper we present theoretical and empirical evidence to the contrary.

For mixtures of two symmetric Gaussians (i.e., and ), we prove that EM for Model 2 converges to the global maximizer of the loglikelihood objective with almost any initialization of the mean parameters, while EM for Model 1 will fail to do so for many choices of . These results are established for idealized executions of EM in an infinite sample size limit, which we complement with finite sample results.

We prove that the spurious local maxima in the (population) loglikelihood objective for Model 1 are eliminated in the objective for Model 2.

We present an empirical study to show that for more general mixtures of Gaussians, with a variety of model parameters and sample sizes, EM for Model 2 has higher probability to find the MLE than Model 1 under random initializations.
Related work.
Since Dempster’s 1977 paper (Dempster et al., 1977), the EM algorithm has become one of the most popular algorithms to find the MLE for mixture models. Due to its popularity, the convergence analysis of EM has attracted researchers’ attention for years. Local convergence of EM has been shown by Wu (1983); Xu and Jordan (1996); Tseng (2004); Chrétien and Hero (2008). Further, for certain models and under various assumptions about the initialization, EM has been shown to converge to the MLE (Redner and Walker, 1984; Balakrishnan et al., 2017; Klusowski and Brinda, 2016; Yan et al., 2017). Typically, the initialization is required to be sufficiently close to the true parameter values of the datagenerating distribution. Much less is known about global convergence of EM, as the landscape of the loglikelihood function has not been wellstudied. For GMMs, Xu et al. (2016) and (Daskalakis et al., 2017) study mixtures of two Gaussians with equal weights and show that the loglikelihood objective has only two global maxima and one saddle point; and if EM is randomly initialized in a natural way, the probability that EM converges to this saddle point is zero. (Our Theorem 2 generalizes these results.) It is known that for mixtures of three or more Gaussians, global convergence is not generally possible (Jin et al., 2016).
The value of overparameterization for local or greedy search algorithms that aim to find a global minimizer of nonconvex objectives has been rigorously established in other domains. Matrix completion is a concrete example: the goal is to recover of a rank matrix from observations of randomly chosen entries (Candès and Recht, 2009). A direct approach to this problem is to find the matrix of minimum rank that is consistent with the observed entries of . However, this optimization problem is NPhard in general, despite the fact that there are only degreesoffreedom. An indirect approach to this matrix completion problem is to find a matrix of smallest nuclear norm, subject to the same constraints; this is a convex relaxation of the rank minimization problem. By considering all degreesoffreedom, Candès and Tao (2010) show that the matrix is exactly recovered via nuclear norm minimization as soon as entries are observed (with high probability). Notably, this combination of overparameterization with convex relaxation works well in many other research problems such as sparsePCA (d’Aspremont et al., 2005) and compressive sensing (Donoho, 2006). However, many problems (like ours) do not have a straightforward convex relaxation. Therefore, it is important to understand how overparameterization can help one solve a nonconvex problem other than convex relaxation.
Another line of work in which the value of overparameterization is observed is in deep learning. It is conjectured that the use of overparameterization is the main reason for the success of local search algorithms in learning good parameters for neural nets
(Livni et al., 2014; Safran and Shamir, 2017). Recently, Haeffele and Vidal (2015); Nguyen and Hein (2017, 2018); Soltani and Hegde (2018); Du and Lee (2018)confirm this observation for many neural networks such as feedforward and convolutional neural networks.
2 Theoretical results
In this section, we present our main theoretical results concerning EM and twocomponent Gaussian mixture models.
2.1 Samplebased EM and Population EM
Without loss of generality, we assume . We consider the following Gaussian mixture model:
(2) 
The mixing weights and are fixed (i.e., assumed to be known). Without loss of generality, we also assume that (and, of course, ). The only parameter to estimate is the mean vector . The EM algorithm for this model uses the following iterations:
(3) 
We refer to this algorithm as Samplebased : it is the EM algorithm one would normally use when the mixing weights are known. In spite of this, we also consider an EM algorithm that pretends that the weights are not known, and estimates them alongside the mean parameters. We refer to this algorithm as Samplebased , which uses the following iterations:
(4) 
This is the EM algorithm for a different Gaussian mixture model in which the weights and are not fixed (i.e., unknown), and hence must be estimated. Our goal is to study the global convergence properties of the above two EM algorithms on data from the first model, where the mixing weights are, in fact, known.
We study idealized executions of the EM algorithms in the large sample limit, where the algorithms are modified to be computed over an infinitely large i.i.d. sample drawn from the mixture distribution in (2). Specifically, we replace the empirical averages in (3) and (4) with the expectations with respect to the mixture distribution. We obtain the following two modified EM algorithms, which we refer to as Population and Population :

Population :
(5) where here denotes the true distribution of given in (2).

Population : Set ^{1}^{1}1Using equal initial weights is a natural way to initialize EM when the weights are unknown., and run
(7)
As , we can show the performance of Samplebased converges to that of the Population in probability. This argument has been used rigorously in many previous works on EM (Balakrishnan et al., 2017; Xu et al., 2016; Klusowski and Brinda, 2016; Daskalakis et al., 2017). The main goal of this section, however, is to study the dynamics of Population and Population , and the landscape of the loglikelihood objectives of the two models.
2.2 Main theoretical results
Let us first consider the special case . Then, it is straightforward to show that for all in Population . Hence, Population is equivalent to Population . Global convergence of to for this case was recently established by Xu et al. (2016, Theorem 1) for almost all initial (see also (Daskalakis et al., 2017)).
We first show that the same global convergence may not hold for Population when .
Theorem 1.
Consider Population in dimension one (i.e., ). For any , there exists , such that given and initialization , the Population estimate converges to a fixed point inside .
This theorem, which is proved in Appendix A, implies that if we use random initialization, Population may converge to the wrong fixed point with constant probability. We illustrate this in Figure 1. The iterates of Population converge to a fixed point of the function defined in (5). We have plotted this function for several different values of in the left panel of Figure 1. When is close to , has only one fixed point and that is at . Hence, in this case, the estimates produced by Population converge to the true . However, when we decrease the value of below a certain threshold (which is numerically found to be approximately for ), two other fixed points of emerge. These new fixed points are foils for Population .
From the failure of Population , one may expect the overparameterized Population to fail as well. Yet, surprisingly, our second theorem proves the opposite is true: Population has global convergence even when .
Theorem 2.
For any , the Population estimate converges to either or with any initialization
except on the hyperplane
. Furthermore, the convergence speed is geometric after some finite number of iterations, i.e., there exists a finite number and constant such that the following hold.
If , then for all ,

If , then for all ,
Theorem 2 implies that if we use random initialization for , with probability one, the Population estimates converge to the true parameters.
The failure of Population and success of Population can be explained intuitively. Let and , respectively, denote the true mixture components with parameters and . Due to the symmetry in Population , we are assured that among the two estimated mixture components, one will have a positive mean, and the other will have a negative mean: call these and , respectively. Assume and , and consider initializing the Population with . This initialization incorrectly associates with the larger weight instead of the smaller weight . This causes, in the Estep of EM, the component to become “responsible” for an overly large share of the overall probability mass, and in particular an overly large share of the mass from (which has a positive mean). Thus, in the Mstep of EM, when the mean of the estimated component is updated, it is pulled rightward towards . It is possible that this rightward pull would cause the estimated mean of to become positive—in which case the roles of and would switch—but this will not happen as long as is sufficiently bounded away from (but still ).^{2}^{2}2When is indeed very close to , then almost all of the probability mass of the true distribution comes from , which has positive mean. So, in the Mstep discussed above, the rightward pull of the mean of may be so strong that the updated mean estimate becomes positive. Since the model enforces that the mean estimates of and be negations of each other, the roles of and switch, and now it is that becomes associated with the larger mixing weight . In this case, owing to the symmetry assumption, Population may be able to successfully converge to . We revisit this issue in the numerical study, where the symmetry assumption is removed. The result is a bias in the estimation of , thus explaining why the Population estimate converges to some when is not too large.
Our discussion confirms that one way Population may fail (in dimension one) is if it is initialized with having the “incorrect” sign (e.g., ). On the other hand, the performance of Population does not depend on the sign of the initial . Recall that the estimates of Population converge to the fixed points of the mapping , as defined in (2.1) and (7). One can check that for all , we have
(8)  
Hence, is a fixed point of if and only if is a fixed point of as well. Therefore, Population is insensitive to the sign of the initial . This property can be extended to mixtures of Gaussians as well. In these cases, the performance of EM for Model 2 is insensitive to permutations of the component parameters. Hence, because of this nice property, as we will confirm in our simulations, when the mixture components are wellseparated, EM for Model 2 performs well for most of the initializations, while EM for Model 1 fails in many cases.
One limitation of our permutationfree explanation is that the argument only holds when the weights in Population are initialized to be uniform. However, the benefits of overparameterization are not limited to this case. Indeed, when we compare the landscapes of the loglikelihood objective for (the mixture models corresponding to) Population and Population , we find that overparameterization eliminates spurious local maxima that were obstacles for Population .
Theorem 3.
For all , the loglikelihood objective optimized by Population has only one saddle point and no local maximizers besides the two global maximizers and .
The proof of this theorem is presented in Appendix C.
Remark 1.
Consider the landscape of the loglikelihood objective for Population and the point , where is the local maximizer suggested by Theorem 1. Theorem 3 implies that we can still easily escape this point due to the nonzero gradient in the direction of and thus is not even a saddle point. We emphasize that this is exactly the mechanism that we have hoped for the purpose and benefit of overparameterization.
Remark 2.
Note that although or are the two fixed points for Population as well, they are not the first order stationary points of the loglikelihood objective if .
Finally, to complete the analysis of EM for the mixtures of two Gaussians, we present the following result that applies to Samplebased .
Theorem 4.
Let be the estimates of Samplebased . Suppose and . Then we have
where convergence is in probability.
2.3 Roadmap of the proof for Theorem 2
Our first lemma, proved in Appendix B.1, confirms that if , then for every and . In other words, the estimates of the Population remain in the correct hyperplane, and the weight moves in the right direction, too.
Lemma 1.
If , we have for all . Otherwise, if , we have for all .
On account of Lemma 1 and the invariance in (8), we can assume without loss of generality that and for all .
Let be the dimension of . We reduce the case to the case. This achieved by proving that the angle between the two vectors and is a decreasing function of and converges to . The details appear in Appendix B.4. Hence, in the rest of this section we focus on the proof of Theorem 2 for .
Let and be the shorthand for the two update functions and defined in (2.1) and (7) for a fixed . To prove that converges to the fixed point , we establish the following claims:

There exists a set , where and , such that contains point and point for all . Further, is a nondecreasing function of for a given and is a nondecreasing function of for a given ,

There is a reference curve defined on (the closure of ) such that:

is continuous, decreasing, and passes through point , i.e.,

Given , function has a stable fixed point in . Further, any stable fixed point in or fixed point in satisfies the following:

If and , then .

If , then .

If and , then .


Given , function has a stable fixed point in . Further, any stable fixed point in or fixed point in satisfies the following:

If , then .

If , then .

If , then .


We explain C.1 and C.2 in Figure 2
. Heuristically, we expect
to be the only fixed point of the mapping , and that move toward this fixed point. Hence, we can prove the convergence of the iterates by showing certain geometric relationships between the curves of fixed points of the two functions. Hence, C.1 helps us to bound the iterates on the area that such nice geometric relations exist, and the reference curve and C.2 are the tools to help us mathematically characterizing the geometric relations shown in the figure. Indeed, the next lemma implies that and are sufficient to show the convergence to the right point :Lemma 2 (Proved in Appendix b.2.1).
Suppose continuous functions satisfy and , then there exists a continuous mapping such that is the only solution for on , the closure of . Further, if we initialize in , the sequence defined by
satisfies that , and therefore converges to .
In our problem, we set and . Then according to Lemma 1 and monotonic property of and , C.1 is satisfied.
To show C.2, we first define the reference curve by
(9) 
The claim C.2a holds by construction. To show C.2b, we establish an even stronger property of the weights update function : for any fixed , the function has at most one other fixed point besides and , and most importantly, it has only one unique stable fixed point. This is formalized in the following lemma.
Lemma 3 (Proved in Appendix b.2.2).
For all , there are at most three fixed points for with respect to . Further, there exists an unique stable fixed point , i.e., (i) and (ii) for all , we have
(10) 
We explain Lemma 3 in Figure 1. Note that, in the figure, we observe that is an increasing function with and . Further, it is either a concave function, it is piecewise concavethenconvex^{3}^{3}3There exists such that is concave in and convex in .. Hence, we know if is at most , the only stable fixed point is , else if the derivative is larger than 1, there exists only one fixed point in (0,1) and it is the only stable fixed point. The complete proof for C.2b is shown in Appendix B.3.
The final step to apply Lemma 2 is to prove C2.c. However, is a point on the reference curve and is a stable fixed point for . This violates C.2c. To address this issue, since we can characterize the shape and the number of fixed points for , by typical uniform continuity arguments, we can find such that the adjusted reference curve satisfies C.2a and C.2b. Then we can prove that the adjusted reference curve satisfies C2.c; see Appendix B.3.1.
3 Numerical results
In this section, we present numerical results that show the value of overparameterization in some mixture models not covered by our theoretical results.
3.1 Setup
Our goal is to analyze the effect of the sample size, mixing weights, and the number of mixture components on the success of the two EM algorithms described in Section 2.1.
We implement EM for both Model 1 (where the weights are assumed to be known) and Model 2 (where the weights are not known), and run the algorithm multiple times with random initial mean estimates. We compare the two versions of EM by their (empirical) success probabilities, which we denote by and , respectively. Success is defined in two ways, depending on whether EM is run with a finite sample, or with an infinitesize sample (i.e., the population analogue of EM).
When EM is run using a finite sample, we do not expect recover the exactly. Hence, success is declared when the are recovered up to some expected error, according to the following measure:
(11) 
where is the set of all possible permutations on . We declare success if the error is at most , where . Here, is the diagonal matrix whose diagonal is , where each is repeated times, and is the Fisher Information at the true value . We adopt this criteria since it is well known that the MLE asymptotically converges to . Thus, constant indicates an approximately coverage.
When EM is run using an infinitesize sample, we declare EM successful when the error defined in (11) is at most .
3.2 Mixtures of two Gaussians
We first consider mixtures of two Gaussians in one dimension, i.e., . Unlike in our theoretical analysis, the mixture components are not constrained to be symmetric about the origin. For simplicity, we always let , but this information is not used by EM. Further, we consider sample size , separation , and mixing weight ; this gives a total of 18 cases. For each case, we run EM with random initializations and compute the empirical probability of success. When , the initial mean parameter is chosen uniformly at random from the sample. When , the initial mean parameter is chosen uniformly at random from the rectangle .
A subset of the success probabilities are shown in Table 1; see Appendix F for the full set of results. Our simulations lead to the following empirical findings about the behavior of EM on data from wellseparated mixtures (). First, for , EM for Model 2 finds the MLE almost always (), while EM for Model 1 only succeeds about half the time (). Second, for smaller , EM for Model 2 still has a higher chance of success than EM for Model 1, except when the weights and are almost equal. When , the bias in Model 1 is not big enough to stand out from the error due to the finite sample, and hence Model 1 is more preferable. Notably, unlike the special model in (2), highly unbalanced weights do not help EM for Model 1 due to the lack of the symmetry of the component means (i.e., we may have ).
We conclude that overparameterization helps EM if the two mixture components are wellseparated and the mixing weights are not too close.
3.3 Mixtures of three or four Gaussians
We now consider a setup with mixtures of three or four Gaussians. Specifically, we consider the following four cases, each using a larger sample size of :

Case 1, mixture of three Gaussians on a line: , , with weights .

Case 2, mixture of three Gaussians on a triangle: , , with weights .

Case 3, mixture of four Gaussians on a line: , , , with weights .

Case 4, mixture of four Gaussians on a trapezoid: , , , with weights .
The other aspects of the simulations are the same as in the previous subsection.
The results are presented in Table 1. From the table, we confirm that EM for Model 2 (with unknown weights) has a higher success probability than EM for Model 1 (with known weights). Therefore, overparameterization helps in all four cases.
3.4 Explaining the disparity
As discussed in Section 2.2, the performance EM algorithm with unknown weights does not depend on the ordering of the initialization means. We conjuncture that in general, this property that is a consequence of overparameterization leads to the boost that is observed in the performance of EM with unknown weights.
We support this conjecture by revisiting the previous simulations with a different way of running EM for Model 1. For each set of vectors selected to be used as initial component means, we run EM times, each using a different onetoone assignment of these vectors to initial component means. We measure the empirical success probability based on the lowest observed error among these runs of EM. The results are presented in Table 3 in Appendix F. In general, we observe for all cases we have studied, which supports our conjecture. However, this procedure is generally more timeconsuming than EM for Model 2 since executions of EM are required.
Success probabilities for mixtures of two Gaussians (Section 3.2)
Separation  Sample size  
0.799 / 0.500  0.497 / 0.800  0.499 / 0.899  
0.504 / 1.000  0.514 / 1.000  0.506 / 1.000 
Success probabilities for mixtures of three or four Gaussians (Section 3.3)
Case 1  Case 2  Case 3  Case 4 
0.164 / 0.900  0.167 / 1.000  0.145 / 0.956  0.159 / 0.861 
Acknowledgements
DH and JX were partially supported by NSF awards DMREF1534910 and CCF1740833, and JX was also partially supported by a CheungKong Graduate School of Business Fellowship. We thank Jiantao Jiao for a helpful discussion about this problem.
References
 Balakrishnan et al. (2017) Sivaraman Balakrishnan, Martin J. Wainwright, and Bin Yu. Statistical guarantees for the em algorithm: From population to samplebased analysis. Ann. Statist., 45(1):77–120, 02 2017.
 Candès and Recht (2009) Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717, 2009.
 Candès and Tao (2010) Emmanuel J Candès and Terence Tao. The power of convex relaxation: Nearoptimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053–2080, 2010.
 Chrétien and Hero (2008) Stéphane Chrétien and Alfred O Hero. On EM algorithms and their proximal generalizations. ESAIM: Probability and Statistics, 12:308–326, May 2008.
 Daskalakis et al. (2017) Constantinos Daskalakis, Christos Tzamos, and Manolis Zampetakis. Ten steps of em suffice for mixtures of two gaussians. In Proceedings of the 2017 Conference on Learning Theory, pages 704–710, 2017.
 d’Aspremont et al. (2005) Alexandre d’Aspremont, Laurent E Ghaoui, Michael I Jordan, and Gert R Lanckriet. A direct formulation for sparse pca using semidefinite programming. In Advances in neural information processing systems, pages 41–48, 2005.
 Dempster et al. (1977) A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximumlikelihood from incomplete data via the EM algorithm. J. Royal Statist. Soc. Ser. B, 39:1–38, 1977.
 Donoho (2006) David L Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, 2006.

Du and Lee (2018)
Simon Du and Jason Lee.
On the power of overparametrization in neural networks with
quadratic activation.
In
Proceedings of the 35th International Conference on Machine Learning
, pages 1329–1338, 2018.  Haeffele and Vidal (2015) Benjamin D Haeffele and René Vidal. Global optimality in tensor factorization, deep learning, and beyond. arXiv preprint arXiv:1506.07540, 2015.
 Jin et al. (2016) Chi Jin, Yuchen Zhang, Sivaraman Balakrishnan, Martin J Wainwright, and Michael I Jordan. Local maxima in the likelihood of gaussian mixture models: Structural results and algorithmic consequences. In Advances in Neural Information Processing Systems, pages 4116–4124, 2016.
 Klusowski and Brinda (2016) J. M. Klusowski and W. D. Brinda. Statistical Guarantees for Estimating the Centers of a Twocomponent Gaussian Mixture by EM. ArXiv eprints, August 2016.
 Koltchinskii (2011) V. Koltchinskii. Oracle inequalities in empirical risk minimization and sparse recovery problems. In École dété de probabilités de SaintFlour XXXVIII, 2011.
 Livni et al. (2014) Roi Livni, Shai ShalevShwartz, and Ohad Shamir. On the computational efficiency of training neural networks. In Advances in Neural Information Processing Systems, pages 855–863, 2014.
 Nguyen and Hein (2017) Quynh Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. In Proceedings of the 34th International Conference on Machine Learning, pages 2603–2612, 2017.
 Nguyen and Hein (2018) Quynh Nguyen and Matthias Hein. Optimization landscape and expressivity of deep CNNs. In Proceedings of the 35th International Conference on Machine Learning, pages 3730–3739, 2018.
 Redner and Walker (1984) R. A. Redner and H. F. Walker. Mixture densities, maximum likelihood and the EM algorithm. SIAM Review, 26(2):195–239, 1984.
 Safran and Shamir (2017) Itay Safran and Ohad Shamir. Spurious local minima are common in twolayer relu neural networks. arXiv preprint arXiv:1712.08968, 2017.

Soltani and Hegde (2018)
Mohammadreza Soltani and Chinmay Hegde.
Towards provable learning of polynomial neural networks using
lowrank matrix estimation.
In
International Conference on Artificial Intelligence and Statistics
, pages 1417–1426, 2018.  Tseng (2004) Paul Tseng. An analysis of the EM algorithm and entropylike proximal point methods. Mathematics of Operations Research, 29(1):27–44, Feb 2004.
 Wu (1983) C. F. Jeff Wu. On the convergence properties of the EM algorithm. The Annals of Statistics, 11(1):95–103, Mar 1983.
 Xu et al. (2016) Ji Xu, Daniel J Hsu, and Arian Maleki. Global analysis of expectation maximization for mixtures of two gaussians. In Advances in Neural Information Processing Systems, pages 2676–2684, 2016.
 Xu and Jordan (1996) L. Xu and M. I. Jordan. On convergence properties of the EM algorithm for Gaussian mixtures. Neural Computation, 8:129–151, 1996.
 Yan et al. (2017) Bowei Yan, Mingzhang Yin, and Purnamrita Sarkar. Convergence of gradient em on multicomponent mixture of gaussians. In Advances in Neural Information Processing Systems, pages 6956–6966, 2017.
Appendix A Proof of Theorem 1
Let us define . First, it is straightforward to show that
and

is concave for and .

is convex for and .
Hence, we have
(12) 
Therefore, if we can show that the curve of is strictly above the curve for all and , i.e.,
(13) 
then by (12), we have
(14) 
Further, since is continuous, we know there exists
Comments
There are no comments yet.