The EM algorithm has a long history dating back to 1886 . Its modern presentation was given in  and it has been applied widely on the maximum likelihood problem in mixture models, or equivalently, minimizing the negative log likelihood (also known as cross entropy). Some basic properties of EM have been derived. For example, EM monotonically decreases the loss  and it converges to critical points . Mixture models 
are important in clustering. The most popular one is the Gaussian mixture model (GMM). The optimization problem is non-convex, making the convergence analysis hard. This is true even in some simple settings such as one-dimensional or spherical covariance matrices. With separability assumptions, there has been a series of works proposing new algorithms for learning GMMs efficiently[5, 6, 7, 8, 9, 10, 11, 12, 13]. However, for the EM algorithm, not so much work has been done. One well-studied case is the two equally balanced mixtures [14, 15, 16] where local and global convergence guarantees are shown. For more than two components,  shows that there exist arbitrarily bad local minima. There has also been some study of EM on the convergence of Gaussian mixtures with more than two components [18, 19]. GMMs are well suited in practice for continuous variables. For discrete data clustering, we need discrete mixture models. One common assumption people make is naive Bayes, saying that in each component, the features are independent. This corresponds to the spherical covariance matrix setting in GMMs. Since discrete data can be converted to binary strings, in this paper we will focus on Bernoulli mixture models (BMMs), which have been applied in text classification  and digit recognition 
. However, the theory is less studied. Although the Bernoulli and Gaussian distributions are both in the exponential family, we will show that BMMs have different properties from GMMs in the sense that GD is stable (i.e., gets trapped) at so-called one-cluster regions of BMMs, but unstable at (i.e., escapes) all one-cluster regions of GMMs. On the other hand, the two mixture models share similarities that we will explain. In the practice of clustering with mixture models, it is common for algorithms to converge to regions with onlyclusters out of desired clusters (where ). We call them -cluster regions. We observed in experiments that EM escapes such regions exponentially faster than GD for GMMs. For BMMs, GD may get stuck at some
-cluster regions and the probability can be very high, while EM always escapes such regions. These experimental results can be found in Section6. Theoretically, -cluster regions are difficult to study. In this paper, we focus on one-cluster regions in mixtures of two components, which can be considered as a first step towards the more difficult problem. We succeed in explaining the different escape rates for GMMs with two components. We find that GD escapes one-cluster regions linearly while EM escapes exponentially fast. For BMMs, our theorem shows that there exist one-cluster regions where GD will converge to, given a small enough step size and a close enough neighborhood, but EM always escapes such regions exponentially. Such one-cluster regions can be arbitrarily worse than the global minimum. Experiments show that this contrast does not only hold for mixtures of two Bernoullis, but also BMMs with any number of components. The rest of the paper is structured as follows. In Section 2, we give necessary background and notations. Following that, the main contributions are summarized in Section 3. In Section 4, we analyze the stability of EM and GD around one-cluster regions, for GMMs and BMMs with two components. The properties of such one-cluster regions are shown in Section 5. We provide supporting experiments in Section 6 and conclude in Section 7.
2 Background and Notations
A mixture model of components has the distribution:
with the mixing vector on the dimensional probability simplex : . Here is the conditional distribution given cluster , and parameters . The sample space (i.e., space of x) is -dimensional. We study the population likelihood, where there are infinitely many samples and the sample distribution is the true distribution. This assumption is common (e.g., in 
where the expectation is over , the true distribution. We assume by default this notation of expectation in this section. The loss function above is also known as cross entropy loss. The true distribution is assumed to have the same form as the model distribution: , where . Notice that the population assumption is not necessary in this section. We could replace with a finite sample distribution, with i.i.d. samples drawn from .
For Gaussian mixtures, we consider the conditional distributions:
where , and the covariance matrix of each cluster .
For Bernoulli mixtures, we have , and:
where the Bernoulli distribution is denoted as. Slightly abusing the notation, we also use
to represent the joint distribution ofas a product of the marginal distributions. The expectation of the conditional distribution is:
so can be interpreted as the mean of cluster . The covariance matrix of cluster is , where we use to denote element-wise multiplication. Also, we always assume , and thus .
2.1 EM algorithm
The EM update map is:
Sometimes, it is better to interpret (2.7) in a different way. Define an unnormalized distribution . The corresponding partition function
and the normalized distributioncan be written as:
where the integration is replaced with summation, given discrete mixture models. (2.7) can be rewritten as:
This interpretation will be useful for our analysis near one-cluster regions in Section 4.
2.2 Gradient Descent
GD and its variants are the default algorithms in optimization of deep learning. In mixture models, we have to deal with constrained optimization, and thus we consider projected gradient descent (PGD) . From (2.1), (2.2) and (2.6), the derivative over is:
The equation above is valid for both GMMs and BMMs. However, for BMMs, raises numerical instability near the boundary of . Therefore, we use the following equivalent formula instead in the implementation:
After a GD step, we have to project back into the feasible space . For BMMs, we use Euclidean norm projection. The projection of the part is simply:
with and applied element-wise. Projection of is not needed for GMMs. For the part, we borrow the algorithm from , which essentially solves the following optimization problem:
The projected gradient descent is therefore:
with the step size.
2.3 -cluster region
We define a -cluster region and a -cluster point as:
A -cluster region is a subset of the parameter space where , with denoting the number of non-zero elements. An element in a -cluster region is called a -cluster point.
In this work, we focus on one-cluster regions and mixtures of two components. In Section 6, we will show experiments on -cluster regions for an arbitrary number of components. A key observation throughout our theoretical analysis and experiments, is that with random initialization, GD often converges to a -cluster region where , whereas EM almost always escapes such -cluster regions, with random initialization and even with initialization in the neighborhood of -cluster regions. Now, let us study one-cluster regions for mixture models of two components. WLOG, assume . To study the stability near such regions, we consider , with sufficiently small. From (2.6), the responsibility functions can be approximated as:
Under this approximation, . For EM , converges to within one step based on (2.7). For GD, converges to at a linear rate222In the context of optimization, “a sequence converges to at a linear rate" means that there exists a number such that . based on (2.12)333For BMMs, assume is not initialized on the boundary. The rate is upper bounded by the initialization as converges to .. For convenience, we define the difference between and .
2.4 Stable fixed point and stable fixed region
A fixed point is stable under map if there exists a small enough feasible neighborhood of such that for any , , where we use to denote a composition for times. Otherwise, the fixed point is called unstable. Similarly, a fixed region is stable under map if there exists a small enough feasible neighborhood of such that for any , exists and . We will use these definitions in Section 4.
3 Main Contributions
We summarize the main contributions in this paper, starting from mixtures of two Gaussians. This is a well-known model that has been widely studied. However, we are not aware of any result regarding -cluster points. As a first result in this line of research, we show that EM is better than GD in escaping one-cluster regions:
Consider a mixture of two Gaussians with , unit covariance matrices and true distribution
When is initialized to be small enough, and , EM increases exponentially fast, while GD increases linearly444Here, we use the usual notion of function growth. Do not confuse it with the rate of convergence..
This result indicates that EM escapes the neighborhood of one cluster-regions faster than GD by increasing the probability of cluster 1 at a rate that is exponentially faster than GD. The escape rates of EM and GD are formally proven in Theorems 4.1 and 4.2. Our second result concerns mixtures of two Bernoullis:
For mixtures of two Bernoullis, there exist one-cluster regions that can trap GD, but do not trap EM. If any one-cluster region traps EM, it will also trap GD.
This result is formally stated in Theorems 4.3 and 4.7. It shows that EM is also better than GD in escaping one-cluster regions in BMMs. Our third result concerns the value of one-cluster regions, which is an informal summary of Theorems 5.1 and 5.2:
The one-cluster regions stated in Result 3.2 are local minima, and they can be worse than the optimal value.
So far, we have seen that EM is better than GD in mixtures of two components, in the sense that EM escapes one-cluster regions exponentially faster than GD in GMMs, and that EM escapes local minima that trap GD in BMMs. Empirically, we show that this is true in general. In Section 6, we find that for BMMs with an arbitrary number of components and features, when we initialize the parameters randomly, EM always converges to an -cluster point, i.e., all clusters are used in the model. Comparably, GD converges to a -cluster point with a high probability, where only some of the clusters are employed in fitting the data (i.e., ). This result, combined with our analysis for mixtures of two components, implies that EM can be more robust than GD in terms of avoiding certain types of bad local optima when learning mixture models.
4 Analysis Near One-Cluster Regions
In this section, we analyze the stability of EM and GD near one-cluster regions. Our theoretical results for EM and GD are verified empirically in Section 6.
4.1 Mixture of two Gaussians
We first consider mixtures of two Gaussians with identity covariance matrices, under the infinite-sample assumption. Due to translation invariance, we can choose the origin to be the midpoint of the two cluster means, such that: and .
4.1.1 EM algortihm
As long as is sufficiently small, the following theorem shows that EM will increase and therefore will not converge to a one-cluster solution. Recall from (2.9) that is multiplied by at every step of EM and therefore we show that almost everywhere, ensuring that will increase. Since converges to within one step, we only consider those points where . We can show that under mild assumptions, EM escapes one-cluster regions exponentially fast:
Consider a mixture of two Gaussians with , unit covariance matrices and true distribution
When is initialized to be small enough, and , EM increases exponentially fast.
It suffices to show that and grows if initially . To calculate , we first compute :
This equation shows that corresponds to an unnormalized mixture of Gaussians with their means shifted by and their mixing coefficients rescaled in comparison to . If , then . So, describes how different deviates from . The partition function can be computed by integrating out :
In fact, which can be derived from . So, becomes:
Using the fact and that equality holds iff , we can show that when , . Now, let us show that increases. It suffices to prove that increases. From (4.1.1) and (2.9), we have the update equation for : , with
If , then and . So, with , and
where we use the superscript to denote the updated values and to denote the old values. Similarly, we can prove that will decrease if . Hence, from (4.4), increases under EM if initially . ∎
This theorem above can be extended to any mixture of two Gaussians with known fixed covariance matrices for both clusters, using the transformation and . We will prove it formally in Appendix C. One may wonder what happens if originally . If we choose randomly, happens with probability zero since the corresponding Lebesgue measure is zero. Moreover, in numerical calculation such points are extremely unstable and thus unlikely. A similar condition appears in balanced mixtures of Gaussians as well, e.g., Theorem 1 in .
Effect of Separation
4.1.2 GD algorithm
Now, let us analyze the behavior of GD near one-cluster regions. After one GD iteration, the mixing coefficients become before projection (where since ). Assume a small step size such that . After projection based on (2.2), the updated mixing coefficients are:
Combining with (2.9), whenever , both EM and GD increase . In this sense, EM and GD achieve an agreement. However, in GD changes little since the change is proportional to . This argument is true for any mixture models with two components. For GMMs, the update of is:
Hence, is on the line segment between and , and converges to at a linear rate . If , it is possible to have . For example, take , , , and , we have . In such cases, could either increase or decrease. In the worst case, stays small until converges to . Then, from (4.4), given . Assuming changes little in this process, we can conclude that with probability one GD escapes one-cluster regions. However, the growth of is linear compared to EM. Therefore, we have the following theorem:
For mixtures of two Gaussians and arbitrary initialization , there exists a small enough such that when the mixture is initialized at , GD increases as a linear function of the number of time steps.
Notice that (4.8) is true for both GMMs and BMMs. The only difference is the computation of . From this equation, we know that if any one-cluster region traps EM, then and . On this configuration, for the GD algorithm, we know that decreases if it is not zero, and stays the same. Hence, this region traps GD as well. So, the following theorem holds:
For two-component mixture models, if any one-cluster region is stable for EM, it is stable for GD.
4.2 Mixture of two Bernoullis
Now, let us see if Bernoulli mixtures have similar results. With the EM algorithm, can be easily obtained. From the definition of , we have:
Consider the unormalized binary distribution with . Then the normalization factor (partition function) is . With this fact, one can derive that
From the intuition of Gaussian mixtures, we similarly define as the separation between the two true cluster means. As usual, describes the separation between and . With the notations above, we obtain:
where we denote and
The derivation of (4.14) is presented in Appendix A.1. Notice that each is an affine transformation of , as defined by (4.13). Hence, the domain of is a Cartesian product of closed intervals. We assume by default that is in the domain. Studying the update of instead of is more convenient, as we will see in the following subsections that plays an important role in expressing the convergence results. Also, we denote the covariance between features and :
We consider the case when there are no independent pairs, i.e., for any .
4.2.1 Attractive one-cluster regions for GD
When and , GD will get stuck in one-cluster regions, as shown in the proof of Theorem 4.3. If , converges to at a linear rate:
Proposition 4.1 (attractive one-cluster regions for GD).
Denote . The one-cluster regions with , and are attractive for GD, given small enough step size. Here, we denote as an affine function of , in the form of (4.13).
For example, when , from (4.12), we have . If , then , and GD will get stuck. If , , and then GD will escape one-cluster regions. In the latter case, or .
4.2.2 Positive regions
Now, let us study the behavior of EM. We will show that EM escapes one-cluster regions when , for the two-feature case. More generally with features, we should consider . In the field of optimization, both and are known as proper cones and a relevant order can be defined. They are critical in our analysis, so, we define these two cones as positive regions:
Definition 4.1 (positive regions).
A positive region of is defined to be one of . We denote it as .
The positive regions are interesting because ensures that each element has the same sign as and ensures that each has the opposite sign of .
For any number of features, in the positive regions.
So, if . Denote , then , given . we can similarly prove given . ∎
We also notice that positive regions are stable. Define to be the EM update of based on the EM update of and (4.13). For any number of features, the following lemma holds:
Lemma 4.1 (stability of positive regions).
For any number of features, given small, the two positive regions and are stable for EM, i.e., for all and for all .
For any number of features, if is initialized in a positive region, then EM will escape one-cluster regions exponentially fast.
With GD, we have similar results as Corollary 4.1, but GD escapes one-cluster regions much more slowly, which can be derived in a similar way as in Section 4.1. What if is not initialized in positive regions? The following theorem tells us that if , no matter where is initialized, EM will almost always escape the one-cluster regions.
For , given and , with the EM algorithm, , and uniform random initialization for , will converge to the positive regions at a linear rate with probability . Therefore, EM will almost surely escape one-cluster regions.
For general and , we conjecture that EM almost always escapes -cluster regions where , as described in Appendix B.
4.2.3 EM as an ascent method
So far, we have studied positive regions . We showed some nice properties of and proved that converges to almost everywhere when . In this section, we show that EM can be treated as an ascent method for :
For in the feasible region, we have , with equality holds iff .
With this theorem, we can take a small step in the EM update direction: , with small. In this way, we can always increase until and EM escapes the one-cluster regions.
4.3 EM vs GD
Now, we are ready to prove one of the main results in our paper: for mixtures of two Bernoullis, there exist one-cluster regions that can trap GD but not EM. We first prove a lemma similar to Lemma 4.1.
For all and , and . Similarly, for all and , and .
WLOG, we prove the lemma for . Notice that for any , can be read directly from (4.14). For , since . Hence, . ∎
This lemma leads to a main result in our work: for mixtures of two Bernoullis, there exist one cluster regions with nonzero measure that trap GD, but not EM.
For mixtures of two Bernoullis, given , and , there exist one-cluster regions that trap GD, but EM escapes such one-cluster regions exponentially fast.
Denote as the unit vector in the standard basis. For any , at and , from Lemma 4.2 and Theorem 4.4, we have and . Notice also that and for all . Therefore, moving in the opposite direction of the gradient, it is possible to find a neighborhood of , such that for all , and . This region traps GD due to Proposition 4.1, and EM escapes such one-cluster regions exponentially fast due to Corollary 4.1. ∎
5 One-Cluster Local Minima
It is possible to show that some one-cluster regions are local minima, as shown in the following theorem:
The attractive one-cluster regions for GD defined in Proposition 4.1 are local minima of the cross entropy loss .
Impose and treat as a function of . Consider any small perturbation . If , then the loss will not decrease as is a local minimum of . If , the change of is determined by the first order. Only is nonzero, so the change is dominated by the first order:
when is small enough. Therefore, we conclude that is a local minimum. ∎
We call such minima one-cluster local minima. The following theorem shows that they are not global minima.
Assume and . For mixtures of two Bernoullis, one-cluster local minima cannot be global. The gap between the one-cluster local minima and the global minimum could be as large as .
The global minimum is obtained when . Denote the optimal value as . We have: