# On the Behavior of the Expectation-Maximization Algorithm for Mixture Models

Finite mixture models are among the most popular statistical models used in different data science disciplines. Despite their broad applicability, inference under these models typically leads to computationally challenging non-convex problems. While the Expectation-Maximization (EM) algorithm is the most popular approach for solving these non-convex problems, the behavior of this algorithm is not well understood. In this work, we focus on the case of mixture of Laplacian (or Gaussian) distribution. We start by analyzing a simple equally weighted mixture of two single dimensional Laplacian distributions and show that every local optimum of the population maximum likelihood estimation problem is globally optimal. Then, we prove that the EM algorithm converges to the ground truth parameters almost surely with random initialization. Our result extends the existing results for Gaussian distribution to Laplacian distribution. Then we numerically study the behavior of mixture models with more than two components. Motivated by our extensive numerical experiments, we propose a novel stochastic method for estimating the mean of components of a mixture model. Our numerical experiments show that our algorithm outperforms the Naive EM algorithm in almost all scenarios.

## Authors

• 8 publications
• 24 publications
• ### Multi-Node EM Algorithm for Finite Mixture Models

Finite mixture models are powerful tools for modelling and analyzing het...
05/14/2020 ∙ by Sharon X. Lee, et al. ∙ 0

• ### Efficient Algorithms for Estimating the Parameters of Mixed Linear Regression Models

Mixed linear regression (MLR) model is among the most exemplary statisti...
05/12/2021 ∙ by Babak Barazandeh, et al. ∙ 0

• ### Fisher Vectors Derived from Hybrid Gaussian-Laplacian Mixture Models for Image Annotation

In the traditional object recognition pipeline, descriptors are densely ...
11/26/2014 ∙ by Benjamin Klein, et al. ∙ 0

• ### Estimating Mixture Models via Mixtures of Polynomials

Mixture modeling is a general technique for making any simple model more...
03/28/2016 ∙ by Sida I. Wang, et al. ∙ 0

• ### A Unified Approach for Learning the Parameters of Sum-Product Networks

We present a unified approach for learning the parameters of Sum-Product...
01/03/2016 ∙ by Han Zhao, et al. ∙ 0

• ### Challenges with EM in application to weakly identifiable mixture models

We study a class of weakly identifiable location-scale mixture models fo...
02/01/2019 ∙ by Raaz Dwivedi, et al. ∙ 0

• ### A Tight Convex Upper Bound on the Likelihood of a Finite Mixture

The likelihood function of a finite mixture model is a non-convex functi...
08/18/2016 ∙ by Elad Mezuman, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

The ability of finite mixture distributions [pearson1894contributions] to model the presence of subpopulations within an overall population has made them popular across almost all engineering and scientific disciplines [melnykov2010finite, zhang2015finite, titterington1985statistical, mclachlan2004finite]. While statistical identifiability for various mixture models has been widely studied [teicher1963identifiability, allman2009identifiability]

, Gaussian mixture model (GMM) has drawn more attention due to its wide applicability

[day1969estimating, wolfe1970pattern]. Started by Dasgupta[dasgupta1999learning], there have been multiple efforts for finding algorithm with polynomial sample/time complexity for estimating GMM parameters [vempala2004spectral, arora2005learning, chaudhuri2008learning, dasgupta2007probabilistic, moitra2010settling, hsu2013learning, belkin2010polynomial]. Despite statistical guarantees, these methods are not computationally efficient enough for many large-scale problems. Moreover, these results assume that the data is generated from an exact generative model which never happens in reality. In contrast, methods based on solving maximum likelihood estimation (MLE) problem are very popular due to computational efficiency and robustness of MLE against perturbations of the generative model [donoho1988automatic]. Although MLE-based methods are popular in practice, the theory behind their optimization algorithms (such as EM method) is little understood. Most existing algorithms with theoretical performance guarantees are not scalable to the modern applications of massive size. This is mainly due to the combinatorial and non-convex nature of the underlying optimization problems.

Recent advances in the fields of non-convex optimization has led to a better understandings of the mixture model inference algorithms such as EM algrithm. For example, [balakrishnan2017statistical] proves that under proper initialization, EM algorithm exponentially converges to the ground truth parameters. However, no computationally efficient initialization approach is provided. [xu2016global] globally analyzes EM algorithm applied to the mixture of two equally weighted Gaussian distributions. While [daskalakis2016ten] provides global convergence guarantees for the EM algorithm, [jin2016local] studies the landscape of GMM likelihood function with more than 3 components and shows that there might be some spurious locals even for the simple case of the equally weighted GMM.

In this work, we revisit the EM algorithm under Laplacian mixture model and Gaussian mixture model. We first show that, similar to the Gaussian case, the maximum likelihood estimation objective has no spurious local optima in the symmetric Laplacian mixture model (LMM) with components. This Laplacian mixture structure has wide range of applications in medical image denoising, video retrieval and blind source separation [bhowmick2006laplace, klein2014fisher, mitianoudis2005overcomplete, amin2007application, rabbani2009wavelet]. For the case of mixture model with

components, we propose a stochastic algorithm which utilizes the likelihood function as well as moment information of the mixture model distribution. Our numerical experiments show that our algorithm outperforms the Naïve EM algorithm in almost all scenarios.

## Ii Problem Formulation

The general mixture model distribution is defined as

 P(x;w,K,θ)=K∑k=1wkf(x;θk)

where is the number of mixture components; is the non-negative mixing weight with and =

is the distribution’s parameter vector. Estimating the parameters of the mixture models

is central in many applications. This estimation is typically done by solving MLE problem due to its intuitive justification and its robust behavior [donoho1988automatic].

The focus of our work is on the population likelihood maximization, i.e., when the number of samples is very large. When parameters w and

are known, using the law of large numbers, MLE problem leads to the following

population risk optimization problem [xu2016global, daskalakis2016ten, jin2016local]:

 θ∗=argmaxθE[log (K∑k=1wkf(x;θk))] (1)

In this paper, we focus on the case of equally weighted mixture components, i.e., [xu2016global, daskalakis2016ten, jin2016local, srebro2007there]. We also restrict ourselves to two widely-used Gaussian mixture models and Laplacian mixture models [bhowmick2006laplace, klein2014fisher, mitianoudis2005overcomplete, amin2007application, rabbani2009wavelet, vempala2004spectral, arora2005learning, chaudhuri2008learning, dasgupta2007probabilistic]. It is worth mentioning that even in these restricted scenarios, the above MLE problem is non-convex and highly challenging to solve.

## Iii EM for the case of K=2

Recently, it has been shown that the EM algorithm recovers the ground truth distributions for equally weighted Gaussian mixture model with components [xu2016global, daskalakis2016ten]. Here we extend this result to single dimensional Laplacian mixture models.

Define the Laplacian distribution with the probability density function

where and

control the mean and variance of the distribution. Thus, the equally weighted Laplacian mixture model with two components has probability density function:

 P(x;μ1,μ2,b)=12L(x;μ1,b)+12L(x;μ2,b).

In the population level estimation, the overall mean of the data, i.e., can be estimated accurately. Hence, without loss of generality, we only need to estimate the normalized difference of the two means, i.e., . Under this generic assumption, our observations are drawn from the distribution

 P(x;μ∗,b)=12L(x;μ∗,b)+12L(x;−μ∗,b).

Our goal is to estimate the parameter from observations at the population level. Without loss of generality, and for simplicity of the presentation, we set , define and . Thus, the -th step of the EM algorithm for estimating the ground truth parameter is:

 λt+1=Ex∼pμ∗[x0.5L(x;λt)pλt(x)]Ex∼p∗μ[0.5L(x;λt)pλt(x)], (2)

where is the estimation of in -th iteration; see [daskalakis2016ten, xu2016global, jin2016local] for the similar Gaussian case. In the rest of the paper, without loss of generality, we assume that . Further, to simplify our analysis, we define the mapping

 M(λ,μ)≜Ex∼pμ[x0.5L(x;λ)pλ(x)]Ex∼pμ[0.5L(x;λ)pλ(x)].

It is easy to verify that , , , and . In other words, are the fixed points of the EM algorithm. Using symmetry, we can simplify as

 M(λ,μ) =Ex∼L(x;μ)[xL(x;λ)−L(x;−λ)L(x;λ)+L(x;−λ)] (3)

Let us first establish few lemmas on the behavior of the mapping .

###### Lemma

The derivative of the mapping with respect to is positive, i.e., .

###### Proof

First notice that is equal to

 Ex∼L(x;μ)[2x(sign(x−λ)+sign(x+λ))(e−|x−λ|−|x+λ|))(e−|x−λ|+e−|x+λ|)2].

We prove the lemma for the following two different cases separately:

Case 1) :

 = 2(λ+1)e−λ(e−μ+eμ)(e−λ+eλ)2=(λ+1)e−λ(cosh(μ))cosh(λ)2>0.

Case 2)

 ∂M∂λ=2e−μ[∫−λ−∞xexdx+∫μλxexdx+e2μ∫∞μxe−xdx](eλ+e−λ)2 = 2(eλ+e−λ)2[e−μ((λ+1)e−λ−(λ−1)eλ)+2μ] ≥

###### Lemma

For , we have

 ∂∂ηM(λ,η)=1−2e−ηλ+e−λeλ+e−λ>1−2e−λλ+e−λeλ+e−λ>0

###### Proof

When , it is not hard to show that Hence,

 ∂M∂η=−e−η{λ+1eλ+e−λ+λ−1eλ+e−λ}+tanh(λ) = −2e−ηλ+eλ−2e−λ+e−λeλ+e−λ>1−2λe−λ+e−λeλ+e−λ>0,

where the last two inequalities are due to the facts that and .

###### Theorem

Without loss of generality, assume that . Then the EM iterate defined in (2) is a contraction, i.e., where , , and .

Theorem III shows that the EM iterates converge to the ground truth parameter which is the global optimum of the MLE.

###### Proof

First of all, according to the Mean Value Theorem, between and such that:

 λt+1−μ∗λt−μ∗=M(λt,μ∗)−M(μ∗,μ∗)λt−μ∗=∂∂λM(λ,μ∗)∣∣λ=ξ>0,

where the inequality is due to lemma  III. Thus, does not change sign during the algorithm. Consider two different regions: and . When , case 1 in Lemma III implies that

 ∂M∂λ∣∣λ=ξ =(ξ+1)e−ξ(cosh(μ∗))cosh(ξ)2≤(μ∗+1)e−μ∗cosh(μ∗)=κ1<1.

The last two inequalities are due to the fact that , and the fact that is a positive and decreasing function in with . On the other hand, when , the Mean Value Theorem implies that

 λt+1−μ∗λt−μ∗=λt+1−λtλt−μ∗+1=M(λt,μ∗)−M(λt,λt)λt−μ∗+1 = 1−∂∂μ∗M(λt,μ∗)∣∣μ∗=η≤2λte−λt+e−λteλt+e−λt ≤ 2λ0e−λ0+e−λ0eλ0+e−λ0=κ2<1,

where the last two inequalities are due to lemma  III and the facts that 1) does not change sign and 2) the function is positive and decreasing in with . Hence, Combining the above two cases will complete the proof.

## Iv Modified EM for the case of K≥3

In [srebro2007there] it is conjectured that the local optima of the population level MLE problem for any equally weighed GMM is globally optimal. Recently, [jin2016local] has rejected this conjecture by providing a counter example with components. Moreover, they have shown that the local optima could be arbitrary far from ground truth parameters and there is a positive probability for the EM algorithm with random initialization to converge to these spurious local optima. Motivated by [jin2016local], we numerically study the performance of the EM algorithm in both GMMs and LMMs.

Numerical Experiment 1: Figure 1 presents the convergence plots of the EM algorithm with four different initializations. Two of these initializations converge to the global optima, while the other two fails to recover the ground truth parameters and they are trapped in spurious local optima. To understand the performance of the EM algorithm with random initialization, we ran the EM algorithm for different number of components  and dimensions . First we generate the -dimensional mean vectors , . These vectors are the mean values of different Gaussian components. For each Gaussian component, the variance is set to . Thus the vectors will completely characterize the distribution of the GMM. Then, samples are randomly drawn from the generated GMM and the EM algorithm is run with 1000 different initializations, each for iterations. The table in Figure 1 shows the percentage of the times that the EM algorithm converges to the ground truth parameters (global optimal point) for different values of and . As can be seen in this table, EM fails dramatically especially for larger values of .

By examining the spurious local optima in the previous numerical experiment, we have noticed that many of these local optima fail to satisfy the first moment condition. More specifically, we know that any global optimum of the MLE problem (1) should recover the ground truth parameter – up to permutations [teicher1963identifiability, allman2009identifiability]. Hence, any global optimum  has to satisfy the first moment condition Without loss of generality and by shifting all data points, we can assume that . Thus, must satisfy the condition

 K∑k=1^μk=0. (4)

However, according to our numerical experiments, many spurious local optima fail to satisfy (4). To enforce condition (4), one can regularize the MLE cost function with the first order moment condition and solve

 maxμEμ∗[log(K∑k=11Kf(x;μk))]−M2∥∥ ∥∥K∑k=1μk∥∥ ∥∥22, (5)

where is the regularization coefficient. To solve (5), we propose the following iterative algorithm:

 μt+1k=Eμ∗[xwtk(x)]+MKμtk−M∑Kj=1μtjMK+Eμ∗[wtk(x)],∀k, (6)

where . This algorithm is based on the successive upper-bound minimization framework [razaviyayn2014parallel, razaviyayn2013unified, hong2016unified]. Notice that if we set in (6), we obtain the naïve EM algorithm. The following theorem establishes the convergence of the iterates in (6).

###### Theorem

Any limit point of the iterates generated by (6) is a stationary point of (5).

Proof Sketch Let be the objective function of  (5). Using Cauchy-Schwarz and Jensen’s inequality, one can show that . Moreover, . Thus the assumptions of [razaviyayn2013unified, Proposition 1] are satisfied. Furthermore, notice that the iterate (6) is obtained based on the update rule . Therefore, [razaviyayn2013unified, Theorem 1] implies that every limit point of the algorithm is a stationary point.

Numerical Experiment 2: To evaluate the performance of the algorithm defined in (6), we repeat the Numerical Experiment 1 with the proposed iterative method (6). Figure 2 shows the performance of the proposed iterative method (6). As can be seen from the table, the regularized method still fails to recover the ground truth parameters in many scenarios. More specifically, although the regularization term enforces (4), it changes the likelihood landscape and hence, it introduces some new spurious local optima.

In our numerical experiment 2, we observed that many of the spurious local optima are tied to a fixed value of . In other words, after getting stuck in a spurious local optimum point, changing the value of helps us escape from that local optimum. Notice that the global optimal parameter  is the solution of (5) for any value of . Motivated by this observation, we consider the following objective function:

 maxμEλ∼Λ[Eμ∗[log(K∑k=11Kf(x;μk))]−λ2∥∥ ∥∥K∑k=1μk∥∥ ∥∥22], (7)

where is some continuous distribution defined over . The idea behind this objective is that each sampled value of leads to different set of spurious local optima. However, if a point is a fixed point of EM algorithm for any value of , it must be a stationary point of the MLE function and also it should satisfy the first moment condition (4). Based on this objective function, we propose algorithm 1 for estimating the ground truth parameter.

Numerical Experiment 3: To evaluate the performance of Algorithm 1, we repeat the data generating procedure in Numerical Experiment 1. Then we run Algorithm 1 on the generated data. Figure  3 shows the performance of this algorithm. As can be seen from this figure, the proposed method significantly improves the percentage of times that a random initialization converges to the ground truth parameter. For example, the proposed method converges to the global optimal parameter of the times for , while the naïve EM converges for of the initializations (comparing Fig. 1 and Fig. 3).

Remark: While the results in this section are only presented for GMM model, we have observed similar results in LMM model. These results are omitted due to lack of space.

## V Conclusion

In this paper, first the convergence behavior of the EM algorithm for equally weighted Laplacian mixture model with two components is studied. It is shown that the EM algorithm with random initialization converges to the ground truth distribution with probability one. Moreover, the landscape of the equally weighted mixture models with more than two components is revisited. Based on our numerical experiments, we proposed a modified EM approach which significantly improves the probability of recovering the ground truth parameters.