# Singularity, Misspecification, and the Convergence Rate of EM

A line of recent work has characterized the behavior of the EM algorithm in favorable settings in which the population likelihood is locally strongly concave around its maximizing argument. Examples include suitably separated Gaussian mixture models and mixtures of linear regressions. We consider instead over-fitted settings in which the likelihood need not be strongly concave, or, equivalently, when the Fisher information matrix might be singular. In such settings, it is known that a global maximum of the MLE based on n samples can have a non-standard n^-1/4 rate of convergence. How does the EM algorithm behave in such settings? Focusing on the simple setting of a two-component mixture fit to a multivariate Gaussian distribution, we study the behavior of the EM algorithm both when the mixture weights are different (unbalanced case), and are equal (balanced case). Our analysis reveals a sharp distinction between these cases: in the former, the EM algorithm converges geometrically to a point at Euclidean distance O((d/n)^1/2) from the true parameter, whereas in the latter case, the convergence rate is exponentially slower, and the fixed point has a much lower O((d/n)^1/4) accuracy. The slower convergence in the balanced over-fitted case arises from the singularity of the Fisher information matrix. Analysis of this singular case requires the introduction of some novel analysis techniques, in particular we make use of a careful form of localization in the associated empirical process, and develop a recursive argument to progressively sharpen the statistical rate.

## Authors

• 12 publications
• 45 publications
• 9 publications
• 226 publications
• 88 publications
• 56 publications
02/01/2019

### Challenges with EM in application to weakly identifiable mixture models

We study a class of weakly identifiable location-scale mixture models fo...
08/28/2019

### Randomly initialized EM algorithm for two-component Gaussian mixture achieves near optimality in O(√(n)) iterations

We analyze the classical EM algorithm for parameter estimation in the sy...
09/10/2020

### Convergence rate of EM algorithm for SDEs under integrability condition

In this paper, by employing Gaussian type estimate of heat kernel, we es...
09/01/2016

### Ten Steps of EM Suffice for Mixtures of Two Gaussians

The Expectation-Maximization (EM) algorithm is a widely used method for ...
06/16/2019

### Global Convergence of Least Squares EM for Demixing Two Log-Concave Densities

This work studies the location estimation problem for a mixture of two r...
05/22/2020

### Instability, Computational Efficiency and Statistical Accuracy

Many statistical estimators are defined as the fixed point of a data-dep...
09/04/2016

### Local Maxima in the Likelihood of Gaussian Mixture Models: Structural Results and Algorithmic Consequences

We provide two fundamental results on the population (infinite-sample) l...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The growth in the size and scope of modern data sets has presented the field of statistics with a number of challenges, among them is how to deal with various forms of heterogeneity. Mixture models provide a principled approach to modeling heterogeneous collections of data. In practice, it is frequently the case that the number of mixture components in the fitted model does not match the number of components in the data-generating mechanism. It is known that such mismatch can lead to substantially slower convergence rates for the maximum likelihood estimate (MLE) for the underlying parameters. In contrast, relatively less attention has been paid to the computational implications of this mismatch. In particular, the algorithm of choice for fitting finite mixture models is the EM algorithm, a general framework that encompasses divide-and-conquer computational strategies. We seek a fundamental understanding of how EM behaves for over-fitted mixture models.

While density estimation in finite mixture models is relatively well understood [26, 9], characterizing the behavior of the MLE for the parameters has remained challenging. The main difficulty for analyzing the MLE in such settings can be attributed to the associated geometry of the parameters and the inevitable label switching between the mixtures [23, 25]. Such issues do not interfere with density estimation since standard divergence measures like the Kullback-Leibler and Hellinger distances remain invariant under permutations of labels. An important contribution to the understanding of parameter estimation in finite mixture models was made by Chen [4]. He considered a class of over-fitted finite mixture models; here the term “over-fitted” means that the model to be fit has more mixture components than the distribution generating the data. In an interesting contrast to the usual convergence rate for the MLE based on samples, Chen showed that for estimating scalar location parameters in a certain class of over-fitted finite mixture models, the corresponding rate slows down to . Such a result is of practical interest, as methods that overfit the number of mixtures are often more feasible than methods that estimate the number of components and subsequently estimate the parameters conditionally [24]. In subsequent work, Nguyen [22] and Heinrich et al. [11] have characterized the (minimax) convergence rates of parameter estimation rates for mixture models in both exact-fitted or over-fitted settings using the Wasserstein distance.

While previous works address the statistical behavior of a global maximum of the log likelihood, it does not consider the associated computational issues of obtaining such a maximum. Non-convexity of the log likelihood function makes it impossible to guarantee that a global maximum of the MLE is obtained by the iterative algorithms used in practice. Perhaps the most widely used algorithm is the expectation-maximization (EM) algorithm

[8]. Early work on the EM algorithm [31] showed that its iterates converge to a local maximum of the log-likelihood function for a broad class of incomplete data models; this general class includes the fitting of mixture models as a special case. The EM algorithm has also been studied in the specific setting of Gaussian mixture models; here we find results both for the population EM algorithm, which is the idealized version of EM based on an infinite sample size, as well as the usual sample-based EM algorithm that is used in practice. For Gaussian mixture models, the population EM algorithm is known to exhibit various convergence rates, ranging from geometric to sub-geometric depending on the overlap between the mixture components [33, 20]. Recently, Balakrishnan et al. [1] provided a general theoretical framework to analyze the convergence of EM updates to a neighborhood of MLE, and to provide a non-asymptotic bounds on the Euclidean error of sample-based EM iterates. These results can be applied to the special case of well-specified two-component Gaussian location mixtures, and under suitable separation conditions, their theory guarantees that (1) population EM updates enjoy geometrical convergence rate to the true parameters when initialized in a sufficiently small neighborhood around the truth, and (2) sample-based EM updates have standard minimax convergence rate of order , based on i.i.d. samples of a finite mixture model in .

Further work in this vein has characterized the behavior of EM in more general settings, including global convergence of population EM [32], guarantees of geometric convergence under less restrictive conditions on the mixture components [15, 7], analysis of EM with unknown mixtures weights and covariances [3], convergence analysis with additional sparsity constraints [30, 35, 10], and extensions to more than two Gaussian components [34, 10]. Other works include providing optimization-theoretic guarantees for EM by viewing it in a generalized surrogate function framework [17]

, and analyzing the statistical properties of confidence intervals based on an EM estimator

[6]. An assumption common to all of these papers is that there is no misspecification in the fitting of the Gaussian mixtures; in particular, it is assumed that the data are generated from a mixture model with the same number of components as the fitted model. As noted above, using over-fitted mixture models is common in practice, so that it is desirable to understand how the EM algorithm behaves in the over-fitted settings.

The current paper aims to shed some light on the performance of the EM algorithm for over-fitted mixtures. We do so by providing a comprehensive study of over-fitted mixture models when fit to the simplest possible (non-mixture) data-generating mechanism, that is a multivariate normal distribution

in dimensions with known scale parameter . This setting, despite its simplicity, suffices to uncover some rather interesting properties of EM in the over-fitted context. In particular, considering the behavior of the EM algorithm when it is used to fit two different types of models to data of this type, we obtain the following results.

• Two-mixture unbalanced fit:

For our first model class, we study a mixture of two location-Gaussian distributions with unknown location, known variance and unequal weights for the two components. We establish that in this case the population EM updates converge at a geometric rate to the true parameter; as an immediate consequence, the sample-based EM algorithm converges in

steps to a ball of radius . The fast convergence rate of EM under the unbalanced setting provides an antidote to the pessimistic belief that statistical estimators generically exhibit slow convergence for over-fitted mixtures.

• Two-mixture balanced fit: In the balanced version of the problem in which the mixture weights are assumed to be equal for both components, we find that the EM algorithm behaves very differently. Beginning with the population version of the EM algorithm, we show that it converges to the true parameter from an arbitrary initialization. However, the rate of convergence varies as a function of the distance of the current iterate from the true parameter value, becoming exponentially slower as the iterates approach the true parameter. This behavior is in sharp contrast to well-specified settings settings [1, 7, 34], where the population updates are globally contractive. We also show that our rates for population EM are tight. By combining the slow convergence of population EM with a novel localization-based analysis of the underlying empirical process, we show that the sample-based EM iterates converge to a ball of radius around the true parameter after steps. The component of the Euclidean error matches known guarantees for the global maximum of the MLE [4]. The localization argument in our analysis is of independent interest, because such techniques are not required in analyzing the EM algorithm in well-specified settings when the population updates are globally contractive. We note that localization methods are known to be essential in deriving sharp statistical rates for M-estimators (e.g., [26, 2, 16]); to the best of our knowledge, their use in analyzing the EM algorithm is novel.

The remainder of the paper is organized as follows. In Section 2, we begin with a few simulations of EM in different settings that motivate the problem set-up considered later. Leading with a brief description of EM for the set-ups analyzed in the paper, we then provide a thorough analysis of the convergence rates of EM when over-fitting Gaussian data with two components. In Section 3, we provide a discussion of our results and a few directions for future work. Proofs of our main results are presented in Section 4 with a few technical lemmas deferred to the Appendix.

##### Experimental settings:

Throughout the paper, we provide simulation experiments to demonstrate theoretical results. We summarize a few common aspects of those experiments here. For population-level computations, we computed expectations using numerical integration on a sufficiently fine grid. In the experiments for sample-based EM using samples, we declared convergence if one of the following two criteria was met: (1) the change in the iterates was small enough, or (2) the number of iterations was too large (). We ran trials for different settings and computed the error (which we define on a case-to-case basis) upon convergence across these experiments. In majority of the runs (for each case), criteria (1) led to convergence. Let

denote the mean error and standard deviation across these experiments. In our plots for sample-EM, we report

on the y-axis. Furthermore, whenever a slope is provided, it is the slope for the best linear fit on the log-log scale for the quantity on -axis when fitted with the quantity reported on the -axis. For instance, in Figure 1(b), for the corresponding value of on the -axis, the -axis value for a green solid dot represents the value of averaged over experiments, accounting for the deviation across these experiments. Furthermore, the green dotted line with legend and the corresponding slope denotes the best linear fit and the respective slope for (green solid dots) with for the experiments corresponding to the setting .

## 2 Behavior of EM for over-fitted Gaussian mixtures

In this section, we explore the wide range of behavior demonstrated by EM for different settings of over-specified location Gaussian mixtures.111Note that in all model fits in the paper, we assume that the scale parameter is known and is set equal to the true parameter. For the case when scale parameter is unknown, refer to the discussion in Section 3. We begin with several simulations in Section 2.1 that illustrate fast and slow convergence of EM for various settings, and serve as motivation for the theoretical results derived later in the paper. We provide basic background on EM in Section 2.2, and describe the problems to be tackled. We then present our main results and discussion of their consequences, distinguishing between the cases of unbalanced and balanced fits in Section 2.3 and Section 2.4 respectively. In Section 2.5, we provide a high-level description of the novel proof techniques that are introduced to obtain sharp guarantees for the balanced mixture case.

### 2.1 Motivating simulations: Fast to slow convergence of EM

In order to illustrate the diversity in the behavior of EM, we begin with some simulations of its behavior for different types of over-fitted location Gaussian mixtures. In Section 2.1.1, we consider effect of separation between the mixtures and observe that weak or no separation leads to slow rates. This phenomenon is actually fairly generic, as shown by the additional simulations that we present in Section 2.1.2.

#### 2.1.1 Effect of signal strength on EM

For Gaussian location mixtures, an important aspect of the problem is the signal strength, which is measured as the separation between the means of mixture components relative to the spread in the components. For the special case of the two component mixture model

 πN(θ∗,σ2Id)+(1−π)N(−θ∗,σ2Id), (1)

the signal strength is given by the ratio . When this ratio is large, we refer it to as the strong signal case; otherwise, it corresponds to the weak signal case.

In Figure 1, we show simulations for data generated from the model (1) in dimension and noise variance , and for three different values of the weight . In all cases, we fit a two-location Gaussian mixture with fixed weights and special structure on the location parameters—more precisely, we fit the model using EM, and take the solution as an estimate of . The two panels demonstrate the rates for EM for two distinct cases of the data-generating mechanism: (a) In the strong signal case, we set so that the data has two well-separated mixture components, and (b) in the limiting case of no signal, we set , so that the two mixture components in the data-generating distribution collapse to one, and we are simply fitting a standard normal distribution.

In the strong signal case, it is well known [1, 7] that EM solutions have estimation error that achieve the classical (parametric) rate ; the empirical results in Figure 1(a) are simply a confirmation of these theoretical predictions. More interesting is the case of no signal, where the simulation results shown in panel (b) of Figure 1

reveal a different story. Indeed, for this case the statistical rate of EM undergoes a phase transition: while it achieves the parametric rate of

when , it slows down to when . We return to these cases in more detail in Section 2.2.

#### 2.1.2 Slow rate of EM with general over-fitted location mixtures

Based on simulations, we have found that the slow rate of order observed in Figure 1(b) is not limited to fitting a mixture model of the special form ; rather, it is a more generic phenomenon that arises in many over-fitted models. In Figure 2, we plot the statistical error of the EM updates when we over-fit more general location mixtures to the data generated from the standard normal distribution . For these cases, we use the second-order Wasserstein metric to quantify the convergence rates of EM updates under these settings. In our case with , the second-order Wasserstein metric takes the form , i.e., it corresponds to a weighted Euclidean distance of the estimated parameters from the truth. Here and , respectively, denote the mixture weight and the location parameter of the -th component of the mixture. Let us summarize our findings for three cases222The exact EM updates for these settings are standard and can be found in the book [21].:

1. Unconstrained location parameters (Figure 2 (a)): We consider the model fit , where the mixture weight is fixed a priori. In contrast to the cases considered in Figure 1—in which we imposed the constraint —here we allow the location parameters and to be estimated independently without any coupling constraint.

2. Unknown weights (Figure 2(b)): In addition to estimating the location parameters independently (as in the previous case), we assume that the mixture weights are also unknown and estimate them using EM.

3. Three mixtures (also in Figure 2(b)): We over-fit standard Gaussian data with three mixtures: more precisely, we fit the model , and estimate the unconstrained location parameters independently using the EM algorithm.

In all these cases, we observe that EM solutions have statistical error with the slow scaling . Thus, these slow rates are a more generic phenomenon with over-fitted mixture models.

### 2.2 Problem set-up

The wide range of behavior of the EM algorithm when applied to over-fitted location Gaussian mixtures is quite intriguing, and warrants a deeper investigation,, in particular to characterize the behavior in a rigorous manner. In order to do so, we focus on the simplest instance that demonstrates the dichotomy between the fast and slow rates. As before, we assume that data is generated according to a Gaussian density in dimensions with covariance :

 f∗(x)=ϕ(x;0,σ2Id), (2)

where . We study the performance of the EM algorithm when we over-fit this data-generating mechanism with a two-component isotropic location-Gaussian mixture model whose density is given by

 fπθ(x) =πϕ(x;θ,σ2Id)+(1−π)ϕ(x;−θ,σ2Id), (3)

where we assume that the mixture weight and the variance are known and fixed apriori. Recall that we simulated this particular set-up, with our findings reported in Figure 1(b).

For the model fit (3), the expected (population) log-likelihood is given by

 Lπ(θ)=EX[logfπθ(X)]=E[log(πϕ(X;θ,σ2Id)+(1−π)ϕ(X;−θ,σ2Id))]. (4)

Note that for any given , the true model belongs to the family with , since . Noting that

 argmaxθLπ(θ)=argminθ%KL(f∗∥fπθ),

where

denotes the Kullback-Leibler divergence between the distributions corresponding to the densities

and . As a result, finding maximizer of the population log-likelihood would yield the true parameter . In practical situations, when one has access to only i.i.d. samples , the most popular choice to estimate is the maximum likelihood estimate (MLE):

 (5)

In order to simplify notation, we often use the more compact notation , , and when the choice of weight is clear from context. In general, there is no closed-form expression for the maximizer of and for the estimate . EM attempts to circumvent this problem via a minorization-maximization scheme. While population EM is a surrogate method to compute the maximizer of the population log-likelihood , sample EM attempts to estimate . While only sample EM is a practical algorithm, several recent works have first analyzed the convergence of population EM, and then leverage the corresponding findings to understand the behavior of sample EM. In the sequel, our theoretical analysis takes a similar path and thereby to facilitate the further discussion we now describe the expressions for these updates for the model-fit (3).

#### 2.2.1 Population versus sample EM updates

Useful for conceptual understanding, the population EM algorithm is an idealized algorithm in which all expectations are computed under the true (population) distribution, effectively taking the sample size to be infinite. We begin by describing this population algorithm, before turning to the sample-based EM algorithm that is actually used in practice.

In particular, given any point , the EM algorithm proceeds in two steps: (1) compute a surrogate function such that and ; and (2) compute the maximizer of with respect to . These steps are referred to as the E-step and the M-step, respectively. In the case of two-location mixtures, it is useful to describe a hidden variable representation of the mixture model. Consider a binary indicator variable with the marginal distribution and , and define the conditional distributions

 (X∣Z=0)∼N(−θ,σ2Id),and(X∣Z=1)∼N(θ,σ2Id).

These marginal and conditional distributions define a joint distribution over the pair

, and by construction, the induced marginal distribution over is a Gaussian mixture of the form (3

). For EM, we first compute the conditional probability of

given :

 wθ(x)=wπθ(x):=πexp(−∥θ−x∥222σ2)πexp(−∥θ−x∥222σ2)+(1−π)exp(−∥θ+x∥222σ2). (6)

Then, given a vector

, the E-step in the population EM algorithm involves computing the minorization function . Doing so is equivalent to computing the expectation

 Q(θ′;θ) =−12E[wθ(X)∥∥X−θ′∥∥22+(1−wθ(X))∥∥X+θ′∥∥22], (7)

where the expectation is taken over the true density of the random variable specified in model (

2). Next, in the M-step we maximize the function , which we capture by defining the population EM operator, :

 M(θ) =argmaxθ′∈RdQ(θ′,θ)=E[(2wθ(X)−1)X], (8)

where the second equality follows by computing the gradient , and setting it to zero. In summary, for the two-location mixtures considered in this paper, the population EM algorithm is defined by the sequence , where the operator is defined in equation (8).

We obtain the sample EM update by simply replacing the expectation in equations (7) and (8) by the empirical average based on an observed set of samples. In particular, given a set of i.i.d. samples , the sample EM operator takes the form

 Mn(θ) :=1nn∑i=1(2wθ(Xi)−1)Xi. (9)

Thus, the sample EM algorithm defines the sequence of iterates given by .

##### Balanced versus unbalanced mixtures:

As noted in Figure 1(b), there is a stark difference in terms of convergence rates of sample EM between the case that and the the case when . In the paper, we study theoretically the behavior of the EM algorithm in two distinct settings:

• Unbalanced mixtures: in this case, the mixture weights in equation (3) are assumed to be unequal, i.e, and for some .

• Balanced mixtures: in this case, the mixture weights are assumed to be equal, that is .

For future references, we summarize the two model-fits when the true distribution is :

 Unbalanced mixture-fit: πN(θ,σ2)+(1−π)N(−θ,σ2),(π≠0.5 fixed), (10a) Balanced mixture-fit: 12N(θ,σ2)+12N(−θ,σ2). (10b)

To be clear, in both the models above, only the parameter is allowed to vary; the mixture weights and scale parameter are fixed.

In the sequel, we study the convergence of EM in these two settings, both for the population EM algorithm in which the updates are given by , as well as the sample-based EM sequence given by . We now turn to a few interesting observations with the population EM for the two cases described above.

#### 2.2.2 An interesting empirical phenomenon with population EM

In order to motivate the analysis to follow, Figure 3 illustrates how the population EM algorithm behaves in the unbalanced case (panel (a)) versus the balanced case (panel (b)). Each plot shows the distance of the EM iterate to the true parameter value, , on the vertical axis, versus the iteration number on the horizontal axis. With the vertical axis on a log scale, a geometric convergence rate shows up as a negatively sloped line (disregarding transient effects in the first few iterations).

For the unbalanced mixtures in panel (a), we see that EM converges geometrically quickly, although the rate of convergence (corresponding to the slope of the line) tends towards zero as the mixture weight tends towards from below. For , we obtain a balanced mixture, and, as shown in the plot in panel (b), the convergence rate is now sub-geometric. In fact, the behavior of the iterates is extremely well characterized by the recursion .

#### 2.2.3 Nature of the log-likelihood and the Fisher information matrix

We now briefly discuss the fundamental difference between the unbalanced and balanced fits that result in very different behavior of EM in the two settings: the nature of the log-likelihood and the associated Fisher information matrix (FIM). For the mixture-fit (3) with mixture weights , the Fisher information matrix can be expressed as

 [Iπ(θ)]ii =−4π(1−π)E[Y2i(πexp(θ⊤Y)+(1−π)exp(−θ⊤Y))2]+1, for i∈[d], and ij =−4π(1−π)E[YiYj(πexp(θ⊤Y)+(1−π)exp(−θ⊤Y))2], for i,j∈[d],i≠j,

where the expectation is taken under the true model: . Clearly, at , we have

 Iπ(θ∗)=βπId,whereβπ=−4π(1−π)+1. (11)

Note that for any such that . On the other hand, for , we have . Consequently, we find that for any unbalanced fit with , the FIM is non-singular at , and, for the balanced fit with , it is singular. Indeed, when we computed the population log-likelihood for different values of , the observations were consistent with equation  (11): the log-likelihood is strongly concave for the unbalanced fit, and weakly concave for the balanced fit. The results are plotted in Figure 4(a)333Figure 4(b), shows the sample likelihoods based on samples, and weights . We observe that while the sample-likelihood may have more critical points, its curvature resembles very closely the curvature of the corresponding population log-likelihood., where we observe that when the mixture weights are unbalanced (), the population log-likelihood for the model has more curvature, and in fact is (numerically) well-approximated as . On the other hand, for the balanced model with , the likelihood is quite flat near origin and is (numerically) well-approximated as . It is a folklore that the convergence rate of optimization methods has a phase transition: optimizing strongly concave functions is exponentially fast than weakly concave functions. Combining this result with the computation above serves as a good intuition as to why population EM may have fundamentally different rate of convergence in the two model fits as observed in Figure 3.

##### From population EM to sample EM:

It is well established [27] that when FIM is invertible in a neighborhood of the true parameter, MLE has the parametric rate of , and the singularity of FIM may lead to a slower than rate for the MLE. As a result, for the singular case (balanced fit), we may expect a slower than parametric rate. Indeed, in the sequel, we show that the slow convergence of population EM—caused by the singularity of FIM—has a direct consequence for sample EM. Not only does the sample EM updates have a slower convergence rate in the balanced case, but this slower convergence leads to a fixed point that has lower statistical accuracy, as already illustrated in Figure 1(b). In particular, consider the problem of fitting a -dimensional mixture based on samples: in the unbalanced case, we prove that any fixed point of the sample EM updates lies at a Euclidean distance of from the truth, whereas in the balanced case, we show that such fixed points lie at order from the truth. Thus, the EM algorithm is worse in the balanced case, both in terms of optimization speed and in terms of ultimate accuracy. Although, this slower statistical rate is in accord with existing results for the MLE in over-fitted mixture models [4], the theory to follow takes a different route to provide a rigorous justification, why such behaviors are to be expected with EM.

### 2.3 Behavior of EM for unbalanced mixtures

We now proceed to characterize the behavior of both population and sample-based EM algorithms, beginning with the setting of unbalanced mixtures (10a). In particular, we assume that the fitted model has known weights and , where . The following result characterizes the behavior of the population EM updates for this problem:

###### Theorem 1.

For the unbalanced mixture model (10a) model fit to the true model (2), the population EM operator is globally strictly contractive, meaning that

 ∥M(θ)∥2 ≤(1−ρ2/2)∥θ∥2%forall$θ∈Rd$, (12a) where ρ=1−2π∈(0,1) denotes the measure of unbalancedness. Moreover, there are universal constants c,c′ such that, given any δ∈(0,1) and a sample size n≥cdlog(1/δ)σ2ρ4, the sample EM sequence θt+1=Mn(θt) satisfies the bound ∥∥θt∥∥2≤∥∥θ0∥∥2⎡⎣(1−ρ2/2)t+c′σ2ρ2√dlog(1/δ)n⎤⎦, (12b)

with probability at least .

See Section 4.1 for the proof of this theorem.

The bulk of the effort in proving Theorem 1 lies in establishing the guarantee (12a) for the population EM iterates. Once this bound has been established, we follow the analysis scheme laid out by Balakrishnan et al. [1] to establish the guarantee for sample EM (12b

). In particular, we establish a non-asymptotic uniform law of large numbers (Lemma

1 to be stated in the sequel) that allows for the translation from population to sample EM iterates. Roughly, this lemma guarantees that under suitable technical conditions, for any radius , tolerance , and sufficiently large , we have

 P⎡⎣sup∥θ∥2≤r∥Mn(θ)−M(θ)∥2≤c2σ2r√dlog(1/δ)n⎤⎦≥1−δ. (13)

This bound, when combined with the contractive behavior of the population EM iterates, allows us to establish the stated bound (12b) on the sample EM iterates.

As a consequence of the bound (12b), we can guarantee that the sample-EM-based updates converge to an estimate with Euclidean error of the order after a relatively small number of steps. More precisely, if we run the algorithm for iterations, where

 T⋆ :=log(ρ2/σ2)+log(n/(dlog(1/δ)))−log(1−ρ2/2),

then with probability at least , we are guaranteed that

 ∥∥θT⋆∥∥2 (14)

The above theoretical prediction (14) is verified by simulation study in Figure 1(b) for the univariate setting of unbalanced mixture-fit, i.e., . Let us now study the dependence on dimension of the sample EM error for fitting unbalanced mixtures. In order to do so, we simulated runs of sample EM for various settings of and . In Figure 5, we present the scaling of the radius of the final EM iterate with respect to and averaged over these runs. Linear fits on the log-log scale in these simulations suggest a rate close to as claimed in Theorem 1.

We note that this result provides some encouragement for the use of over-fitted mixtures in practice. Even though the true model is overfit with a two-component Gaussian mixture, as long as the mixture weights are unbalanced (i.e., ), the EM algorithm yields an estimate with the classical minimax-optimal parametric rate for estimating the true location parameter . This positive result serves as an antidote to the popular belief that the price for overestimating the number of components in the mixture is a necessarily slower rate for parameter estimation.

As revealed in Theorem 1, the extent of unbalancedness in the mixture weights plays a crucial role in the geometric rate of convergence for the population EM. Indeed, in order to obtain an -accurate estimate of , population EM takes steps, where denotes the unbalancedness of the mixtures. When the mixture weights are bounded away from , we have that is bounded away from zero, and EM converges in steps to an -level of accuracy. However, when the mixtures become more balanced, that is, weight approaches or equivalently approaches zero, the number of steps required to achieve -accuracy scales as . Note that in the limit , this bound degenerates to for any finite . In fact, the bound (12a) from Theorem 1 simply states that the population EM operator is non-expansive for balanced mixtures, and does not provide rates of convergence for this case. We now describe an exact characterization of the EM updates when fitting balanced mixtures (10b) for the true model (2).

### 2.4 Behavior of population EM for balanced mixtures

We begin our study of balanced mixtures by showing that the population EM operator is globally convergent, albeit with a contraction parameter that depends on , and degrades towards as . Our statement involves the constant where is a standard normal variate.

###### Theorem 2.

For the balanced mixtures setting (10b) of the true model (2), the population EM operator has the following properties. For all non-zero , the operator satisfies the upper bound

 ∥M(θ)∥2∥θ∥2 ≤γup(θ):=1−p+p1+∥θ∥222σ2<1. (15a) Moreover, for all non-zero θ such that ∥θ∥22≤5σ28, it satisfies the lower bound ∥M(θ)∥2∥θ∥2 ≥γlo(θ):=11+2∥θ∥22σ2. (15b)

See Section 4.2 for the proof of Theorem 2.

The salient feature of Theorem 2 is that the contraction coefficient is not globally bounded away from and in fact satisfies . In conjunction with the lower bound (15b), we see that

 ∥M(θ)∥2∥θ∥2 ≍(1−∥θ∥22σ2)for small ∥θ∥2. (16)

This precise contraction behavior of population EM operator is in accord with that of the simulation study in Figure 3(b).

##### Two phases of convergence:

The preceding results show that the population EM updates should exhibit two phases of behavior. In the first phase, up to a relatively coarse accuracy, which is of the order , the iterates exhibit geometric convergence. Concretely, we are guaranteed to have after running the algorithm for steps. In the second phase, as the error decreases from to a given , the convergence rate becomes sub-geometric: concretely, we have

 ∥∥θT0+t∥∥2 ≤ϵfort≥cσ2ϵ2log(σ/ϵ). (17)

Note that the conclusion (16) shows that for small enough , the population EM takes steps to find -accurate estimate of . This rate is extremely slow in comparison to the geometric rate established for the unbalanced mixtures in Theorem 1, and demonstrates a phase transition in the behavior of EM between the two settings.

Moreover, the sub-geometric rate derived above is also in stark contrast with the favorable behavior of EM for the exact-fitted setting established in past work. Balakrishnan et al. [1] showed that when the EM algorithm is used to fit a two-component Gaussian mixture with sufficiently large separation between the means relative to the standard deviation (known as the signal-to-noise ratio, or SNR for short), the population EM operator is contractive in a neighborhood of the true parameter , which implies a geometric rate of convergence rate in this same neighborhood. In later work on this same model, Daskalakis et al. [7] showed that the convergence is in fact geometric for any non-zero SNR. The model considered in Theorem 2 can be seen as the degenerate limit of zero SNR, and we see that the behavior of EM breaks down, as geometric convergence no longer holds.

### 2.5 New techniques for analyzing sample EM

We now turn to the problem of converting the preceding guarantees on population EM to associated guarantees for the sample EM updates that are implemented in practice. A direct application of the framework developed by Balakrishnan et al. [1] leads to a sub-optimal guarantee for sample EM in the case of balanced mixtures (see Section 2.5.1

). This sub-optimality motivates the development of new methods for analyzing the behavior of the sample EM iterates, based on localization over a sequence of epochs. We sketch out these methods in Sections

2.5.2 and 2.5.3, and in Section 2.5.4 we draw these results together and state a theorem that provides a sharp rate of convergence for sample EM applied to balanced mixtures.

#### 2.5.1 A sub-optimal guarantee

Let us recall the set-up for the procedure suggested by Balakrishnan et al. [1], specializing to the case where the true parameter , as in our specific set-up. Using the triangle inequality, the norm of the sample EM iterates can be upper bounded by a sum of two terms as follows:

 ∥∥θt+1∥∥2=∥∥Mn(θt)∥∥2≤∥∥Mn(θt)−M(θt)∥∥2+∥∥M(θt)∥∥2,valid for all t=0,1,2,…. (18)

The first term on the right-hand side corresponds to the deviations between the sample and population EM operators, and can be controlled via empirical process theory. The second term corresponds to the behavior of the (deterministic) population EM operator, as applied to the sample EM iterate , and needs to be controlled via a result on population EM.

Theorem 5(a) from Balakrishnan et al. [1] is based on imposing generic conditions on each of these two terms, and then using them to derive a generic bound on the sample EM iterates. In the current context, their theorem can be summarized as follows. For given tolerances , and starting radius , suppose that there exists a function , decreasing in terms of the sample size , such that

 sup∥θ∥2≥ϵ∥M(θ)∥2∥θ∥2≤κandP[sup∥θ∥2≤r∥Mn(θ)−M(θ)∥2≤ε(n,δ)]≥1−δ. (19a) Then for a sample size n sufficiently large and ϵ sufficiently small to ensure that (1−κ)ϵ(i)≤ε(n,δ)(ii)≤(1−κ)r, (19b)

the sample EM iterates are guaranteed to converge to a ball of radius around the true parameter .

In order to apply this theorem to the current setting, we need to specify a choice of for which the bound on the empirical process holds. The following auxiliary result provides such control for us:

###### Lemma 1.

There are universal positive constants such that for any positive radius , any threshold , and any sample size , we have

 P⎡⎣sup∥θ∥2≤r∥Mn(θ)−M(θ)∥2≤c2σ2r√dlog(1/δ)n⎤⎦≥1−δ. (20)

The proof of this lemma is based on Rademacher complexity arguments; see Appendix A for the details. In the proof, we establish the result with and , but make no effort to obtain “good” constants.

With the choice , Lemma 1 guarantees that the second inequality in line (19a) holds with . On the other hand, Theorem 2 implies that for any such that , we have that population EM is contractive with parameter bounded above by . In order to satisfy inequality (i) in equation (19b), we solve the equation . Tracking only the dependency on and , we obtain444Moreover, with this choice of , inequality (ii) in equation (19b) is satisfied with a constant , as long as is sufficiently large relative to .

 √d/nϵ2=ϵ⟹ϵ=O((d/n)1/6), (21)

which shows that the Euclidean norm of the sample EM iterate is bounded by a term of order .

While this rate is much slower than the classical rate that we established in the unbalanced case, it does not coincide with the rate that we obtained in Figure 1(b) for balanced setting with . Thus, the proof technique based on the framework of Balakrishnan et al. [1] appears to be sub-optimal. The sub-optimality of this approach necessitates the development of a more refined technique. Before sketching this technique, we would like to quantify empirically the convergence rate of sample EM in terms of both dimension and sample size under balanced setting. In Figure 6, we summarize the results of these experiments. The two panels in the figure suggest a statistical rate of order for sample EM and thereby provide further numerical evidence that the preceding discussion indeed yielded a sub-optimal rate.

#### 2.5.2 Localization over epochs

Let us try to understand why the preceding argument led to a sub-optimal bound. In brief, its “one-shot” nature contains two major deficiencies. First, the tolerance parameter is used both for measuring the contractivity of the updates, as in the first inequality in equation (19a), and for determining the final accuracy that we achieve. At earlier phases of the iteration, the algorithm will converge more quickly than the worst-case analysis based on the final accuracy suggests. A second deficiency is that the argument uses the radius only once, setting it to a constant to reflect the initialization at the start of the algorithm. This means that we failed to “localize” our bound on the empirical process in Lemma 1. At later iterations of the algorithm, the norm will be smaller, meaning that the empirical process can be more tightly controlled. We note that similar ideas of localizing an empirical process plays a crucial role in obtaining sharp bounds on the error of -estimation procedures [26, 2, 16].

In order to exploit this multi-phase behavior of the EM algorithm, we introduce a novel localization argument, in which the sequence of iterations is broken up into a sequence of different epochs, and we track the decay of the error carefully through each epoch. The epochs are set up in the following way:

• We index epochs by the integer , and associate them with a sequence of scalars in the interval . The input to epoch is the scalar , and the output from epoch is the scalar .

• For all iterations of the sample EM algorithm contained strictly within epoch , the sample EM iterate has Euclidean norm and lies within the interval .

• Upon completion of epoch at some iteration , the EM algorithm returns an estimate such that , where

 αℓ+1=13αℓ+16. (22)

Note that the new scalar serves as the input to epoch .

The recursion (22) is crucial in our analysis: it tracks the evolution of the exponent acting upon the ratio , and the rate is the bound on the Euclidean norm of the sample EM iterates achieved after epoch .

A few properties of the recursion are worth noting. First, given our initialization , we see that , which agrees with the outcome of our one-step analysis from above. Second, as the recursion is iterated, it converges from below to the fixed point . Thus, our argument will allow us prove a bound arbitrarily close to , as stated formally in Theorem 3 to follow.

#### 2.5.3 How does the key recursion (22) arise?

Let us now sketch out how the key recursion (22) arises. Consider epoch specified by input , and consider an iterate such that . We begin by proving that this initial condition ensures that is less than level for all future iterations (Lemma 4 in the sequel). Given this guarantee, our second step is to apply Theorem 2 for the population EM operator, for all such that . Consequently, for these iterations, we have

 ∥∥M(θt)∥∥2 ≤(1−p+p1+∥θ∥222σ2)∥∥θt∥∥2≾(1−(d/n)2αℓ+1)(d/n)αℓ≤˜γ(dn)αℓ, (23a) where ˜γ:=e−(d/n)2αℓ+1. On the other hand, applying Lemma 1 for this epoch, we obtain that ∥∥Mn(θt)−M(θt)∥∥2≾(dn)αℓ√dn=(dn)αℓ+1/2, (23b)

for all in the epoch. Unfolding the basic triangle inequality (18) for steps, we find that

 ∥∥θt+T∥∥2 ≤∥∥Mn(θt)−M(θt)∥∥2(1+˜γ+…+˜γT−1)+˜γT∥θt∥2 ≤11−˜γ∥∥Mn(θt)−M(θt)∥∥2+e−T(d/n)2αℓ+1(d/n)αℓ.

The second term decays exponentially in , and our analysis shows that it is dominated by the first term in the relevant regime of analysis. Examining the first term, we find that has Euclidean norm of the order

 ∥∥θt+T∥∥2 ≾11−˜γ∥∥Mn(θt)−M(θt)∥∥2≈(dn)−2αℓ+1(dn)αℓ+1/2=:r. (24)

The epoch is said to be complete once . Disregarding constants, this condition is satisfied when , or equivalently when

 (dn)−2αℓ+1(dn)αℓ+1/2 =(dn)αℓ+1.

Viewing this equation as a function of the pair and solving for in terms of yields the recursion (22). Refer to Figures 7 and 8 for a visual illustration of the above localization argument.

Of course, the preceding discussion is informal, and there remain many details to be addressed in order to obtain a formal proof. Nonetheless, it contains the essence of the argument used to establish a sharp rate of convergence for the sample EM iterates, which we now state.

#### 2.5.4 Upper and lower bounds on sample EM

We turn to a statements of upper and lower bounds on the rate of the sample EM iterates for balanced Gaussian-location mixtures. We begin with a sharp upper bound, proved using the epoch-based localization technique that was just sketched:

###### Theorem 3.

There are universal constants , and such that for any scalars and , any sample size and for any iterate number larger than , the sample-based EM updates satisfy the bound

 ∥θt∥2≤[∥∥θ0∥∥2⋅t−1∏j=0γup(θj)]+√2σ(σ2dnloglog(4/ϵ)δ)14−ϵ, (25)

with probability at least .

See Section 4.3 for the detail proof of this theorem, where we also provide explicit expressions for the bounds on sample size and the number of time steps with precise values of constants and that appear in the statement.

As we show in our proofs, once the iteration number satisfies the lower bound stated in the theorem, the second term on the right-hand side of the bound (25) dominates the first term; therefore, from this point onwards, the the sample EM iterates have Euclidean norm of the order . Note that can be chosen arbitrarily close to zero, so at the expense of increasing the lower bound on the number of iterations , we can obtain rates arbitrarily close to .

We note that earlier studies of parameter estimation for over-fitted mixtures, in both the frequentist [4] and Bayesian settings [13, 22], have derived a rate of for the global maximum of likelihood. To the best of our knowledge, Theorem 3 is the first algorithmic result that shows that such rates apply to the fixed points and dynamics of the EM algorithm, which need not converge to global optima.

The preceding discussion was devoted to an upper bound on EM. Let us now match this upper bound, at least in the univariate case , by showing that any non-zero fixed point of the sample EM updates has Euclidean norm of the order