1 Introduction
Gaussian mixture models [21] have been used widely to model heterogeneous data in many applications arising from physical and biological sciences. In several scenarios, the data has a large number of subpopulations, and the mixture components in the data may not be wellseparated. In such settings, estimating the true number of components may be difficult, so that one may end up fitting a mixture model with a number of components larger than that actually present in the data. Such mixture fits, referred to as overspecified mixture distributions, are commonly used by practitioners in order to deal with uncertainty in the number of components in the data [23, 12]. However, a deficiency of such models is that they are singular, meaning that their Fisher information matrices are degenerate. Given the popularity of overspecified models in practice, it is important to understand how methods for parameter estimation, including maximum likelihood and the EM algorithm, behave when applied to such models.
1.1 Background and past work
In the context of singular mixture models, an important distinction is between those that are strongly versus weakly identifiable. Chen [5] studied the class of strongly identifiable models in which, while the Fisher information matrix may be degenerate at a point, it is not degenerate over a larger set. Studying overspecified Gaussian mixtures with known scale parameters, he showed that the accuracy of the MLE for the unknown location parameter is of the order , which should be contrasted with the classical rate achieved in regular settings. A line of followup work has extended this type of analysis to other types of strongly identifiable mixture models; see the papers [15, 23, 20, 13] as well as the references therein for more details.
A more challenging class of mixture models are those that are only weakly identifiable, meaning that the Fisher information is degenerate over some larger set. This stronger form of singularity arises, for instance, when the scale parameter in an overspecified Gaussian mixture is also unknown [4, 6, 17]. Ho et al. [14] characterized the behavior of MLE for a class of weakly identifiable models. They showed that the convergence rates of MLE in these models can be very slow, with the precise rates determined by algebraic relations among the partial derivatives. However, this past work has not addressed the computational complexity of computing the MLE in a weakly identifiable model.
The focus of this paper is the intersection of statistical and computational issues associated with fitting the parameters of weakly identifiable mixture models. In particular, we study the expectationmaximization (EM) algorithm [9, 27, 22], which is the most popular algorithm for computing (approximate) MLEs in mixture models. It is an instance of a minorizationmaximization algorithm, in which at each step a suitably chosen lower bound of the loglikelihood is maximized. There is now a lengthy line of work on the behavior of EM when applied to regular models. The classical papers [27, 24, 7] establish the asymptotic convergence of EM to a local maximum of the loglikelihood function for a general class of incomplete data models. Other papers [16, 29, 19] characterized the rate of convergence of EM for regular Gaussian mixtures. More recent years have witnessed a flurry of work on the behavior of EM for various kinds of regular mixture models [1, 26, 31, 28, 8, 30, 11, 2]; as a consequence, our understanding of EM in such cases is now relatively mature. More precisely, it is known that for Gaussian mixtures, EM converges in steps to parameter estimates that lie within Euclidean distance of the true location parameters, assuming minimal separation between the mixture components.
In our own recent work [10], we studied the behavior of EM for fitting a class of nonregular mixture models, namely those in which the Fisher information is degenerate at a point, but the model remains strongly identifiable. One such class of models are Gaussian location mixtures with known scale parameters that are overspecified, meaning that the number of components in the mixture fit exceeds the number of components in the data generating distribution. For such nonregular but strongly identifiable mixture models, our own past work [10] showed that the EM algorithm takes steps to converge to a Euclidean ball of radius around the true location parameter. Recall that for such models, the MLE is known to lie at a distance from the true parameter [5], so that even though its convergence rate as an optimization algorithm is slow, the EM algorithm nonetheless produces a solution with statistical error of the same order as the MLE. This past work does not consider the more realistic setting in which both the location and scale parameters are unknown, and the EM algorithm is used to fit both simultaneously. Indeed, as mentioned earlier, such models may become weakly identifiable due to algebraic relations among the partial derivatives [6, 17]. Thus, analyzing EM in the case of weakly identifiable mixtures is challenging for two reasons: the weak separation between the mixture components, and the algebraic interdependence of the partial derivatives of the loglikelihood.
The main contributions of this work are (a) to highlight the dramatic differences in the convergence behavior of the EM algorithm, depending on the structure of the fitted model relative to the datagenerating distribution; and (b) to analyze the EM algorithm under a few specific yet representative settings of weakly identifiable models, giving a precise analytical characterization of its convergence behavior.
1.2 Some illustrative simulations
In order to put our results into context, it is useful to study via simulation how the behavior of EM can vary dramatically for some simple models. In particular, consider the class of twocomponent mixture densities of the form
(1) 
where is the density of a
random vector. In the parameterization (
1), both the mean parameterand the variance parameter
are allowed to vary. Suppose that data is drawn from standard Gaussian model in dimension , whose density is given by(2) 
Note that the density belongs to the family , since we have with and . We refer to densities of the form (1) as isotropic locationscale Gaussian mixtures.
Given samples from a dimensional instance of this model, the sample EM algorithm generates a sequence of the form ; see equation (7c) for a precise definition. An abstract counterpart of the sample EM algorithm—not useful in practice but rather for theoretical understanding—is the population EM algorithm , obtained in the limit of an infinite sample size (cf. equation (9)).
Panel (a) in Figure 1 plots the Euclidean norm of the population EM iterate^{1}^{1}1In fact, our analysis makes use of two slightly different populationlevel operators and defined in equations (9) and (19) respectively. Figure 1 shows plots for the operator , but the results are qualitatively similar for the operator . versus the iteration number , with solid red line corresponding to and the dashdotted green line corresponding to . Observe that the algorithm converges far more slowly in the univariate case than the multivariate case. The theory to follow in this paper (see Theorems 1 and 2) provides explicit predictions for the rate at which the Euclidean error should decay; these theoretical predictions are plotted as a dashed blue line (univariate case ) and a dotted black line (). In practice, running the sample EM algorithm yields an estimate of the unknown location parameter . Panel (b) shows the scaling of the statistical estimation error of this sample EM estimate versus the sample size on a loglog scale. The three curves correspond to dimensions , along with leastsquares fits to the data. Note that the slow convergence of EM in dimension , as shown in panel (a), manifests itself in the lower statistical accuracy of the sample EM estimate.
This qualitative behavior is not limited only to the relatively simple class of Gaussian mixture models (1), nor is slow convergence of EM only an issue in dimension . More generally, consider mixture models of the form
(3a)  
(3b) 
Note that the model family (1) in dimension is a special case of the tied diagonal covariance fit, one in which we enforce the constraint .
In Figure 2, we compare and contrast the convergence behavior of EM for the fits (3a) and (3b) along with results using the simpler isotropic model (1). On one hand, from panel (a) we see that the population EM sequence for the tied diagonal fit converges very slowly when compared to the isotropic fit. (Population EM for the free covariance fit is numerically expensive and thereby omitted.) On the other hand, from panel (b), we see that the statistical error when using sample EM for the two fits (3a)(3b) is of order , which is substantially slower than the rate for fitting an isotropic covariance matrix.
(a)  (b) 
1.3 Our contributions
The main contribution of this paper is to provide a precise analytical characterization of the behavior of the EM algorithm for overspecified mixture models. Doing so requires introducing a number of new technical ideas to the analysis of optimization algorithms in statistical settings, to be detailed below. As demonstrated in the previous section, the EM algorithm can exhibit a range of convergence behaviors in various settings, both univariate and multivariate. In this paper, so as to bring our contributions into sharp focus, we direct our analytical attention to the class of isotropic locationscale mixtures (1), with the goal of obtaining a precise characterization of the phenomena illustrated in Figure 1. We analyze the behavior of the EM algorithm when it is used to fit the parameters of a density from the family (1), based on a data set generated in an i.i.d. fashion from the standard
dimensional Gaussian distribution
. We study both the population EM algorithm—meaning the idealized method obtained in the limit of infinite data—as well as the samplebased EM algorithm, as applied to a finite collection of data.Our main findings for both types of algorithms can be summarized as follows:

Multivariate singular Gaussian mixtures: For , we establish that the population EM algorithm requires and steps in order to achieve accuracy for the location and scale parameters, respectively. Using an argument introduced in our past work [10], we show that these rates of convergence for the population EM algorithm imply that the sample EM algorithm takes steps in order to converge to estimates, of the location and scale parameters respectively, that lie within distances and of the true location and scale parameters, respectively. The order accuracy in the scale parameter estimate demonstrates the benefit of borrowing strength from multiple dimensions in order to estimate a shared scalar parameter.

Univariate singular Gaussian mixtures: By contrast, in the univariate setting (), we find that the EM algorithm behaves very differently. The population EM algorithm requires and steps to reach accuracy for the location and scale parameters, respectively. Translating these rates to the sample EM algorithm, we prove that the EM estimate has statistical estimation error of the order and after order steps for the location and scale parameters respectively. Proving these rates requires a novel analysis, and herein lies the main technical contribution of our paper. Indeed, we show that all the analysis techniques introduced in past work on EM, including work on both the regular [1] and strongly identifiable cases [10], lead to suboptimal rates. Our novel method is a twostage approach. In the first stage, we prove that sample EM iterates converge to a ball around the true parameters after steps, whose radius scales like and , for the location and scale parameters respectively. In the second stage, we consider a corrected population EM algorithm, which enables a tighter analysis for the sample EM iterates when they lie in a close enough vicinity of the true parameters. This two stage analysis enables us to establish the optimal statistical error rates of order and of the sample EM iterates after order steps, for the location and scale parameter respectively.
The remainder of the paper is organized as follows. In Section 2, we begin with a description of the specific singular mixture fits and the associated EM updates. Then, we provide a description of algebraic relations between partial derivatives of the density in the univariate case, and conclude the section by discussing the challenges associated with our analysis with EM. Section 3 is devoted to our main results on the convergence of EM, first for multivariate and then for the univariate mixtures. In Section 4, we discuss the results and suggest possible directions for future work. We provide proofs of our theorems in Section 5 while deferring the proofs of more technical results to the Appendices.
Notation:
In the paper, the expression will be used to denote for some positive universal constant that does not change with . Additionally, we write if both and hold. Furthermore, we denote as the set for any . We define as the smallest integer greater than or equal to for any . The notation stands for the norm of vector . Finally, for any two densities and , the total variation distance between and is given by . Similarly, the squared Hellinger distance between and is defined as .
2 Overspecified locationscale Gaussian mixtures
In this section, we first provide explicit formulation of sample and population likelihood for locationscale Gaussian mixture models (1). Then, we present specific EM updates for these models. Finally, based on an intrinsic algebraic interdependence of the partial derivatives of the locationscale Gaussian density function, we briefly discuss the intuition of analyzing EM algorithm with these models.
2.1 Likelihood structure
Recall that we are analyzing the isotropic locationscale Gaussian mixture models (1). Given i.i.d. samples , the maximum likelihood estimate for the model fit (1) takes the form
(4) 
where is a compact subset of . Constraining the likelihood to the compact set suffices to ensure that the MLE is consistent [6].
The largesample limit of the sample likelihood is the population loglikelihood given by
(5) 
where the expectation is taken with respect to the true distribution of the data, namely, . Note that the true parameters achieve the the global maximum of the population loglikelihood .
2.2 EM algorithm
In general, the MLEs do not admit a closedform expression, so that the EM algorithm is used to approximate maximizers of loglikelihood function . The EM updates for Gaussian mixture models are standard, so we simply state them here. In terms of the shorthand notation , the Estep in the EM algorithm involves computing the function
(6a)  
where the weight function is given by  
(6b) 
The Mstep involves maximizing the function (6a) over the pair with fixed, which yields
(7a)  
(7b)  
Doing some straightforward algebra, the EM updates can be succinctly defined as  
(7c)  
and . For simplicity in presentation, we refer to the operator as the sample EM operator. 
One can gain some intuition into the behavior of EM by examining the algebraic structure of loglikelihood. In the univariate case (), the loglikelihood has two partial derivatives, one with respect to the location parameter and the other with respect to the scale parameter . By computing these partial derivatives, one finds that they satisfy the algebraic relation
(8) 
valid for all , and . In the multivariate setting (1) when the same parameter is shared across multiple dimensions, this algebraic relation no longer holds and we obtain faster convergence of EM estimates.
It is known from past work that algebraic relations of the form (8) have consequences for the behavior of the MLE [4, 3, 17, 14]. In the current context, by adapting techniques developed by Ho et al. [14], we can verify that the convergence rates of to the true parameters are of order and under the univariate singular setting (1). In the multivariate setting , the same analysis shows that the MLE achieves the faster rates and .
Given these differences in the algebraic structure of the likelihood and the behavior of the MLE, it is natural to wonder whether the same phenomena arise for the EM algorithm when applied to weakly identifiable mixtures. Note that the answer to this question is not a priori obvious, given the nonconvexity of the sample likelihood. Since the EM algorithm is only guaranteed to return local optima, it is quite possible that the accuracy of EM estimates could be substantially different than a global maximum of the likelihood—that is, the MLE.
3 Main results and their consequences
We now turn to formal statements of our main results on the behavior of EM. The analyses for the multivariate and the univariate settings are significantly different; accordingly, we devote separate sections to each, beginning with the (easier) multivariate case in Section 3.1 before turning to the (more challenging) univariate case in Section 3.2.
3.1 Guarantees for multivariate singular mixtures
In order to study the EM algorithm for fitting Gaussian mixtures of the form (1), we consider the operator
(9) 
In the sequel, we refer to it as a pseudopopulation operator, since it is neither a purely samplebased operator (due to the outer expectation), nor a purely population operator due to its dependence on the samples . In past work, Cai et al. [2] used this operator for analyzing the behavior of EM for regular case of locationscale mixtures of sample EM with unknown locationscale mixtures.
We start with a description of both the population EM and the sample EM updates for the model (1) under the multivariate setting of . We use to denote the interval . We now state our first main result for the multivariate singular setting:
Theorem 1.
Let , be fixed positive numbers and the sample size . Then, for any the pseudopopulation EM operator satisfies
(10a)  
with probability at least .^{2}^{2}2 Since the operator depends on the samples , only a high probability bound (and not a deterministic one) on the ratio is possible. Moreover, with any starting point such that , the sample EM sequence satisfies  
(10b) 
for all number of iterates with probability at least .
See Section 5.1 for
the proof.
The results in Theorem 1 establish the provably slow convergence of EM on both, the algorithmic and statistical fronts for the multivariate singular mixtures:

[label=()]

The bound (10a) shows that the pseudopopulation EM operator has a contraction parameter that scales like as . Therefore, we conclude that the norm of the pseudopopulation EM sequence will be accurately captured by the norm of the sequence ; which is in agreement with the results from the simulation study of Figure 1 for .

The bound (10b) shows that sample EM updates converge to a ball around with radius arbitrarily close to for .
Statistical rates for the scale parameter:
We now derive the convergence rates for the scale parameter using Theorem 1. Noting that , we obtain the following relation
Using standard chisquared bounds, we obtain that
with high probability. From the bound (10b), we also have . Putting the pieces together, we conclude that the statistical error for the scale parameter satisfies
(11) 
with high probability. Consequently, in the sequel, we focus primarily on the convergence rate for the EM estimates of the location parameter, as the corresponding guarantee for the scale parameter is readily implied by it.
Related work:
Note that the slow convergence behavior of EM described above is in stark contrast with the geometric convergence of EM location estimates to a parametric radius of order with wellspecified mixtures [1, 8]. Nonetheless, the results from Theorem 1 are reminiscent of the slow convergence of EM for overspecified mixtures with known scale parameter [10]. In fact, the results for the overspecified location case from the paper [10] when combined with Theorem 1 in this paper can be summarized as follows: for the multivariate singular fits (1), the convergence behavior of EM is unaffected whether the scale parameter is known or unknown. Moreover, the proof of Theorem 1 follows a similar roadmap as of the proofs in the paper [10]: We first establish the population contraction bounds (10a) and then using epochbasedlocalization argument derive the statistical rates (10b).^{3}^{3}3Moreover, similar to the arguments made in the paper [10], localization argument is necessary to derive a sharp rate. Indeed, a direct application of the framework introduced by Balakrishnan et al. [1] for our setting implies a suboptimal rate of order for the Euclidean error .
Initial condition:
We now make a few comments on the initial conditions on assumed in Theorem 1. At first sight, the condition and the consequent constraint might seem rather restrictive. However, we show that (Lemma 7 in Section B.7) for any satisfying , we have that the pseudopopulation EM update satisfies , with high probability. In light of these observations, the initialization condition is Theorem 1 is not overly restrictive.
3.2 EM for univariate singular mixtures
When we substitute in the upper bound (10a), we obtain that . In other words, the bound (10a) is not informative enough for , as it does not tell us whether the pseudopopulation EM operator is even contractive for overspecified univariate mixtures. We now briefly demonstrate that the knowledge of scale parameter in the univariate case has a significant effect on the nature of the loglikelihood function (defined in equation (5)). When the scale parameter is known, the population loglikelihood is given by the map . When the scale is unknown, recalling that the EM updates (7b) satisfy , we consider the map . The results are plotted in Figure 3, where we observe that for both cases the population loglikelihood functions are uniformly concave [33] with different orders around . More concretely, we observe the following numerical behaviors: for the location case, and, for the locationscale case. Such a difference in the population likelihood surface also affects the sample likelihood function as depicted in panel (b) in Figure 3, where we see that the local maximas in the function corresponding to the locationscale mixtures have higher magnitude than those in the location mixtures. These observed differences lead to a more involved theoretical analysis for the univariate setting, as we now describe.
(a)  (b) 
As discussed before, due to the relationship between the location and scale parameter, namely the updates (7c), it suffices to analyze the sample EM operator for the location parameter. For the univariate Gaussian mixtures, given samples , the sample EM operator and the corresponding pseudopopulation operator (10a) are given by
(12a)  
(12b) 
Let denote the interval where is a positive universal constant. We now characterize the guarantees for EM under the univariate setting:
Theorem 2.
Fix , , and let the sample size be sufficiently large . Then the pseudopopulation operator satisfies
(13a)  
with probability at least . Moreover for any initialization that satisfies , the sample EM sequence , satisfies  
(13b)  
with probability at least . 
See
Section 5.2 for
the proof.
The bound (13a) provides a sharp characterization of the local contractivity properties of pseudopopulation operator . In particular, we have that which is significantly slower than the contraction coefficient of order derived in Theorem 1 for the multivariate setting. As a direct consequence, we see that while the pseudopopulation EM sequence converges to a ball of radius around in iterations for the multivariate setting, it takes much more iterations for the univariate setting. Such a behavior is indeed in accordance with the observations made in Figure 1.
The bound (13b) shows that with high probability after steps the sample EM iterates converge to a ball around whose radius is arbitrarily close to . This scaling of order is, in fact, tight in the standard minimax sense (Proposition 1 in Appendix A). Thus, we conclude that Theorems 1 and 2, in conjunction, provide a rigorous justification of the empirical behavior of sample EM updates as illustrated in Figure 1(b).
Remark:
The proof of Theorem 2 follows a fundamentally different route when compared to that of Theorem 1. On one hand, the analysis of the univariate operator is technically more involved and requires new techniques when compared to the analysis of the operator in the multivariate setting. On the other hand, the localization argument when applied with the EM operators and leads to suboptimal rates of order for sample EM updates, which are slower when compared to the order established in Theorem 2. Such a suboptimal guarantee necessitates the introduction of a different population operator that we discuss in the sequel. We now first derive a few suboptimal guarantees using the operator in Section 3.2.1 and then in Section 3.2.2 we introduce a novel twostage localization argument that enables us to derive the correct statistical rate for the sample EM updates.
3.2.1 Suboptimal rates with pseudopopulation EM
Given the behavior of the EM pseudopopulation operator , the analysis of sample EM iterations requires uniform laws for the perturbations bounds between the sample EM operator and the operator . Later we establish (Lemma 6 stated in Section B.5) the following high probability bound: For any radius , we have
(14) 
Using the bound (12b) from Theorem 2 and the uniform bound (14), we now sketch the statistical rates for sample EM that can be obtained using (a) the generic procedure outlined by Balakrishnan et al. [1] and (b) the localization argument introduced by Dwivedi et al. [10]. As we show, both these arguments end up being suboptimal as they do not provide us the rate of order stated in Theorem 2. To summarize, the discussion to follow makes use of the following two bounds,
(15) 
both of which hold with high probability.
Suboptimal rate I:
The eventual radius of convergence obtained using Theorem 5(a) from the paper [1] can be determined by solving the simpler equation
(16a)  
where denotes the bound on the initialization radius . Tracking dependency only on , we obtain that  
(16b) 
where we use the fact that the initialization radius is a constant. This informal computation suggests that the the sample EM iterates for location parameter are bounded by a term of order . This rate is clearly suboptimal when compared to the EM rate of order from Theorem 2.
Suboptimal rate II:
Next we apply the more sophisticated localization argument from the paper [10] in order to obtain a sharper rate. Given the bounds (15), this argument leads to solving the equation
(17a)  
where, as before, denotes the (constant) bound on initialization radius. Given that , we find that  
(17b) 
This calculation allows us to conclude that the EM algorithm converges
to an estimate which is at a distance of order from the true
parameter, which is again
suboptimal compared to the rate of EM from Theorem 2.
Notice that this rate, while still suboptimal, is sharper than the rate
of order obtained in equation (16b);
this sharpening arises from the addiitonal factor in the
LHS of equation (17a) in comparison to the LHS of
equation (16a). We refer the readers to the
paper [10] for a detailed discussion on the
localization argument that leads to this additional factor.
Indeed both the conclusions above can be made rigorous; nonetheless, the bound (17b) is essential in the analysis that follows and hence, we formalize the corresponding statistical rate in the next corollary. Recall that denotes the interval where is a positive universal constant.
Corollary 1.
Given constants and , suppose that we generate the the samplelevel EM sequence starting from an initialization , and using a sample size lower bounded as . Then for all iterations , we have
(18) 
See Appendix B.6 for the proof of this corollary.
Before proceeding further, a few remarks are in order. The suboptimal bound (18) is not an artifact of the localization argument; instead it arises due to our choice of the pseudopopulation EM operator (12b). Indeed, as we show in the next section, a better choice of the population operator enables us to derive sharper rates for sample EM in the univariate setting of singular mixtures. However, a key assumption in the further derivation is that the sample EM iterates can converge to a ball of radius around in a finite number of steps. Thus, Corollary 1, with a choice of , plays a crucial role in the analysis to follow, as it enables us to claim that the aforementioned assumption does hold with high probability.
3.2.2 Towards a sharper analysis: Corrected population EM operator
The particular choice of the populationlike operator in equation (12b) was motivated by the previous works [2] with the locationscale Gaussian mixtures. Nonetheless as we noted above, this choice did not lead to the sharpest rate for sample EM iterates under the univariate singular mixtures. Upon a closer look, we realize that a “better” choice of the population operator is indeed possible in this setting. We now turn to a detailed description of this new operator and the associated consequences for the statistical rates for sample EM.
Define the following population EM operator:
(19) 
Unlike the pseudopopulation operator (9), this operator is indeed a population operator as it does not depend on samples . Note that, this operator is obtained when we replace the sum in the definition (7c) of the operator by its corresponding expectation . For this reason, we also refer to this operator as the corrected population operator. We now derive the contraction coefficient and the perturbation error for the new operator for :
Lemma 1.
The corrected population EM operator satisfies the following bounds:  
(20a)  
Furthermore, for any fixed and , we have that  
(20b) 
where is some universal constant.
Dimension 
(Theorem 1 & Lemma (6)) 
(Lemma (1)) 


similar  
(Contraction rate)  similar  
(Perturbation bound)  similar 
Using Lemma 1, we now compare and contrast the two operators and . While both the operators have similar contraction coefficient of order as , their perturbation bounds are significantly different. While, the error scales linearly with the radius (see bound (14)), from the inequality (20b) implies that the deviation error has a cubic scaling . In Figure 4, we plot the numerically computed perturbation bounds that verify the linear and cubic dependence on the radius for these two bounds. Moreover, for reader’s convenience, we summarize these similarities and differences in Table 1.
Another notable difference between the two perturbation bounds is the range of radius over which we prove their validity. With our tools, we establish that the perturbation bound (20b) for the operator is valid for any . On the other hand, the corresponding bound (14) for the operator is valid for any . The later difference is the key reason behind why the operator and the consequent result in Corollary 1 remain crucial in the analysis to follow.
A twostage analysis:
In lieu of the above observations, the proof of the sharp upper bound (13b) in Theorem 2 proceeds in two stages. In the first stage, invoking Corollary 1, we conclude that with high probability the sample EM iterates converge to a ball of radius at most after steps, where . Consequently, the sample EM iterates after steps satisfy the assumptions required to apply Lemma 1. Thereby, in the second stage of the proof, we invoke Lemma 1 and employ the contraction bound (20a) of the operator in conjunction with the cubic perturbation bound (20b). Using localization argument for this stage, we establish that the EM iterates obtain a statistical error of order in steps as stated in Theorem 2.
4 Discussion
In this paper, we established several results characterizing the convergence behavior of EM algorithm for overspecified locationscale Gaussian mixtures. Our results reveal an interesting phase transition with the convergence rates of EM between the univariate and multivariate settings. With overspecified models in dimension
, the EM updates based on samples recover noisy estimates of the true parameters with a rather large error of for the location parameter and for the scale parameter. Thus we see that the error scaling for the location estimation error is of the order with the sample size which is rather slow when compared to the usual parametric rate of order for the regular models. In the univariate setting, the story gets more interesting and we observe that these errors are even larger: for the location parameter and for the scale parameter. This phase transition from the multivariate to the univariate setting can be attributed to the interdependence among the parameters arising from an intrinsic linear dependence in the PDE structure in the univariate setting. We view our analysis of EM for singular Gaussian mixtures as a first step towards a rigorous understanding of EM for a broader class of weakly identifiable mixture models. Such a study would provide a better understanding of the singular models with weak identifiability. Note that, these models do arise in practice while fitting heterogeneous data owing to the following two main reasons: (a) overspecification is a common phenomenon in fitting mixture models due to weak separation between mixture components, and, (b) the parameters being estimated are often inherently dependent due to the algebraic structures of the class of kernel densities being fitted and the associated partial derivatives. We now discuss a few other directions that can serve as a natural followup of our work.The slow rate of order for EM is in a sense a worstcase guarantee. In the univariate case, for the entire class of two mixture Gaussian fits, MLE exhibits the slowest known statistical rate for the settings that we analyzed. More precisely, for certain asymmetric Gaussian mixture fits, the MLE convergence rates for the location parameter is faster than that of the symmetric equalweighted mixture considered in this paper. As a concrete example, for the univariate model fit with the standard isotropic Gaussian data, the MLE converges at the rate and respectively for the location and scale parameters. These specific convergence rates are also based on the solvability of a certain system of polynomial equations with specific constraints on the unknown parameters arising from the model fit. It is interesting to understand the effect of such a geometric structure of the global maxima on the convergence of EM algorithm.
Finally, we analyzed overspecified mixtures with only one extra component. Nonetheless, quite often one may fit a model with a larger number of overspecified components so as to avoid underspecification of the true model. The convergence rate of the MLE for such overspecified models is known to further deteriorate as a function of the number of extra components. It remains to understand how the EM algorithm responds to these more severe—and practically relevant—instances of overspecification.
5 Proofs
We now provide the proofs of our main results, with Sections 5.1 and 5.2 devoted to Theorems 1 and 2 respectively. In all cases, the proofs of technical and auxiliary results are deferred to the appendices.
5.1 Proof of Theorem 1
The first step in this proof is to characterize the local contractivity of the pseudopopulation operator . We then leverage this result to the convergence rate of sample EM updates under multivariate setting of singular Gaussian mixture.
5.1.1 Contractivity of the pseudopopulation EM operator
In order to simplify notation, we use the shorthand . Recalling the definition (9) of operator , we have
(21) 
We can find an orthonormal matrix such that , where is the first canonical basis in . Define the random vector . Since , we have that . On performing the change of variables , we find that