Challenges with EM in application to weakly identifiable mixture models

02/01/2019 ∙ by Raaz Dwivedi, et al. ∙ 0

We study a class of weakly identifiable location-scale mixture models for which the maximum likelihood estimates based on n i.i.d. samples are known to have lower accuracy than the classical n^- 1/2 error. We investigate whether the Expectation-Maximization (EM) algorithm also converges slowly for these models. We first demonstrate via simulation studies a broad range of over-specified mixture models for which the EM algorithm converges very slowly, both in one and higher dimensions. We provide a complete analytical characterization of this behavior for fitting data generated from a multivariate standard normal distribution using two-component Gaussian mixture with varying location and scale parameters. Our results reveal distinct regimes in the convergence behavior of EM as a function of the dimension d. In the multivariate setting (d ≥ 2), when the covariance matrix is constrained to a multiple of the identity matrix, the EM algorithm converges in order (n/d)^1/2 steps and returns estimates that are at a Euclidean distance of order (n/d)^-1/4 and (n d)^- 1/2 from the true location and scale parameter respectively. On the other hand, in the univariate setting (d = 1), the EM algorithm converges in order n^3/4 steps and returns estimates that are at a Euclidean distance of order n^- 1/8 and n^-1/4 from the true location and scale parameter respectively. Establishing the slow rates in the univariate setting requires a novel localization argument with two stages, with each stage involving an epoch-based argument applied to a different surrogate EM operator at the population level. We also show multivariate (d ≥ 2) examples, involving more general covariance matrices, that exhibit the same slow rates as the univariate case.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Gaussian mixture models [21] have been used widely to model heterogeneous data in many applications arising from physical and biological sciences. In several scenarios, the data has a large number of sub-populations, and the mixture components in the data may not be well-separated. In such settings, estimating the true number of components may be difficult, so that one may end up fitting a mixture model with a number of components larger than that actually present in the data. Such mixture fits, referred to as over-specified mixture distributions, are commonly used by practitioners in order to deal with uncertainty in the number of components in the data [23, 12]. However, a deficiency of such models is that they are singular, meaning that their Fisher information matrices are degenerate. Given the popularity of over-specified models in practice, it is important to understand how methods for parameter estimation, including maximum likelihood and the EM algorithm, behave when applied to such models.

1.1 Background and past work

In the context of singular mixture models, an important distinction is between those that are strongly versus weakly identifiable. Chen [5] studied the class of strongly identifiable models in which, while the Fisher information matrix may be degenerate at a point, it is not degenerate over a larger set. Studying over-specified Gaussian mixtures with known scale parameters, he showed that the accuracy of the MLE for the unknown location parameter is of the order , which should be contrasted with the classical rate achieved in regular settings. A line of follow-up work has extended this type of analysis to other types of strongly identifiable mixture models; see the papers [15, 23, 20, 13] as well as the references therein for more details.

A more challenging class of mixture models are those that are only weakly identifiable, meaning that the Fisher information is degenerate over some larger set. This stronger form of singularity arises, for instance, when the scale parameter in an over-specified Gaussian mixture is also unknown [4, 6, 17]. Ho et al. [14] characterized the behavior of MLE for a class of weakly identifiable models. They showed that the convergence rates of MLE in these models can be very slow, with the precise rates determined by algebraic relations among the partial derivatives. However, this past work has not addressed the computational complexity of computing the MLE in a weakly identifiable model.

The focus of this paper is the intersection of statistical and computational issues associated with fitting the parameters of weakly identifiable mixture models. In particular, we study the expectation-maximization (EM) algorithm [9, 27, 22], which is the most popular algorithm for computing (approximate) MLEs in mixture models. It is an instance of a minorization-maximization algorithm, in which at each step a suitably chosen lower bound of the log-likelihood is maximized. There is now a lengthy line of work on the behavior of EM when applied to regular models. The classical papers [27, 24, 7] establish the asymptotic convergence of EM to a local maximum of the log-likelihood function for a general class of incomplete data models. Other papers [16, 29, 19] characterized the rate of convergence of EM for regular Gaussian mixtures. More recent years have witnessed a flurry of work on the behavior of EM for various kinds of regular mixture models [1, 26, 31, 28, 8, 30, 11, 2]; as a consequence, our understanding of EM in such cases is now relatively mature. More precisely, it is known that for Gaussian mixtures, EM converges in -steps to parameter estimates that lie within Euclidean distance of the true location parameters, assuming minimal separation between the mixture components.

In our own recent work [10], we studied the behavior of EM for fitting a class of non-regular mixture models, namely those in which the Fisher information is degenerate at a point, but the model remains strongly identifiable. One such class of models are Gaussian location mixtures with known scale parameters that are over-specified, meaning that the number of components in the mixture fit exceeds the number of components in the data generating distribution. For such non-regular but strongly identifiable mixture models, our own past work [10] showed that the EM algorithm takes steps to converge to a Euclidean ball of radius around the true location parameter. Recall that for such models, the MLE is known to lie at a distance from the true parameter [5], so that even though its convergence rate as an optimization algorithm is slow, the EM algorithm nonetheless produces a solution with statistical error of the same order as the MLE. This past work does not consider the more realistic setting in which both the location and scale parameters are unknown, and the EM algorithm is used to fit both simultaneously. Indeed, as mentioned earlier, such models may become weakly identifiable due to algebraic relations among the partial derivatives [6, 17]. Thus, analyzing EM in the case of weakly identifiable mixtures is challenging for two reasons: the weak separation between the mixture components, and the algebraic interdependence of the partial derivatives of the log-likelihood.

The main contributions of this work are (a) to highlight the dramatic differences in the convergence behavior of the EM algorithm, depending on the structure of the fitted model relative to the data-generating distribution; and (b) to analyze the EM algorithm under a few specific yet representative settings of weakly identifiable models, giving a precise analytical characterization of its convergence behavior.

1.2 Some illustrative simulations

In order to put our results into context, it is useful to study via simulation how the behavior of EM can vary dramatically for some simple models. In particular, consider the class of two-component mixture densities of the form

(1)

where is the density of a

random vector. In the parameterization (

1), both the mean parameter

and the variance parameter

are allowed to vary. Suppose that data is drawn from standard Gaussian model in dimension , whose density is given by

(2)

Note that the density belongs to the family , since we have with and . We refer to densities of the form (1) as isotropic location-scale Gaussian mixtures.

Given samples from a -dimensional instance of this model, the sample EM algorithm generates a sequence of the form ; see equation (7c) for a precise definition. An abstract counterpart of the sample EM algorithm—not useful in practice but rather for theoretical understanding—is the population EM algorithm , obtained in the limit of an infinite sample size (cf. equation (9)).

Panel (a) in Figure 1 plots the Euclidean norm of the population EM iterate111In fact, our analysis makes use of two slightly different population-level operators and defined in equations (9) and (19) respectively. Figure 1 shows plots for the operator , but the results are qualitatively similar for the operator . versus the iteration number , with solid red line corresponding to and the dash-dotted green line corresponding to . Observe that the algorithm converges far more slowly in the univariate case than the multivariate case. The theory to follow in this paper (see Theorems 1 and 2) provides explicit predictions for the rate at which the Euclidean error should decay; these theoretical predictions are plotted as a dashed blue line (univariate case ) and a dotted black line (). In practice, running the sample EM algorithm yields an estimate of the unknown location parameter . Panel (b) shows the scaling of the statistical estimation error of this sample EM estimate versus the sample size on a log-log scale. The three curves correspond to dimensions , along with least-squares fits to the data. Note that the slow convergence of EM in dimension , as shown in panel (a), manifests itself in the lower statistical accuracy of the sample EM estimate.

Figure 1: (a) Convergence behavior of the population-like EM sequence  (9) (with ) for singular Gaussian mixtures (1) in dimensions and . The EM algorithm was used to fit a standard normal distribution . In dimension , the population EM sequences converges very slowly to the true solution . Note that the curve given by seems to track the population EM sequence. In dimension , the population EM algorithm converges more quickly than in dimension . The convergence behavior of the sample EM algorithm is qualitatively similar. (b) Scaling of the Euclidean error for the estimate computed using the sample EM algorithm using samples. Both the parameters are estimated using EM. Plot shows scaling of the error with respect to sample-size for .

This qualitative behavior is not limited only to the relatively simple class of Gaussian mixture models (1), nor is slow convergence of EM only an issue in dimension . More generally, consider mixture models of the form

(3a)
(3b)

Note that the model family (1) in dimension is a special case of the tied diagonal covariance fit, one in which we enforce the constraint .

In Figure 2, we compare and contrast the convergence behavior of EM for the fits (3a) and (3b) along with results using the simpler isotropic model (1). On one hand, from panel (a) we see that the population EM sequence for the tied diagonal fit converges very slowly when compared to the isotropic fit. (Population EM for the free covariance fit is numerically expensive and thereby omitted.) On the other hand, from panel (b), we see that the statistical error when using sample EM for the two fits (3a)-(3b) is of order , which is substantially slower than the rate for fitting an isotropic covariance matrix.

(a) (b)
Figure 2: Comparison of the algorithmic and statistical rate of convergence of population and sample EM for the multivariate fits fits (3a) and (3b) compared with the isotropic fit (1). (a) The population EM algorithm for tied diagonal fits converges slowly relatively to its behavior for the isotropic fit. While the EM optimization error for the isotropic fit is tracked accurately by the sequence , the corresponding behavior in the case of non-spherical fit is captured well by the sequence . (b) Plots of the Euclidean norn the Euclidean distance versus sample size for the tied diagonal fits, and the standard Wasserstein distance and for the free-covariance fit. The statistical estimation error for both fits scales as , which is substantially slower than the order convergence for the isotropic fit.

1.3 Our contributions

The main contribution of this paper is to provide a precise analytical characterization of the behavior of the EM algorithm for over-specified mixture models. Doing so requires introducing a number of new technical ideas to the analysis of optimization algorithms in statistical settings, to be detailed below. As demonstrated in the previous section, the EM algorithm can exhibit a range of convergence behaviors in various settings, both univariate and multivariate. In this paper, so as to bring our contributions into sharp focus, we direct our analytical attention to the class of isotropic location-scale mixtures (1), with the goal of obtaining a precise characterization of the phenomena illustrated in Figure 1. We analyze the behavior of the EM algorithm when it is used to fit the parameters of a density from the family (1), based on a data set generated in an i.i.d. fashion from the standard

dimensional Gaussian distribution

. We study both the population EM algorithm—meaning the idealized method obtained in the limit of infinite data—as well as the sample-based EM algorithm, as applied to a finite collection of data.

Our main findings for both types of algorithms can be summarized as follows:

  • Multivariate singular Gaussian mixtures: For , we establish that the population EM algorithm requires and steps in order to achieve -accuracy for the location and scale parameters, respectively. Using an argument introduced in our past work [10], we show that these rates of convergence for the population EM algorithm imply that the sample EM algorithm takes steps in order to converge to estimates, of the location and scale parameters respectively, that lie within distances and of the true location and scale parameters, respectively. The order accuracy in the scale parameter estimate demonstrates the benefit of borrowing strength from multiple dimensions in order to estimate a shared scalar parameter.

  • Univariate singular Gaussian mixtures: By contrast, in the univariate setting (), we find that the EM algorithm behaves very differently. The population EM algorithm requires and steps to reach -accuracy for the location and scale parameters, respectively. Translating these rates to the sample EM algorithm, we prove that the EM estimate has statistical estimation error of the order and after order steps for the location and scale parameters respectively. Proving these rates requires a novel analysis, and herein lies the main technical contribution of our paper. Indeed, we show that all the analysis techniques introduced in past work on EM, including work on both the regular [1] and strongly identifiable cases [10], lead to sub-optimal rates. Our novel method is a two-stage approach. In the first stage, we prove that sample EM iterates converge to a ball around the true parameters after steps, whose radius scales like and , for the location and scale parameters respectively. In the second stage, we consider a corrected population EM algorithm, which enables a tighter analysis for the sample EM iterates when they lie in a close enough vicinity of the true parameters. This two stage analysis enables us to establish the optimal statistical error rates of order and of the sample EM iterates after order steps, for the location and scale parameter respectively.

The remainder of the paper is organized as follows. In Section 2, we begin with a description of the specific singular mixture fits and the associated EM updates. Then, we provide a description of algebraic relations between partial derivatives of the density in the univariate case, and conclude the section by discussing the challenges associated with our analysis with EM. Section 3 is devoted to our main results on the convergence of EM, first for multivariate and then for the univariate mixtures. In Section 4, we discuss the results and suggest possible directions for future work. We provide proofs of our theorems in Section 5 while deferring the proofs of more technical results to the Appendices.

Notation:

In the paper, the expression will be used to denote for some positive universal constant that does not change with . Additionally, we write if both and hold. Furthermore, we denote as the set for any . We define as the smallest integer greater than or equal to for any . The notation stands for the norm of vector . Finally, for any two densities and , the total variation distance between and is given by . Similarly, the squared Hellinger distance between and is defined as .

2 Over-specified location-scale Gaussian mixtures

In this section, we first provide explicit formulation of sample and population likelihood for location-scale Gaussian mixture models (1). Then, we present specific EM updates for these models. Finally, based on an intrinsic algebraic interdependence of the partial derivatives of the location-scale Gaussian density function, we briefly discuss the intuition of analyzing EM algorithm with these models.

2.1 Likelihood structure

Recall that we are analyzing the isotropic location-scale Gaussian mixture models (1). Given i.i.d. samples , the maximum likelihood estimate for the model fit (1) takes the form

(4)

where is a compact subset of . Constraining the likelihood to the compact set suffices to ensure that the MLE is consistent [6].

The large-sample limit of the sample likelihood  is the population log-likelihood given by

(5)

where the expectation is taken with respect to the true distribution of the data, namely, . Note that the true parameters achieve the the global maximum of the population log-likelihood .

2.2 EM algorithm

In general, the MLEs do not admit a closed-form expression, so that the EM algorithm is used to approximate maximizers of log-likelihood function . The EM updates for Gaussian mixture models are standard, so we simply state them here. In terms of the shorthand notation , the E-step in the EM algorithm involves computing the function

(6a)
where the weight function is given by
(6b)

The M-step involves maximizing the -function (6a) over the pair with fixed, which yields

(7a)
(7b)
Doing some straightforward algebra, the EM updates can be succinctly defined as
(7c)
and . For simplicity in presentation, we refer to the operator as the sample EM operator.

One can gain some intuition into the behavior of EM by examining the algebraic structure of log-likelihood. In the univariate case (), the log-likelihood has two partial derivatives, one with respect to the location parameter and the other with respect to the scale parameter . By computing these partial derivatives, one finds that they satisfy the algebraic relation

(8)

valid for all , and . In the multivariate setting (1) when the same parameter is shared across multiple dimensions, this algebraic relation no longer holds and we obtain faster convergence of EM estimates.

It is known from past work that algebraic relations of the form (8) have consequences for the behavior of the MLE [4, 3, 17, 14]. In the current context, by adapting techniques developed by Ho et al. [14], we can verify that the convergence rates of to the true parameters are of order and under the univariate singular setting (1). In the multivariate setting , the same analysis shows that the MLE achieves the faster rates and .

Given these differences in the algebraic structure of the likelihood and the behavior of the MLE, it is natural to wonder whether the same phenomena arise for the EM algorithm when applied to weakly identifiable mixtures. Note that the answer to this question is not a priori obvious, given the non-convexity of the sample likelihood. Since the EM algorithm is only guaranteed to return local optima, it is quite possible that the accuracy of EM estimates could be substantially different than a global maximum of the likelihood—that is, the MLE.

3 Main results and their consequences

We now turn to formal statements of our main results on the behavior of EM. The analyses for the multivariate and the univariate settings are significantly different; accordingly, we devote separate sections to each, beginning with the (easier) multivariate case in Section 3.1 before turning to the (more challenging) univariate case in Section 3.2.

3.1 Guarantees for multivariate singular mixtures

In order to study the EM algorithm for fitting Gaussian mixtures of the form (1), we consider the operator

(9)

In the sequel, we refer to it as a pseudo-population operator, since it is neither a purely sample-based operator (due to the outer expectation), nor a purely population operator due to its dependence on the samples . In past work, Cai et al. [2] used this operator for analyzing the behavior of EM for regular case of location-scale mixtures of sample EM with unknown location-scale mixtures.

We start with a description of both the population EM and the sample EM updates for the model (1) under the multivariate setting of . We use to denote the interval . We now state our first main result for the multivariate singular setting:

Theorem 1.

Let , be fixed positive numbers and the sample size . Then, for any the pseudo-population EM operator satisfies

(10a)

with probability at least

.222 Since the operator depends on the samples , only a high probability bound (and not a deterministic one) on the ratio is possible. Moreover, with any starting point such that , the sample EM sequence satisfies
(10b)

for all number of iterates with probability at least .

See Section 5.1 for the proof.

The results in Theorem 1 establish the provably slow convergence of EM on both, the algorithmic and statistical fronts for the multivariate singular mixtures:

  1. [label=()]

  2. The bound (10a) shows that the pseudo-population EM operator has a contraction parameter that scales like as . Therefore, we conclude that the norm of the pseudo-population EM sequence will be accurately captured by the norm of the sequence ; which is in agreement with the results from the simulation study of Figure 1 for .

  3. The bound (10b) shows that sample EM updates converge to a ball around with radius arbitrarily close to for .

Statistical rates for the scale parameter:

We now derive the convergence rates for the scale parameter using Theorem 1. Noting that , we obtain the following relation

Using standard chi-squared bounds, we obtain that

with high probability. From the bound (10b), we also have . Putting the pieces together, we conclude that the statistical error for the scale parameter satisfies

(11)

with high probability. Consequently, in the sequel, we focus primarily on the convergence rate for the EM estimates of the location parameter, as the corresponding guarantee for the scale parameter is readily implied by it.

Related work:

Note that the slow convergence behavior of EM described above is in stark contrast with the geometric convergence of EM location estimates to a parametric radius of order with well-specified mixtures [1, 8]. Nonetheless, the results from Theorem 1 are reminiscent of the slow convergence of EM for over-specified mixtures with known scale parameter [10]. In fact, the results for the over-specified location case from the paper [10] when combined with Theorem 1 in this paper can be summarized as follows: for the multivariate singular fits (1), the convergence behavior of EM is unaffected whether the scale parameter is known or unknown. Moreover, the proof of Theorem 1 follows a similar road-map as of the proofs in the paper [10]: We first establish the population contraction bounds (10a) and then using epoch-based-localization argument derive the statistical rates (10b).333Moreover, similar to the arguments made in the paper [10], localization argument is necessary to derive a sharp rate. Indeed, a direct application of the framework introduced by Balakrishnan et al. [1] for our setting implies a sub-optimal rate of order for the Euclidean error .

Initial condition:

We now make a few comments on the initial conditions on assumed in Theorem 1. At first sight, the condition and the consequent constraint might seem rather restrictive. However, we show that (Lemma 7 in Section B.7) for any satisfying , we have that the pseudo-population EM update satisfies , with high probability. In light of these observations, the initialization condition is Theorem 1 is not overly restrictive.

3.2 EM for univariate singular mixtures

When we substitute in the upper bound (10a), we obtain that . In other words, the bound (10a) is not informative enough for , as it does not tell us whether the pseudo-population EM operator is even contractive for over-specified univariate mixtures. We now briefly demonstrate that the knowledge of scale parameter in the univariate case has a significant effect on the nature of the log-likelihood function (defined in equation (5)). When the scale parameter is known, the population log-likelihood is given by the map . When the scale is unknown, recalling that the EM updates (7b) satisfy , we consider the map . The results are plotted in Figure 3, where we observe that for both cases the population log-likelihood functions are uniformly concave [33] with different orders around . More concretely, we observe the following numerical behaviors: for the location case, and, for the location-scale case. Such a difference in the population likelihood surface also affects the sample likelihood function as depicted in panel (b) in Figure 3, where we see that the local maximas in the function corresponding to the location-scale mixtures have higher magnitude than those in the location mixtures. These observed differences lead to a more involved theoretical analysis for the univariate setting, as we now describe.

(a) (b)
Figure 3: Plots of the log-likelihood for the over-specified location and location-scale Gaussian mixtures. In panel (a), we see that the population log-likelihood (5) for over-specified mixtures is weakly concave, although when the scale parameter is unknown the log-likelihood is much flatter than the case when it is known. Indeed, for the over-specified location-scale mixture the log-likelihood as a function of scales like and as for the over-specified location mixture. In panel (b), we see that sample log-likelihood has similar flatness as the population case. Moreover, the local maximas in the location-scale case have noticeably larger magnitude than those in the location case.

As discussed before, due to the relationship between the location and scale parameter, namely the updates (7c), it suffices to analyze the sample EM operator for the location parameter. For the univariate Gaussian mixtures, given samples , the sample EM operator and the corresponding pseudo-population operator (10a) are given by

(12a)
(12b)

Let denote the interval where is a positive universal constant. We now characterize the guarantees for EM under the univariate setting:

Theorem 2.

Fix , , and let the sample size be sufficiently large . Then the pseudo-population operator satisfies

(13a)
with probability at least . Moreover for any initialization that satisfies , the sample EM sequence , satisfies
(13b)
with probability at least .

See Section 5.2 for the proof.

The bound (13a) provides a sharp characterization of the local contractivity properties of pseudo-population operator . In particular, we have that which is significantly slower than the contraction coefficient of order derived in Theorem 1 for the multivariate setting. As a direct consequence, we see that while the pseudo-population EM sequence converges to a ball of radius around in iterations for the multivariate setting, it takes much more iterations for the univariate setting. Such a behavior is indeed in accordance with the observations made in Figure 1.

The bound (13b) shows that with high probability after steps the sample EM iterates converge to a ball around whose radius is arbitrarily close to . This scaling of order is, in fact, tight in the standard minimax sense (Proposition 1 in Appendix A). Thus, we conclude that Theorems 1 and 2, in conjunction, provide a rigorous justification of the empirical behavior of sample EM updates as illustrated in Figure 1(b).

Remark:

The proof of Theorem 2 follows a fundamentally different route when compared to that of Theorem 1. On one hand, the analysis of the univariate operator  is technically more involved and requires new techniques when compared to the analysis of the operator in the multivariate setting. On the other hand, the localization argument when applied with the EM operators  and leads to sub-optimal rates of order for sample EM updates, which are slower when compared to the order established in Theorem 2. Such a suboptimal guarantee necessitates the introduction of a different population operator that we discuss in the sequel. We now first derive a few sub-optimal guarantees using the operator in Section 3.2.1 and then in Section 3.2.2 we introduce a novel two-stage localization argument that enables us to derive the correct statistical rate for the sample EM updates.

3.2.1 Sub-optimal rates with pseudo-population EM

Given the behavior of the EM pseudo-population operator , the analysis of sample EM iterations requires uniform laws for the perturbations bounds between the sample EM operator and the operator . Later we establish (Lemma 6 stated in Section B.5) the following high probability bound: For any radius , we have

(14)

Using the bound (12b) from Theorem 2 and the uniform bound (14), we now sketch the statistical rates for sample EM that can be obtained using (a) the generic procedure outlined by Balakrishnan et al. [1] and (b) the localization argument introduced by Dwivedi et al. [10]. As we show, both these arguments end up being sub-optimal as they do not provide us the rate of order stated in Theorem 2. To summarize, the discussion to follow makes use of the following two bounds,

(15)

both of which hold with high probability.

Sub-optimal rate I:

The eventual radius of convergence obtained using Theorem 5(a) from the paper [1] can be determined by solving the simpler equation

(16a)
where denotes the bound on the initialization radius . Tracking dependency only on , we obtain that
(16b)

where we use the fact that the initialization radius is a constant. This informal computation suggests that the the sample EM iterates for location parameter are bounded by a term of order . This rate is clearly sub-optimal when compared to the EM rate of order from Theorem 2.

Sub-optimal rate II:

Next we apply the more sophisticated localization argument from the paper [10] in order to obtain a sharper rate. Given the bounds (15), this argument leads to solving the equation

(17a)
where, as before, denotes the (constant) bound on initialization radius. Given that , we find that
(17b)

This calculation allows us to conclude that the EM algorithm converges to an estimate which is at a distance of order from the true parameter, which is again sub-optimal compared to the rate of EM from Theorem 2. Notice that this rate, while still sub-optimal, is sharper than the rate of order obtained in equation (16b); this sharpening arises from the addiitonal factor in the LHS of equation (17a) in comparison to the LHS of equation (16a). We refer the readers to the paper [10] for a detailed discussion on the localization argument that leads to this additional factor.

Indeed both the conclusions above can be made rigorous; nonetheless, the bound (17b) is essential in the analysis that follows and hence, we formalize the corresponding statistical rate in the next corollary. Recall that denotes the interval where is a positive universal constant.

Corollary 1.

Given constants and , suppose that we generate the the sample-level EM sequence starting from an initialization , and using a sample size lower bounded as . Then for all iterations , we have

(18)

See Appendix B.6 for the proof of this corollary.

Before proceeding further, a few remarks are in order. The sub-optimal bound (18) is not an artifact of the localization argument; instead it arises due to our choice of the pseudo-population EM operator  (12b). Indeed, as we show in the next section, a better choice of the population operator enables us to derive sharper rates for sample EM in the univariate setting of singular mixtures. However, a key assumption in the further derivation is that the sample EM iterates can converge to a ball of radius around in a finite number of steps. Thus, Corollary 1, with a choice of , plays a crucial role in the analysis to follow, as it enables us to claim that the aforementioned assumption does hold with high probability.

3.2.2 Towards a sharper analysis: Corrected population EM operator

The particular choice of the population-like operator in equation (12b) was motivated by the previous works [2] with the location-scale Gaussian mixtures. Nonetheless as we noted above, this choice did not lead to the sharpest rate for sample EM iterates under the univariate singular mixtures. Upon a closer look, we realize that a “better” choice of the population operator is indeed possible in this setting. We now turn to a detailed description of this new operator and the associated consequences for the statistical rates for sample EM.

Define the following population EM operator:

(19)

Unlike the pseudo-population operator  (9), this operator is indeed a population operator as it does not depend on samples . Note that, this operator is obtained when we replace the sum in the definition (7c) of the operator by its corresponding expectation . For this reason, we also refer to this operator as the corrected population operator. We now derive the contraction coefficient and the perturbation error for the new operator for :

Lemma 1.
The corrected population EM operator satisfies the following bounds:
(20a)
Furthermore, for any fixed and , we have that
(20b)

where is some universal constant.

See Section B.1 for the proof of Lemma 1.

Dimension
(Theorem 1 &
Lemma (6))

(Lemma (1))
similar
(Contraction rate) similar
(Perturbation bound) similar
Table 1: Summary of the contraction rates and perturbation bounds of the pseudo-population operator  and the corrected-population operator , defined in equations (9) and (19) respectively. The perturbation bound denotes a high probability bound on the error between the corresponding operator and the sample EM operator  (7c). The phrase similar is used to denote that the corresponding entry is similar to the entry on its left. Clearly, there is no difference in the two operators for . However, in the univariate case, there is a stark difference in the perturbation bounds of the two operators, while it scales linearly with radius for the operator , the scaling is of order for the operator . This difference leads to a sub-optimal guarantee in the statistical rate for sample EM (Corollary 1) when using only the pseudo-population operator in the analysis for the univariate singular fit.

Using Lemma 1, we now compare and contrast the two operators  and . While both the operators have similar contraction coefficient of order as , their perturbation bounds are significantly different. While, the error scales linearly with the radius (see bound  (14)), from the inequality (20b) implies that the deviation error has a cubic scaling . In Figure 4, we plot the numerically computed perturbation bounds that verify the linear and cubic dependence on the radius for these two bounds. Moreover, for reader’s convenience, we summarize these similarities and differences in Table 1.

Figure 4: Plots of the perturbation errors for two population EM operators, namely, the pseudo-population operator  (12b) and the corrected population operator  (19) with respect to the sample EM operator  (12a), as a function of . From the least-squares fit on the log-log scale, we see that while the error scales linearly with , the error has a cubic dependence on . This behavior is in accordance with our theoretical results.

Another notable difference between the two perturbation bounds is the range of radius over which we prove their validity. With our tools, we establish that the perturbation bound (20b) for the operator is valid for any . On the other hand, the corresponding bound (14) for the operator is valid for any . The later difference is the key reason behind why the operator and the consequent result in Corollary 1 remain crucial in the analysis to follow.

A two-stage analysis:

In lieu of the above observations, the proof of the sharp upper bound (13b) in Theorem 2 proceeds in two stages. In the first stage, invoking Corollary 1, we conclude that with high probability the sample EM iterates converge to a ball of radius at most after steps, where . Consequently, the sample EM iterates after steps satisfy the assumptions required to apply Lemma 1. Thereby, in the second stage of the proof, we invoke Lemma 1 and employ the contraction bound (20a) of the operator in conjunction with the cubic perturbation bound (20b). Using localization argument for this stage, we establish that the EM iterates obtain a statistical error of order in steps as stated in Theorem 2.

4 Discussion

In this paper, we established several results characterizing the convergence behavior of EM algorithm for over-specified location-scale Gaussian mixtures. Our results reveal an interesting phase transition with the convergence rates of EM between the univariate and multivariate settings. With over-specified models in dimension

, the EM updates based on samples recover noisy estimates of the true parameters with a rather large error of for the location parameter and for the scale parameter. Thus we see that the error scaling for the location estimation error is of the order with the sample size which is rather slow when compared to the usual parametric rate of order for the regular models. In the univariate setting, the story gets more interesting and we observe that these errors are even larger: for the location parameter and for the scale parameter. This phase transition from the multivariate to the univariate setting can be attributed to the interdependence among the parameters arising from an intrinsic linear dependence in the PDE structure in the univariate setting. We view our analysis of EM for singular Gaussian mixtures as a first step towards a rigorous understanding of EM for a broader class of weakly identifiable mixture models. Such a study would provide a better understanding of the singular models with weak identifiability. Note that, these models do arise in practice while fitting heterogeneous data owing to the following two main reasons: (a) over-specification is a common phenomenon in fitting mixture models due to weak separation between mixture components, and, (b) the parameters being estimated are often inherently dependent due to the algebraic structures of the class of kernel densities being fitted and the associated partial derivatives. We now discuss a few other directions that can serve as a natural follow-up of our work.

The slow rate of order for EM is in a sense a worst-case guarantee. In the univariate case, for the entire class of two mixture Gaussian fits, MLE exhibits the slowest known statistical rate for the settings that we analyzed. More precisely, for certain asymmetric Gaussian mixture fits, the MLE convergence rates for the location parameter is faster than that of the symmetric equal-weighted mixture considered in this paper. As a concrete example, for the univariate model fit with the standard isotropic Gaussian data, the MLE converges at the rate and respectively for the location and scale parameters. These specific convergence rates are also based on the solvability of a certain system of polynomial equations with specific constraints on the unknown parameters arising from the model fit. It is interesting to understand the effect of such a geometric structure of the global maxima on the convergence of EM algorithm.

Finally, we analyzed over-specified mixtures with only one extra component. Nonetheless, quite often one may fit a model with a larger number of over-specified components so as to avoid under-specification of the true model. The convergence rate of the MLE for such over-specified models is known to further deteriorate as a function of the number of extra components. It remains to understand how the EM algorithm responds to these more severe—and practically relevant—instances of over-specification.

5 Proofs

We now provide the proofs of our main results, with Sections 5.1 and 5.2 devoted to Theorems 1 and 2 respectively. In all cases, the proofs of technical and auxiliary results are deferred to the appendices.

5.1 Proof of Theorem 1

The first step in this proof is to characterize the local contractivity of the pseudo-population operator . We then leverage this result to the convergence rate of sample EM updates under multivariate setting of singular Gaussian mixture.

5.1.1 Contractivity of the pseudo-population EM operator

In order to simplify notation, we use the shorthand . Recalling the definition (9) of operator , we have

(21)

We can find an orthonormal matrix such that , where is the first canonical basis in . Define the random vector . Since , we have that . On performing the change of variables , we find that